您好,欢迎来到微智科技网。
搜索
您的当前位置:首页(A

(A

来源:微智科技网
Reprint from the proceedings of the ACM Symposium on Parallel Algorithms and Architectures, SPAA92June 29 - July 1, 1992, San Diego, California.

Supportingthehypercubeprogrammingmodelonmesharchitectures

(AfastsorterforiWarptori)

ThomasM.StrickerCarnegieMellonUniversityPittsburgh,PA15213-30

Abstract

Manycombinatorialproblemshavesimplesolutionsforparallelprocessingonhighly-connectednetworkssuchasthebutterflyorthehypercube,whereasthefastestprocessor-to-processorintercon-nectionsarerealizedinparallelmachineswithlowdimensionalmeshortorustopology.Thispaperpresentsamethodformap-pingbinaryhypercube-algorithmsontolowerdimensionalmeshesandanalyzesthismethodinamodelderivedfromthearchitectureofmodernmeshmachines.Weoutlinethecriteriausedtoevalu-ategraphembeddingsformappingsupercomputercommunicationnetworks.

Ourworkwasmotivatedbytheneedforfastlibraryroutinestodoparallelsorting,fastFouriertransformationandprocessorsyn-chronization.Duringthedesigneffortofthesebuildingblocks,wedevelopedandanalyzedanewtechniquetosupportahypercubenetworkembeddedontoatwodimensionaltorus.Adirectimple-mentationoftheembeddingismadepossiblebylogicalchannelsandpathways.Afastmergesorterbasedonthebitonicnetworkservesasanexampletoshowhowasimplehypercubealgorithmcanoutperformmostoftheasymptoticallyoptimalmeshalgorithmsforpracticalmachinesizes.

Intheconventionalmeshcomputationmodel,processorsareallowedtoexchangeoneunitofdatawithaneighborineachstep.Thismodelneedstoberefinedsincemodernmeshcomputers,suchastheiWarpsystem,havehardwaresupportforfastnon-neighborcommunication.

Thebitonicmergesort,asimplehypercubealgorithm,containsafairamountoffinegrainparallelismnotfoundinstandardmeshalgorithms.Thisformofparallelismincludespipelinedcommuni-cation,computationoverlappedwithcommunication,useofwideinstructionwordsandoperandsdirectlyreadfromthecommunica-tionsystemthroughsystolicgates.

Themeasuredsortingrateofmorethan2106keys/seconaniWarptoruswithjustprocessorsshowstheexcellentabsoluteperformanceofourapproach.Theperformanceresultscompare

1.2Previousworkonsortinginhypercubes

Parallelsortingofelementsonbinaryhypercubesofprocessorshasbeenwidelystudied.Batcherintroducedthebitonicmethodto

sort

elementsinlog2time[Bat68].Thebitonicsortingnetworkisdiscussedinthealgorithmscollectionof[Knu73]andseveraltextbooks.

Inourimplementationwemadeuseofthebitonicsortingnetworkbutdidnotusethebitonicmethodtodothelocalsortwithinaprocessor,becausebitonicsortingrequiresatotaloflog2passeswhichresultsinexcessivelocalbitonicmerges.Aspointedoutinmanyearlierpublications,thelogstagenetworksproposedby[AKS83],andevenmorerecentimprovedversions,donotprovidefeasiblesortingsolutionsforpracticalma-chinesizes[Pat90].

Sortersinlineartimearewidelyknown,forsortingdatalocallywithinoneprocessorofaparallelmachine.Aradixsorter,basedoncountingsort,iswelldescribedinmanyalgorithmstextbookse.g.[CLR90].Angoodlocalsorterbasedoncomparisonsisquicksort[Hoa61]withlogaveragecasecomplexity.

Parallelversionsofradixsortbyrankcountingwereusedsuc-cessfullyontheConnectionMachine[BLM91]andtheCrayYMP[ZB91]tobreaksortingrecords.Probabilisticsortersbasedonsam-pleddistributionsandpermutationroutingperformverywellonlargemachinesforbigproblemsizes[RR][BLM91].Onmeshcomputers,messageroutingofrandompermutationsisofacom-plexitysimilartosorting[Kun91a].Reducingsortingtoroutingisnotsolvingtheproblemonthemeshunlesssignificantlyfasterrouterhardwareisprovided.

1.3Previousworkonsortinginmeshes

Alargenumberofpapershavebeenpublishedonsortingusingthemeshmodel.Wecompareoursortingalgorithmusingthemodelofemulatedhypercubestoseveralsortingalgorithmsbasedontheconventionalmeshmodel.

ThompsonandKungshowthatthecomplexityofsortingonanmeshisandgivea6algorithm[TK77].Unfortunatelyforpracticalmachinesizes,hidessignificantlow

orderterms.Furtherwehavelookedintotheoptimal3

3

4log

.Thebound,derivedfromthebisection

bandwidth,isnotverytightandsuggestsforexamplethatabinary

4-cubecouldbeembeddedina16cellringwithcongestion

1.Theembeddingofbinary-cubesontoan-toruscanbedoneeitherinonestep,directlyreducingthedimensionfromto,orinseveralsteps,bymappingstep-by-stepgroupsofhypercubedimensionsontoonetorusdimension.

Definition3.1Amappingofabinary-cubeintoan-dimensionaltorusisseparableif1...:andthereexistsasetofmappings1...suchthatisamapping,andmapsa-cubetoa2ring.

3.1Asystematicseparableembedding

Earlierworkonmesh–hypercubeembeddingsdealsmostlywiththereverseproblemofembeddingmeshesintohypercubes[CTH][LSBD88].Suchanembeddingismucheasiertofindsincethemeshislessconnectedthanthehypercube.Graycodemappings

weresuggestedforsimplecasesandevenimprovedforirregularsizesby[LSBD88]andothers.

WefoundGraycodemappingstobeagoodstartingpointtosolveourproblemofmappinghypercubesontomeshes.AGraycodeisasimple,systematicwaytogenerateasequenceofnumberswithaHammingdistanceofone.OneoftheGraycodesisthebinaryreflectedGraycode.Itisrecursivelydefined:

1

0

2

isanoperatordesignatesaddingtheaconstantsequencevaluewrittentoeveryinreversenumber.andwhere

16listsas:

0132675412131514101198

Definition3.2AGraycoderingmappingisobtainedbymappingthelogicalprocessorofthe-cubewithcellnumbertothenodeintheringwith.

Definition3.3AGraycodetorusmappingisaseparablemappingwhenalloftheprimitivemappingsareGraycoderingmappings.

(1,1,0)(1,1,1)4675457623(0,1,1)23011(0,0,0)(0,0,1)0Figure1:TheGraycoderingmappingofabinary3-cubeontoaringoflength8

Themappingweusedforthe-processoriWarptorusiscom-posedfromtwoGraycoderingmappings,whereeachofthemmapsa3-cubeontoa8-cellringeach.DuringtheanalysisandevaluationoftheGraycodemappingwecanshowwhichpropertiesareim-portantforagoodmappinginouremulatedhypercubemodelanditsimplementationoniWarp.

Aworkloadofoneisveryimportantunlesstheprocessorsoftheparallelsystemcanbeusedinmultitaskingmodeveryefficiently.Thedilationoftheembeddingisofsecondaryimportanceforourmodelsincethearchitectureprovidesnon-neighborcommunicationatalowlatencyperhop.FortheGraycodemappingthedilationis21,whichisobviousfromtherecursivestructureoftheGraycode.Themostimportantfactorforourmappedhypercube

modelistheedgecongestion.Asweshowlaterintheanalysisofthesorter,theedgecongestionisanimportantdeterminantoftherunningtimeofourcommunicationsteps.Ingeneralwecannotavoidcongestion,butwithmodernmachineslikeiWarpwe

canprovidearchitecturalsupporttohandleacertainamountofcongestionwithoutdegradationofperformance.

Thealgorithmsconsideredinthispaperusethehypercubedi-mensionsoneatatime.Wethereforecomputethecongestionforeveryhypercubedimensionseparately.

Theorem3.1LetbetheGraycodemappingofalllogical

connectionsalongthe-thhypercubedimension.Let

bethemaximaledgecongestionofonedimensionofthemappedhypercubei.e.theedgecongestionof.Thecongestioncanbecomputedas:

recursion

closedform

42111

2122

2

2

1

2

2

Proof3.1ByinductionontherecursivestructureoftheGraycode.WheneverthetwosubsequencesofaGraycodedringoflength2arecombinedintonewGraycodedringoflength21,anumberof22connectionsmustbereroutedand2newconnectionsareintroduced,halfofthemroutedthroughthewraparoundofthering.

Asimilarrecursioncanbeestablishedandsolvedforthetotalcongestion.Edgecongestiondoesnotcapturealloftheresourceconstraintsofourarchitecturalmodel.Logicalconnectionsconsumeresourcesateveryswitchingnode(whichdistinguishesthelogicalchannelsofiWarpfromthegeneralconceptofvirtualchannels).Thevertexcongestionofanembeddingcapturestheseresourceconstraints.ForexampleinaniWarpsystemthenumberofchannelsduetoout-goingconnections,thenumberofchannelsduetovertexcongestion,andthenumberofchannelspermanentlyallocatedtoservicenet-worksmustbesmallerthanthetotalnumberofchannelsavailableintheVLSIcomponent.3.2Non-separablemappingTheproblemofanoptimalwaytoembedabinary6-cubeontoan88torushasbeenstudiedearlier.BesidestheGraycodedmappingdescribedinthispaper,non-separablemappingswithbettercongestionweregivenby[Sch92]and[She91].ThefollowingtableliststhecongestionanddilationfortheGraycodemappingsusedinouralgorithmversusaslightlyimproved,nonseparablemappinggivenby[Sch92].

DilationHDm123

456

1122221

1

1

1

1

2

rayc

oldedg

copiesofthesetoriareinterleavedtoobtainan88meshinthefollowingway:thetorusiscopiedandshifteddiagonallybyoneinthegrid.Theconnectionsbetweenthesetwotoricanberoutedwithacongestionofoneandadilationofuptofour.Thisresultsinanembeddingfor32nodes.Theresultinggridgraphisagaincopied,transposed(mirroredonthegriddiagonal)andtranslatedtoformaproperlyinterleavednodegraph.Theconnectionstolinkthenodesofthegraphwithitstransposedcopyareverysimilartoagridtransposenetwork.Theconnectionsofthisnetworkhavealengthofupto11(dilation)andacongestionofuptotwo.ThemeshwiththemappedprocessorindicesisshowninFigure3.2.

03214035624834456075268339411157104937124513611553142435254327592651392847296331553016341742195818503820462162235422Figure2:Thenon-separable,mappingbasedontwofoldedgrids(26).Thedrawingincludestheroutingforlogicallinksofdimension0.

Weleaveasanopenquestiontowhatextentthecongestioncanbereducedforsmallmeshesandhowtofindsuchembeddings.

4Acommunicationarchitectureformappedhy-percubes

ThearchitectureoftheiWarpsystemhasbeendescribedindetailin[BCC88]and[BCC90].SinceourmethodofemulationandourmodelfortheanalysisarestronglymotivatedbyarchitecturalfeaturesoftheiWarpcomponent,wepresentbrieflythearchitecturalelementsrelevanttomappedhypercubesandtothesorter.

4.1Highspeedprocessorinterconnects

Allprocessorinterconnectsinthe2dimensionaltorusoftheiWarpmachinearerealizedasparallelbuseswitheightdatalines.Everyprocessorhasfourincomingandfouroutgoingports,connectedtoitsmeshneighbors.Wecallthehardwareofsuchaninterconnectaphysicallink.Themaximumaggregatespeedofaphysicallinkis40MBytes/secineachdirection.Aprocessorcanthereforecommunicateoverallofitseightlinkstogetheratratesupto320MBytes/sec.

4.2LogicalchannelsandiWarppathways

TheconceptmostrelevanttoouremulatedhypercubemodelistheiWarppathway[BCC88][Gro].Pathwaysallowpermanentlogicalconnectionsbetweencellsthatarenotphysicalneighborsinthemesh.Logicalchannelsareusedtosharephysicallinksandtobuildpathways.Everylogicalchannelhasitsownbuffer,andthereforeapathwayincludesitsownbuffersateveryintermediatenode.Thebandwidthofasharedphysicallinkisallocatedbyaroundrobinscheduleramongthelogicalchannels.Routing,flowcontrolandexceptionsarealsohandledseparatelyforeachlogicalchannel.

Thehardwareunitthatperformsthisswitchingandschedulingiscalledthecommunicationagent.Theagenthandlesthelinkstothefourneighborsandfurtherinterfacestothecomputationagent.Athirdunit,thememoryagentprovidesdirectaccesstoandfromthecommunicationchannels.Thecommunicationagentisimple-mentedwithafixednumberofbufferresources.Thecurrentimple-mentationofiWarpallows20logicalconnectionsatonetime.Thisaddsaninterestingconstrainttothenetworksthatcanbeembeddedandsetupstaticallywithlonglivedconnection.However,networksarealsodynamicallyreconfigurablewithareasonableoverhead.

4.3

Asynchronousmemorycommunications

(iWarpspools)

Manyalgorithmsuseblocktransferstomovedatafrompathwaystomemoryorviceversa.Tooverlapcommunicationandcomputationthismustbedonewithoutaffectingthecomputationintheprocessor.IniWarp,theinterfacebetweencommunicationandmemoryisdonebyeightindependentagents,calledspools[BCC90].SimilartoaDMAcontroller,aniWarpspoolcopiesdatafrommemorytoapathwayorfromapathwaytomemory.Thedatatransfershappeninthebackgroundasynchronouslywithlittleinfluenceonacomputationgoingonatthesametime.

4.4Processingpowerandmemorybandwidth

TheiWarpinstructionsetincludesalonginstructionwordopera-tionthatcanexecuteabranch,aload,astore,anintegerortwoaddresscomputationsandtwobitIEEEfloatingpointoperationsinaslittleas4cyclesat50ns/cycle.Atpeakrateoneproces-sorcanexecute10millionoftheseinstructionspersecond.Themaximalbandwidthofload-andstore-operationstomemoryisapproximately80MBytes/sec.

4.5

Processingdatadirectlyfromcommunicationchannels(systolicgates)

Asimplethroughput(piping)analysisofinstructionrates,operandsandmemorybandwidthshowsthatthefullprocessingpoweroftheiWarpcommunicationcannotbeusedunlessstreamsofincomingdataareprocessedbythecomputationagentwithoutgoingfirsttomemory.Therefore,someoperandsofthecomputationcanbebroughtinfromthecommunicationunitthroughsystolicgates.Gatesareusedintheinstructionslikearegister[BCC90].

5AfastparallelsorterforiWarp

Inthischapterwedefinetheproblemofparallelsortingmorepre-ciselyandgiveadetaileddescriptionofthealgorithmthatresulted

inthefastestsorterforiWarpsystems.Weaddressthefollowingsortingproblem:

Parallelmachine:processors,organizedinan

meshwith

wrap-aroundlinks.

Inputtobesorted:

keysin-bitIEEEfloatingpointnumberrepresentation.Thekeysareevenlypartitionedamongtheprocessors,i.e.keysperprocessor.

Output:

keys,sortedinascendingorder.Morepreciselythekeysareevenlypartitionedamongprocessorsandeach

processorholdsasortedsub-sequenceoflength

.Thesortedsequencesarepositionedintheprocessorsinsortedorder,i.e.thedonotoverlapandtheirrangesareorderedaccordingtotheGraycodedprocessorindices.

Workspace:Aworkspaceofthesizeoftheinputdataplusaspace

ofconstantsizeforcountersaregranted.Thesortingalgorithmweusedtoevaluateouremulatedhyper-cubemodelformeshesisbasedonthefollowingtwosteps:

Thekeysaresortedbyaradixsortlocallywithinallproces-sors,resultinginmonotonicsequences.Algorithm5.1.

Theglobalsortingisdonewithsequencemergesonthebitonic

sortinggraph.Thisrequireslog2stagesofmerges.Al-gorithm5.2.

Theideaofsortingmultipleelementsperprocessorbyalocalsortfollowedbyaseriesofsequencemergeswasdiscussedin[BS78]withinthecontextofparallelmachines.

5.1Thelocalsort

Thelocal(intra-processor)sortisaradixsortexactlyasdescribedine.g.[CLR90].Inapre-andapost-processorphasethesignbitsaretranslatedbetweentheIEEEfloatingpointrepresentationandaplainbitintegerrepresentation.

Inourdescriptionofthealgorithms,weusethefollowingvariablenamesforthearraystoholdthedata,theworkspaceandthecountersoftheradix-countingsort.data:Theinput/outputarrayof-bitkeys.

tmp:Anarrayof

storagelocationsforintermediateresults.

cnt:Anarrayoffixedsizeforthecountersoftheradixsort.Inthe

currentimplementationthereare210counters,whichallowaradixsortofbitkeysin6passes.Algorithm5.1LocalRadixSort

mask(data,bit

field.

forallprocessorsdo:

bit

field[])]++;

for

2to

field[])

1

5.2Thesequencemerge

Theglobal(inter-processor)sortisperformedwithaseriesofse-quencemergesteps.Asequencemergeisanextensionofthecompareandswapconcepttomonotonicsequencesofelements.Amergecanbeusedinplaceofacompareandswapstepinanysort-ingnetwork.Forthemergestep,apairofprocessors(wecallthemneighbors)mustworktogethertomergethemonotonicsequencesstoredinthoseprocessors.Afteramerge,oneprocessorholdsthelowerhalfofthemergedsequenceandtheotherprocessorholdstheupperhalf.

Thespoolingagentandthecomputationagentarebothinvolvedinthemerge.Theyareprocessingtwostreamsofdataasyn-chronously.Thespoolingagentsendspartofthesequenceoverapathwaytoitsneighborwhilethecomputationagentreceivessomeelementsfromitsneighbor,comparesthemtotheelementsofthelocalsequenceandstorestheresultsasamergedsequencetomemory.Figure3explainsthisprocessinmoredetail.

memorycomputationcomputationmemoryagentagentagentagent(spool)

(fpu)

(fpu)

(spool)

communication0 3 4 7 9agents1 2 5 8 9<<0 1 2 3 45 7 8 9 9memory

CPU

pathwayCPUmemory

Figure3:Finegrainparallelismisusedtooverlapcommunicationandcomputationinthesequencemergestep.

Thealgorithmfortheinter-processorsortingstepcanbestatedasfollows:

Algorithm5.2SequenceMergesend(,)

sendsto.recv(neighbor[i])returnsthereceivedfrom.doinparallel

Inparallelwithineachprocessor.(Finegrainparallelism)

select(...

...)

Selectproperchoiceaccording

tocorrespondingstagesinbitonicgraph.

forallprocessorsdo:for1tologdo

for

1tododoinparallelfor1to

do(memoryagent)

send

j=1;rcv=recv(computationagent)

forselect(1to

to1)

ifselect()

then;

1;

else

;

recv

discardrestof

Thebitonicnetworkisstructuredin

2

for8processors.The-selectorsarecomputedbyaseriesofxoroperationsonselectedbitsoftheprocessornumber,thestageandthegroupnumber.Thedirectionofthemergestepisdeterminedbythispredicate.Theselectorforthedirectionoftheindexinstoreloopisdeterminedbythemergedirectionofthenextmergestep.Thisassuresthatthekeysareintherightorderfortransmissionthroughspoolinginthenextstage.

Figure4:Thebitonicmergegraphaccordingto[Bat68]and[Knuth73].

ThesystolicimplementationofthesequencemergestepinAl-gorithm5.2iscrucialtothesuccessofouralgorithmdesignforiWarpparallelsystems.Thealgorithmforthesequencemergewascarefullychosenandhasmaximalfinegrainparallelism.

Ourimplementationdoesmergeswithin0.5sperelement,whichisonlythreetimesslowerthanasimplecompareoperationoftwofloatingpointnumbersbetweenregisters.Theexecutiontimeincludestheloadfrommemory,andthestoretomemoryaswellasthesendsandreceivesoverthenetworkandthebranchesforcon-ditionalsandloopprocessing.ThemergeispipelinedandproperlyscheduledforthelonginstructionwordoperationsiniWarp.

Interestingly,mergingtwosequenceswithapairofprocessorsismorethantwotimesfasterthanmergingtwosequencesonaniWarpuni-processor.Designedforsystolicalgorithms,theproces-sorsprovidedirectaccesstothecommunicationchannelsthroughsystolicgatesintheregisterfile.Intheparallelversionofthemergetheoperandsofthecomparisonscanbereceivedandprocessedatthesametime.Thissavesonethirdofthememoryaccessesandremovesabottleneckinthesingleprocessorimplementation.

Giventhearchitecturalmodelofprivatememorymeshcomputerswithhighspeedprocessorinterconnects,wefoundthatthebitonicsortingnetworkexecutedonanemulatedhypercubeisthebestmethodpossibleforparallelsortingonthesekindofmeshmachinesforallpracticalmachinesizes.Wewillsupportthisstatementbyacomparativeanalysisofhypercube-andmesh-sortingalgorithmsinthenextchapter.

6Analysisofthemappedbitonicsorter

Wepresenttheanalysisofouralgorithmforthecaseofan-dimensionaltorus,assumingthatall-dimensionshavethesamesizeandarepowersoftwo.Sinceweuseseparablemappings,wecangeneralizetheanalysistodifferentarraysizesbyusingdifferentmappingsfortorusdimensionsofdifferentsize.

thethenumberdimensionofprocessorsofthemeshin-the(hypercubeorthemesh.2foriWarptori.)

bitsfortheelementsand2

log

2

2

stages.

Thebitonicsortingnetworkdoesallcommunicationalonghy-percubeconnections.Thenetworkcanthereforebeembeddedontoaphysicalhypercubenetworkwithload1andcongestion1.Fur-thermoreeverystageofthebitonicsortingnetworkusesthelinksofexactlyonehypercubedimension.Thenumberofparallelmergestepsrequiredforabitonicsortonahypercube-connectedparal-lelcomputerswithprocessorequalsthenumberofstagesinthebitonicsortingnetworkwithinputs.

Theperformanceofthemappedalgorithmisaffectedbythechar-acteristicsofthemapping,whenexecutingthestagesofabitonicnetworkonaphysicalmesh.Thecomputationcostsdonotdifferfromthelogicalhypercube,sinceallmappingsconsideredhave

load

1.Themergestephandleselementsandispipelined.Theeffectofthedilationonthecommunicationlatencycanbeneglectedaslong

intothecalculation.Thisratiolimitstheamountofcongestionthatcanbehandledwithoutslow-down.

BasedonthearchitecturalspecificationofiWarpwepredictthefollowingrunningtimesforthecomputationandcommunicationoperations:Computationtakes14clocksperelement,whilethecommunicationonlytakes4clocksperelement.Theratioisthere-fore2

7

thisisarateof1

tend2isclosetotheactualmeasurementsincethehardwareschedulerstoallocateevenfractionsofbandwidth.

Whenmultiplemergestepsshareaphysicallink,theslowdowndependsonthenumberofmergesinexcessof1

inallmeshdimensionsandthatthe

hypercubedimensionscanbepartitionedevenlyintomeshdimen-sions.Notethatforatorusthewrap-aroundlinksreducethecongestionforthehighestdimensions.

max12

max12

Tocomputethenumberofeffectivemergestagesmergewecarryouttheweightedsumoverallthestagesofthebitonicsortingnetwork:

merge

1

1

Forthesymbolicsummationthesumisrewrittenas:

(1log)

merge

1

(1

)

2

(1)

log)1

2(12

(1

)

1

4

Therunningtimeoftheglobalsort,mergeisthenumberofeffectivemergestagestimesthetimecomplexityofasinglemergestep(Corollary6.1).

merge

merge

2

2

8

((1

log)

4)

1

2

2

1

2

8log

6

inthenumber

ofprocessors.

Corollary6.3Fortwodimensionaltori2,

,

2logand2

4

5log(42

4)log

2loglog7

Forothervaluesof,thetimecomplexitiesare(14:

1563

4log5and1:62004log7.Avalueof1correspondstoperfectlybalancedarchitectureswithacommunication–computationratioof1.

Theasymptoticresultoftheanalysisisslightlyworsethanthe

3

lowerboundfor1-1sortinggivenin[SS88],butinourmappedalgorithmwehavesmallloworderterms.ThisisthereasonwhythismethodresultedinthefastestsorteronourcelliWarpmachines.

CurrentiWarpsystems,whichrunat20106cyclespersecond,require14cyclestoprocessasinglebitkeyinamergestepandatotalof230cyclestoprocessabitkeyduringthelocalradixsort.Togetherwiththethetimecomplexitiesweareabletoestimatetheexecutiontimesofthewholesorter.TheestimatematchthemeasuredexecutiontimeslistedinChapter8verywell.

Comparisons (merge steps) vs. processors for ratio 1/21000000󰀵Rev-Sort󰀵󰀹1-1 Optimal Sort󰀵󰀹N100000NMap. bitonic󰀵cong.󰀵󰀹N󰀹Map. bitonic w/o󰀵N\"cong.󰀹󰀵N󰀹10000󰀵󰀹N󰀵󰀹NN󰀵󰀹1000󰀵󰀹N󰀹N󰀵󰀹N\"\"\"󰀵󰀹N\"\"\"\"\"100󰀹N\"\"󰀹󰀵\"󰀹󰀵N\"\"N\"10\"N󰀵6kkkkk16514652MMMMMGG2166146781421662Figure5:Numberofeffectivemergestepsvs.numberofprocessors

forRevSort,optimalsortandmappedbitonicsort.Assumedis

communication–computationratio

1

2

loglogsteps.The

algorithmissimpleandthereforeagoodcandidateforthefastest

sorteronpracticalmachinesizes,sinceloglog

5inpractice.AsseeninFigure5,RevSortitisbetterthantheoptimalmeshsorterforsmallcasesbutcannotoutperformthemappedbitonicsorter.Forlargetheoptimal1-1sorterisbest.

Thekeytothehighefficiencyofthesorterliesinthegoodbehaviorofsmallercasesandinthefactthatall\"practical\"machines

Comparisons (merge steps) vs. processors for ratio 1/11000000LRev-SortL󰀹1-1 Optimal SortLNN󰀹100000NMap. bitonicLcong.N󰀹LMap. bitonic w/oN󰀹L\"cong.LN󰀹󰀹10000LN󰀹LN󰀹NL󰀹1000NL󰀹N󰀹󰀹NLL󰀹N\"\"\"\"\"\"\"100󰀹\"\"󰀹LNL\"\"\"󰀹N\"\"10L\"N6kkkkk16514652MMMMMGG2166146781421662Figure6:Numberofeffectivemergestepsvs.numberofprocessorsforRevSort,optimalsortandmappedbitonicsort.Assumedisperfectlybalancedcommunication–computationratio1.

sizesarewellwithinthatrangeoptimality.AtthistimeiWarpiscommerciallyavailableuptosizesof1024cells.LargeriWarptoriarehypothetical.

7.3Usingthewraparoundlinksinameshsorter

Averyinterestingaspectofmeshsortingisthattherowandcolumnsortingstepscannotmakeuseofthetorusunlesstheyuseextrael-ementsofbuffer-space.Aresultwithsixextrabuffersismentionedin[Kun91a]andattributedto[MS].

7.4Theh-hmeshsortingmodels

Somerecentliteraturepointsouttheimportanceofthesort-ingmodels[Kun91b],[Kun91a]andgivesa25algorithmforaslightlymodified1-1sortingmodel.Wefoundmostofthemethodsimpracticalbecauseoftheirlargeloworderterms,butmuchmoreworkisneededtocometoaconclusionaboutallmethodsformeshsortingproposedintheliterature.

7.5

Comparisonbetweenmeshandmappedhyper-cubesorters

TheFigures5and6showthatthecomplexitiesforthemappedalgorithmarebetterthanforthetwodirectmeshsorterssortersforallpracticalmachinesizes.

Tobreakevenwithourmappedsorterintheweightednumberofmergesteps,theoptimalalgorithmneedstorunonatwodimen-sionaltorusofsize65536,with4109processors;seeFigure5.

Evenundertheassumptionofan11ratiobetweencommunica-tionandcomputation,amachinethatis34timeslessefficientincommunicationthaniWarp,themappedbitonicmergealgorithm

canoutperformtheoptimalmeshsort,unlessmorethanor16103processorsareused.

128

8Comparisontootherparallelsorterprograms

Toevaluatethespeedofthemappedbitonicmergesorter,wegiveanoverviewofsomepublishedexecutiontimesforsortingonemillion-bitkeysondifferentmachines:

A1024SprintnodeConnectionMachine(CM-2)sortsonemillionelementsin660msatarateof15106keys/sec.usingsamplesort[BLM91].TheConnectionMachineisbasedonphysicalhypercubeinterconnectionsbetweentheSprintnodes.1

AprocessoriWarpsortsonemillionelementsin442msandatarateof23106keys/sec.usingthemappedbitonicmergesorterdiscussedinthispaper.Themachineisbasedonatwodimensionaltorus.

A4096processorMasParMP-1sortssortonemillionelements

usingbitonicsortin1446msatarateof072106

2

onxnet.Thismachineisbasedontwodimensionalgrid,adiagonalgrid(xnet)andageneralpurposerouter[Pri91].

AsingleprocessorCrayYMP3sortsonemillionelementin330msatarateof30106keys/secusingaparallelversionofradixsort.An8processorCrayYMPcanachievea5-6foldspeedupresultinginarateof1516106keys/sec.[ZB91].

ThemeasuredexecutiontimesandthesortingratesfordifferentsizesortingtasksareshowninFigures7and8.ThemeasurementsontheCM-2showtheperformanceforparallelsortingonarealhypercube,whiletherunningtimesoniWarpshowtheperformanceofsortingonalogicalhypercubemappedtoaphysicaltorus.

Theimplementationofsamplingallowsfastsortersonlargeprob-lemsontheCM-2sincetheoverheadoftheroutercanbecompen-sated.IntheFigures8and7itcanbenotedthatiWarpdelivershighsortingperformancealsoforsmallproblemsizes.Thisisdueagoodmatchofthealgorithmandthebalancedarchitecturewithhighspeedlowlatencycommunication.

9Conclusion

Wehaveestablishedamethodandamodeltomaphypercubecom-putationsefficientlytolowerdimensionaltori.Themethodisbasedonpast-neighborcommunicationoverfastprocessorinterconnec-tionswithlogicalchannels.Weexpectthesefeaturestobeincludedinmanymeshmachinesofthefuture.Themappedhypercubemodelprovidestheprogrammerwiththeviewofthemachineasahypercubeandletshimworkwiththesimplerhypercubealgorithms.ThehypercubenetworkisembeddedontothetorususingasimpleseparableGraycodemapping.TheGraycodemappingisgeneralandresultsacceptablecongestion,alsoslightlybettermappingswerefoundforspecialcases.Andetailedanalysisofthetimecomplexityandanimplementationwasdoneforamappedbitonicmergesorterinourmappedhypercubemodel.Thisallowedusto

helpfuldiscussionwithourtheorycommunityatCMU,particularlywithAnjaFeldmann,GaryMillerandEricSchwabe.

References

[AKS83]

M.Ajtai,J.Komlos,andE.Szemeredi.Anlogsortingnetwork.InProceedingsofthe15thAnnualACMSymposiumonTheoryofComputing,pages1–9,April1983.

[Bat68]

K.Batcher.Sortingnetworksandtheirapplications.InProceedingsoftheAFIPSSpringJointComputingConference,volume32,pages307–314,1968.

[BCC88]S.Borkar,R.Cohn,G.Cox,S.Gleason,T.Gross,

H.T.Kung,M.Lam,B.Moore,C.Peterson,J.Pieper,L.Rankin,P.S.Tseng,J.Sutton,J.Urbanski,andJ.Webb.iWarp:Anintegratedsolutiontohigh-speedparallelcomputing.InProceedingsofSupercomputing’88,pages330–339,Orlando,FL,USA,November1988.IEEEComputerSocietyandACMSIGARCH.

[BCC90]S.Borkar,R.Cohn,G.Cox,T.Gross,H.T.Kung,

M.Lam,B.Moore,W.Moore,C.Peterson,J.Susman,J.Sutton,J.Urbanski,andJ.Webb.SupportingsystolicandmemorycommunicationiniWarp.InProceedingsof17thIntl’SymposiumonComputerArchitecture,May1990.

[Bla90]

T.Blank.TheMasParMP-1architecture.In35thIEEEComputerSocietyInternationalConference,pages20–40,Spring1990.

[BLM91]G.E.Blelloch,C.E.Leiserson,B.M.Maggs,C.G.

PlanxtonandS.J.Smith,andM.Zagha.AcomparisonofsortingalgorithmsfortheconnectionmachineCM-2.In1991ACMSymposiumonParallelAlgorithmsandArchitectures,June1991.

[BS78]

G.BaudetandD.Stevenson.Optimalsortingalgo-rithmsforparallelcomputers.IEEETransactionsonComputers,C–27:84–87,1978.

[CLR90]

T.H.Cormen,C.E.Leiserson,andR.L.Rivest.Intro-ductiontoAlgorithms.TheMITPressandMcGraw-Hill,1990.

[CTH]

S.L.JohnsonC.T.Ho.Embeddingmeshesinbooleancubesbygraphdecomposition.Computersciencetech-nicalreport,YaleUniversity,NewHaven,CT06520,19.

[Gro]

ThomasGross.CommunicationiniWarpsystems.IninProceedingsofSupercomputing’,pages436–445,Reno,NV,USA,November19.IEEEComputerSo-cietyandACMSIGARCH.

[Hoa61]C.A.R.Hoare.Quicksort.ComputingJournal,5,1961.

[KH88]

B.KahleandD.Hillis.Theconnectionmachinemodelcm-1architecture.IEEESystems,Man,andCybernet-icsSpecialIssue,March1988.

[Knu73]

D.E.Knuth.SortingandSearching.TheArtofCom-puterProgramming.Vol3.Wesley,ReadingMas-sachusetts,1973.

[Kun91a]

M.Kunde.Routingandsortingongrids.Habilitation-sschrift,FakultaetfuerMathematikundInformatikderTUMuenchen,1991.

[Kun91b]

M.Kunde.Sortingonmeshes.InProceedingsofthe32stAnnualSymposiumonFoundationsofComputerScience,October1991.

[LSBD88]S.LakshmivarahanL.S.BaraschandS.K.Dhall.Gen-eralizedgraycodesandtheirproperties.InThirdInter-nationalConferenceonSupercomputing,May1988.[MS]

Y.MansourandL.Schulman.Sortingonaringofpro-cessors.Technicalreport,LaboratoryofComputerSci-ence,MassachusettsInst.ofTechnology,CambridgeMA,19.

[Pat90]M.S.Paterson.Imporovedsortingnetworkswithlogdepth.Algorithmica,5:75–92,1990.

[Pri91]

J.Prins.EfficientbitonicsortingoflargearraysontheMasParMP-1.TechnicalReportTR91-041,UniversityofNorthCarolina,ChapelHillNC,1991.

[RR]

S.RajasekaranandJ.H.Reif.Optimalandsublogarith-mictimerandomizedparallelsortingalgorithms.SIAMJournalonComputing,18(3):594–607,June19.[Sch92]EricSchwabe.PersonalCommunication,January

1992.

[She91]JonathanShewchuk.PersonalCommunication,August1991.

[SS88]

C.P.SchnorrandA.Shamir.Anoptimalsortingalgo-rithmformesh-connectedcomputers.InProceedingsofthe20thAnnualACMSymposiumonTheoryofCom-puting,pages255–263,1988.

[TK77]

C.D.ThompsonandH.T.Kung.Sortingonameshconnectedparallelcomputer.CommunicationsoftheACM,20:263–271,1977.

[ZB91]

MarcoZaghaandGuyE.Blelloch.Radixsortforvec-tormultiprocessors.InProceedingsSupercomputing’91,pages712–721,November1991.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 7swz.com 版权所有 赣ICP备2024042798号-8

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务