Supportingthehypercubeprogrammingmodelonmesharchitectures
(AfastsorterforiWarptori)
ThomasM.StrickerCarnegieMellonUniversityPittsburgh,PA15213-30
Abstract
Manycombinatorialproblemshavesimplesolutionsforparallelprocessingonhighly-connectednetworkssuchasthebutterflyorthehypercube,whereasthefastestprocessor-to-processorintercon-nectionsarerealizedinparallelmachineswithlowdimensionalmeshortorustopology.Thispaperpresentsamethodformap-pingbinaryhypercube-algorithmsontolowerdimensionalmeshesandanalyzesthismethodinamodelderivedfromthearchitectureofmodernmeshmachines.Weoutlinethecriteriausedtoevalu-ategraphembeddingsformappingsupercomputercommunicationnetworks.
Ourworkwasmotivatedbytheneedforfastlibraryroutinestodoparallelsorting,fastFouriertransformationandprocessorsyn-chronization.Duringthedesigneffortofthesebuildingblocks,wedevelopedandanalyzedanewtechniquetosupportahypercubenetworkembeddedontoatwodimensionaltorus.Adirectimple-mentationoftheembeddingismadepossiblebylogicalchannelsandpathways.Afastmergesorterbasedonthebitonicnetworkservesasanexampletoshowhowasimplehypercubealgorithmcanoutperformmostoftheasymptoticallyoptimalmeshalgorithmsforpracticalmachinesizes.
Intheconventionalmeshcomputationmodel,processorsareallowedtoexchangeoneunitofdatawithaneighborineachstep.Thismodelneedstoberefinedsincemodernmeshcomputers,suchastheiWarpsystem,havehardwaresupportforfastnon-neighborcommunication.
Thebitonicmergesort,asimplehypercubealgorithm,containsafairamountoffinegrainparallelismnotfoundinstandardmeshalgorithms.Thisformofparallelismincludespipelinedcommuni-cation,computationoverlappedwithcommunication,useofwideinstructionwordsandoperandsdirectlyreadfromthecommunica-tionsystemthroughsystolicgates.
Themeasuredsortingrateofmorethan2106keys/seconaniWarptoruswithjustprocessorsshowstheexcellentabsoluteperformanceofourapproach.Theperformanceresultscompare
1.2Previousworkonsortinginhypercubes
Parallelsortingofelementsonbinaryhypercubesofprocessorshasbeenwidelystudied.Batcherintroducedthebitonicmethodto
sort
elementsinlog2time[Bat68].Thebitonicsortingnetworkisdiscussedinthealgorithmscollectionof[Knu73]andseveraltextbooks.
Inourimplementationwemadeuseofthebitonicsortingnetworkbutdidnotusethebitonicmethodtodothelocalsortwithinaprocessor,becausebitonicsortingrequiresatotaloflog2passeswhichresultsinexcessivelocalbitonicmerges.Aspointedoutinmanyearlierpublications,thelogstagenetworksproposedby[AKS83],andevenmorerecentimprovedversions,donotprovidefeasiblesortingsolutionsforpracticalma-chinesizes[Pat90].
Sortersinlineartimearewidelyknown,forsortingdatalocallywithinoneprocessorofaparallelmachine.Aradixsorter,basedoncountingsort,iswelldescribedinmanyalgorithmstextbookse.g.[CLR90].Angoodlocalsorterbasedoncomparisonsisquicksort[Hoa61]withlogaveragecasecomplexity.
Parallelversionsofradixsortbyrankcountingwereusedsuc-cessfullyontheConnectionMachine[BLM91]andtheCrayYMP[ZB91]tobreaksortingrecords.Probabilisticsortersbasedonsam-pleddistributionsandpermutationroutingperformverywellonlargemachinesforbigproblemsizes[RR][BLM91].Onmeshcomputers,messageroutingofrandompermutationsisofacom-plexitysimilartosorting[Kun91a].Reducingsortingtoroutingisnotsolvingtheproblemonthemeshunlesssignificantlyfasterrouterhardwareisprovided.
1.3Previousworkonsortinginmeshes
Alargenumberofpapershavebeenpublishedonsortingusingthemeshmodel.Wecompareoursortingalgorithmusingthemodelofemulatedhypercubestoseveralsortingalgorithmsbasedontheconventionalmeshmodel.
ThompsonandKungshowthatthecomplexityofsortingonanmeshisandgivea6algorithm[TK77].Unfortunatelyforpracticalmachinesizes,hidessignificantlow
orderterms.Furtherwehavelookedintotheoptimal3
3
4log
.Thebound,derivedfromthebisection
bandwidth,isnotverytightandsuggestsforexamplethatabinary
4-cubecouldbeembeddedina16cellringwithcongestion
1.Theembeddingofbinary-cubesontoan-toruscanbedoneeitherinonestep,directlyreducingthedimensionfromto,orinseveralsteps,bymappingstep-by-stepgroupsofhypercubedimensionsontoonetorusdimension.
Definition3.1Amappingofabinary-cubeintoan-dimensionaltorusisseparableif1...:andthereexistsasetofmappings1...suchthatisamapping,andmapsa-cubetoa2ring.
3.1Asystematicseparableembedding
Earlierworkonmesh–hypercubeembeddingsdealsmostlywiththereverseproblemofembeddingmeshesintohypercubes[CTH][LSBD88].Suchanembeddingismucheasiertofindsincethemeshislessconnectedthanthehypercube.Graycodemappings
weresuggestedforsimplecasesandevenimprovedforirregularsizesby[LSBD88]andothers.
WefoundGraycodemappingstobeagoodstartingpointtosolveourproblemofmappinghypercubesontomeshes.AGraycodeisasimple,systematicwaytogenerateasequenceofnumberswithaHammingdistanceofone.OneoftheGraycodesisthebinaryreflectedGraycode.Itisrecursivelydefined:
1
0
2
isanoperatordesignatesaddingtheaconstantsequencevaluewrittentoeveryinreversenumber.andwhere
16listsas:
0132675412131514101198
Definition3.2AGraycoderingmappingisobtainedbymappingthelogicalprocessorofthe-cubewithcellnumbertothenodeintheringwith.
Definition3.3AGraycodetorusmappingisaseparablemappingwhenalloftheprimitivemappingsareGraycoderingmappings.
(1,1,0)(1,1,1)4675457623(0,1,1)23011(0,0,0)(0,0,1)0Figure1:TheGraycoderingmappingofabinary3-cubeontoaringoflength8
Themappingweusedforthe-processoriWarptorusiscom-posedfromtwoGraycoderingmappings,whereeachofthemmapsa3-cubeontoa8-cellringeach.DuringtheanalysisandevaluationoftheGraycodemappingwecanshowwhichpropertiesareim-portantforagoodmappinginouremulatedhypercubemodelanditsimplementationoniWarp.
Aworkloadofoneisveryimportantunlesstheprocessorsoftheparallelsystemcanbeusedinmultitaskingmodeveryefficiently.Thedilationoftheembeddingisofsecondaryimportanceforourmodelsincethearchitectureprovidesnon-neighborcommunicationatalowlatencyperhop.FortheGraycodemappingthedilationis21,whichisobviousfromtherecursivestructureoftheGraycode.Themostimportantfactorforourmappedhypercube
modelistheedgecongestion.Asweshowlaterintheanalysisofthesorter,theedgecongestionisanimportantdeterminantoftherunningtimeofourcommunicationsteps.Ingeneralwecannotavoidcongestion,butwithmodernmachineslikeiWarpwe
canprovidearchitecturalsupporttohandleacertainamountofcongestionwithoutdegradationofperformance.
Thealgorithmsconsideredinthispaperusethehypercubedi-mensionsoneatatime.Wethereforecomputethecongestionforeveryhypercubedimensionseparately.
Theorem3.1LetbetheGraycodemappingofalllogical
connectionsalongthe-thhypercubedimension.Let
bethemaximaledgecongestionofonedimensionofthemappedhypercubei.e.theedgecongestionof.Thecongestioncanbecomputedas:
recursion
closedform
42111
2122
2
2
1
2
2
Proof3.1ByinductionontherecursivestructureoftheGraycode.WheneverthetwosubsequencesofaGraycodedringoflength2arecombinedintonewGraycodedringoflength21,anumberof22connectionsmustbereroutedand2newconnectionsareintroduced,halfofthemroutedthroughthewraparoundofthering.
Asimilarrecursioncanbeestablishedandsolvedforthetotalcongestion.Edgecongestiondoesnotcapturealloftheresourceconstraintsofourarchitecturalmodel.Logicalconnectionsconsumeresourcesateveryswitchingnode(whichdistinguishesthelogicalchannelsofiWarpfromthegeneralconceptofvirtualchannels).Thevertexcongestionofanembeddingcapturestheseresourceconstraints.ForexampleinaniWarpsystemthenumberofchannelsduetoout-goingconnections,thenumberofchannelsduetovertexcongestion,andthenumberofchannelspermanentlyallocatedtoservicenet-worksmustbesmallerthanthetotalnumberofchannelsavailableintheVLSIcomponent.3.2Non-separablemappingTheproblemofanoptimalwaytoembedabinary6-cubeontoan88torushasbeenstudiedearlier.BesidestheGraycodedmappingdescribedinthispaper,non-separablemappingswithbettercongestionweregivenby[Sch92]and[She91].ThefollowingtableliststhecongestionanddilationfortheGraycodemappingsusedinouralgorithmversusaslightlyimproved,nonseparablemappinggivenby[Sch92].
DilationHDm123
456
1122221
1
1
1
1
2
rayc
oldedg
copiesofthesetoriareinterleavedtoobtainan88meshinthefollowingway:thetorusiscopiedandshifteddiagonallybyoneinthegrid.Theconnectionsbetweenthesetwotoricanberoutedwithacongestionofoneandadilationofuptofour.Thisresultsinanembeddingfor32nodes.Theresultinggridgraphisagaincopied,transposed(mirroredonthegriddiagonal)andtranslatedtoformaproperlyinterleavednodegraph.Theconnectionstolinkthenodesofthegraphwithitstransposedcopyareverysimilartoagridtransposenetwork.Theconnectionsofthisnetworkhavealengthofupto11(dilation)andacongestionofuptotwo.ThemeshwiththemappedprocessorindicesisshowninFigure3.2.
03214035624834456075268339411157104937124513611553142435254327592651392847296331553016341742195818503820462162235422Figure2:Thenon-separable,mappingbasedontwofoldedgrids(26).Thedrawingincludestheroutingforlogicallinksofdimension0.
Weleaveasanopenquestiontowhatextentthecongestioncanbereducedforsmallmeshesandhowtofindsuchembeddings.
4Acommunicationarchitectureformappedhy-percubes
ThearchitectureoftheiWarpsystemhasbeendescribedindetailin[BCC88]and[BCC90].SinceourmethodofemulationandourmodelfortheanalysisarestronglymotivatedbyarchitecturalfeaturesoftheiWarpcomponent,wepresentbrieflythearchitecturalelementsrelevanttomappedhypercubesandtothesorter.
4.1Highspeedprocessorinterconnects
Allprocessorinterconnectsinthe2dimensionaltorusoftheiWarpmachinearerealizedasparallelbuseswitheightdatalines.Everyprocessorhasfourincomingandfouroutgoingports,connectedtoitsmeshneighbors.Wecallthehardwareofsuchaninterconnectaphysicallink.Themaximumaggregatespeedofaphysicallinkis40MBytes/secineachdirection.Aprocessorcanthereforecommunicateoverallofitseightlinkstogetheratratesupto320MBytes/sec.
4.2LogicalchannelsandiWarppathways
TheconceptmostrelevanttoouremulatedhypercubemodelistheiWarppathway[BCC88][Gro].Pathwaysallowpermanentlogicalconnectionsbetweencellsthatarenotphysicalneighborsinthemesh.Logicalchannelsareusedtosharephysicallinksandtobuildpathways.Everylogicalchannelhasitsownbuffer,andthereforeapathwayincludesitsownbuffersateveryintermediatenode.Thebandwidthofasharedphysicallinkisallocatedbyaroundrobinscheduleramongthelogicalchannels.Routing,flowcontrolandexceptionsarealsohandledseparatelyforeachlogicalchannel.
Thehardwareunitthatperformsthisswitchingandschedulingiscalledthecommunicationagent.Theagenthandlesthelinkstothefourneighborsandfurtherinterfacestothecomputationagent.Athirdunit,thememoryagentprovidesdirectaccesstoandfromthecommunicationchannels.Thecommunicationagentisimple-mentedwithafixednumberofbufferresources.Thecurrentimple-mentationofiWarpallows20logicalconnectionsatonetime.Thisaddsaninterestingconstrainttothenetworksthatcanbeembeddedandsetupstaticallywithlonglivedconnection.However,networksarealsodynamicallyreconfigurablewithareasonableoverhead.
4.3
Asynchronousmemorycommunications
(iWarpspools)
Manyalgorithmsuseblocktransferstomovedatafrompathwaystomemoryorviceversa.Tooverlapcommunicationandcomputationthismustbedonewithoutaffectingthecomputationintheprocessor.IniWarp,theinterfacebetweencommunicationandmemoryisdonebyeightindependentagents,calledspools[BCC90].SimilartoaDMAcontroller,aniWarpspoolcopiesdatafrommemorytoapathwayorfromapathwaytomemory.Thedatatransfershappeninthebackgroundasynchronouslywithlittleinfluenceonacomputationgoingonatthesametime.
4.4Processingpowerandmemorybandwidth
TheiWarpinstructionsetincludesalonginstructionwordopera-tionthatcanexecuteabranch,aload,astore,anintegerortwoaddresscomputationsandtwobitIEEEfloatingpointoperationsinaslittleas4cyclesat50ns/cycle.Atpeakrateoneproces-sorcanexecute10millionoftheseinstructionspersecond.Themaximalbandwidthofload-andstore-operationstomemoryisapproximately80MBytes/sec.
4.5
Processingdatadirectlyfromcommunicationchannels(systolicgates)
Asimplethroughput(piping)analysisofinstructionrates,operandsandmemorybandwidthshowsthatthefullprocessingpoweroftheiWarpcommunicationcannotbeusedunlessstreamsofincomingdataareprocessedbythecomputationagentwithoutgoingfirsttomemory.Therefore,someoperandsofthecomputationcanbebroughtinfromthecommunicationunitthroughsystolicgates.Gatesareusedintheinstructionslikearegister[BCC90].
5AfastparallelsorterforiWarp
Inthischapterwedefinetheproblemofparallelsortingmorepre-ciselyandgiveadetaileddescriptionofthealgorithmthatresulted
inthefastestsorterforiWarpsystems.Weaddressthefollowingsortingproblem:
Parallelmachine:processors,organizedinan
meshwith
wrap-aroundlinks.
Inputtobesorted:
keysin-bitIEEEfloatingpointnumberrepresentation.Thekeysareevenlypartitionedamongtheprocessors,i.e.keysperprocessor.
Output:
keys,sortedinascendingorder.Morepreciselythekeysareevenlypartitionedamongprocessorsandeach
processorholdsasortedsub-sequenceoflength
.Thesortedsequencesarepositionedintheprocessorsinsortedorder,i.e.thedonotoverlapandtheirrangesareorderedaccordingtotheGraycodedprocessorindices.
Workspace:Aworkspaceofthesizeoftheinputdataplusaspace
ofconstantsizeforcountersaregranted.Thesortingalgorithmweusedtoevaluateouremulatedhyper-cubemodelformeshesisbasedonthefollowingtwosteps:
Thekeysaresortedbyaradixsortlocallywithinallproces-sors,resultinginmonotonicsequences.Algorithm5.1.
Theglobalsortingisdonewithsequencemergesonthebitonic
sortinggraph.Thisrequireslog2stagesofmerges.Al-gorithm5.2.
Theideaofsortingmultipleelementsperprocessorbyalocalsortfollowedbyaseriesofsequencemergeswasdiscussedin[BS78]withinthecontextofparallelmachines.
5.1Thelocalsort
Thelocal(intra-processor)sortisaradixsortexactlyasdescribedine.g.[CLR90].Inapre-andapost-processorphasethesignbitsaretranslatedbetweentheIEEEfloatingpointrepresentationandaplainbitintegerrepresentation.
Inourdescriptionofthealgorithms,weusethefollowingvariablenamesforthearraystoholdthedata,theworkspaceandthecountersoftheradix-countingsort.data:Theinput/outputarrayof-bitkeys.
tmp:Anarrayof
storagelocationsforintermediateresults.
cnt:Anarrayoffixedsizeforthecountersoftheradixsort.Inthe
currentimplementationthereare210counters,whichallowaradixsortofbitkeysin6passes.Algorithm5.1LocalRadixSort
mask(data,bit
field.
forallprocessorsdo:
bit
field[])]++;
for
2to
field[])
1
5.2Thesequencemerge
Theglobal(inter-processor)sortisperformedwithaseriesofse-quencemergesteps.Asequencemergeisanextensionofthecompareandswapconcepttomonotonicsequencesofelements.Amergecanbeusedinplaceofacompareandswapstepinanysort-ingnetwork.Forthemergestep,apairofprocessors(wecallthemneighbors)mustworktogethertomergethemonotonicsequencesstoredinthoseprocessors.Afteramerge,oneprocessorholdsthelowerhalfofthemergedsequenceandtheotherprocessorholdstheupperhalf.
Thespoolingagentandthecomputationagentarebothinvolvedinthemerge.Theyareprocessingtwostreamsofdataasyn-chronously.Thespoolingagentsendspartofthesequenceoverapathwaytoitsneighborwhilethecomputationagentreceivessomeelementsfromitsneighbor,comparesthemtotheelementsofthelocalsequenceandstorestheresultsasamergedsequencetomemory.Figure3explainsthisprocessinmoredetail.
memorycomputationcomputationmemoryagentagentagentagent(spool)
(fpu)
(fpu)
(spool)
communication0 3 4 7 9agents1 2 5 8 9<<0 1 2 3 45 7 8 9 9memory
CPU
pathwayCPUmemory
Figure3:Finegrainparallelismisusedtooverlapcommunicationandcomputationinthesequencemergestep.
Thealgorithmfortheinter-processorsortingstepcanbestatedasfollows:
Algorithm5.2SequenceMergesend(,)
sendsto.recv(neighbor[i])returnsthereceivedfrom.doinparallel
Inparallelwithineachprocessor.(Finegrainparallelism)
select(...
...)
Selectproperchoiceaccording
tocorrespondingstagesinbitonicgraph.
forallprocessorsdo:for1tologdo
for
1tododoinparallelfor1to
do(memoryagent)
send
j=1;rcv=recv(computationagent)
forselect(1to
to1)
ifselect()
then;
1;
else
;
recv
discardrestof
Thebitonicnetworkisstructuredin
2
for8processors.The-selectorsarecomputedbyaseriesofxoroperationsonselectedbitsoftheprocessornumber,thestageandthegroupnumber.Thedirectionofthemergestepisdeterminedbythispredicate.Theselectorforthedirectionoftheindexinstoreloopisdeterminedbythemergedirectionofthenextmergestep.Thisassuresthatthekeysareintherightorderfortransmissionthroughspoolinginthenextstage.
Figure4:Thebitonicmergegraphaccordingto[Bat68]and[Knuth73].
ThesystolicimplementationofthesequencemergestepinAl-gorithm5.2iscrucialtothesuccessofouralgorithmdesignforiWarpparallelsystems.Thealgorithmforthesequencemergewascarefullychosenandhasmaximalfinegrainparallelism.
Ourimplementationdoesmergeswithin0.5sperelement,whichisonlythreetimesslowerthanasimplecompareoperationoftwofloatingpointnumbersbetweenregisters.Theexecutiontimeincludestheloadfrommemory,andthestoretomemoryaswellasthesendsandreceivesoverthenetworkandthebranchesforcon-ditionalsandloopprocessing.ThemergeispipelinedandproperlyscheduledforthelonginstructionwordoperationsiniWarp.
Interestingly,mergingtwosequenceswithapairofprocessorsismorethantwotimesfasterthanmergingtwosequencesonaniWarpuni-processor.Designedforsystolicalgorithms,theproces-sorsprovidedirectaccesstothecommunicationchannelsthroughsystolicgatesintheregisterfile.Intheparallelversionofthemergetheoperandsofthecomparisonscanbereceivedandprocessedatthesametime.Thissavesonethirdofthememoryaccessesandremovesabottleneckinthesingleprocessorimplementation.
Giventhearchitecturalmodelofprivatememorymeshcomputerswithhighspeedprocessorinterconnects,wefoundthatthebitonicsortingnetworkexecutedonanemulatedhypercubeisthebestmethodpossibleforparallelsortingonthesekindofmeshmachinesforallpracticalmachinesizes.Wewillsupportthisstatementbyacomparativeanalysisofhypercube-andmesh-sortingalgorithmsinthenextchapter.
6Analysisofthemappedbitonicsorter
Wepresenttheanalysisofouralgorithmforthecaseofan-dimensionaltorus,assumingthatall-dimensionshavethesamesizeandarepowersoftwo.Sinceweuseseparablemappings,wecangeneralizetheanalysistodifferentarraysizesbyusingdifferentmappingsfortorusdimensionsofdifferentsize.
thethenumberdimensionofprocessorsofthemeshin-the(hypercubeorthemesh.2foriWarptori.)
bitsfortheelementsand2
log
2
2
stages.
Thebitonicsortingnetworkdoesallcommunicationalonghy-percubeconnections.Thenetworkcanthereforebeembeddedontoaphysicalhypercubenetworkwithload1andcongestion1.Fur-thermoreeverystageofthebitonicsortingnetworkusesthelinksofexactlyonehypercubedimension.Thenumberofparallelmergestepsrequiredforabitonicsortonahypercube-connectedparal-lelcomputerswithprocessorequalsthenumberofstagesinthebitonicsortingnetworkwithinputs.
Theperformanceofthemappedalgorithmisaffectedbythechar-acteristicsofthemapping,whenexecutingthestagesofabitonicnetworkonaphysicalmesh.Thecomputationcostsdonotdifferfromthelogicalhypercube,sinceallmappingsconsideredhave
load
1.Themergestephandleselementsandispipelined.Theeffectofthedilationonthecommunicationlatencycanbeneglectedaslong
intothecalculation.Thisratiolimitstheamountofcongestionthatcanbehandledwithoutslow-down.
BasedonthearchitecturalspecificationofiWarpwepredictthefollowingrunningtimesforthecomputationandcommunicationoperations:Computationtakes14clocksperelement,whilethecommunicationonlytakes4clocksperelement.Theratioisthere-fore2
7
thisisarateof1
tend2isclosetotheactualmeasurementsincethehardwareschedulerstoallocateevenfractionsofbandwidth.
Whenmultiplemergestepsshareaphysicallink,theslowdowndependsonthenumberofmergesinexcessof1
inallmeshdimensionsandthatthe
hypercubedimensionscanbepartitionedevenlyintomeshdimen-sions.Notethatforatorusthewrap-aroundlinksreducethecongestionforthehighestdimensions.
max12
max12
Tocomputethenumberofeffectivemergestagesmergewecarryouttheweightedsumoverallthestagesofthebitonicsortingnetwork:
merge
1
1
Forthesymbolicsummationthesumisrewrittenas:
(1log)
merge
1
(1
)
2
(1)
log)1
2(12
(1
)
1
4
Therunningtimeoftheglobalsort,mergeisthenumberofeffectivemergestagestimesthetimecomplexityofasinglemergestep(Corollary6.1).
merge
merge
2
2
8
((1
log)
4)
1
2
2
1
2
8log
6
inthenumber
ofprocessors.
Corollary6.3Fortwodimensionaltori2,
,
2logand2
4
5log(42
4)log
2loglog7
Forothervaluesof,thetimecomplexitiesare(14:
1563
4log5and1:62004log7.Avalueof1correspondstoperfectlybalancedarchitectureswithacommunication–computationratioof1.
Theasymptoticresultoftheanalysisisslightlyworsethanthe
3
lowerboundfor1-1sortinggivenin[SS88],butinourmappedalgorithmwehavesmallloworderterms.ThisisthereasonwhythismethodresultedinthefastestsorteronourcelliWarpmachines.
CurrentiWarpsystems,whichrunat20106cyclespersecond,require14cyclestoprocessasinglebitkeyinamergestepandatotalof230cyclestoprocessabitkeyduringthelocalradixsort.Togetherwiththethetimecomplexitiesweareabletoestimatetheexecutiontimesofthewholesorter.TheestimatematchthemeasuredexecutiontimeslistedinChapter8verywell.
Comparisons (merge steps) vs. processors for ratio 1/21000000Rev-Sort1-1 Optimal SortN100000NMap. bitoniccong.NMap. bitonic w/oN\"cong.N10000NNN1000NNN\"\"\"N\"\"\"\"\"100N\"\"\"N\"\"N\"10\"N6kkkkk16514652MMMMMGG2166146781421662Figure5:Numberofeffectivemergestepsvs.numberofprocessors
forRevSort,optimalsortandmappedbitonicsort.Assumedis
communication–computationratio
1
2
loglogsteps.The
algorithmissimpleandthereforeagoodcandidateforthefastest
sorteronpracticalmachinesizes,sinceloglog
5inpractice.AsseeninFigure5,RevSortitisbetterthantheoptimalmeshsorterforsmallcasesbutcannotoutperformthemappedbitonicsorter.Forlargetheoptimal1-1sorterisbest.
Thekeytothehighefficiencyofthesorterliesinthegoodbehaviorofsmallercasesandinthefactthatall\"practical\"machines
Comparisons (merge steps) vs. processors for ratio 1/11000000LRev-SortL1-1 Optimal SortLNN100000NMap. bitonicLcong.NLMap. bitonic w/oNL\"cong.LN10000LNLNNL1000NLNNLLN\"\"\"\"\"\"\"100\"\"LNL\"\"\"N\"\"10L\"N6kkkkk16514652MMMMMGG2166146781421662Figure6:Numberofeffectivemergestepsvs.numberofprocessorsforRevSort,optimalsortandmappedbitonicsort.Assumedisperfectlybalancedcommunication–computationratio1.
sizesarewellwithinthatrangeoptimality.AtthistimeiWarpiscommerciallyavailableuptosizesof1024cells.LargeriWarptoriarehypothetical.
7.3Usingthewraparoundlinksinameshsorter
Averyinterestingaspectofmeshsortingisthattherowandcolumnsortingstepscannotmakeuseofthetorusunlesstheyuseextrael-ementsofbuffer-space.Aresultwithsixextrabuffersismentionedin[Kun91a]andattributedto[MS].
7.4Theh-hmeshsortingmodels
Somerecentliteraturepointsouttheimportanceofthesort-ingmodels[Kun91b],[Kun91a]andgivesa25algorithmforaslightlymodified1-1sortingmodel.Wefoundmostofthemethodsimpracticalbecauseoftheirlargeloworderterms,butmuchmoreworkisneededtocometoaconclusionaboutallmethodsformeshsortingproposedintheliterature.
7.5
Comparisonbetweenmeshandmappedhyper-cubesorters
TheFigures5and6showthatthecomplexitiesforthemappedalgorithmarebetterthanforthetwodirectmeshsorterssortersforallpracticalmachinesizes.
Tobreakevenwithourmappedsorterintheweightednumberofmergesteps,theoptimalalgorithmneedstorunonatwodimen-sionaltorusofsize65536,with4109processors;seeFigure5.
Evenundertheassumptionofan11ratiobetweencommunica-tionandcomputation,amachinethatis34timeslessefficientincommunicationthaniWarp,themappedbitonicmergealgorithm
canoutperformtheoptimalmeshsort,unlessmorethanor16103processorsareused.
128
8Comparisontootherparallelsorterprograms
Toevaluatethespeedofthemappedbitonicmergesorter,wegiveanoverviewofsomepublishedexecutiontimesforsortingonemillion-bitkeysondifferentmachines:
A1024SprintnodeConnectionMachine(CM-2)sortsonemillionelementsin660msatarateof15106keys/sec.usingsamplesort[BLM91].TheConnectionMachineisbasedonphysicalhypercubeinterconnectionsbetweentheSprintnodes.1
AprocessoriWarpsortsonemillionelementsin442msandatarateof23106keys/sec.usingthemappedbitonicmergesorterdiscussedinthispaper.Themachineisbasedonatwodimensionaltorus.
A4096processorMasParMP-1sortssortonemillionelements
usingbitonicsortin1446msatarateof072106
2
onxnet.Thismachineisbasedontwodimensionalgrid,adiagonalgrid(xnet)andageneralpurposerouter[Pri91].
AsingleprocessorCrayYMP3sortsonemillionelementin330msatarateof30106keys/secusingaparallelversionofradixsort.An8processorCrayYMPcanachievea5-6foldspeedupresultinginarateof1516106keys/sec.[ZB91].
ThemeasuredexecutiontimesandthesortingratesfordifferentsizesortingtasksareshowninFigures7and8.ThemeasurementsontheCM-2showtheperformanceforparallelsortingonarealhypercube,whiletherunningtimesoniWarpshowtheperformanceofsortingonalogicalhypercubemappedtoaphysicaltorus.
Theimplementationofsamplingallowsfastsortersonlargeprob-lemsontheCM-2sincetheoverheadoftheroutercanbecompen-sated.IntheFigures8and7itcanbenotedthatiWarpdelivershighsortingperformancealsoforsmallproblemsizes.Thisisdueagoodmatchofthealgorithmandthebalancedarchitecturewithhighspeedlowlatencycommunication.
9Conclusion
Wehaveestablishedamethodandamodeltomaphypercubecom-putationsefficientlytolowerdimensionaltori.Themethodisbasedonpast-neighborcommunicationoverfastprocessorinterconnec-tionswithlogicalchannels.Weexpectthesefeaturestobeincludedinmanymeshmachinesofthefuture.Themappedhypercubemodelprovidestheprogrammerwiththeviewofthemachineasahypercubeandletshimworkwiththesimplerhypercubealgorithms.ThehypercubenetworkisembeddedontothetorususingasimpleseparableGraycodemapping.TheGraycodemappingisgeneralandresultsacceptablecongestion,alsoslightlybettermappingswerefoundforspecialcases.Andetailedanalysisofthetimecomplexityandanimplementationwasdoneforamappedbitonicmergesorterinourmappedhypercubemodel.Thisallowedusto
helpfuldiscussionwithourtheorycommunityatCMU,particularlywithAnjaFeldmann,GaryMillerandEricSchwabe.
References
[AKS83]
M.Ajtai,J.Komlos,andE.Szemeredi.Anlogsortingnetwork.InProceedingsofthe15thAnnualACMSymposiumonTheoryofComputing,pages1–9,April1983.
[Bat68]
K.Batcher.Sortingnetworksandtheirapplications.InProceedingsoftheAFIPSSpringJointComputingConference,volume32,pages307–314,1968.
[BCC88]S.Borkar,R.Cohn,G.Cox,S.Gleason,T.Gross,
H.T.Kung,M.Lam,B.Moore,C.Peterson,J.Pieper,L.Rankin,P.S.Tseng,J.Sutton,J.Urbanski,andJ.Webb.iWarp:Anintegratedsolutiontohigh-speedparallelcomputing.InProceedingsofSupercomputing’88,pages330–339,Orlando,FL,USA,November1988.IEEEComputerSocietyandACMSIGARCH.
[BCC90]S.Borkar,R.Cohn,G.Cox,T.Gross,H.T.Kung,
M.Lam,B.Moore,W.Moore,C.Peterson,J.Susman,J.Sutton,J.Urbanski,andJ.Webb.SupportingsystolicandmemorycommunicationiniWarp.InProceedingsof17thIntl’SymposiumonComputerArchitecture,May1990.
[Bla90]
T.Blank.TheMasParMP-1architecture.In35thIEEEComputerSocietyInternationalConference,pages20–40,Spring1990.
[BLM91]G.E.Blelloch,C.E.Leiserson,B.M.Maggs,C.G.
PlanxtonandS.J.Smith,andM.Zagha.AcomparisonofsortingalgorithmsfortheconnectionmachineCM-2.In1991ACMSymposiumonParallelAlgorithmsandArchitectures,June1991.
[BS78]
G.BaudetandD.Stevenson.Optimalsortingalgo-rithmsforparallelcomputers.IEEETransactionsonComputers,C–27:84–87,1978.
[CLR90]
T.H.Cormen,C.E.Leiserson,andR.L.Rivest.Intro-ductiontoAlgorithms.TheMITPressandMcGraw-Hill,1990.
[CTH]
S.L.JohnsonC.T.Ho.Embeddingmeshesinbooleancubesbygraphdecomposition.Computersciencetech-nicalreport,YaleUniversity,NewHaven,CT06520,19.
[Gro]
ThomasGross.CommunicationiniWarpsystems.IninProceedingsofSupercomputing’,pages436–445,Reno,NV,USA,November19.IEEEComputerSo-cietyandACMSIGARCH.
[Hoa61]C.A.R.Hoare.Quicksort.ComputingJournal,5,1961.
[KH88]
B.KahleandD.Hillis.Theconnectionmachinemodelcm-1architecture.IEEESystems,Man,andCybernet-icsSpecialIssue,March1988.
[Knu73]
D.E.Knuth.SortingandSearching.TheArtofCom-puterProgramming.Vol3.Wesley,ReadingMas-sachusetts,1973.
[Kun91a]
M.Kunde.Routingandsortingongrids.Habilitation-sschrift,FakultaetfuerMathematikundInformatikderTUMuenchen,1991.
[Kun91b]
M.Kunde.Sortingonmeshes.InProceedingsofthe32stAnnualSymposiumonFoundationsofComputerScience,October1991.
[LSBD88]S.LakshmivarahanL.S.BaraschandS.K.Dhall.Gen-eralizedgraycodesandtheirproperties.InThirdInter-nationalConferenceonSupercomputing,May1988.[MS]
Y.MansourandL.Schulman.Sortingonaringofpro-cessors.Technicalreport,LaboratoryofComputerSci-ence,MassachusettsInst.ofTechnology,CambridgeMA,19.
[Pat90]M.S.Paterson.Imporovedsortingnetworkswithlogdepth.Algorithmica,5:75–92,1990.
[Pri91]
J.Prins.EfficientbitonicsortingoflargearraysontheMasParMP-1.TechnicalReportTR91-041,UniversityofNorthCarolina,ChapelHillNC,1991.
[RR]
S.RajasekaranandJ.H.Reif.Optimalandsublogarith-mictimerandomizedparallelsortingalgorithms.SIAMJournalonComputing,18(3):594–607,June19.[Sch92]EricSchwabe.PersonalCommunication,January
1992.
[She91]JonathanShewchuk.PersonalCommunication,August1991.
[SS88]
C.P.SchnorrandA.Shamir.Anoptimalsortingalgo-rithmformesh-connectedcomputers.InProceedingsofthe20thAnnualACMSymposiumonTheoryofCom-puting,pages255–263,1988.
[TK77]
C.D.ThompsonandH.T.Kung.Sortingonameshconnectedparallelcomputer.CommunicationsoftheACM,20:263–271,1977.
[ZB91]
MarcoZaghaandGuyE.Blelloch.Radixsortforvec-tormultiprocessors.InProceedingsSupercomputing’91,pages712–721,November1991.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- 7swz.com 版权所有 赣ICP备2024042798号-8
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务