您的当前位置：首页 (A

(A

来源：微智科技网

Reprint from the proceedings of the ACM Symposium on Parallel Algorithms and Architectures, SPAA92June 29 - July 1, 1992, San Diego, California.

Supportingthehypercubeprogrammingmodelonmesharchitectures

(AfastsorterforiWarptori)

ThomasM.StrickerCarnegieMellonUniversityPittsburgh,PA15213-30

Abstract

Manycombinatorialproblemshavesimplesolutionsforparallelprocessingonhighly-connectednetworkssuchasthebutterﬂyorthehypercube,whereasthefastestprocessor-to-processorintercon-nectionsarerealizedinparallelmachineswithlowdimensionalmeshortorustopology.Thispaperpresentsamethodformap-pingbinaryhypercube-algorithmsontolowerdimensionalmeshesandanalyzesthismethodinamodelderivedfromthearchitectureofmodernmeshmachines.Weoutlinethecriteriausedtoevalu-ategraphembeddingsformappingsupercomputercommunicationnetworks.

Ourworkwasmotivatedbytheneedforfastlibraryroutinestodoparallelsorting,fastFouriertransformationandprocessorsyn-chronization.Duringthedesigneffortofthesebuildingblocks,wedevelopedandanalyzedanewtechniquetosupportahypercubenetworkembeddedontoatwodimensionaltorus.Adirectimple-mentationoftheembeddingismadepossiblebylogicalchannelsandpathways.Afastmergesorterbasedonthebitonicnetworkservesasanexampletoshowhowasimplehypercubealgorithmcanoutperformmostoftheasymptoticallyoptimalmeshalgorithmsforpracticalmachinesizes.

Intheconventionalmeshcomputationmodel,processorsareallowedtoexchangeoneunitofdatawithaneighborineachstep.Thismodelneedstobereﬁnedsincemodernmeshcomputers,suchastheiWarpsystem,havehardwaresupportforfastnon-neighborcommunication.

Thebitonicmergesort,asimplehypercubealgorithm,containsafairamountofﬁnegrainparallelismnotfoundinstandardmeshalgorithms.Thisformofparallelismincludespipelinedcommuni-cation,computationoverlappedwithcommunication,useofwideinstructionwordsandoperandsdirectlyreadfromthecommunica-tionsystemthroughsystolicgates.

Themeasuredsortingrateofmorethan2106keys/seconaniWarptoruswithjustprocessorsshowstheexcellentabsoluteperformanceofourapproach.Theperformanceresultscompare

1.2Previousworkonsortinginhypercubes

Parallelsortingofelementsonbinaryhypercubesofprocessorshasbeenwidelystudied.Batcherintroducedthebitonicmethodto

sort

elementsinlog2time[Bat68].Thebitonicsortingnetworkisdiscussedinthealgorithmscollectionof[Knu73]andseveraltextbooks.

Inourimplementationwemadeuseofthebitonicsortingnetworkbutdidnotusethebitonicmethodtodothelocalsortwithinaprocessor,becausebitonicsortingrequiresatotaloflog2passeswhichresultsinexcessivelocalbitonicmerges.Aspointedoutinmanyearlierpublications,thelogstagenetworksproposedby[AKS83],andevenmorerecentimprovedversions,donotprovidefeasiblesortingsolutionsforpracticalma-chinesizes[Pat90].

Sortersinlineartimearewidelyknown,forsortingdatalocallywithinoneprocessorofaparallelmachine.Aradixsorter,basedoncountingsort,iswelldescribedinmanyalgorithmstextbookse.g.[CLR90].Angoodlocalsorterbasedoncomparisonsisquicksort[Hoa61]withlogaveragecasecomplexity.

Parallelversionsofradixsortbyrankcountingwereusedsuc-cessfullyontheConnectionMachine[BLM91]andtheCrayYMP[ZB91]tobreaksortingrecords.Probabilisticsortersbasedonsam-pleddistributionsandpermutationroutingperformverywellonlargemachinesforbigproblemsizes[RR][BLM91].Onmeshcomputers,messageroutingofrandompermutationsisofacom-plexitysimilartosorting[Kun91a].Reducingsortingtoroutingisnotsolvingtheproblemonthemeshunlesssigniﬁcantlyfasterrouterhardwareisprovided.

1.3Previousworkonsortinginmeshes

Alargenumberofpapershavebeenpublishedonsortingusingthemeshmodel.Wecompareoursortingalgorithmusingthemodelofemulatedhypercubestoseveralsortingalgorithmsbasedontheconventionalmeshmodel.

ThompsonandKungshowthatthecomplexityofsortingonanmeshisandgivea6algorithm[TK77].Unfortunatelyforpracticalmachinesizes,hidessigniﬁcantlow

orderterms.Furtherwehavelookedintotheoptimal3

4log

.Thebound,derivedfromthebisection

bandwidth,isnotverytightandsuggestsforexamplethatabinary

4-cubecouldbeembeddedina16cellringwithcongestion

1.Theembeddingofbinary-cubesontoan-toruscanbedoneeitherinonestep,directlyreducingthedimensionfromto,orinseveralsteps,bymappingstep-by-stepgroupsofhypercubedimensionsontoonetorusdimension.

Deﬁnition3.1Amappingofabinary-cubeintoan-dimensionaltorusisseparableif1...:andthereexistsasetofmappings1...suchthatisamapping,andmapsa-cubetoa2ring.

3.1Asystematicseparableembedding

Earlierworkonmesh–hypercubeembeddingsdealsmostlywiththereverseproblemofembeddingmeshesintohypercubes[CTH][LSBD88].Suchanembeddingismucheasiertoﬁndsincethemeshislessconnectedthanthehypercube.Graycodemappings

weresuggestedforsimplecasesandevenimprovedforirregularsizesby[LSBD88]andothers.

WefoundGraycodemappingstobeagoodstartingpointtosolveourproblemofmappinghypercubesontomeshes.AGraycodeisasimple,systematicwaytogenerateasequenceofnumberswithaHammingdistanceofone.OneoftheGraycodesisthebinaryreﬂectedGraycode.Itisrecursivelydeﬁned:

isanoperatordesignatesaddingtheaconstantsequencevaluewrittentoeveryinreversenumber.andwhere

16listsas:

0132675412131514101198

Deﬁnition3.2AGraycoderingmappingisobtainedbymappingthelogicalprocessorofthe-cubewithcellnumbertothenodeintheringwith.

Deﬁnition3.3AGraycodetorusmappingisaseparablemappingwhenalloftheprimitivemappingsareGraycoderingmappings.

(1,1,0)(1,1,1)4675457623(0,1,1)23011(0,0,0)(0,0,1)0Figure1:TheGraycoderingmappingofabinary3-cubeontoaringoflength8

Themappingweusedforthe-processoriWarptorusiscom-posedfromtwoGraycoderingmappings,whereeachofthemmapsa3-cubeontoa8-cellringeach.DuringtheanalysisandevaluationoftheGraycodemappingwecanshowwhichpropertiesareim-portantforagoodmappinginouremulatedhypercubemodelanditsimplementationoniWarp.

Aworkloadofoneisveryimportantunlesstheprocessorsoftheparallelsystemcanbeusedinmultitaskingmodeveryefﬁciently.Thedilationoftheembeddingisofsecondaryimportanceforourmodelsincethearchitectureprovidesnon-neighborcommunicationatalowlatencyperhop.FortheGraycodemappingthedilationis21,whichisobviousfromtherecursivestructureoftheGraycode.Themostimportantfactorforourmappedhypercube

modelistheedgecongestion.Asweshowlaterintheanalysisofthesorter,theedgecongestionisanimportantdeterminantoftherunningtimeofourcommunicationsteps.Ingeneralwecannotavoidcongestion,butwithmodernmachineslikeiWarpwe

canprovidearchitecturalsupporttohandleacertainamountofcongestionwithoutdegradationofperformance.

Thealgorithmsconsideredinthispaperusethehypercubedi-mensionsoneatatime.Wethereforecomputethecongestionforeveryhypercubedimensionseparately.

Theorem3.1LetbetheGraycodemappingofalllogical

connectionsalongthe-thhypercubedimension.Let

bethemaximaledgecongestionofonedimensionofthemappedhypercubei.e.theedgecongestionof.Thecongestioncanbecomputedas:

recursion

closedform

42111

2122

Proof3.1ByinductionontherecursivestructureoftheGraycode.WheneverthetwosubsequencesofaGraycodedringoflength2arecombinedintonewGraycodedringoflength21,anumberof22connectionsmustbereroutedand2newconnectionsareintroduced,halfofthemroutedthroughthewraparoundofthering.

Asimilarrecursioncanbeestablishedandsolvedforthetotalcongestion.Edgecongestiondoesnotcapturealloftheresourceconstraintsofourarchitecturalmodel.Logicalconnectionsconsumeresourcesateveryswitchingnode(whichdistinguishesthelogicalchannelsofiWarpfromthegeneralconceptofvirtualchannels).Thevertexcongestionofanembeddingcapturestheseresourceconstraints.ForexampleinaniWarpsystemthenumberofchannelsduetoout-goingconnections,thenumberofchannelsduetovertexcongestion,andthenumberofchannelspermanentlyallocatedtoservicenet-worksmustbesmallerthanthetotalnumberofchannelsavailableintheVLSIcomponent.3.2Non-separablemappingTheproblemofanoptimalwaytoembedabinary6-cubeontoan88torushasbeenstudiedearlier.BesidestheGraycodedmappingdescribedinthispaper,non-separablemappingswithbettercongestionweregivenby[Sch92]and[She91].ThefollowingtableliststhecongestionanddilationfortheGraycodemappingsusedinouralgorithmversusaslightlyimproved,nonseparablemappinggivenby[Sch92].

DilationHDm123

456

1122221

rayc

oldedg

copiesofthesetoriareinterleavedtoobtainan88meshinthefollowingway:thetorusiscopiedandshifteddiagonallybyoneinthegrid.Theconnectionsbetweenthesetwotoricanberoutedwithacongestionofoneandadilationofuptofour.Thisresultsinanembeddingfor32nodes.Theresultinggridgraphisagaincopied,transposed(mirroredonthegriddiagonal)andtranslatedtoformaproperlyinterleavednodegraph.Theconnectionstolinkthenodesofthegraphwithitstransposedcopyareverysimilartoagridtransposenetwork.Theconnectionsofthisnetworkhavealengthofupto11(dilation)andacongestionofuptotwo.ThemeshwiththemappedprocessorindicesisshowninFigure3.2.

03214035624834456075268339411157104937124513611553142435254327592651392847296331553016341742195818503820462162235422Figure2:Thenon-separable,mappingbasedontwofoldedgrids(26).Thedrawingincludestheroutingforlogicallinksofdimension0.

Weleaveasanopenquestiontowhatextentthecongestioncanbereducedforsmallmeshesandhowtoﬁndsuchembeddings.

4Acommunicationarchitectureformappedhy-percubes

ThearchitectureoftheiWarpsystemhasbeendescribedindetailin[BCC88]and[BCC90].SinceourmethodofemulationandourmodelfortheanalysisarestronglymotivatedbyarchitecturalfeaturesoftheiWarpcomponent,wepresentbrieﬂythearchitecturalelementsrelevanttomappedhypercubesandtothesorter.

4.1Highspeedprocessorinterconnects

Allprocessorinterconnectsinthe2dimensionaltorusoftheiWarpmachinearerealizedasparallelbuseswitheightdatalines.Everyprocessorhasfourincomingandfouroutgoingports,connectedtoitsmeshneighbors.Wecallthehardwareofsuchaninterconnectaphysicallink.Themaximumaggregatespeedofaphysicallinkis40MBytes/secineachdirection.Aprocessorcanthereforecommunicateoverallofitseightlinkstogetheratratesupto320MBytes/sec.

4.2LogicalchannelsandiWarppathways

TheconceptmostrelevanttoouremulatedhypercubemodelistheiWarppathway[BCC88][Gro].Pathwaysallowpermanentlogicalconnectionsbetweencellsthatarenotphysicalneighborsinthemesh.Logicalchannelsareusedtosharephysicallinksandtobuildpathways.Everylogicalchannelhasitsownbuffer,andthereforeapathwayincludesitsownbuffersateveryintermediatenode.Thebandwidthofasharedphysicallinkisallocatedbyaroundrobinscheduleramongthelogicalchannels.Routing,ﬂowcontrolandexceptionsarealsohandledseparatelyforeachlogicalchannel.

Thehardwareunitthatperformsthisswitchingandschedulingiscalledthecommunicationagent.Theagenthandlesthelinkstothefourneighborsandfurtherinterfacestothecomputationagent.Athirdunit,thememoryagentprovidesdirectaccesstoandfromthecommunicationchannels.Thecommunicationagentisimple-mentedwithaﬁxednumberofbufferresources.Thecurrentimple-mentationofiWarpallows20logicalconnectionsatonetime.Thisaddsaninterestingconstrainttothenetworksthatcanbeembeddedandsetupstaticallywithlonglivedconnection.However,networksarealsodynamicallyreconﬁgurablewithareasonableoverhead.

4.3

Asynchronousmemorycommunications

(iWarpspools)

Manyalgorithmsuseblocktransferstomovedatafrompathwaystomemoryorviceversa.Tooverlapcommunicationandcomputationthismustbedonewithoutaffectingthecomputationintheprocessor.IniWarp,theinterfacebetweencommunicationandmemoryisdonebyeightindependentagents,calledspools[BCC90].SimilartoaDMAcontroller,aniWarpspoolcopiesdatafrommemorytoapathwayorfromapathwaytomemory.Thedatatransfershappeninthebackgroundasynchronouslywithlittleinﬂuenceonacomputationgoingonatthesametime.

4.4Processingpowerandmemorybandwidth

TheiWarpinstructionsetincludesalonginstructionwordopera-tionthatcanexecuteabranch,aload,astore,anintegerortwoaddresscomputationsandtwobitIEEEﬂoatingpointoperationsinaslittleas4cyclesat50ns/cycle.Atpeakrateoneproces-sorcanexecute10millionoftheseinstructionspersecond.Themaximalbandwidthofload-andstore-operationstomemoryisapproximately80MBytes/sec.

4.5

Processingdatadirectlyfromcommunicationchannels(systolicgates)

Asimplethroughput(piping)analysisofinstructionrates,operandsandmemorybandwidthshowsthatthefullprocessingpoweroftheiWarpcommunicationcannotbeusedunlessstreamsofincomingdataareprocessedbythecomputationagentwithoutgoingﬁrsttomemory.Therefore,someoperandsofthecomputationcanbebroughtinfromthecommunicationunitthroughsystolicgates.Gatesareusedintheinstructionslikearegister[BCC90].

5AfastparallelsorterforiWarp

Inthischapterwedeﬁnetheproblemofparallelsortingmorepre-ciselyandgiveadetaileddescriptionofthealgorithmthatresulted

inthefastestsorterforiWarpsystems.Weaddressthefollowingsortingproblem:

Parallelmachine:processors,organizedinan

meshwith

wrap-aroundlinks.

Inputtobesorted:

keysin-bitIEEEﬂoatingpointnumberrepresentation.Thekeysareevenlypartitionedamongtheprocessors,i.e.keysperprocessor.

Output:

keys,sortedinascendingorder.Morepreciselythekeysareevenlypartitionedamongprocessorsandeach

processorholdsasortedsub-sequenceoflength

.Thesortedsequencesarepositionedintheprocessorsinsortedorder,i.e.thedonotoverlapandtheirrangesareorderedaccordingtotheGraycodedprocessorindices.

Workspace:Aworkspaceofthesizeoftheinputdataplusaspace

ofconstantsizeforcountersaregranted.Thesortingalgorithmweusedtoevaluateouremulatedhyper-cubemodelformeshesisbasedonthefollowingtwosteps:

Thekeysaresortedbyaradixsortlocallywithinallproces-sors,resultinginmonotonicsequences.Algorithm5.1.

Theglobalsortingisdonewithsequencemergesonthebitonic

sortinggraph.Thisrequireslog2stagesofmerges.Al-gorithm5.2.

Theideaofsortingmultipleelementsperprocessorbyalocalsortfollowedbyaseriesofsequencemergeswasdiscussedin[BS78]withinthecontextofparallelmachines.

5.1Thelocalsort

Thelocal(intra-processor)sortisaradixsortexactlyasdescribedine.g.[CLR90].Inapre-andapost-processorphasethesignbitsaretranslatedbetweentheIEEEﬂoatingpointrepresentationandaplainbitintegerrepresentation.

Inourdescriptionofthealgorithms,weusethefollowingvariablenamesforthearraystoholdthedata,theworkspaceandthecountersoftheradix-countingsort.data:Theinput/outputarrayof-bitkeys.

tmp:Anarrayof

storagelocationsforintermediateresults.

cnt:Anarrayofﬁxedsizeforthecountersoftheradixsort.Inthe

currentimplementationthereare210counters,whichallowaradixsortofbitkeysin6passes.Algorithm5.1LocalRadixSort

mask(data,bit

ﬁeld.

forallprocessorsdo:

bit

ﬁeld[])]++;

for

2to

ﬁeld[])

5.2Thesequencemerge

Theglobal(inter-processor)sortisperformedwithaseriesofse-quencemergesteps.Asequencemergeisanextensionofthecompareandswapconcepttomonotonicsequencesofelements.Amergecanbeusedinplaceofacompareandswapstepinanysort-ingnetwork.Forthemergestep,apairofprocessors(wecallthemneighbors)mustworktogethertomergethemonotonicsequencesstoredinthoseprocessors.Afteramerge,oneprocessorholdsthelowerhalfofthemergedsequenceandtheotherprocessorholdstheupperhalf.

Thespoolingagentandthecomputationagentarebothinvolvedinthemerge.Theyareprocessingtwostreamsofdataasyn-chronously.Thespoolingagentsendspartofthesequenceoverapathwaytoitsneighborwhilethecomputationagentreceivessomeelementsfromitsneighbor,comparesthemtotheelementsofthelocalsequenceandstorestheresultsasamergedsequencetomemory.Figure3explainsthisprocessinmoredetail.

memorycomputationcomputationmemoryagentagentagentagent(spool)

(fpu)

(spool)

communication0 3 4 7 9agents1 2 5 8 9<<0 1 2 3 45 7 8 9 9memory

CPU

pathwayCPUmemory

Figure3:Finegrainparallelismisusedtooverlapcommunicationandcomputationinthesequencemergestep.

Thealgorithmfortheinter-processorsortingstepcanbestatedasfollows:

Algorithm5.2SequenceMergesend(,)

sendsto.recv(neighbor[i])returnsthereceivedfrom.doinparallel

Inparallelwithineachprocessor.(Finegrainparallelism)

select(...

...)

Selectproperchoiceaccording

tocorrespondingstagesinbitonicgraph.

forallprocessorsdo:for1tologdo

for

1tododoinparallelfor1to

do(memoryagent)

send

j=1;rcv=recv(computationagent)

forselect(1to

to1)

ifselect()

then;

else

;

recv

discardrestof

Thebitonicnetworkisstructuredin

for8processors.The-selectorsarecomputedbyaseriesofxoroperationsonselectedbitsoftheprocessornumber,thestageandthegroupnumber.Thedirectionofthemergestepisdeterminedbythispredicate.Theselectorforthedirectionoftheindexinstoreloopisdeterminedbythemergedirectionofthenextmergestep.Thisassuresthatthekeysareintherightorderfortransmissionthroughspoolinginthenextstage.

Figure4:Thebitonicmergegraphaccordingto[Bat68]and[Knuth73].

ThesystolicimplementationofthesequencemergestepinAl-gorithm5.2iscrucialtothesuccessofouralgorithmdesignforiWarpparallelsystems.Thealgorithmforthesequencemergewascarefullychosenandhasmaximalﬁnegrainparallelism.

Ourimplementationdoesmergeswithin0.5sperelement,whichisonlythreetimesslowerthanasimplecompareoperationoftwoﬂoatingpointnumbersbetweenregisters.Theexecutiontimeincludestheloadfrommemory,andthestoretomemoryaswellasthesendsandreceivesoverthenetworkandthebranchesforcon-ditionalsandloopprocessing.ThemergeispipelinedandproperlyscheduledforthelonginstructionwordoperationsiniWarp.

Interestingly,mergingtwosequenceswithapairofprocessorsismorethantwotimesfasterthanmergingtwosequencesonaniWarpuni-processor.Designedforsystolicalgorithms,theproces-sorsprovidedirectaccesstothecommunicationchannelsthroughsystolicgatesintheregisterﬁle.Intheparallelversionofthemergetheoperandsofthecomparisonscanbereceivedandprocessedatthesametime.Thissavesonethirdofthememoryaccessesandremovesabottleneckinthesingleprocessorimplementation.

Giventhearchitecturalmodelofprivatememorymeshcomputerswithhighspeedprocessorinterconnects,wefoundthatthebitonicsortingnetworkexecutedonanemulatedhypercubeisthebestmethodpossibleforparallelsortingonthesekindofmeshmachinesforallpracticalmachinesizes.Wewillsupportthisstatementbyacomparativeanalysisofhypercube-andmesh-sortingalgorithmsinthenextchapter.

6Analysisofthemappedbitonicsorter

Wepresenttheanalysisofouralgorithmforthecaseofan-dimensionaltorus,assumingthatall-dimensionshavethesamesizeandarepowersoftwo.Sinceweuseseparablemappings,wecangeneralizetheanalysistodifferentarraysizesbyusingdifferentmappingsfortorusdimensionsofdifferentsize.

thethenumberdimensionofprocessorsofthemeshin-the(hypercubeorthemesh.2foriWarptori.)

bitsfortheelementsand2

log

stages.

Thebitonicsortingnetworkdoesallcommunicationalonghy-percubeconnections.Thenetworkcanthereforebeembeddedontoaphysicalhypercubenetworkwithload1andcongestion1.Fur-thermoreeverystageofthebitonicsortingnetworkusesthelinksofexactlyonehypercubedimension.Thenumberofparallelmergestepsrequiredforabitonicsortonahypercube-connectedparal-lelcomputerswithprocessorequalsthenumberofstagesinthebitonicsortingnetworkwithinputs.

Theperformanceofthemappedalgorithmisaffectedbythechar-acteristicsofthemapping,whenexecutingthestagesofabitonicnetworkonaphysicalmesh.Thecomputationcostsdonotdifferfromthelogicalhypercube,sinceallmappingsconsideredhave

load

1.Themergestephandleselementsandispipelined.Theeffectofthedilationonthecommunicationlatencycanbeneglectedaslong

intothecalculation.Thisratiolimitstheamountofcongestionthatcanbehandledwithoutslow-down.

BasedonthearchitecturalspeciﬁcationofiWarpwepredictthefollowingrunningtimesforthecomputationandcommunicationoperations:Computationtakes14clocksperelement,whilethecommunicationonlytakes4clocksperelement.Theratioisthere-fore2

thisisarateof1

tend2isclosetotheactualmeasurementsincethehardwareschedulerstoallocateevenfractionsofbandwidth.

Whenmultiplemergestepsshareaphysicallink,theslowdowndependsonthenumberofmergesinexcessof1

inallmeshdimensionsandthatthe

hypercubedimensionscanbepartitionedevenlyintomeshdimen-sions.Notethatforatorusthewrap-aroundlinksreducethecongestionforthehighestdimensions.

max12

Tocomputethenumberofeffectivemergestagesmergewecarryouttheweightedsumoverallthestagesofthebitonicsortingnetwork:

merge

Forthesymbolicsummationthesumisrewrittenas:

(1log)

merge

)

(1)

log)1

2(12

)

Therunningtimeoftheglobalsort,mergeisthenumberofeffectivemergestagestimesthetimecomplexityofasinglemergestep(Corollary6.1).

merge

((1

log)

8log

inthenumber

ofprocessors.

Corollary6.3Fortwodimensionaltori2,

2logand2

5log(42

4)log

2loglog7

Forothervaluesof,thetimecomplexitiesare(14:

1563

4log5and1:62004log7.Avalueof1correspondstoperfectlybalancedarchitectureswithacommunication–computationratioof1.

Theasymptoticresultoftheanalysisisslightlyworsethanthe

lowerboundfor1-1sortinggivenin[SS88],butinourmappedalgorithmwehavesmallloworderterms.ThisisthereasonwhythismethodresultedinthefastestsorteronourcelliWarpmachines.

CurrentiWarpsystems,whichrunat20106cyclespersecond,require14cyclestoprocessasinglebitkeyinamergestepandatotalof230cyclestoprocessabitkeyduringthelocalradixsort.Togetherwiththethetimecomplexitiesweareabletoestimatetheexecutiontimesofthewholesorter.TheestimatematchthemeasuredexecutiontimeslistedinChapter8verywell.

Comparisons (merge steps) vs. processors for ratio 1/21000000󰀵Rev-Sort󰀵󰀹1-1 Optimal Sort󰀵󰀹N100000NMap. bitonic󰀵cong.󰀵󰀹N󰀹Map. bitonic w/o󰀵N\"cong.󰀹󰀵N󰀹10000󰀵󰀹N󰀵󰀹NN󰀵󰀹1000󰀵󰀹N󰀹N󰀵󰀹N\"\"\"󰀵󰀹N\"\"\"\"\"100󰀹N\"\"󰀹󰀵\"󰀹󰀵N\"\"N\"10\"N󰀵6kkkkk16514652MMMMMGG2166146781421662Figure5:Numberofeffectivemergestepsvs.numberofprocessors

forRevSort,optimalsortandmappedbitonicsort.Assumedis

communication–computationratio

loglogsteps.The

algorithmissimpleandthereforeagoodcandidateforthefastest

sorteronpracticalmachinesizes,sinceloglog

5inpractice.AsseeninFigure5,RevSortitisbetterthantheoptimalmeshsorterforsmallcasesbutcannotoutperformthemappedbitonicsorter.Forlargetheoptimal1-1sorterisbest.

Thekeytothehighefﬁciencyofthesorterliesinthegoodbehaviorofsmallercasesandinthefactthatall\"practical\"machines

Comparisons (merge steps) vs. processors for ratio 1/11000000LRev-SortL󰀹1-1 Optimal SortLNN󰀹100000NMap. bitonicLcong.N󰀹LMap. bitonic w/oN󰀹L\"cong.LN󰀹󰀹10000LN󰀹LN󰀹NL󰀹1000NL󰀹N󰀹󰀹NLL󰀹N\"\"\"\"\"\"\"100󰀹\"\"󰀹LNL\"\"\"󰀹N\"\"10L\"N6kkkkk16514652MMMMMGG2166146781421662Figure6:Numberofeffectivemergestepsvs.numberofprocessorsforRevSort,optimalsortandmappedbitonicsort.Assumedisperfectlybalancedcommunication–computationratio1.

sizesarewellwithinthatrangeoptimality.AtthistimeiWarpiscommerciallyavailableuptosizesof1024cells.LargeriWarptoriarehypothetical.

7.3Usingthewraparoundlinksinameshsorter

Averyinterestingaspectofmeshsortingisthattherowandcolumnsortingstepscannotmakeuseofthetorusunlesstheyuseextrael-ementsofbuffer-space.Aresultwithsixextrabuffersismentionedin[Kun91a]andattributedto[MS].

7.4Theh-hmeshsortingmodels

Somerecentliteraturepointsouttheimportanceofthesort-ingmodels[Kun91b],[Kun91a]andgivesa25algorithmforaslightlymodiﬁed1-1sortingmodel.Wefoundmostofthemethodsimpracticalbecauseoftheirlargeloworderterms,butmuchmoreworkisneededtocometoaconclusionaboutallmethodsformeshsortingproposedintheliterature.

7.5

Comparisonbetweenmeshandmappedhyper-cubesorters

TheFigures5and6showthatthecomplexitiesforthemappedalgorithmarebetterthanforthetwodirectmeshsorterssortersforallpracticalmachinesizes.

Tobreakevenwithourmappedsorterintheweightednumberofmergesteps,theoptimalalgorithmneedstorunonatwodimen-sionaltorusofsize65536,with4109processors;seeFigure5.

Evenundertheassumptionofan11ratiobetweencommunica-tionandcomputation,amachinethatis34timeslessefﬁcientincommunicationthaniWarp,themappedbitonicmergealgorithm

canoutperformtheoptimalmeshsort,unlessmorethanor16103processorsareused.

128

8Comparisontootherparallelsorterprograms

Toevaluatethespeedofthemappedbitonicmergesorter,wegiveanoverviewofsomepublishedexecutiontimesforsortingonemillion-bitkeysondifferentmachines:

A1024SprintnodeConnectionMachine(CM-2)sortsonemillionelementsin660msatarateof15106keys/sec.usingsamplesort[BLM91].TheConnectionMachineisbasedonphysicalhypercubeinterconnectionsbetweentheSprintnodes.1

AprocessoriWarpsortsonemillionelementsin442msandatarateof23106keys/sec.usingthemappedbitonicmergesorterdiscussedinthispaper.Themachineisbasedonatwodimensionaltorus.

A4096processorMasParMP-1sortssortonemillionelements

usingbitonicsortin1446msatarateof072106

onxnet.Thismachineisbasedontwodimensionalgrid,adiagonalgrid(xnet)andageneralpurposerouter[Pri91].

AsingleprocessorCrayYMP3sortsonemillionelementin330msatarateof30106keys/secusingaparallelversionofradixsort.An8processorCrayYMPcanachievea5-6foldspeedupresultinginarateof1516106keys/sec.[ZB91].

ThemeasuredexecutiontimesandthesortingratesfordifferentsizesortingtasksareshowninFigures7and8.ThemeasurementsontheCM-2showtheperformanceforparallelsortingonarealhypercube,whiletherunningtimesoniWarpshowtheperformanceofsortingonalogicalhypercubemappedtoaphysicaltorus.

Theimplementationofsamplingallowsfastsortersonlargeprob-lemsontheCM-2sincetheoverheadoftheroutercanbecompen-sated.IntheFigures8and7itcanbenotedthatiWarpdelivershighsortingperformancealsoforsmallproblemsizes.Thisisdueagoodmatchofthealgorithmandthebalancedarchitecturewithhighspeedlowlatencycommunication.

9Conclusion

Wehaveestablishedamethodandamodeltomaphypercubecom-putationsefﬁcientlytolowerdimensionaltori.Themethodisbasedonpast-neighborcommunicationoverfastprocessorinterconnec-tionswithlogicalchannels.Weexpectthesefeaturestobeincludedinmanymeshmachinesofthefuture.Themappedhypercubemodelprovidestheprogrammerwiththeviewofthemachineasahypercubeandletshimworkwiththesimplerhypercubealgorithms.ThehypercubenetworkisembeddedontothetorususingasimpleseparableGraycodemapping.TheGraycodemappingisgeneralandresultsacceptablecongestion,alsoslightlybettermappingswerefoundforspecialcases.Andetailedanalysisofthetimecomplexityandanimplementationwasdoneforamappedbitonicmergesorterinourmappedhypercubemodel.Thisallowedusto

helpfuldiscussionwithourtheorycommunityatCMU,particularlywithAnjaFeldmann,GaryMillerandEricSchwabe.

References

[AKS83]

M.Ajtai,J.Komlos,andE.Szemeredi.Anlogsortingnetwork.InProceedingsofthe15thAnnualACMSymposiumonTheoryofComputing,pages1–9,April1983.

[Bat68]

K.Batcher.Sortingnetworksandtheirapplications.InProceedingsoftheAFIPSSpringJointComputingConference,volume32,pages307–314,1968.

[BCC88]S.Borkar,R.Cohn,G.Cox,S.Gleason,T.Gross,

H.T.Kung,M.Lam,B.Moore,C.Peterson,J.Pieper,L.Rankin,P.S.Tseng,J.Sutton,J.Urbanski,andJ.Webb.iWarp:Anintegratedsolutiontohigh-speedparallelcomputing.InProceedingsofSupercomputing’88,pages330–339,Orlando,FL,USA,November1988.IEEEComputerSocietyandACMSIGARCH.

[BCC90]S.Borkar,R.Cohn,G.Cox,T.Gross,H.T.Kung,

M.Lam,B.Moore,W.Moore,C.Peterson,J.Susman,J.Sutton,J.Urbanski,andJ.Webb.SupportingsystolicandmemorycommunicationiniWarp.InProceedingsof17thIntl’SymposiumonComputerArchitecture,May1990.

[Bla90]

T.Blank.TheMasParMP-1architecture.In35thIEEEComputerSocietyInternationalConference,pages20–40,Spring1990.

[BLM91]G.E.Blelloch,C.E.Leiserson,B.M.Maggs,C.G.

PlanxtonandS.J.Smith,andM.Zagha.AcomparisonofsortingalgorithmsfortheconnectionmachineCM-2.In1991ACMSymposiumonParallelAlgorithmsandArchitectures,June1991.

[BS78]

G.BaudetandD.Stevenson.Optimalsortingalgo-rithmsforparallelcomputers.IEEETransactionsonComputers,C–27:84–87,1978.

[CLR90]

T.H.Cormen,C.E.Leiserson,andR.L.Rivest.Intro-ductiontoAlgorithms.TheMITPressandMcGraw-Hill,1990.

[CTH]

S.L.JohnsonC.T.Ho.Embeddingmeshesinbooleancubesbygraphdecomposition.Computersciencetech-nicalreport,YaleUniversity,NewHaven,CT06520,19.

[Gro]

ThomasGross.CommunicationiniWarpsystems.IninProceedingsofSupercomputing’,pages436–445,Reno,NV,USA,November19.IEEEComputerSo-cietyandACMSIGARCH.

[Hoa61]C.A.R.Hoare.Quicksort.ComputingJournal,5,1961.

[KH88]

B.KahleandD.Hillis.Theconnectionmachinemodelcm-1architecture.IEEESystems,Man,andCybernet-icsSpecialIssue,March1988.

[Knu73]

D.E.Knuth.SortingandSearching.TheArtofCom-puterProgramming.Vol3.Wesley,ReadingMas-sachusetts,1973.

[Kun91a]

M.Kunde.Routingandsortingongrids.Habilitation-sschrift,FakultaetfuerMathematikundInformatikderTUMuenchen,1991.

[Kun91b]

M.Kunde.Sortingonmeshes.InProceedingsofthe32stAnnualSymposiumonFoundationsofComputerScience,October1991.

[LSBD88]S.LakshmivarahanL.S.BaraschandS.K.Dhall.Gen-eralizedgraycodesandtheirproperties.InThirdInter-nationalConferenceonSupercomputing,May1988.[MS]

Y.MansourandL.Schulman.Sortingonaringofpro-cessors.Technicalreport,LaboratoryofComputerSci-ence,MassachusettsInst.ofTechnology,CambridgeMA,19.

[Pat90]M.S.Paterson.Imporovedsortingnetworkswithlogdepth.Algorithmica,5:75–92,1990.

[Pri91]

J.Prins.EfﬁcientbitonicsortingoflargearraysontheMasParMP-1.TechnicalReportTR91-041,UniversityofNorthCarolina,ChapelHillNC,1991.

[RR]

S.RajasekaranandJ.H.Reif.Optimalandsublogarith-mictimerandomizedparallelsortingalgorithms.SIAMJournalonComputing,18(3):594–607,June19.[Sch92]EricSchwabe.PersonalCommunication,January

1992.

[She91]JonathanShewchuk.PersonalCommunication,August1991.

[SS88]

C.P.SchnorrandA.Shamir.Anoptimalsortingalgo-rithmformesh-connectedcomputers.InProceedingsofthe20thAnnualACMSymposiumonTheoryofCom-puting,pages255–263,1988.

[TK77]

C.D.ThompsonandH.T.Kung.Sortingonameshconnectedparallelcomputer.CommunicationsoftheACM,20:263–271,1977.

[ZB91]

MarcoZaghaandGuyE.Blelloch.Radixsortforvec-tormultiprocessors.InProceedingsSupercomputing’91,pages712–721,November1991.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文