您好,欢迎来到微智科技网。
搜索
您的当前位置:首页Comparing the Performance of Centralized and Distributed Coordination on Systems with Impro

Comparing the Performance of Centralized and Distributed Coordination on Systems with Impro

来源:微智科技网
ComparingthePerformanceofCentralizedandDistributedCoordinationonSystemswithImproved

CombiningSwitches

NYUComputerScienceTechnicalReportTR2003-849

EricFreudenthalandAllanGottliebDepartmentofComputerScience

CourantInstituteofMathematicalSciences

NewYorkUniversity

{freudenthal,gottlieb}@nyu.edu

thesharedmemoryisphysicallydistributedamongtheprocessors).

Anlesscommontechniqueistoutilizespecialpurposecoordinationhardwaresuchasthebarriernetworkof[2],theCM5ControlNetwork[3],orthe“combiningnetwork”of[1]andhavetheprocessorsreferencecentralizedmemory.Theideabehindcombiningisthatwhenreferencestothesamememorylocationmeetatanetworkswitch,theyarecombinedintoonereferencethatproceedstomemory.Whentheresponsetothecombinedmessagesreachestheswitch,dataheldinthe“waitbuffer”isusedtogeneratetheneededsecondre-sponse.

TheearlyworkatNYUoncombiningnetworksshowedtheirgreatadvantageforcertainclassesofmemorytraffic,especiallythosewithasignificantportionofhot-spotaccesses(adisproportionately

I.INTRODUCTION

largepercentageofthereferencestooneorafew

Itiswellknownthatthescalabilityofinter-locations).Itisperhapssurprisingthatthisworkprocesscoordinationcanlimittheperformanceofdidnotsimulatethetrafficgeneratedwhenalltheshared-memorycomputers.Sincethelatencyre-processorsengageinbusy-waitpolling,i.e.,100%quiredforcoordinationalgorithmssuchasbarri-hot-spotaccesses(butseethecommentson[7]ersorreaders-writersincreaseswiththeavailableinsectionIII).Whencompletingstudiesbegunaparallelism,theirimpactisespeciallyimportantfornumberofyearsagoofwhatweexpectedtobelarge-scalesystems.Acommonsoftwaretechniqueveryfastcentralizedalgorithmsforbarriersandusedtominimizethiseffectisdistributedlocal-readers-writers,wewereparticularlysurprisedtospinninginwhichprocessorsrepeatedlyaccessvari-findthatthecombiningnetworkperformedpoorlyablesstoredlocally(inso-calledNUMAsystems,inthissituation.Whileitdidnotexhibitthedis-Abstract—Memorysystemcongestionduetoserializa-tionofhotspotaccessescanadverselyaffecttheperfor-manceofinterprocesscoordinationalgorithms.Hardwareandsoftwaretechniqueshavebeenproposedtoreducethiscongestionandtherebyprovidesuperiorsystemper-formance.ThecombiningnetworksofGottliebetal.auto-maticallyparallelizeconcurrenthotspotmemoryaccesses,improvingtheperformanceofalgorithmsthatpollasmallnumberofsharedvariables.Thewidelyused“MCS”distributed-spinalgorithmstakeasoftwareapproach:theyreducehotspotcongestionbypollingonlyvariablesstoredlocally.

Ourinvestigationsdetectedperformanceproblemsinexistingdesignsforcombiningnetworksandweproposemechanismsthatalleviatethem.Simulationstudiesde-scribedhereinindicatethatacentralizedreaderswritersalgorithmsexecutedontheimprovedcombiningnetworkshaveperformanceatleastcompetitivetotheMCSalgo-rithms.

astrousserializationcharacteristicofaccessestoasinglelocationwithoutcombining,theimprovementwasmuchlessthanexpectedandouralgorithmswerenotnearlycompetitivewiththosebasedondistributedlocal-spinning[MCS].

Inthispaperwebrieflyreviewcombiningnet-worksandpresentthesurprisingdatajustmen-tioned.WethenpresenttwofairlysimplechangestotheNYUcombiningswitchesthatenablethesystemtoperformmuchbetter.Thefirstchangeistoincreasethewait-buffersize.Thesecondchangeismoresubtle:Thenetworkisoutput-bufferedandatrade-offexistsinvolvingthesizeoftheoutputqueues.Largequeuesarewellknowntoimproveperformanceforrandomtraffic.However,wefoundthatlargequeuescausepoorpollingperformance.Wethereforeproposeadaptingthequeuesizetothetrafficencountered:asmorecombinedmessagesarepresent,thequeuecapacityisreduced.Together,thesetwosimplechangeshaveadramaticeffectonpollingandourcentralizedbarrierandreaders-writersalgorithmsbecomecompetitivewiththecommonlyusedlocal-spinalgorithmsofMellor-CrummeyandScott(someofwhichalsobenefitfromtheavailabilityofcombining).

II.BACKGROUND

Large-scale,shared-memorycomputationre-quiresmemorysystemswithbandwidththatscaleswiththenumberofprocessors.Multi-stageinter-connectionfabricsandinterleavingofmemoryad-dressesamongmultiplememoryunitscanprovidescalablememorybandwidthformemoryreferencepatternswhoseaddressesareuniformlydistributed.Manyvariantsofthisarchitecturehavebeenim-plementedincommercialandotherresearchsys-tems[9],[10],[18].However,theserializationofmemorytransactionsateachmemoryunitisproblematicforreferencepatternswhosemappingtomemoryunitsisunevenlydistributed.Anim-portantcauseofnon-uniformmemoryaccesspat-ternsishot-spotmemoryaccessesgeneratedbycentralizedbusy-waitingcoordinationalgorithms.TheUltracomputerarchitectureincludesnetworkswitches[16]withlogictoreducethiscongestionbycombiningintoasinglerequestmultiplememorytransactions(e.g.loads,stores,fetch-and-adds)thatreferencethesamememoryaddress.

2

PE7PE6PE5PE4PE3PE2PE1PE0SWSWSWSWSWSWMM7MM6MM5MM4MM3MM2MM1MM0SWSWSWSWSWSWFig.1.EightPESystemwithHot-SpotCongestiontoMM3.TheUltracomputercombiningswitchdesignuti-lizesavariantofcut-throughrouting[13]thatimposesalatencyofoneclockcyclewhenthereisnocontentionforanoutgoingnetworklink.Whenthereiscontentionforanoutputport,messagesarebufferedonqueuesassociatedwitheachoutputport.InvestigationsbyDiasandJump[4],Dickey[5],Liu[14],andothersindicatethatthesequeuessignificantlyincreasenetworkbandwidthforlargesystemswithuniformlydistributedmemoryaccesspatterns.

Systemswithhighdegreesofparallelismcanbeconstructedusingtheseswitches:Figure1illus-tratesaneight-processorsystemwithd=3stagesofroutingswitchesinterconnectedbyashuffle-exchange[19]routingpattern.ReferencestoMM3arecommunicatedviacomponentsdenotedusingboldoutlines.

1)AnOverviewofCombining:Weassumethatasinglememorymodule(MM)caninitiateatmostonerequestpercycle.1Thusunbalancedmemoryaccesspatterns,suchashotspotpollingofacoor-dinationvariable,cangeneratenetworkcongestion.Figure1illustratescontentionamongreferencestoMM3.WhentherateofrequeststooneMMexceedsitsbandwidth,theswitchqueuesfeedingitwillfill.Sinceaswitchcannotacceptmes-sageswhenitsoutputbuffersarefull,afunnel-of-congestionwillspreadtothenetworkstagesthatfeedtheoverloadedMMandinterferewithtransactionsdestinedforotherMMsaswell.2

Ultracomputerswitchescombinepairsofmemoryrequestsaccessingthesamelocationintoasinglere-ThisrestrictionappliestoeachbankiftheMMisitselfcomposedofbanksthatcanbeaccessedconcurrently.TheMMdesignsimulatedinourexperimentscanacceptonerequesteveryfourcycles.2

PfisterandNorton[8]calledthisfunneltreesaturationandobservedthataccesspatternscontainingonly5%hotspottrafficsubstantiallyincreasememorylatency

1

F&A(X,e)X:tF&A(X,f)X:t+eF&A(X,1)X:12F&A(X,2)X:13F&A(X,4)X:0F&A(X,8)X:41F&A(X,3)X:12F&A(X,12)X:012eF&A(X,e+f)tAddend for decombinein wait buffer.PEor SWToPEPortsRQ1WB1FCQ1DecombineinfoMMor SWToMMPortsRQ1PEor SWWB0DecombineinfoFCQ0MMor SWF&A(X,15)Start: X=0 End: X=15X:0MMReverse PathPE MMDecombineForward PathPE MM4Fig.3.BlockDiagramofCombining2-by-2SwitchNotation:RQ:Reverse(ToPE)Queue,WB:WaitBuffer,FCQ:Forward(ToMM)CombiningQueueresponsearrivesatthecombiningswitchthelatterFig.2.CombiningofFetch-And-Addsatasingleswitch(above)transmitsXtosatisfytherequestFAA(X,e)andandatmultipleswitches(below).

transmitsT+etosatisfytherequestFAA(x,f),

questtoreducethecongestiongeneratedbyhotspotthusachievingthesameeffectasifFAA(X,e)wasmemorytraffic.Whenthememoryresponsesubse-followedimmediatelybyFAA(X,f).Thisprocessquentlyarrivesatthisswitchitisde-combinedintoisillustratedintheupperportionofFigure2.Theapairofresponsesthatareroutedtotherequestingcascadedcombiningof4requestsattwonetworkPEs.Toenablethisde-combination,theswitchusesstagesisillustratedinthelowerportionofthesameaninternalwaitbuffertoholdinformationfoundinfigure.therequestuntilitisneededtogeneratethesecondFigure3illustratesanUltracomputercombiningresponse.Sincecombinedmessagescanthemselvesswitch.Eachswitchcontainsbecombined,thistechniquehasthepotentialto•TwoDual-inputforward-pathcombiningreducehotspotcontentionbyafactoroftwoatqueues:Entriesareinsertedanddeletedineachnetworkstage.aFIFOmannerandmatchingentriesare2)CombiningofFetch-and-add:Ourfetch-and-combined,whichnecessitatesanALUto

addbasedcentralizedcoordinationalgorithmspollcomputethesume+f.asmallnumber(typicallyone)of“hotspot”shared•TwoDual-inputreversepathqueues:Entriesvariableswhosevaluesaremodifiedusingfetch-areinsertedanddeletedinaFIFOmanner.

3

and-add.Thus,asindicatedabove,itiscrucialina•TwoWaitBuffers:Entriesareinsertedandas-designsupportinglargenumbersofprocessorsnotsociativesearchesareperformedwithmatchedtoserializethisactivity.Thesolutionemployedisentriesremoved.AnincludedALUcomputestoincludeaddersintheMMs(thusguaranteeingX+e.atomicity)andtocombineconcurrentfetch-and-addoperationsattheswitches.A.WhenCombiningCanOccurWhentwofetch-and-addoperationsreferencing

Networklatencyisproportionaltoswitchcycle

thesamesharedvariable,sayFAA(X,e)and

timesandgrowswithqueuingdelays.VLSIsimu-FAA(X,f),meetatswitchthecombinedrequest

lationsshowedthatthecriticalpathinaproposed

FAA(X,e+f)istransmittedandthevalueeisstored

Ultracomputerswitchincludedtheaddertoform

inthewaitbuffer.Loadandfetch-and-addopera-e+fandtheoutputdrivers.Toreducecycletime,

tionsdirectedtowardsthesamehotspotvariable

atthecostofrestrictingthecircumstancesinwhich

canbecombinedifloadsaretransmittedasfetch-combiningcouldoccur,thechosendesigndidnot

and-addswhoseaddendsarezero.

combinerequeststhatwereattheheadoftheoutput

UponreceivingFAA(X,e+f),theMMupdates

queue(andhencemightbetransmittedthesame

X(toX+e+f)andrespondswithXWhenthe

cycleascombined).Thismodificationreducedthe

3

RecallthatFAA(X,e)isdefinedtoreturntheoldvalueofXandcriticalpathtimingtothemaxoftheadderandatomicallyincrementXbythevaluee.driverratherthantheirsum.

3

512 256 128Roundtrip Latency 32 16 8 4

Baseline, 100% hotspotWaitbuf100, 100% hotspot

Baseline, 10% hotspotWaitbuf100, 10% hotspot

2

3

4

5

6 7 8log(# processors)

9

10

11

Fig.4.MemoryLatencyforUltraswitcheswithWaitBufferCapacitiesof8and100messagesfor10%and100%HotspotTraffic,1OutstandingRequest/PE.

Wecallthemodifieddesign“decoupled”becausetheadderanddriverareinasensedecoupled,andcalltheoriginaldesign“coupled”.Sincetheheadentrycannotbecombined,wenotethatadecoupledqueuerequiresatleastthreerequestsforcombiningtooccur.Weshallseethatthistrivialsoundingobservationisimportant.

Toenablethedualinputqueuestoacceptitemsoneachinputinonecycle,thequeueisconstructedfromtwoindependentsingle-inputqueueswhoseoutputsaremultiplexed.Toachievethemaximumcombiningrate,wethereforerequireatleastthreerequestsineachofthesingle-inputcombiningqueues,whichimpliesatleastsixineachdual-inputcombiningqueues.Amorecomplicateddual-inputdecoupledcombiningqueue,dubbedtypeAinDickey[5]requiresonlythreemessagestoachievethemaximumcombiningrateratherthansixinthe“typeB”designweareassuming.

III.IMPROVINGTHEPERFORMANCEOF

BUSY-WAITPOLLINGFigure4plotsmemorylatencyforsimulatedsystemsoffourto2048PEswithmemorytraf-ficcontaining10%and100%hotspotreferences(thelattertypifiesthetrafficwhenprocessorsareengagedinbusy-waitpollingandthelocalcachesfilteroutinstructionandprivatedatareferences).ThesimulatedmultiprocessorapproximatestheUl-tracomputerdesign.Observethatfor10%hotspotreferencesthelatencyisonlyslightlygreaterthantheminimumnetworktransittime(1cycleperstage

4

perdirection)plusthesimulatedmemorylatencyof2cycles.

Onlargersystems,memorylatencyissubstan-tiallygreaterforthe100%hotspotloadandcanexceed10timestheminimum.Sincethecombiningswitchessimulatedwereexpectedtoperformwellforthistraffic,theresultsweresurprising,especiallytotheseniorauthorwhowasheavilyinvolvedwiththeUltracomputerprojectthroughoutitsduration.Itisclearthatourcurrentobjectiveofexhibit-inghigh-performancecentralizedcoordinational-gorithmscannotbeachievedusingthesesimulatedswitches.

Wehavediscoveredthatthereweretwodesignflawsinthecombiningswitchdesign.Thefirstisthatthewaitbufferwastoosmall,thesecondwasthat,inasensetobeexplainedbelow,thecombiningqueuesweretoolarge.

A.IncreasingtheWaitBufferCapacity

Thefabricatedswitcheshadwaitbufferscapableofholding8entries.Weseein4thatincreasingthecapacityto100entries(feasiblewithtoday’stechnology)reducesthelatencyfora2048PEsys-temfrom306to168cyclesanimprovementof45%.Whilethisimprovementcertainlyhelps,thelatencyofhotspotpollingtrafficisseventimesthelatencyofuniformlydistributedreferencepatterns(24cycles).

B.AdaptiveCombiningQueues

Inordertosupplyhighbandwidthfortypicaluniformlydistributedtraffic(i.e.,0%hotspot),itisimportantfortheswitchqueuestobelarge.However,asobservedby[7],busywaitpolling(100%hotspot),however,ispoorlyservedbytheselargequeues,aswenowdescribe.

Forbusy-waitpolling,eachprocessoralwayshasoneoutstandingrequestdirectedatthesameloca-tion.4OurexpectationwasthatwithNprocessorsandhencelogNstagesofswitches,pairswouldcombineateachstageresultinginjustone(orperhapsmorerealisticallyafew)requestreachingmemory.

Thisisnotquitecorrect:Whentheresponsearrivesittakesafewcyclesbeforethenextrequestisgenerated.Oursimulationsaccuratelyaccountforthisdelay.

4

250 200Combines / kc 150 100 50 0 2 4 6Combines / kc111082 250 200 150 100 50111082 512 256 128Roundtrip Latency 32 16 8 4

BaselineWaitbuf100ImprovedAggressive

8 10 0 2 4 6 8 10Distance to MemoryDistance to Memory

Fig.5.Combiningrate,bystageforsimulatedpollingonsystemsof22to211PEs.Waitbuffershavecapacity100andcombiningqueuescanhold4combinedoruncombinedmessages.Intherightplotthecombiningqueuesaredeclaredfulliftwocombinedrequestsarepresent.

Whathappensinsteadisthatthequeuesinswitchesnearmemoryfilltocapacityandthequeuesintheremainderoftheswitchesarenearlyempty.Sincecombiningrequiresmultipleentriestobepresent,itcanonlyoccurnearmemory.How-ever,asingleswitchcannotcombineanunboundednumberofrequestsintoone.ThosefabricatedfortheUltracomputercouldcombineonlypairsso,if,forexample,eightrequestsarequeuedforthesamelocation,(atleast)fourrequestswilldepart.5

Figure5illustratesthiseffect.Bothplotsareforhotspotpolling.TheleftplotisforsimulatedswitchesmodeledafterthosefabricatedfortheUl-tracomputer,butwiththewaitbuffersizeincreasedto100.Eachqueuehasfourslotseachofwhichcanholdarequestreceivedbytheswitchoroneformedbycombiningtworeceivedrequests.Theplotontherightisforthesameswitcheswiththequeuecapacityrestrictedsothatif2combinedrequestsarepresentthequeueisdeclaredfullevenifemptyslotsremain.Wecomparethegraphslabeled10(representingasystemwith1024PEs)ineachplot.Intheleftplotwiththeoriginalswitches,wefindthatcombinesoccuratthemaximalrateforthefour(outof10)stagesclosesttomemory,occuratnearlythemaximalrateforthefifthstage,anddonotoccurfortheremainingfivestages.Therestrictedswitchesdobetter,combiningatmaximalrateforfivestagesandat1/4ofthemaximumforthesixthstage.Notethatforuniformlydistributedtrafficwithouthotspots,combiningwillveryrarelyoccurandtheartificiallimitof2combinedrequestsperqueuewillnotbeinvoked.Wecallthisnewcombiningqueue

Alternatedesignscouldcombinemorethantworequestsintoone,but,asobservedby[7],whenthis“combiningdegree”increases,congestionarisesatthepointwherethesingleresponseisde-combinedintomany.

5

2 3 4 5

6 7 8log(# processors)

9 10 11

Fig.6.Memorylatencyforsimulatedhotspotpollingtraffic,4-2048PEs.

designadaptivesincethequeuesarefullsizeforuniformlydistributedtrafficandadapttobusywaitpollingbyartificiallyreducingtheirsize.

WeseeinFigure6thattheincreasedcombiningrateachievedbytheadaptivequeuesdramaticallylowersthelatencyexperiencedduringbusywaitpolling.Fora2048PEsystemthereductionisfrom168cyclesforasystemwithlarge(100entry)waitbuffersandtheoriginalqueuesto118cycles(fivetimesthelatencyofuniformtraffic)withthesamewaitbuffersbutadaptivecombiningqueues.Thisisareductionofover30%andgivesatotalreductionof61%whencomparedwiththe306cyclesneededbythebaselineswitches.Thebottomtwoplotsareformoreaggressiveswitchdesignsthatarediscussedbelow.InSectionIVweshallseethatcentralizedcoordinationalgorithmsexecutedonsystemswithadaptivequeuesandlargewaitbuffersarecompetitivewiththeirdistributedlocal-spinningalternatives.

Figure7comparesthelatencyfor1024PEsys-temswithvariousswitchdesignsandarangeofacceptedloads(i.e.,processorscanhavemultipleoutstandingrequestsunlikethesituationaboveforbusywaitpolling).Theseresultsconfirmourasser-tionthatadaptivequeueshaveverylittleeffectforlowhotspotratesandareaconsiderableimprove-mentforhighrates.

C.MoreAggressiveCombiningQueues

RecallthatwehavebeensimulatingdecoupledtypeBswitchesinwhichcombiningisdisabledfortheheadentry(to“decouple”theALUandoutputdrivers)andthedualinputcombiningqueuesare5

25 24Roundtrip LatencyRoundtrip Latency 23 22 21 20 0.06

AggressiveBaselineWaitbuf100Improved

0.07

0.08 0.09 0.1

Accepted Load

0.11

0.12

90 80 70 60 50 40 30

Roundtrip LatencyBaselineWaitbuf100ImprovedAggressive 2048 1024 512 256 128 32 16

0

BaselineWaitbuf100ImprovedAggressive

20

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13

Accepted Load

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Accepted Load

Fig.7.SimulatedRound-tripLatencyoveraRangeofOfferedLoadsfor1%(left),20%(middle)and100%(right)HotSpotTraffic.

1400 1200 1000Roundtrip Latency 800 600 400 200 0

Waitbuf100MMWait40ImprovedMMWait40AggressiveMMWait40

composedoftwoindependentsingleinputcom-biningqueueswithmultiplexedoutputs.Westartedwitha“baselinedesign”,usedintheUltracomputer,andproducedwhatwerefertoasthe“improveddesign”havingalargerwaitbufferandadaptivecombiningqueues.WealsoappliedthesametwoimprovementstotypeAswitcheshavingcoupledALUsandrefertotheresultasthe“aggressivedesign”or“aggressiveswitches”Forexample,thelowestplotinFigure6isforaggressiveswitches.Otherexperimentsnotpresentedherehaveshownthataggressiveswitchespermitsignificantratesofcombiningtooccurinnetworkstagesneartheprocessors.Also,aswewillshowinSectionIV,thecentralizedcoordinationalgorithmsperformexcep-tionallywellonthisarchitecture,Althoughaggres-siveswitchesarethebestperforming,wecautionthereaderthatourmeasurementsaregiveninunitsofaswitchcycletimeand,withoutamoredetaileddesignstudy,wecannotestimatethedegradationincycletimesuchaggressiveswitchesmightentail.D.ApplicabilityofResultstoModernSystemsTheresearchdescribedaboveinvestigatessys-temswhosecomponentshavesimilarspeeds,aswastypicalwhenthisprojectbegan.Duringtheinter-veningdecade,however,logicandcommunicationrateshaveincreasedbymorethantwoordersofmagnitudewhileDRAMlatencyhasimprovedbylessthanafactoroftwo.

Inordertobettermodeltheperformanceob-tainablewithmodernhardware,weincreasedthememorylatencyfromtwotothirty-eightcycles,andtheintervalbetweenacceptingrequestsfromfourtofortycycles.TheseresultsareplottedinFigure8andindicatethattheadvantageachievedbytheadaptiveswitchdesignisevengreaterthanbefore.Varioustechniquescanmitigatetheeffectofslow

6

4 5 6

7 8

log(# processors)

9 10

Fig.8.MemorylatencyforhotspotpollingonsystemswithMMsthatcanacceptonemessageevery40cycles.

memory.Forexample,havingmultiplesub-banksperMMwouldincreasememorybandwidthandimprovetheperformanceonuniformlydistributedtraffic,andcachesattheMMarewellsuitedtothetemporallocalityofhotspotaccesses.

IV.PERFORMANCEEVALUATIONOFCENTRALIZEDANDMCSREADER-WRITER

COORDINATIONManyalgorithmsforcoordinatingreadersandwritershaveappeared.CourtoisandHeyman[17]introducedtheproblemandpresentedasolutionthatserializestheentryofreadersaswellaswriters.Moreimportantly,thesealgorithmsgeneratehotspotpollingtrafficandthereforesufferslowdownsduetomemorysystemcongestiononsystemsthatdonotcombinehotspotreferences.

Acentralizedalgorithmispresentedin[1]that,onsystemscapableofcombiningfetch-and-addoperations,eliminatestheserializationofreadersintheabsenceofwriters.However,nocommercialsystemswiththiscapabilityhavebeenconstructed,andintheirabsence,alternative“distributedlocal-spin”MCSalgorithmshavebeendeveloped[11]

thatminimizehotspottrafficbyhavingeachproces-sorbusy-waitonasharedvariablestoredinmemoryco-locatedwiththisprocessor.ThisNUMAmemoryorganizationresultsinlocalbusywaitingthatdoesnotcontributetoorencounternetworkcongestion.6

Themostlikelycauseofunboundedwaitinginanyreader-writeralgorithmisthatacontinualstreamofreaderscanstarveallwriters.Thestandardtechniqueofgivingwriterspriorityeliminatesthispossibility.7Inthissectionwepresentaperformancecomparisonof“bestofbreed”centralizedandMCSwriter-priorityreader-writeralgorithmseachexe-cutedonasimulatedsystemwiththearchitecturalfeaturesitexploits.

Roughlysimilarresultsholdforbarrieralgo-rithms.Theinterestedreaderisreferredto[6].A.CentralizedAlgorithmsforReadersandWritersFigure9presentsacentralizedreader-writerlockusingfetch-and-add.Thisalgorithmissuesonlyasingleshared-memoryreference(afetch-and-add)whenthelockisuncontested.Weknowofnootheralgorithmwiththisdesirableproperty.Amorecompletedescriptionofthetechniquesemployedbythisalgorithmappearsin[6].

sharedintC=0;

faaReaderLock(){

for(;;){

intc=faa(C,1);if(creturntrue;c=faa(C,−1);while(c>=K){

c=C;

}}}

//

//////////request

nowriters?succeeded!cancelreqeust

whileawriterisactive...readC

faaReaderUnlock(){faa(C,−1);}

boolfaaWriterLock(){

for(;;){

intc=faa(C,K);//requestif(cif(c)//anyreaders?

while((C%K)!=0);//waitforreaderstrue;//succeeded}

c=faa(C,−K)−K;//conflict:mustdecrement&retrywhile(c>=K){//whileawriterisactive...

c=C;//readC

}}}faaWriterUnlock(){faa(C,−K);}

Fig.9.Fetch-and-addReaders-WritersLock

simulatedhardwareincludescombiningswitches,whichimprovestheperformanceofMCSandiscrucialforthecentralizedalgorithm,aswellasNUMAmemory,whichisimportantforMCSandnotusedbythecentralizedalgorithm.Alltheswitchdesignsdescribedaboveplusothersweresimulatedinthejuniorauthor’sdissertation[6].Asummary

B.HybridAlgorithmofMellor-CrummeyandScottoftheresultsispresentedbelow.

Neitheralgorithmdominatestheother:Thecen-Thewriter-priorityreaders-writersalgorithmof

Mellor-CrummeyandScott[11]iscommonlyusedtralizedalgorithmsaresuperiorexceptwhenonlyonlargeSMPsystems.Thisalgorithm,commonlywritersarepresent.Recallthatanidealreaderlock,referredtoasMCS,isahybridofcentralizedandintheabsenceofcontention,yieldslinearspeedup;distributedapproaches.Centralstatevariables,ma-whereasanidealwriterexhibitsnoslowdownasnipulatedwithvarioussynchronizationprimitives,parallelismincreases.Whentherearelargenumbersareusedtocountthenumberandtypeoflockofreaderspresent,thecentralizedalgorithm,withgrantedatanytimeandtoheadthelistsofwait-itscompletelackofreaderserialization,thusgainsingprocessors.NUMAmemoryisusedforbusyanadvantage,whichisgreaterfortheaggressive

architecture.waiting,whicheliminatesnetworkcontention.C.OverviewofExperimentalResults

Aseriesofmicro-benchmarkexperimentswereperformedtoevaluatetheperformanceofcentral-izedfetch-and-addbasedcoordinationwiththehy-bridalgorithmofMellor-CrummeyandScott.The

SomeauthorsusethetermNUMAtosimplymeanthatthememoryaccessisnon-uniform:certainlocationsarefurtherawaythanothers.Weuseittosignifythat(atleastaportionof)thesharedmemoryisdistributedamongtheprocessors,witheachprocessorhavingdirectaccesstotheportionstoredlocally.7

Butnaturallypermitswritersstarvingreaders.

6

D.ExperimentalFramework

Thescalabilityoflockingprimitivesisunim-portantwhentheyareexecutedinfrequentlywithlowcontention.Ourexperimentsconsiderthemoreinterestingcaseofsystemsthatfrequentlyrequestreaderandwriterlocks.Forallexperimentseachprocessrepeatedly:

Stochasticallychooses(atafixed,butparame-terized,probability)whethertoobtainareader

7

orwriterlock.8

•Issuesafixedsequenceofnon-combinablesharedmemoryreferencesdistributedamongmultipleMMs—the“simulatedwork”.•Releasesthelock.•Waitsafixeddelay.

Eachexperimentmeasurestheratethatlocksaregrantedoverarangeofsystemsizes(highervaluesaresuperior).Inordertogenerateequivalentcontentionfromwritersineachplot,theexpectednumberofwritersEWisheldconstantforallsystemsizes.Thus,theprobabilityofaprocessbecomingawriterisEW/Parallelism.

Eachmicro-benchmarkshasfourparameters:•PAR:thenumberofprocessorsinthesimu-latedsystem.

•Ew:Theexpectednumberofwriters.

•Work:Thenumberofsharedaccessesexe-cutedwhileholdingalock.

•Delay:Thenumberofcyclesaprocesswaitsafterreleasingalock.

Twoclassesofexperimentswereperformed:Thoseclassified“I”measurethecostofintensesyn-chronizationinwhicheachprocessorsrequestandreleaselocksatthehighestratepossible,Work=Delay=0.Thoseclassified“R”aresomewhatmorerealistic,Work=10andDelay=100.

1)AllReaderExperiments,Ew=0:Theleft-sidechartsinFigures10and11presentresultsfromexperimentswhereallprocessesarereaders.Thecentralizedalgorithmrequiresasinglehot-spotmemoryreferencetograntareaderlockintheabsenceofwriters.Incontrast,theMCSalgorithmgeneratesaccessestocentralizedstatevariablesandlinkedlistsofrequestingreaders.Notsurprisingly,thecentralizedalgorithmshavesignificantlysupe-riorperformanceinthisexperiment.Forsomecon-figurations,anorderofmagnitudedifferenceisseen.WealsoobservethatMCSdoesbenefitsignificantlyfromcombining.EvenwhenonlyMCSusestheaggressiveswitches,thecentralizedalgorithmissuperior.Naturally,thedifferencesaregreaterinthe“intense”experiments.

2)All-WriterExperiments:Theright-sidechartsinFigures10and11presentresultsfromexperi-Thesimulatedrandomnumbergeneratorexecutesinasinglecycle.

8

10000Readers/KCycle 1000 100 10

2

3

4

Writers/KCycleFaa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 100 10 1 0.1 0.01

Mcs_nocombNumaMcs_AggressiveNumaFaa_AggressiveFaa_Improved

2

3

4

5

6

7

8

9 10

log(# processors)

5 6 7 8 9 10

log(# processors)

Fig.10.ExperimentR,AllReaders(left),AllWriters(right)

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 100Writers/KCycle 10 1 0.1 0.01

Mcs_AggressiveNumaMcs_nocombNumaFaa_AggressiveFaa_Improved

2

3

4

5

6

7

8

9 10

log(# processors)

10000Readers/KCycle 1000 100 10

2

3

4

5 6 7 8 9 10

log(# processors)

Fig.11.ExperimentI,AllReaders(left),AllWriters(right)

mentswhereallprocessesarewriters,whichmustserializeandthereforetypicallyspendasubstantialperiodoftimebusy-waiting.TheMCSalgorithmhassuperiorperformanceintheseexperiments.Sincewritersenforcemutualexclusionnospeedupispossibleasthesystemsizeincreases.Indeedoneexpectsaslowdownduetothein-creasedaveragedistancetomemory.Theanoma-lousspeedupobservedforMCSbetween4and16processorsisduetothealgorithm’sdecouplingofqueueinsertionandlockpassing.Thiseffectisdescribedin[11].Asexplainedin[6]theMCSalgorithmissuesverylittlehotspottrafficwhennoreadersarepresentandthusdoesnotbenefitfromcombiningintheseexperiments.

3)MixedReader-WriterExperiments:Figures12through15presentresultsofexperimentswithbothreadersandwriters.Inthefirstset,thereisonaverage1writerpresent(thislockwillhavesub-stantialcontentionfromwriters)andinthesecondset0.1(asomewhatlesscontendedlock).

100Writers/KCycle 10 1 0.1 0.01

Faa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma

2

3

4 5 6 7 8log(# processors)

9 10

10000Readers/KCycle 1000 100 10

2

3

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 4 5 6 7 8log(# processors)

9 10

Fig.12.ExperimentI,EW=1

8

10000Readers/KCycle 1000 100 10

2

3

4

Writers/KCycleFaa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma

100 10 1 0.1 0.01

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma

2

3

4

5

6

7

8

9 10

log(# processors)

5 6 7 8 9 10

log(# processors)

bymixedtrafficpatternsthatwouldperformbetterwithlongerqueues.Wehaveneitherwitnessednorinvestigatedthiseffect,whichmaybemitigatedbymoregradualadaptivedesignsthatvariablyadjustqueuecapacityasafunctionofacontinuouslymeasuredrateofcombining.

B.GeneralizationofCombiningtoInternetTrafficThetreesaturationproblemduetohotspotaccesspatternsisnotuniquetosharedmemorysystems.Congestiongeneratedbyfloodattacksandflashcrowds[12]presentssimilarchallengesforInternetServiceProviders.In[15]Mahajanetal.proposeatechniquetolimitthedisruptiongeneratedbyhotspotcongestiononnetworktrafficwithoverlappingcommunicationroutes.Intheirscheme,enhancedserversandroutersincorporatemechanismstochar-acterizehotspotreferencepatterns.Aswithadap-tivecombining,upstreamroutersareinstructedtothrottlethehotspottrafficinordertoreducedown-streamcongestion.

Hotspotrequestsdonotbenefitfromthisap-proach,howevercombiningmayprovideanalter-nativetothrottling.Forexample,thedetectionofhotspotcongestion,couldtriggerdeploymentofproxiesneartonetworkentrypoints,potentiallyre-ducingdownstreamloadandincreasingthehotspotperformance.Thistypeofcombiningisservice-typespecificandthereforeservice-specificstrategiesmustbeemployed.Dynamicdeploymentofsuchedgeserversrequiresprotocolsforcommunicatingthecharacteristicsofhotspotaggregatestoservers,andsecuremechanismstodynamicallyinstallandactivateupstreamproxies.

C.CombiningandCache-Coherency

Cachecoherenceprotocolstypicallymanageshared(read-only)andexclusive(read-write)copiesofsharedvariables.Despitetheobviouscorrespon-dencebetweencachecoherenceandthereaders-writerscoordinationproblem,coherenceprotocolstypicallyserializethetransmissionoflinecontentstoindividualcaches.TheSCIcachecoherenceprotocolspecifiesavariantofcombiningfetch-and-storetoefficientlyenqueuerequests.However,datadistributionandlineinvalidationonnetworkconnectedsystemsisstrictlyserialized.Extensionsofcombiningmaybeabletoparallelizecachefill9

Fig.13.ExperimentR,EW=1

100Writers/KCycle 10 1 0.1 0.01

Faa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma

2

3

4

5

6

7

8

9 10

log(# processors)

10000Readers/KCycle 1000 100 10

2

3

4

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 5 6 7 8 9 10

log(# processors)

Fig.14.ExperimentI,EW=0.1

Therateatwhichthecentralizedalgorithmgrantsreaderlocksincreaseslinearlywithsystemsizeforalltheseexperimentsand,asaresult,significantlyexceedstherategrantedbyMCSforalllargesystemexperiments.Evenwithouttheaggressiveswitches,thedifferencenormallyexceedsandorderofmagnitude.

Eventhoughwritershavepriorityoverreaders,therearevastlyfewerwriterlocksgrantedwiththesevaluesofEw.Herealsothecentralizedalgo-rithmoutperformsMCS,oftenbyanorderofmag-nitude.Forsomeexperiments,writerlockgrantingratesaresolowthatnonewereobservedforsomeMCSexperiments.

V.OPENQUESTIONS

A.ExtendingtheAdaptiveTechnique

Ouradaptivetechniquesharplyreducesqueuecapacitywhenacrudedetectorofhotspottrafficistriggered.Whilethistechniquereducesnetworkla-tencyforhotspotpolling,itmightalsobetriggered

100Writers/KCycle 10 1 0.1 0.01

Mcs_AggressiveNumaFaa_ImprovedFaa_Aggressive

2

3

4 5 6 7 8log(# processors)

9 10

10000Readers/KCycle 1000 100 10

2

3

Faa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma 4 5 6 7 8log(# processors)

9 10

Fig.15.ExperimentR,EW=0.1

operations.Challengesforsuchschemeswouldin-cludethedevelopmentofanappropriatescalabledi-rectorystructurethatisamenableto(de)combinabletransactions.

VI.CONCLUSIONS

HavingbeensurprisedatthecomparativelypoorperformanceattainedbytheUltracomputer’scom-biningnetworkwhenpresentedwith100%hotspottrafficthattypifiesbusywaitpollingofcoordi-nationvariables,weidentifiedtwoimprovementstothecombiningswitchdesignthatincreaseitsperformancesignificantly.Thefirstimprovementistosimplyincreasethesizeofoneofthebufferspresent.Themoresurprisingsecondimprovementistoartificiallydecreasethecapacityofcombiningqueuesduringperiodsofheavycombining.Theseadaptivecombiningqueuesbetterdistributethememoryrequestsacrossthestagesofthenetwork,therebyincreasingtheoverallcombiningrates.Wethencomparedtheperformanceofacentral-izedalgorithmforthereaderswritersproblemwiththatofthewidelyusedMCSalgorithmthatreduceshotspotsbybusywaitspinningonlyonvariablesstoredintheportionofsharedmemorythatisco-locatedwiththeprocessorinquestion.

Oursimulationstudiesofthesetwoalgorithmshaveyieldedseveralresults:First,theperformanceoftheMCSalgorithmisimprovedbytheavail-abilityofcombining.Second,whennoreadersarepresent,theMCSalgorithmoutperformsthecentral-izedalgorithm.Finally,whenreadersarepresent,theresultsarereversed.Formanyoftheselastexperiments,anorderofmagnitudeimprovementisseen.

Aswitchcapableofcombiningmemoryref-erencesismorecomplexthannon-combiningswitches.Anobjectiveofthepreviousdesigneffortswastopermitacycletimecomparabletoasimilarnon-combiningswitch.Inordertomaximizeswitchclockfrequency,a(typeB,uncoupled)designwasselectedthatcancombinemessagesonlyiftheyarriveonthesameinputportandisunabletocombinearequestattheheadofanoutputqueue.Wealsosimulatedanaggressive(typeA,coupled)designwithoutthesetworestrictions.Asexpecteditperformedverywell,butwehavenotestimatedthecycle-timepenaltythatmayoccur.

REFERENCES

[1]AllanGottlieb,BorisLubachevsky,andLarryRudolph.Basic

TechniquesfortheEfficientCoordinationofVeryLargeNum-bersofCooperatingSequentialProcessors.ACMTOPLAS,pages1–1,April1983.

[2]ConstantineD.PolychronopoulosCarlJ.Beckmann.Fast

barriersynchronizationhardware.InProc.1990ConferenceonSupercomputing,pages180–1.IEEEComputerSocietyPress,1990.

[3]ThinkingMachinesCorp.TheConnectionMachineCM-5

TechnicalSummary,1991.

[4]DanielM.DiasandJ.RobertJump.Analysisandsimulationof

buffereddeltanetworks.IEEETrans.Comp.,C-30(4):273–282,April1981.

[5]SusanR.Dickey.SystolicCombiningSwitchDesigns.PhD

thesis,CourantInstitute,NewYorkUniveristy,NewYork,1994.[6]EricFreudenthal.ComparingandImprovingCentralizedand

DistributedTechniquesforCoordinatingMassivelyParallelShared-MemorySystems.PhDthesis,NYU,NewYork,June2003.

[7]GjynghoLee,C.P.Kruskal,andD.J.Kuck.OntheEffective-nessofCombininginResolving‘HotSpot’Contention.JournalofParallelandDistributedComputing,20(2),February1985.[8]GregoryF.Pfister,V.AlanNorton.“HotSpot”Contention

andCombininginMultistageInterconnectionNetworks.IEEETransactionsonComputers,c-34(10),October1985.

[9]GregoryF.Pfister,WilliamC.Brantley,DavidA.George,Steve

L.Harvey,WallyJ.Kleinfielder,KevinP.McAuliffe,EvelinS.Melton,V.AlanNorton,andJodiWeiss.Theibmresearchparallelprocessorprototype(rp3).InProc.ICPP,pages7–771,1985.

[10]JamesLaudon,DanielLenoski.TheSGIOrigin:accNUMA

highlyscalableserver.ACMSIGARCHComputerArchitectureNews,1997.

[11]JohnM.Mellor-CrummeyandMichaelL.Scott.Scalable

Reader-WriterSynchronizationforSharedMemoryMultipro-cessors.ACMTrans.Comput.Systems,9(1):21–65,1991.[12]”J.Jung,B.Krishnamurthy,andM.Rabinovich”.”flashcrowds

anddenialofserviceattacks:Characterizationandimplicationsforcdnsandwebsites”.InProc.InternationalWorldWideWebConference,pages252–262.”IEEE”,May”2002”.

[13]P.KermaniandLeonardKleinrock.VirtualCut-through:A

newcomputercommunicationswitchingtechnique.ComputerNetworks,3:267–286,1979.

[14]Yue-ShengLiu.ArchitectureandPerformanceofProcessor-MemoryInterconnectionNetworksforMIMDSharedMemoryParallelProcessingSystems.PhDthesis,NewYorkUniversity,1990.

[15]R.Mahajan,S.Bellovin,S.Floyd,J.Vern,andP.Scott.

Controllinghighbandwidthaggregatesinthenetwork,2001.[16]JonA.SolworthMarcSnir.UltracomputerNote39,The

Ultraswitch-AVLSINetworkNodeforParallelProcessing.Technicalreport,CourantInstitute,NewYorkUniversity,1984.[17]P.Courtois,F.Heymans,andD.Parnas.Concurrentcontrolwith

readersandwriters.Comm.ACM,14(10):667–668,October1971.

[18]RandallD.Rettberg,WilliamR.Crowther,PhillipP.Carvey.

TheMonarchParallelProcessorHardwareDesign.IEEEComputer,pages18–30,April1990.

[19]HaroldStone.Parallelprocessingwiththeperfectshuffle.IEEE

Trans.Computing,C-25(20):55–65,1971.

10

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 7swz.com 版权所有 赣ICP备2024042798号-8

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务