您的当前位置：首页 Comparing the Performance of Centralized and Distributed Coordination on Systems with Impro

Comparing the Performance of Centralized and Distributed Coordination on Systems with Impro

来源：微智科技网

ComparingthePerformanceofCentralizedandDistributedCoordinationonSystemswithImproved

CombiningSwitches

NYUComputerScienceTechnicalReportTR2003-849

EricFreudenthalandAllanGottliebDepartmentofComputerScience

CourantInstituteofMathematicalSciences

NewYorkUniversity

{freudenthal,gottlieb}@nyu.edu

thesharedmemoryisphysicallydistributedamongtheprocessors).

Anlesscommontechniqueistoutilizespecialpurposecoordinationhardwaresuchasthebarriernetworkof[2],theCM5ControlNetwork[3],orthe“combiningnetwork”of[1]andhavetheprocessorsreferencecentralizedmemory.Theideabehindcombiningisthatwhenreferencestothesamememorylocationmeetatanetworkswitch,theyarecombinedintoonereferencethatproceedstomemory.Whentheresponsetothecombinedmessagesreachestheswitch,dataheldinthe“waitbuffer”isusedtogeneratetheneededsecondre-sponse.

TheearlyworkatNYUoncombiningnetworksshowedtheirgreatadvantageforcertainclassesofmemorytrafﬁc,especiallythosewithasigniﬁcantportionofhot-spotaccesses(adisproportionately

I.INTRODUCTION

largepercentageofthereferencestooneorafew

Itiswellknownthatthescalabilityofinter-locations).Itisperhapssurprisingthatthisworkprocesscoordinationcanlimittheperformanceofdidnotsimulatethetrafﬁcgeneratedwhenalltheshared-memorycomputers.Sincethelatencyre-processorsengageinbusy-waitpolling,i.e.,100%quiredforcoordinationalgorithmssuchasbarri-hot-spotaccesses(butseethecommentson[7]ersorreaders-writersincreaseswiththeavailableinsectionIII).Whencompletingstudiesbegunaparallelism,theirimpactisespeciallyimportantfornumberofyearsagoofwhatweexpectedtobelarge-scalesystems.Acommonsoftwaretechniqueveryfastcentralizedalgorithmsforbarriersandusedtominimizethiseffectisdistributedlocal-readers-writers,wewereparticularlysurprisedtospinninginwhichprocessorsrepeatedlyaccessvari-ﬁndthatthecombiningnetworkperformedpoorlyablesstoredlocally(inso-calledNUMAsystems,inthissituation.Whileitdidnotexhibitthedis-Abstract—Memorysystemcongestionduetoserializa-tionofhotspotaccessescanadverselyaffecttheperfor-manceofinterprocesscoordinationalgorithms.Hardwareandsoftwaretechniqueshavebeenproposedtoreducethiscongestionandtherebyprovidesuperiorsystemper-formance.ThecombiningnetworksofGottliebetal.auto-maticallyparallelizeconcurrenthotspotmemoryaccesses,improvingtheperformanceofalgorithmsthatpollasmallnumberofsharedvariables.Thewidelyused“MCS”distributed-spinalgorithmstakeasoftwareapproach:theyreducehotspotcongestionbypollingonlyvariablesstoredlocally.

Ourinvestigationsdetectedperformanceproblemsinexistingdesignsforcombiningnetworksandweproposemechanismsthatalleviatethem.Simulationstudiesde-scribedhereinindicatethatacentralizedreaderswritersalgorithmsexecutedontheimprovedcombiningnetworkshaveperformanceatleastcompetitivetotheMCSalgo-rithms.

astrousserializationcharacteristicofaccessestoasinglelocationwithoutcombining,theimprovementwasmuchlessthanexpectedandouralgorithmswerenotnearlycompetitivewiththosebasedondistributedlocal-spinning[MCS].

Inthispaperwebrieﬂyreviewcombiningnet-worksandpresentthesurprisingdatajustmen-tioned.WethenpresenttwofairlysimplechangestotheNYUcombiningswitchesthatenablethesystemtoperformmuchbetter.Theﬁrstchangeistoincreasethewait-buffersize.Thesecondchangeismoresubtle:Thenetworkisoutput-bufferedandatrade-offexistsinvolvingthesizeoftheoutputqueues.Largequeuesarewellknowntoimproveperformanceforrandomtrafﬁc.However,wefoundthatlargequeuescausepoorpollingperformance.Wethereforeproposeadaptingthequeuesizetothetrafﬁcencountered:asmorecombinedmessagesarepresent,thequeuecapacityisreduced.Together,thesetwosimplechangeshaveadramaticeffectonpollingandourcentralizedbarrierandreaders-writersalgorithmsbecomecompetitivewiththecommonlyusedlocal-spinalgorithmsofMellor-CrummeyandScott(someofwhichalsobeneﬁtfromtheavailabilityofcombining).

II.BACKGROUND

Large-scale,shared-memorycomputationre-quiresmemorysystemswithbandwidththatscaleswiththenumberofprocessors.Multi-stageinter-connectionfabricsandinterleavingofmemoryad-dressesamongmultiplememoryunitscanprovidescalablememorybandwidthformemoryreferencepatternswhoseaddressesareuniformlydistributed.Manyvariantsofthisarchitecturehavebeenim-plementedincommercialandotherresearchsys-tems[9],[10],[18].However,theserializationofmemorytransactionsateachmemoryunitisproblematicforreferencepatternswhosemappingtomemoryunitsisunevenlydistributed.Anim-portantcauseofnon-uniformmemoryaccesspat-ternsishot-spotmemoryaccessesgeneratedbycentralizedbusy-waitingcoordinationalgorithms.TheUltracomputerarchitectureincludesnetworkswitches[16]withlogictoreducethiscongestionbycombiningintoasinglerequestmultiplememorytransactions(e.g.loads,stores,fetch-and-adds)thatreferencethesamememoryaddress.

PE7PE6PE5PE4PE3PE2PE1PE0SWSWSWSWSWSWMM7MM6MM5MM4MM3MM2MM1MM0SWSWSWSWSWSWFig.1.EightPESystemwithHot-SpotCongestiontoMM3.TheUltracomputercombiningswitchdesignuti-lizesavariantofcut-throughrouting[13]thatimposesalatencyofoneclockcyclewhenthereisnocontentionforanoutgoingnetworklink.Whenthereiscontentionforanoutputport,messagesarebufferedonqueuesassociatedwitheachoutputport.InvestigationsbyDiasandJump[4],Dickey[5],Liu[14],andothersindicatethatthesequeuessigniﬁcantlyincreasenetworkbandwidthforlargesystemswithuniformlydistributedmemoryaccesspatterns.

Systemswithhighdegreesofparallelismcanbeconstructedusingtheseswitches:Figure1illus-tratesaneight-processorsystemwithd=3stagesofroutingswitchesinterconnectedbyashufﬂe-exchange[19]routingpattern.ReferencestoMM3arecommunicatedviacomponentsdenotedusingboldoutlines.

1)AnOverviewofCombining:Weassumethatasinglememorymodule(MM)caninitiateatmostonerequestpercycle.1Thusunbalancedmemoryaccesspatterns,suchashotspotpollingofacoor-dinationvariable,cangeneratenetworkcongestion.Figure1illustratescontentionamongreferencestoMM3.WhentherateofrequeststooneMMexceedsitsbandwidth,theswitchqueuesfeedingitwillﬁll.Sinceaswitchcannotacceptmes-sageswhenitsoutputbuffersarefull,afunnel-of-congestionwillspreadtothenetworkstagesthatfeedtheoverloadedMMandinterferewithtransactionsdestinedforotherMMsaswell.2

Ultracomputerswitchescombinepairsofmemoryrequestsaccessingthesamelocationintoasinglere-ThisrestrictionappliestoeachbankiftheMMisitselfcomposedofbanksthatcanbeaccessedconcurrently.TheMMdesignsimulatedinourexperimentscanacceptonerequesteveryfourcycles.2

PﬁsterandNorton[8]calledthisfunneltreesaturationandobservedthataccesspatternscontainingonly5%hotspottrafﬁcsubstantiallyincreasememorylatency

F&A(X,e)X:tF&A(X,f)X:t+eF&A(X,1)X:12F&A(X,2)X:13F&A(X,4)X:0F&A(X,8)X:41F&A(X,3)X:12F&A(X,12)X:012eF&A(X,e+f)tAddend for decombinein wait buffer.PEor SWToPEPortsRQ1WB1FCQ1DecombineinfoMMor SWToMMPortsRQ1PEor SWWB0DecombineinfoFCQ0MMor SWF&A(X,15)Start: X=0 End: X=15X:0MMReverse PathPE MMDecombineForward PathPE MM4Fig.3.BlockDiagramofCombining2-by-2SwitchNotation:RQ:Reverse(ToPE)Queue,WB:WaitBuffer,FCQ:Forward(ToMM)CombiningQueueresponsearrivesatthecombiningswitchthelatterFig.2.CombiningofFetch-And-Addsatasingleswitch(above)transmitsXtosatisfytherequestFAA(X,e)andandatmultipleswitches(below).

transmitsT+etosatisfytherequestFAA(x,f),

questtoreducethecongestiongeneratedbyhotspotthusachievingthesameeffectasifFAA(X,e)wasmemorytrafﬁc.Whenthememoryresponsesubse-followedimmediatelybyFAA(X,f).Thisprocessquentlyarrivesatthisswitchitisde-combinedintoisillustratedintheupperportionofFigure2.Theapairofresponsesthatareroutedtotherequestingcascadedcombiningof4requestsattwonetworkPEs.Toenablethisde-combination,theswitchusesstagesisillustratedinthelowerportionofthesameaninternalwaitbuffertoholdinformationfoundinﬁgure.therequestuntilitisneededtogeneratethesecondFigure3illustratesanUltracomputercombiningresponse.Sincecombinedmessagescanthemselvesswitch.Eachswitchcontainsbecombined,thistechniquehasthepotentialto•TwoDual-inputforward-pathcombiningreducehotspotcontentionbyafactoroftwoatqueues:Entriesareinsertedanddeletedineachnetworkstage.aFIFOmannerandmatchingentriesare2)CombiningofFetch-and-add:Ourfetch-and-combined,whichnecessitatesanALUto

addbasedcentralizedcoordinationalgorithmspollcomputethesume+f.asmallnumber(typicallyone)of“hotspot”shared•TwoDual-inputreversepathqueues:Entriesvariableswhosevaluesaremodiﬁedusingfetch-areinsertedanddeletedinaFIFOmanner.

and-add.Thus,asindicatedabove,itiscrucialina•TwoWaitBuffers:Entriesareinsertedandas-designsupportinglargenumbersofprocessorsnotsociativesearchesareperformedwithmatchedtoserializethisactivity.Thesolutionemployedisentriesremoved.AnincludedALUcomputestoincludeaddersintheMMs(thusguaranteeingX+e.atomicity)andtocombineconcurrentfetch-and-addoperationsattheswitches.A.WhenCombiningCanOccurWhentwofetch-and-addoperationsreferencing

Networklatencyisproportionaltoswitchcycle

thesamesharedvariable,sayFAA(X,e)and

timesandgrowswithqueuingdelays.VLSIsimu-FAA(X,f),meetatswitchthecombinedrequest

lationsshowedthatthecriticalpathinaproposed

FAA(X,e+f)istransmittedandthevalueeisstored

Ultracomputerswitchincludedtheaddertoform

inthewaitbuffer.Loadandfetch-and-addopera-e+fandtheoutputdrivers.Toreducecycletime,

tionsdirectedtowardsthesamehotspotvariable

atthecostofrestrictingthecircumstancesinwhich

canbecombinedifloadsaretransmittedasfetch-combiningcouldoccur,thechosendesigndidnot

and-addswhoseaddendsarezero.

combinerequeststhatwereattheheadoftheoutput

UponreceivingFAA(X,e+f),theMMupdates

queue(andhencemightbetransmittedthesame

X(toX+e+f)andrespondswithXWhenthe

cycleascombined).Thismodiﬁcationreducedthe

RecallthatFAA(X,e)isdeﬁnedtoreturntheoldvalueofXandcriticalpathtimingtothemaxoftheadderandatomicallyincrementXbythevaluee.driverratherthantheirsum.

512 256 128Roundtrip Latency 32 16 8 4

Baseline, 100% hotspotWaitbuf100, 100% hotspot

Baseline, 10% hotspotWaitbuf100, 10% hotspot

6 7 8log(# processors)

Fig.4.MemoryLatencyforUltraswitcheswithWaitBufferCapacitiesof8and100messagesfor10%and100%HotspotTrafﬁc,1OutstandingRequest/PE.

Wecallthemodiﬁeddesign“decoupled”becausetheadderanddriverareinasensedecoupled,andcalltheoriginaldesign“coupled”.Sincetheheadentrycannotbecombined,wenotethatadecoupledqueuerequiresatleastthreerequestsforcombiningtooccur.Weshallseethatthistrivialsoundingobservationisimportant.

Toenablethedualinputqueuestoacceptitemsoneachinputinonecycle,thequeueisconstructedfromtwoindependentsingle-inputqueueswhoseoutputsaremultiplexed.Toachievethemaximumcombiningrate,wethereforerequireatleastthreerequestsineachofthesingle-inputcombiningqueues,whichimpliesatleastsixineachdual-inputcombiningqueues.Amorecomplicateddual-inputdecoupledcombiningqueue,dubbedtypeAinDickey[5]requiresonlythreemessagestoachievethemaximumcombiningrateratherthansixinthe“typeB”designweareassuming.

III.IMPROVINGTHEPERFORMANCEOF

BUSY-WAITPOLLINGFigure4plotsmemorylatencyforsimulatedsystemsoffourto2048PEswithmemorytraf-ﬁccontaining10%and100%hotspotreferences(thelattertypiﬁesthetrafﬁcwhenprocessorsareengagedinbusy-waitpollingandthelocalcachesﬁlteroutinstructionandprivatedatareferences).ThesimulatedmultiprocessorapproximatestheUl-tracomputerdesign.Observethatfor10%hotspotreferencesthelatencyisonlyslightlygreaterthantheminimumnetworktransittime(1cycleperstage

perdirection)plusthesimulatedmemorylatencyof2cycles.

Onlargersystems,memorylatencyissubstan-tiallygreaterforthe100%hotspotloadandcanexceed10timestheminimum.Sincethecombiningswitchessimulatedwereexpectedtoperformwellforthistrafﬁc,theresultsweresurprising,especiallytotheseniorauthorwhowasheavilyinvolvedwiththeUltracomputerprojectthroughoutitsduration.Itisclearthatourcurrentobjectiveofexhibit-inghigh-performancecentralizedcoordinational-gorithmscannotbeachievedusingthesesimulatedswitches.

Wehavediscoveredthatthereweretwodesignﬂawsinthecombiningswitchdesign.Theﬁrstisthatthewaitbufferwastoosmall,thesecondwasthat,inasensetobeexplainedbelow,thecombiningqueuesweretoolarge.

A.IncreasingtheWaitBufferCapacity

Thefabricatedswitcheshadwaitbufferscapableofholding8entries.Weseein4thatincreasingthecapacityto100entries(feasiblewithtoday’stechnology)reducesthelatencyfora2048PEsys-temfrom306to168cyclesanimprovementof45%.Whilethisimprovementcertainlyhelps,thelatencyofhotspotpollingtrafﬁcisseventimesthelatencyofuniformlydistributedreferencepatterns(24cycles).

B.AdaptiveCombiningQueues

Inordertosupplyhighbandwidthfortypicaluniformlydistributedtrafﬁc(i.e.,0%hotspot),itisimportantfortheswitchqueuestobelarge.However,asobservedby[7],busywaitpolling(100%hotspot),however,ispoorlyservedbytheselargequeues,aswenowdescribe.

Forbusy-waitpolling,eachprocessoralwayshasoneoutstandingrequestdirectedatthesameloca-tion.4OurexpectationwasthatwithNprocessorsandhencelogNstagesofswitches,pairswouldcombineateachstageresultinginjustone(orperhapsmorerealisticallyafew)requestreachingmemory.

Thisisnotquitecorrect:Whentheresponsearrivesittakesafewcyclesbeforethenextrequestisgenerated.Oursimulationsaccuratelyaccountforthisdelay.

250 200Combines / kc 150 100 50 0 2 4 6Combines / kc111082 250 200 150 100 50111082 512 256 128Roundtrip Latency 32 16 8 4

BaselineWaitbuf100ImprovedAggressive

8 10 0 2 4 6 8 10Distance to MemoryDistance to Memory

Fig.5.Combiningrate,bystageforsimulatedpollingonsystemsof22to211PEs.Waitbuffershavecapacity100andcombiningqueuescanhold4combinedoruncombinedmessages.Intherightplotthecombiningqueuesaredeclaredfulliftwocombinedrequestsarepresent.

Whathappensinsteadisthatthequeuesinswitchesnearmemoryﬁlltocapacityandthequeuesintheremainderoftheswitchesarenearlyempty.Sincecombiningrequiresmultipleentriestobepresent,itcanonlyoccurnearmemory.How-ever,asingleswitchcannotcombineanunboundednumberofrequestsintoone.ThosefabricatedfortheUltracomputercouldcombineonlypairsso,if,forexample,eightrequestsarequeuedforthesamelocation,(atleast)fourrequestswilldepart.5

Figure5illustratesthiseffect.Bothplotsareforhotspotpolling.TheleftplotisforsimulatedswitchesmodeledafterthosefabricatedfortheUl-tracomputer,butwiththewaitbuffersizeincreasedto100.Eachqueuehasfourslotseachofwhichcanholdarequestreceivedbytheswitchoroneformedbycombiningtworeceivedrequests.Theplotontherightisforthesameswitcheswiththequeuecapacityrestrictedsothatif2combinedrequestsarepresentthequeueisdeclaredfullevenifemptyslotsremain.Wecomparethegraphslabeled10(representingasystemwith1024PEs)ineachplot.Intheleftplotwiththeoriginalswitches,weﬁndthatcombinesoccuratthemaximalrateforthefour(outof10)stagesclosesttomemory,occuratnearlythemaximalratefortheﬁfthstage,anddonotoccurfortheremainingﬁvestages.Therestrictedswitchesdobetter,combiningatmaximalrateforﬁvestagesandat1/4ofthemaximumforthesixthstage.Notethatforuniformlydistributedtrafﬁcwithouthotspots,combiningwillveryrarelyoccurandtheartiﬁciallimitof2combinedrequestsperqueuewillnotbeinvoked.Wecallthisnewcombiningqueue

Alternatedesignscouldcombinemorethantworequestsintoone,but,asobservedby[7],whenthis“combiningdegree”increases,congestionarisesatthepointwherethesingleresponseisde-combinedintomany.

2 3 4 5

6 7 8log(# processors)

9 10 11

Fig.6.Memorylatencyforsimulatedhotspotpollingtrafﬁc,4-2048PEs.

designadaptivesincethequeuesarefullsizeforuniformlydistributedtrafﬁcandadapttobusywaitpollingbyartiﬁciallyreducingtheirsize.

WeseeinFigure6thattheincreasedcombiningrateachievedbytheadaptivequeuesdramaticallylowersthelatencyexperiencedduringbusywaitpolling.Fora2048PEsystemthereductionisfrom168cyclesforasystemwithlarge(100entry)waitbuffersandtheoriginalqueuesto118cycles(ﬁvetimesthelatencyofuniformtrafﬁc)withthesamewaitbuffersbutadaptivecombiningqueues.Thisisareductionofover30%andgivesatotalreductionof61%whencomparedwiththe306cyclesneededbythebaselineswitches.Thebottomtwoplotsareformoreaggressiveswitchdesignsthatarediscussedbelow.InSectionIVweshallseethatcentralizedcoordinationalgorithmsexecutedonsystemswithadaptivequeuesandlargewaitbuffersarecompetitivewiththeirdistributedlocal-spinningalternatives.

Figure7comparesthelatencyfor1024PEsys-temswithvariousswitchdesignsandarangeofacceptedloads(i.e.,processorscanhavemultipleoutstandingrequestsunlikethesituationaboveforbusywaitpolling).Theseresultsconﬁrmourasser-tionthatadaptivequeueshaveverylittleeffectforlowhotspotratesandareaconsiderableimprove-mentforhighrates.

C.MoreAggressiveCombiningQueues

RecallthatwehavebeensimulatingdecoupledtypeBswitchesinwhichcombiningisdisabledfortheheadentry(to“decouple”theALUandoutputdrivers)andthedualinputcombiningqueuesare5

25 24Roundtrip LatencyRoundtrip Latency 23 22 21 20 0.06

AggressiveBaselineWaitbuf100Improved

0.07

0.08 0.09 0.1

Accepted Load

0.11

0.12

90 80 70 60 50 40 30

Roundtrip LatencyBaselineWaitbuf100ImprovedAggressive 2048 1024 512 256 128 32 16

BaselineWaitbuf100ImprovedAggressive

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13

Accepted Load

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Accepted Load

Fig.7.SimulatedRound-tripLatencyoveraRangeofOfferedLoadsfor1%(left),20%(middle)and100%(right)HotSpotTrafﬁc.

1400 1200 1000Roundtrip Latency 800 600 400 200 0

Waitbuf100MMWait40ImprovedMMWait40AggressiveMMWait40

composedoftwoindependentsingleinputcom-biningqueueswithmultiplexedoutputs.Westartedwitha“baselinedesign”,usedintheUltracomputer,andproducedwhatwerefertoasthe“improveddesign”havingalargerwaitbufferandadaptivecombiningqueues.WealsoappliedthesametwoimprovementstotypeAswitcheshavingcoupledALUsandrefertotheresultasthe“aggressivedesign”or“aggressiveswitches”Forexample,thelowestplotinFigure6isforaggressiveswitches.Otherexperimentsnotpresentedherehaveshownthataggressiveswitchespermitsigniﬁcantratesofcombiningtooccurinnetworkstagesneartheprocessors.Also,aswewillshowinSectionIV,thecentralizedcoordinationalgorithmsperformexcep-tionallywellonthisarchitecture,Althoughaggres-siveswitchesarethebestperforming,wecautionthereaderthatourmeasurementsaregiveninunitsofaswitchcycletimeand,withoutamoredetaileddesignstudy,wecannotestimatethedegradationincycletimesuchaggressiveswitchesmightentail.D.ApplicabilityofResultstoModernSystemsTheresearchdescribedaboveinvestigatessys-temswhosecomponentshavesimilarspeeds,aswastypicalwhenthisprojectbegan.Duringtheinter-veningdecade,however,logicandcommunicationrateshaveincreasedbymorethantwoordersofmagnitudewhileDRAMlatencyhasimprovedbylessthanafactoroftwo.

Inordertobettermodeltheperformanceob-tainablewithmodernhardware,weincreasedthememorylatencyfromtwotothirty-eightcycles,andtheintervalbetweenacceptingrequestsfromfourtofortycycles.TheseresultsareplottedinFigure8andindicatethattheadvantageachievedbytheadaptiveswitchdesignisevengreaterthanbefore.Varioustechniquescanmitigatetheeffectofslow

4 5 6

7 8

log(# processors)

9 10

Fig.8.MemorylatencyforhotspotpollingonsystemswithMMsthatcanacceptonemessageevery40cycles.

memory.Forexample,havingmultiplesub-banksperMMwouldincreasememorybandwidthandimprovetheperformanceonuniformlydistributedtrafﬁc,andcachesattheMMarewellsuitedtothetemporallocalityofhotspotaccesses.

IV.PERFORMANCEEVALUATIONOFCENTRALIZEDANDMCSREADER-WRITER

COORDINATIONManyalgorithmsforcoordinatingreadersandwritershaveappeared.CourtoisandHeyman[17]introducedtheproblemandpresentedasolutionthatserializestheentryofreadersaswellaswriters.Moreimportantly,thesealgorithmsgeneratehotspotpollingtrafﬁcandthereforesufferslowdownsduetomemorysystemcongestiononsystemsthatdonotcombinehotspotreferences.

Acentralizedalgorithmispresentedin[1]that,onsystemscapableofcombiningfetch-and-addoperations,eliminatestheserializationofreadersintheabsenceofwriters.However,nocommercialsystemswiththiscapabilityhavebeenconstructed,andintheirabsence,alternative“distributedlocal-spin”MCSalgorithmshavebeendeveloped[11]

thatminimizehotspottrafﬁcbyhavingeachproces-sorbusy-waitonasharedvariablestoredinmemoryco-locatedwiththisprocessor.ThisNUMAmemoryorganizationresultsinlocalbusywaitingthatdoesnotcontributetoorencounternetworkcongestion.6

Themostlikelycauseofunboundedwaitinginanyreader-writeralgorithmisthatacontinualstreamofreaderscanstarveallwriters.Thestandardtechniqueofgivingwriterspriorityeliminatesthispossibility.7Inthissectionwepresentaperformancecomparisonof“bestofbreed”centralizedandMCSwriter-priorityreader-writeralgorithmseachexe-cutedonasimulatedsystemwiththearchitecturalfeaturesitexploits.

Roughlysimilarresultsholdforbarrieralgo-rithms.Theinterestedreaderisreferredto[6].A.CentralizedAlgorithmsforReadersandWritersFigure9presentsacentralizedreader-writerlockusingfetch-and-add.Thisalgorithmissuesonlyasingleshared-memoryreference(afetch-and-add)whenthelockisuncontested.Weknowofnootheralgorithmwiththisdesirableproperty.Amorecompletedescriptionofthetechniquesemployedbythisalgorithmappearsin[6].

sharedintC=0;

faaReaderLock(){

for(;;){

intc=faa(C,1);if(creturntrue;c=faa(C,−1);while(c>=K){

c=C;

}}}

//////////request

nowriters?succeeded!cancelreqeust

whileawriterisactive...readC

faaReaderUnlock(){faa(C,−1);}

boolfaaWriterLock(){

for(;;){

intc=faa(C,K);//requestif(cif(c)//anyreaders?

while((C%K)!=0);//waitforreaderstrue;//succeeded}

c=faa(C,−K)−K;//conflict:mustdecrement&retrywhile(c>=K){//whileawriterisactive...

c=C;//readC

}}}faaWriterUnlock(){faa(C,−K);}

Fig.9.Fetch-and-addReaders-WritersLock

simulatedhardwareincludescombiningswitches,whichimprovestheperformanceofMCSandiscrucialforthecentralizedalgorithm,aswellasNUMAmemory,whichisimportantforMCSandnotusedbythecentralizedalgorithm.Alltheswitchdesignsdescribedaboveplusothersweresimulatedinthejuniorauthor’sdissertation[6].Asummary

B.HybridAlgorithmofMellor-CrummeyandScottoftheresultsispresentedbelow.

Neitheralgorithmdominatestheother:Thecen-Thewriter-priorityreaders-writersalgorithmof

Mellor-CrummeyandScott[11]iscommonlyusedtralizedalgorithmsaresuperiorexceptwhenonlyonlargeSMPsystems.Thisalgorithm,commonlywritersarepresent.Recallthatanidealreaderlock,referredtoasMCS,isahybridofcentralizedandintheabsenceofcontention,yieldslinearspeedup;distributedapproaches.Centralstatevariables,ma-whereasanidealwriterexhibitsnoslowdownasnipulatedwithvarioussynchronizationprimitives,parallelismincreases.Whentherearelargenumbersareusedtocountthenumberandtypeoflockofreaderspresent,thecentralizedalgorithm,withgrantedatanytimeandtoheadthelistsofwait-itscompletelackofreaderserialization,thusgainsingprocessors.NUMAmemoryisusedforbusyanadvantage,whichisgreaterfortheaggressive

architecture.waiting,whicheliminatesnetworkcontention.C.OverviewofExperimentalResults

Aseriesofmicro-benchmarkexperimentswereperformedtoevaluatetheperformanceofcentral-izedfetch-and-addbasedcoordinationwiththehy-bridalgorithmofMellor-CrummeyandScott.The

SomeauthorsusethetermNUMAtosimplymeanthatthememoryaccessisnon-uniform:certainlocationsarefurtherawaythanothers.Weuseittosignifythat(atleastaportionof)thesharedmemoryisdistributedamongtheprocessors,witheachprocessorhavingdirectaccesstotheportionstoredlocally.7

Butnaturallypermitswritersstarvingreaders.

D.ExperimentalFramework

Thescalabilityoflockingprimitivesisunim-portantwhentheyareexecutedinfrequentlywithlowcontention.Ourexperimentsconsiderthemoreinterestingcaseofsystemsthatfrequentlyrequestreaderandwriterlocks.Forallexperimentseachprocessrepeatedly:

•

Stochasticallychooses(ataﬁxed,butparame-terized,probability)whethertoobtainareader

orwriterlock.8

•Issuesaﬁxedsequenceofnon-combinablesharedmemoryreferencesdistributedamongmultipleMMs—the“simulatedwork”.•Releasesthelock.•Waitsaﬁxeddelay.

Eachexperimentmeasurestheratethatlocksaregrantedoverarangeofsystemsizes(highervaluesaresuperior).Inordertogenerateequivalentcontentionfromwritersineachplot,theexpectednumberofwritersEWisheldconstantforallsystemsizes.Thus,theprobabilityofaprocessbecomingawriterisEW/Parallelism.

Eachmicro-benchmarkshasfourparameters:•PAR:thenumberofprocessorsinthesimu-latedsystem.

•Ew:Theexpectednumberofwriters.

•Work:Thenumberofsharedaccessesexe-cutedwhileholdingalock.

•Delay:Thenumberofcyclesaprocesswaitsafterreleasingalock.

Twoclassesofexperimentswereperformed:Thoseclassiﬁed“I”measurethecostofintensesyn-chronizationinwhicheachprocessorsrequestandreleaselocksatthehighestratepossible,Work=Delay=0.Thoseclassiﬁed“R”aresomewhatmorerealistic,Work=10andDelay=100.

1)AllReaderExperiments,Ew=0:Theleft-sidechartsinFigures10and11presentresultsfromexperimentswhereallprocessesarereaders.Thecentralizedalgorithmrequiresasinglehot-spotmemoryreferencetograntareaderlockintheabsenceofwriters.Incontrast,theMCSalgorithmgeneratesaccessestocentralizedstatevariablesandlinkedlistsofrequestingreaders.Notsurprisingly,thecentralizedalgorithmshavesigniﬁcantlysupe-riorperformanceinthisexperiment.Forsomecon-ﬁgurations,anorderofmagnitudedifferenceisseen.WealsoobservethatMCSdoesbeneﬁtsigniﬁcantlyfromcombining.EvenwhenonlyMCSusestheaggressiveswitches,thecentralizedalgorithmissuperior.Naturally,thedifferencesaregreaterinthe“intense”experiments.

2)All-WriterExperiments:Theright-sidechartsinFigures10and11presentresultsfromexperi-Thesimulatedrandomnumbergeneratorexecutesinasinglecycle.

10000Readers/KCycle 1000 100 10

Writers/KCycleFaa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 100 10 1 0.1 0.01

Mcs_nocombNumaMcs_AggressiveNumaFaa_AggressiveFaa_Improved

9 10

log(# processors)

5 6 7 8 9 10

log(# processors)

Fig.10.ExperimentR,AllReaders(left),AllWriters(right)

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 100Writers/KCycle 10 1 0.1 0.01

Mcs_AggressiveNumaMcs_nocombNumaFaa_AggressiveFaa_Improved

9 10

log(# processors)

10000Readers/KCycle 1000 100 10

5 6 7 8 9 10

log(# processors)

Fig.11.ExperimentI,AllReaders(left),AllWriters(right)

mentswhereallprocessesarewriters,whichmustserializeandthereforetypicallyspendasubstantialperiodoftimebusy-waiting.TheMCSalgorithmhassuperiorperformanceintheseexperiments.Sincewritersenforcemutualexclusionnospeedupispossibleasthesystemsizeincreases.Indeedoneexpectsaslowdownduetothein-creasedaveragedistancetomemory.Theanoma-lousspeedupobservedforMCSbetween4and16processorsisduetothealgorithm’sdecouplingofqueueinsertionandlockpassing.Thiseffectisdescribedin[11].Asexplainedin[6]theMCSalgorithmissuesverylittlehotspottrafﬁcwhennoreadersarepresentandthusdoesnotbeneﬁtfromcombiningintheseexperiments.

3)MixedReader-WriterExperiments:Figures12through15presentresultsofexperimentswithbothreadersandwriters.Intheﬁrstset,thereisonaverage1writerpresent(thislockwillhavesub-stantialcontentionfromwriters)andinthesecondset0.1(asomewhatlesscontendedlock).

100Writers/KCycle 10 1 0.1 0.01

Faa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma

4 5 6 7 8log(# processors)

9 10

10000Readers/KCycle 1000 100 10

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 4 5 6 7 8log(# processors)

9 10

Fig.12.ExperimentI,EW=1

10000Readers/KCycle 1000 100 10

Writers/KCycleFaa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma

100 10 1 0.1 0.01

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma

9 10

log(# processors)

5 6 7 8 9 10

log(# processors)

bymixedtrafﬁcpatternsthatwouldperformbetterwithlongerqueues.Wehaveneitherwitnessednorinvestigatedthiseffect,whichmaybemitigatedbymoregradualadaptivedesignsthatvariablyadjustqueuecapacityasafunctionofacontinuouslymeasuredrateofcombining.

B.GeneralizationofCombiningtoInternetTrafﬁcThetreesaturationproblemduetohotspotaccesspatternsisnotuniquetosharedmemorysystems.Congestiongeneratedbyﬂoodattacksandﬂashcrowds[12]presentssimilarchallengesforInternetServiceProviders.In[15]Mahajanetal.proposeatechniquetolimitthedisruptiongeneratedbyhotspotcongestiononnetworktrafﬁcwithoverlappingcommunicationroutes.Intheirscheme,enhancedserversandroutersincorporatemechanismstochar-acterizehotspotreferencepatterns.Aswithadap-tivecombining,upstreamroutersareinstructedtothrottlethehotspottrafﬁcinordertoreducedown-streamcongestion.

Hotspotrequestsdonotbeneﬁtfromthisap-proach,howevercombiningmayprovideanalter-nativetothrottling.Forexample,thedetectionofhotspotcongestion,couldtriggerdeploymentofproxiesneartonetworkentrypoints,potentiallyre-ducingdownstreamloadandincreasingthehotspotperformance.Thistypeofcombiningisservice-typespeciﬁcandthereforeservice-speciﬁcstrategiesmustbeemployed.Dynamicdeploymentofsuchedgeserversrequiresprotocolsforcommunicatingthecharacteristicsofhotspotaggregatestoservers,andsecuremechanismstodynamicallyinstallandactivateupstreamproxies.

C.CombiningandCache-Coherency

Cachecoherenceprotocolstypicallymanageshared(read-only)andexclusive(read-write)copiesofsharedvariables.Despitetheobviouscorrespon-dencebetweencachecoherenceandthereaders-writerscoordinationproblem,coherenceprotocolstypicallyserializethetransmissionoflinecontentstoindividualcaches.TheSCIcachecoherenceprotocolspeciﬁesavariantofcombiningfetch-and-storetoefﬁcientlyenqueuerequests.However,datadistributionandlineinvalidationonnetworkconnectedsystemsisstrictlyserialized.Extensionsofcombiningmaybeabletoparallelizecacheﬁll9

Fig.13.ExperimentR,EW=1

100Writers/KCycle 10 1 0.1 0.01

Faa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma

9 10

log(# processors)

10000Readers/KCycle 1000 100 10

Faa_AggressiveFaa_Improved

Mcs_AggressiveNumaMcs_nocombNuma 5 6 7 8 9 10

log(# processors)

Fig.14.ExperimentI,EW=0.1

Therateatwhichthecentralizedalgorithmgrantsreaderlocksincreaseslinearlywithsystemsizeforalltheseexperimentsand,asaresult,signiﬁcantlyexceedstherategrantedbyMCSforalllargesystemexperiments.Evenwithouttheaggressiveswitches,thedifferencenormallyexceedsandorderofmagnitude.

Eventhoughwritershavepriorityoverreaders,therearevastlyfewerwriterlocksgrantedwiththesevaluesofEw.Herealsothecentralizedalgo-rithmoutperformsMCS,oftenbyanorderofmag-nitude.Forsomeexperiments,writerlockgrantingratesaresolowthatnonewereobservedforsomeMCSexperiments.

V.OPENQUESTIONS

A.ExtendingtheAdaptiveTechnique

Ouradaptivetechniquesharplyreducesqueuecapacitywhenacrudedetectorofhotspottrafﬁcistriggered.Whilethistechniquereducesnetworkla-tencyforhotspotpolling,itmightalsobetriggered

100Writers/KCycle 10 1 0.1 0.01

Mcs_AggressiveNumaFaa_ImprovedFaa_Aggressive

4 5 6 7 8log(# processors)

9 10

10000Readers/KCycle 1000 100 10

Faa_AggressiveFaa_Improved

Mcs_nocombNumaMcs_AggressiveNuma 4 5 6 7 8log(# processors)

9 10

Fig.15.ExperimentR,EW=0.1

operations.Challengesforsuchschemeswouldin-cludethedevelopmentofanappropriatescalabledi-rectorystructurethatisamenableto(de)combinabletransactions.

VI.CONCLUSIONS

HavingbeensurprisedatthecomparativelypoorperformanceattainedbytheUltracomputer’scom-biningnetworkwhenpresentedwith100%hotspottrafﬁcthattypiﬁesbusywaitpollingofcoordi-nationvariables,weidentiﬁedtwoimprovementstothecombiningswitchdesignthatincreaseitsperformancesigniﬁcantly.Theﬁrstimprovementistosimplyincreasethesizeofoneofthebufferspresent.Themoresurprisingsecondimprovementistoartiﬁciallydecreasethecapacityofcombiningqueuesduringperiodsofheavycombining.Theseadaptivecombiningqueuesbetterdistributethememoryrequestsacrossthestagesofthenetwork,therebyincreasingtheoverallcombiningrates.Wethencomparedtheperformanceofacentral-izedalgorithmforthereaderswritersproblemwiththatofthewidelyusedMCSalgorithmthatreduceshotspotsbybusywaitspinningonlyonvariablesstoredintheportionofsharedmemorythatisco-locatedwiththeprocessorinquestion.

Oursimulationstudiesofthesetwoalgorithmshaveyieldedseveralresults:First,theperformanceoftheMCSalgorithmisimprovedbytheavail-abilityofcombining.Second,whennoreadersarepresent,theMCSalgorithmoutperformsthecentral-izedalgorithm.Finally,whenreadersarepresent,theresultsarereversed.Formanyoftheselastexperiments,anorderofmagnitudeimprovementisseen.

Aswitchcapableofcombiningmemoryref-erencesismorecomplexthannon-combiningswitches.Anobjectiveofthepreviousdesigneffortswastopermitacycletimecomparabletoasimilarnon-combiningswitch.Inordertomaximizeswitchclockfrequency,a(typeB,uncoupled)designwasselectedthatcancombinemessagesonlyiftheyarriveonthesameinputportandisunabletocombinearequestattheheadofanoutputqueue.Wealsosimulatedanaggressive(typeA,coupled)designwithoutthesetworestrictions.Asexpecteditperformedverywell,butwehavenotestimatedthecycle-timepenaltythatmayoccur.

REFERENCES

[1]AllanGottlieb,BorisLubachevsky,andLarryRudolph.Basic

TechniquesfortheEfﬁcientCoordinationofVeryLargeNum-bersofCooperatingSequentialProcessors.ACMTOPLAS,pages1–1,April1983.

[2]ConstantineD.PolychronopoulosCarlJ.Beckmann.Fast

barriersynchronizationhardware.InProc.1990ConferenceonSupercomputing,pages180–1.IEEEComputerSocietyPress,1990.

[3]ThinkingMachinesCorp.TheConnectionMachineCM-5

TechnicalSummary,1991.

[4]DanielM.DiasandJ.RobertJump.Analysisandsimulationof

buffereddeltanetworks.IEEETrans.Comp.,C-30(4):273–282,April1981.

[5]SusanR.Dickey.SystolicCombiningSwitchDesigns.PhD

thesis,CourantInstitute,NewYorkUniveristy,NewYork,1994.[6]EricFreudenthal.ComparingandImprovingCentralizedand

DistributedTechniquesforCoordinatingMassivelyParallelShared-MemorySystems.PhDthesis,NYU,NewYork,June2003.

[7]GjynghoLee,C.P.Kruskal,andD.J.Kuck.OntheEffective-nessofCombininginResolving‘HotSpot’Contention.JournalofParallelandDistributedComputing,20(2),February1985.[8]GregoryF.Pﬁster,V.AlanNorton.“HotSpot”Contention

andCombininginMultistageInterconnectionNetworks.IEEETransactionsonComputers,c-34(10),October1985.

[9]GregoryF.Pﬁster,WilliamC.Brantley,DavidA.George,Steve

L.Harvey,WallyJ.Kleinﬁelder,KevinP.McAuliffe,EvelinS.Melton,V.AlanNorton,andJodiWeiss.Theibmresearchparallelprocessorprototype(rp3).InProc.ICPP,pages7–771,1985.

[10]JamesLaudon,DanielLenoski.TheSGIOrigin:accNUMA

highlyscalableserver.ACMSIGARCHComputerArchitectureNews,1997.

[11]JohnM.Mellor-CrummeyandMichaelL.Scott.Scalable

Reader-WriterSynchronizationforSharedMemoryMultipro-cessors.ACMTrans.Comput.Systems,9(1):21–65,1991.[12]”J.Jung,B.Krishnamurthy,andM.Rabinovich”.”ﬂashcrowds

anddenialofserviceattacks:Characterizationandimplicationsforcdnsandwebsites”.InProc.InternationalWorldWideWebConference,pages252–262.”IEEE”,May”2002”.

[13]P.KermaniandLeonardKleinrock.VirtualCut-through:A

newcomputercommunicationswitchingtechnique.ComputerNetworks,3:267–286,1979.

[14]Yue-ShengLiu.ArchitectureandPerformanceofProcessor-MemoryInterconnectionNetworksforMIMDSharedMemoryParallelProcessingSystems.PhDthesis,NewYorkUniversity,1990.

[15]R.Mahajan,S.Bellovin,S.Floyd,J.Vern,andP.Scott.

Controllinghighbandwidthaggregatesinthenetwork,2001.[16]JonA.SolworthMarcSnir.UltracomputerNote39,The

Ultraswitch-AVLSINetworkNodeforParallelProcessing.Technicalreport,CourantInstitute,NewYorkUniversity,1984.[17]P.Courtois,F.Heymans,andD.Parnas.Concurrentcontrolwith

readersandwriters.Comm.ACM,14(10):667–668,October1971.

[18]RandallD.Rettberg,WilliamR.Crowther,PhillipP.Carvey.

TheMonarchParallelProcessorHardwareDesign.IEEEComputer,pages18–30,April1990.

[19]HaroldStone.Parallelprocessingwiththeperfectshufﬂe.IEEE

Trans.Computing,C-25(20):55–65,1971.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文