CombiningSwitches
NYUComputerScienceTechnicalReportTR2003-849
EricFreudenthalandAllanGottliebDepartmentofComputerScience
CourantInstituteofMathematicalSciences
NewYorkUniversity
{freudenthal,gottlieb}@nyu.edu
thesharedmemoryisphysicallydistributedamongtheprocessors).
Anlesscommontechniqueistoutilizespecialpurposecoordinationhardwaresuchasthebarriernetworkof[2],theCM5ControlNetwork[3],orthe“combiningnetwork”of[1]andhavetheprocessorsreferencecentralizedmemory.Theideabehindcombiningisthatwhenreferencestothesamememorylocationmeetatanetworkswitch,theyarecombinedintoonereferencethatproceedstomemory.Whentheresponsetothecombinedmessagesreachestheswitch,dataheldinthe“waitbuffer”isusedtogeneratetheneededsecondre-sponse.
TheearlyworkatNYUoncombiningnetworksshowedtheirgreatadvantageforcertainclassesofmemorytraffic,especiallythosewithasignificantportionofhot-spotaccesses(adisproportionately
I.INTRODUCTION
largepercentageofthereferencestooneorafew
Itiswellknownthatthescalabilityofinter-locations).Itisperhapssurprisingthatthisworkprocesscoordinationcanlimittheperformanceofdidnotsimulatethetrafficgeneratedwhenalltheshared-memorycomputers.Sincethelatencyre-processorsengageinbusy-waitpolling,i.e.,100%quiredforcoordinationalgorithmssuchasbarri-hot-spotaccesses(butseethecommentson[7]ersorreaders-writersincreaseswiththeavailableinsectionIII).Whencompletingstudiesbegunaparallelism,theirimpactisespeciallyimportantfornumberofyearsagoofwhatweexpectedtobelarge-scalesystems.Acommonsoftwaretechniqueveryfastcentralizedalgorithmsforbarriersandusedtominimizethiseffectisdistributedlocal-readers-writers,wewereparticularlysurprisedtospinninginwhichprocessorsrepeatedlyaccessvari-findthatthecombiningnetworkperformedpoorlyablesstoredlocally(inso-calledNUMAsystems,inthissituation.Whileitdidnotexhibitthedis-Abstract—Memorysystemcongestionduetoserializa-tionofhotspotaccessescanadverselyaffecttheperfor-manceofinterprocesscoordinationalgorithms.Hardwareandsoftwaretechniqueshavebeenproposedtoreducethiscongestionandtherebyprovidesuperiorsystemper-formance.ThecombiningnetworksofGottliebetal.auto-maticallyparallelizeconcurrenthotspotmemoryaccesses,improvingtheperformanceofalgorithmsthatpollasmallnumberofsharedvariables.Thewidelyused“MCS”distributed-spinalgorithmstakeasoftwareapproach:theyreducehotspotcongestionbypollingonlyvariablesstoredlocally.
Ourinvestigationsdetectedperformanceproblemsinexistingdesignsforcombiningnetworksandweproposemechanismsthatalleviatethem.Simulationstudiesde-scribedhereinindicatethatacentralizedreaderswritersalgorithmsexecutedontheimprovedcombiningnetworkshaveperformanceatleastcompetitivetotheMCSalgo-rithms.
astrousserializationcharacteristicofaccessestoasinglelocationwithoutcombining,theimprovementwasmuchlessthanexpectedandouralgorithmswerenotnearlycompetitivewiththosebasedondistributedlocal-spinning[MCS].
Inthispaperwebrieflyreviewcombiningnet-worksandpresentthesurprisingdatajustmen-tioned.WethenpresenttwofairlysimplechangestotheNYUcombiningswitchesthatenablethesystemtoperformmuchbetter.Thefirstchangeistoincreasethewait-buffersize.Thesecondchangeismoresubtle:Thenetworkisoutput-bufferedandatrade-offexistsinvolvingthesizeoftheoutputqueues.Largequeuesarewellknowntoimproveperformanceforrandomtraffic.However,wefoundthatlargequeuescausepoorpollingperformance.Wethereforeproposeadaptingthequeuesizetothetrafficencountered:asmorecombinedmessagesarepresent,thequeuecapacityisreduced.Together,thesetwosimplechangeshaveadramaticeffectonpollingandourcentralizedbarrierandreaders-writersalgorithmsbecomecompetitivewiththecommonlyusedlocal-spinalgorithmsofMellor-CrummeyandScott(someofwhichalsobenefitfromtheavailabilityofcombining).
II.BACKGROUND
Large-scale,shared-memorycomputationre-quiresmemorysystemswithbandwidththatscaleswiththenumberofprocessors.Multi-stageinter-connectionfabricsandinterleavingofmemoryad-dressesamongmultiplememoryunitscanprovidescalablememorybandwidthformemoryreferencepatternswhoseaddressesareuniformlydistributed.Manyvariantsofthisarchitecturehavebeenim-plementedincommercialandotherresearchsys-tems[9],[10],[18].However,theserializationofmemorytransactionsateachmemoryunitisproblematicforreferencepatternswhosemappingtomemoryunitsisunevenlydistributed.Anim-portantcauseofnon-uniformmemoryaccesspat-ternsishot-spotmemoryaccessesgeneratedbycentralizedbusy-waitingcoordinationalgorithms.TheUltracomputerarchitectureincludesnetworkswitches[16]withlogictoreducethiscongestionbycombiningintoasinglerequestmultiplememorytransactions(e.g.loads,stores,fetch-and-adds)thatreferencethesamememoryaddress.
2
PE7PE6PE5PE4PE3PE2PE1PE0SWSWSWSWSWSWMM7MM6MM5MM4MM3MM2MM1MM0SWSWSWSWSWSWFig.1.EightPESystemwithHot-SpotCongestiontoMM3.TheUltracomputercombiningswitchdesignuti-lizesavariantofcut-throughrouting[13]thatimposesalatencyofoneclockcyclewhenthereisnocontentionforanoutgoingnetworklink.Whenthereiscontentionforanoutputport,messagesarebufferedonqueuesassociatedwitheachoutputport.InvestigationsbyDiasandJump[4],Dickey[5],Liu[14],andothersindicatethatthesequeuessignificantlyincreasenetworkbandwidthforlargesystemswithuniformlydistributedmemoryaccesspatterns.
Systemswithhighdegreesofparallelismcanbeconstructedusingtheseswitches:Figure1illus-tratesaneight-processorsystemwithd=3stagesofroutingswitchesinterconnectedbyashuffle-exchange[19]routingpattern.ReferencestoMM3arecommunicatedviacomponentsdenotedusingboldoutlines.
1)AnOverviewofCombining:Weassumethatasinglememorymodule(MM)caninitiateatmostonerequestpercycle.1Thusunbalancedmemoryaccesspatterns,suchashotspotpollingofacoor-dinationvariable,cangeneratenetworkcongestion.Figure1illustratescontentionamongreferencestoMM3.WhentherateofrequeststooneMMexceedsitsbandwidth,theswitchqueuesfeedingitwillfill.Sinceaswitchcannotacceptmes-sageswhenitsoutputbuffersarefull,afunnel-of-congestionwillspreadtothenetworkstagesthatfeedtheoverloadedMMandinterferewithtransactionsdestinedforotherMMsaswell.2
Ultracomputerswitchescombinepairsofmemoryrequestsaccessingthesamelocationintoasinglere-ThisrestrictionappliestoeachbankiftheMMisitselfcomposedofbanksthatcanbeaccessedconcurrently.TheMMdesignsimulatedinourexperimentscanacceptonerequesteveryfourcycles.2
PfisterandNorton[8]calledthisfunneltreesaturationandobservedthataccesspatternscontainingonly5%hotspottrafficsubstantiallyincreasememorylatency
1
F&A(X,e)X:tF&A(X,f)X:t+eF&A(X,1)X:12F&A(X,2)X:13F&A(X,4)X:0F&A(X,8)X:41F&A(X,3)X:12F&A(X,12)X:012eF&A(X,e+f)tAddend for decombinein wait buffer.PEor SWToPEPortsRQ1WB1FCQ1DecombineinfoMMor SWToMMPortsRQ1PEor SWWB0DecombineinfoFCQ0MMor SWF&A(X,15)Start: X=0 End: X=15X:0MMReverse PathPE MMDecombineForward PathPE MM4Fig.3.BlockDiagramofCombining2-by-2SwitchNotation:RQ:Reverse(ToPE)Queue,WB:WaitBuffer,FCQ:Forward(ToMM)CombiningQueueresponsearrivesatthecombiningswitchthelatterFig.2.CombiningofFetch-And-Addsatasingleswitch(above)transmitsXtosatisfytherequestFAA(X,e)andandatmultipleswitches(below).
transmitsT+etosatisfytherequestFAA(x,f),
questtoreducethecongestiongeneratedbyhotspotthusachievingthesameeffectasifFAA(X,e)wasmemorytraffic.Whenthememoryresponsesubse-followedimmediatelybyFAA(X,f).Thisprocessquentlyarrivesatthisswitchitisde-combinedintoisillustratedintheupperportionofFigure2.Theapairofresponsesthatareroutedtotherequestingcascadedcombiningof4requestsattwonetworkPEs.Toenablethisde-combination,theswitchusesstagesisillustratedinthelowerportionofthesameaninternalwaitbuffertoholdinformationfoundinfigure.therequestuntilitisneededtogeneratethesecondFigure3illustratesanUltracomputercombiningresponse.Sincecombinedmessagescanthemselvesswitch.Eachswitchcontainsbecombined,thistechniquehasthepotentialto•TwoDual-inputforward-pathcombiningreducehotspotcontentionbyafactoroftwoatqueues:Entriesareinsertedanddeletedineachnetworkstage.aFIFOmannerandmatchingentriesare2)CombiningofFetch-and-add:Ourfetch-and-combined,whichnecessitatesanALUto
addbasedcentralizedcoordinationalgorithmspollcomputethesume+f.asmallnumber(typicallyone)of“hotspot”shared•TwoDual-inputreversepathqueues:Entriesvariableswhosevaluesaremodifiedusingfetch-areinsertedanddeletedinaFIFOmanner.
3
and-add.Thus,asindicatedabove,itiscrucialina•TwoWaitBuffers:Entriesareinsertedandas-designsupportinglargenumbersofprocessorsnotsociativesearchesareperformedwithmatchedtoserializethisactivity.Thesolutionemployedisentriesremoved.AnincludedALUcomputestoincludeaddersintheMMs(thusguaranteeingX+e.atomicity)andtocombineconcurrentfetch-and-addoperationsattheswitches.A.WhenCombiningCanOccurWhentwofetch-and-addoperationsreferencing
Networklatencyisproportionaltoswitchcycle
thesamesharedvariable,sayFAA(X,e)and
timesandgrowswithqueuingdelays.VLSIsimu-FAA(X,f),meetatswitchthecombinedrequest
lationsshowedthatthecriticalpathinaproposed
FAA(X,e+f)istransmittedandthevalueeisstored
Ultracomputerswitchincludedtheaddertoform
inthewaitbuffer.Loadandfetch-and-addopera-e+fandtheoutputdrivers.Toreducecycletime,
tionsdirectedtowardsthesamehotspotvariable
atthecostofrestrictingthecircumstancesinwhich
canbecombinedifloadsaretransmittedasfetch-combiningcouldoccur,thechosendesigndidnot
and-addswhoseaddendsarezero.
combinerequeststhatwereattheheadoftheoutput
UponreceivingFAA(X,e+f),theMMupdates
queue(andhencemightbetransmittedthesame
X(toX+e+f)andrespondswithXWhenthe
cycleascombined).Thismodificationreducedthe
3
RecallthatFAA(X,e)isdefinedtoreturntheoldvalueofXandcriticalpathtimingtothemaxoftheadderandatomicallyincrementXbythevaluee.driverratherthantheirsum.
3
512 256 128Roundtrip Latency 32 16 8 4
Baseline, 100% hotspotWaitbuf100, 100% hotspot
Baseline, 10% hotspotWaitbuf100, 10% hotspot
2
3
4
5
6 7 8log(# processors)
9
10
11
Fig.4.MemoryLatencyforUltraswitcheswithWaitBufferCapacitiesof8and100messagesfor10%and100%HotspotTraffic,1OutstandingRequest/PE.
Wecallthemodifieddesign“decoupled”becausetheadderanddriverareinasensedecoupled,andcalltheoriginaldesign“coupled”.Sincetheheadentrycannotbecombined,wenotethatadecoupledqueuerequiresatleastthreerequestsforcombiningtooccur.Weshallseethatthistrivialsoundingobservationisimportant.
Toenablethedualinputqueuestoacceptitemsoneachinputinonecycle,thequeueisconstructedfromtwoindependentsingle-inputqueueswhoseoutputsaremultiplexed.Toachievethemaximumcombiningrate,wethereforerequireatleastthreerequestsineachofthesingle-inputcombiningqueues,whichimpliesatleastsixineachdual-inputcombiningqueues.Amorecomplicateddual-inputdecoupledcombiningqueue,dubbedtypeAinDickey[5]requiresonlythreemessagestoachievethemaximumcombiningrateratherthansixinthe“typeB”designweareassuming.
III.IMPROVINGTHEPERFORMANCEOF
BUSY-WAITPOLLINGFigure4plotsmemorylatencyforsimulatedsystemsoffourto2048PEswithmemorytraf-ficcontaining10%and100%hotspotreferences(thelattertypifiesthetrafficwhenprocessorsareengagedinbusy-waitpollingandthelocalcachesfilteroutinstructionandprivatedatareferences).ThesimulatedmultiprocessorapproximatestheUl-tracomputerdesign.Observethatfor10%hotspotreferencesthelatencyisonlyslightlygreaterthantheminimumnetworktransittime(1cycleperstage
4
perdirection)plusthesimulatedmemorylatencyof2cycles.
Onlargersystems,memorylatencyissubstan-tiallygreaterforthe100%hotspotloadandcanexceed10timestheminimum.Sincethecombiningswitchessimulatedwereexpectedtoperformwellforthistraffic,theresultsweresurprising,especiallytotheseniorauthorwhowasheavilyinvolvedwiththeUltracomputerprojectthroughoutitsduration.Itisclearthatourcurrentobjectiveofexhibit-inghigh-performancecentralizedcoordinational-gorithmscannotbeachievedusingthesesimulatedswitches.
Wehavediscoveredthatthereweretwodesignflawsinthecombiningswitchdesign.Thefirstisthatthewaitbufferwastoosmall,thesecondwasthat,inasensetobeexplainedbelow,thecombiningqueuesweretoolarge.
A.IncreasingtheWaitBufferCapacity
Thefabricatedswitcheshadwaitbufferscapableofholding8entries.Weseein4thatincreasingthecapacityto100entries(feasiblewithtoday’stechnology)reducesthelatencyfora2048PEsys-temfrom306to168cyclesanimprovementof45%.Whilethisimprovementcertainlyhelps,thelatencyofhotspotpollingtrafficisseventimesthelatencyofuniformlydistributedreferencepatterns(24cycles).
B.AdaptiveCombiningQueues
Inordertosupplyhighbandwidthfortypicaluniformlydistributedtraffic(i.e.,0%hotspot),itisimportantfortheswitchqueuestobelarge.However,asobservedby[7],busywaitpolling(100%hotspot),however,ispoorlyservedbytheselargequeues,aswenowdescribe.
Forbusy-waitpolling,eachprocessoralwayshasoneoutstandingrequestdirectedatthesameloca-tion.4OurexpectationwasthatwithNprocessorsandhencelogNstagesofswitches,pairswouldcombineateachstageresultinginjustone(orperhapsmorerealisticallyafew)requestreachingmemory.
Thisisnotquitecorrect:Whentheresponsearrivesittakesafewcyclesbeforethenextrequestisgenerated.Oursimulationsaccuratelyaccountforthisdelay.
4
250 200Combines / kc 150 100 50 0 2 4 6Combines / kc111082 250 200 150 100 50111082 512 256 128Roundtrip Latency 32 16 8 4
BaselineWaitbuf100ImprovedAggressive
8 10 0 2 4 6 8 10Distance to MemoryDistance to Memory
Fig.5.Combiningrate,bystageforsimulatedpollingonsystemsof22to211PEs.Waitbuffershavecapacity100andcombiningqueuescanhold4combinedoruncombinedmessages.Intherightplotthecombiningqueuesaredeclaredfulliftwocombinedrequestsarepresent.
Whathappensinsteadisthatthequeuesinswitchesnearmemoryfilltocapacityandthequeuesintheremainderoftheswitchesarenearlyempty.Sincecombiningrequiresmultipleentriestobepresent,itcanonlyoccurnearmemory.How-ever,asingleswitchcannotcombineanunboundednumberofrequestsintoone.ThosefabricatedfortheUltracomputercouldcombineonlypairsso,if,forexample,eightrequestsarequeuedforthesamelocation,(atleast)fourrequestswilldepart.5
Figure5illustratesthiseffect.Bothplotsareforhotspotpolling.TheleftplotisforsimulatedswitchesmodeledafterthosefabricatedfortheUl-tracomputer,butwiththewaitbuffersizeincreasedto100.Eachqueuehasfourslotseachofwhichcanholdarequestreceivedbytheswitchoroneformedbycombiningtworeceivedrequests.Theplotontherightisforthesameswitcheswiththequeuecapacityrestrictedsothatif2combinedrequestsarepresentthequeueisdeclaredfullevenifemptyslotsremain.Wecomparethegraphslabeled10(representingasystemwith1024PEs)ineachplot.Intheleftplotwiththeoriginalswitches,wefindthatcombinesoccuratthemaximalrateforthefour(outof10)stagesclosesttomemory,occuratnearlythemaximalrateforthefifthstage,anddonotoccurfortheremainingfivestages.Therestrictedswitchesdobetter,combiningatmaximalrateforfivestagesandat1/4ofthemaximumforthesixthstage.Notethatforuniformlydistributedtrafficwithouthotspots,combiningwillveryrarelyoccurandtheartificiallimitof2combinedrequestsperqueuewillnotbeinvoked.Wecallthisnewcombiningqueue
Alternatedesignscouldcombinemorethantworequestsintoone,but,asobservedby[7],whenthis“combiningdegree”increases,congestionarisesatthepointwherethesingleresponseisde-combinedintomany.
5
2 3 4 5
6 7 8log(# processors)
9 10 11
Fig.6.Memorylatencyforsimulatedhotspotpollingtraffic,4-2048PEs.
designadaptivesincethequeuesarefullsizeforuniformlydistributedtrafficandadapttobusywaitpollingbyartificiallyreducingtheirsize.
WeseeinFigure6thattheincreasedcombiningrateachievedbytheadaptivequeuesdramaticallylowersthelatencyexperiencedduringbusywaitpolling.Fora2048PEsystemthereductionisfrom168cyclesforasystemwithlarge(100entry)waitbuffersandtheoriginalqueuesto118cycles(fivetimesthelatencyofuniformtraffic)withthesamewaitbuffersbutadaptivecombiningqueues.Thisisareductionofover30%andgivesatotalreductionof61%whencomparedwiththe306cyclesneededbythebaselineswitches.Thebottomtwoplotsareformoreaggressiveswitchdesignsthatarediscussedbelow.InSectionIVweshallseethatcentralizedcoordinationalgorithmsexecutedonsystemswithadaptivequeuesandlargewaitbuffersarecompetitivewiththeirdistributedlocal-spinningalternatives.
Figure7comparesthelatencyfor1024PEsys-temswithvariousswitchdesignsandarangeofacceptedloads(i.e.,processorscanhavemultipleoutstandingrequestsunlikethesituationaboveforbusywaitpolling).Theseresultsconfirmourasser-tionthatadaptivequeueshaveverylittleeffectforlowhotspotratesandareaconsiderableimprove-mentforhighrates.
C.MoreAggressiveCombiningQueues
RecallthatwehavebeensimulatingdecoupledtypeBswitchesinwhichcombiningisdisabledfortheheadentry(to“decouple”theALUandoutputdrivers)andthedualinputcombiningqueuesare5
25 24Roundtrip LatencyRoundtrip Latency 23 22 21 20 0.06
AggressiveBaselineWaitbuf100Improved
0.07
0.08 0.09 0.1
Accepted Load
0.11
0.12
90 80 70 60 50 40 30
Roundtrip LatencyBaselineWaitbuf100ImprovedAggressive 2048 1024 512 256 128 32 16
0
BaselineWaitbuf100ImprovedAggressive
20
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13
Accepted Load
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Accepted Load
Fig.7.SimulatedRound-tripLatencyoveraRangeofOfferedLoadsfor1%(left),20%(middle)and100%(right)HotSpotTraffic.
1400 1200 1000Roundtrip Latency 800 600 400 200 0
Waitbuf100MMWait40ImprovedMMWait40AggressiveMMWait40
composedoftwoindependentsingleinputcom-biningqueueswithmultiplexedoutputs.Westartedwitha“baselinedesign”,usedintheUltracomputer,andproducedwhatwerefertoasthe“improveddesign”havingalargerwaitbufferandadaptivecombiningqueues.WealsoappliedthesametwoimprovementstotypeAswitcheshavingcoupledALUsandrefertotheresultasthe“aggressivedesign”or“aggressiveswitches”Forexample,thelowestplotinFigure6isforaggressiveswitches.Otherexperimentsnotpresentedherehaveshownthataggressiveswitchespermitsignificantratesofcombiningtooccurinnetworkstagesneartheprocessors.Also,aswewillshowinSectionIV,thecentralizedcoordinationalgorithmsperformexcep-tionallywellonthisarchitecture,Althoughaggres-siveswitchesarethebestperforming,wecautionthereaderthatourmeasurementsaregiveninunitsofaswitchcycletimeand,withoutamoredetaileddesignstudy,wecannotestimatethedegradationincycletimesuchaggressiveswitchesmightentail.D.ApplicabilityofResultstoModernSystemsTheresearchdescribedaboveinvestigatessys-temswhosecomponentshavesimilarspeeds,aswastypicalwhenthisprojectbegan.Duringtheinter-veningdecade,however,logicandcommunicationrateshaveincreasedbymorethantwoordersofmagnitudewhileDRAMlatencyhasimprovedbylessthanafactoroftwo.
Inordertobettermodeltheperformanceob-tainablewithmodernhardware,weincreasedthememorylatencyfromtwotothirty-eightcycles,andtheintervalbetweenacceptingrequestsfromfourtofortycycles.TheseresultsareplottedinFigure8andindicatethattheadvantageachievedbytheadaptiveswitchdesignisevengreaterthanbefore.Varioustechniquescanmitigatetheeffectofslow
6
4 5 6
7 8
log(# processors)
9 10
Fig.8.MemorylatencyforhotspotpollingonsystemswithMMsthatcanacceptonemessageevery40cycles.
memory.Forexample,havingmultiplesub-banksperMMwouldincreasememorybandwidthandimprovetheperformanceonuniformlydistributedtraffic,andcachesattheMMarewellsuitedtothetemporallocalityofhotspotaccesses.
IV.PERFORMANCEEVALUATIONOFCENTRALIZEDANDMCSREADER-WRITER
COORDINATIONManyalgorithmsforcoordinatingreadersandwritershaveappeared.CourtoisandHeyman[17]introducedtheproblemandpresentedasolutionthatserializestheentryofreadersaswellaswriters.Moreimportantly,thesealgorithmsgeneratehotspotpollingtrafficandthereforesufferslowdownsduetomemorysystemcongestiononsystemsthatdonotcombinehotspotreferences.
Acentralizedalgorithmispresentedin[1]that,onsystemscapableofcombiningfetch-and-addoperations,eliminatestheserializationofreadersintheabsenceofwriters.However,nocommercialsystemswiththiscapabilityhavebeenconstructed,andintheirabsence,alternative“distributedlocal-spin”MCSalgorithmshavebeendeveloped[11]
thatminimizehotspottrafficbyhavingeachproces-sorbusy-waitonasharedvariablestoredinmemoryco-locatedwiththisprocessor.ThisNUMAmemoryorganizationresultsinlocalbusywaitingthatdoesnotcontributetoorencounternetworkcongestion.6
Themostlikelycauseofunboundedwaitinginanyreader-writeralgorithmisthatacontinualstreamofreaderscanstarveallwriters.Thestandardtechniqueofgivingwriterspriorityeliminatesthispossibility.7Inthissectionwepresentaperformancecomparisonof“bestofbreed”centralizedandMCSwriter-priorityreader-writeralgorithmseachexe-cutedonasimulatedsystemwiththearchitecturalfeaturesitexploits.
Roughlysimilarresultsholdforbarrieralgo-rithms.Theinterestedreaderisreferredto[6].A.CentralizedAlgorithmsforReadersandWritersFigure9presentsacentralizedreader-writerlockusingfetch-and-add.Thisalgorithmissuesonlyasingleshared-memoryreference(afetch-and-add)whenthelockisuncontested.Weknowofnootheralgorithmwiththisdesirableproperty.Amorecompletedescriptionofthetechniquesemployedbythisalgorithmappearsin[6].
sharedintC=0;
faaReaderLock(){
for(;;){
intc=faa(C,1);if(c c=C; }}} // //////////request nowriters?succeeded!cancelreqeust whileawriterisactive...readC faaReaderUnlock(){faa(C,−1);} boolfaaWriterLock(){ for(;;){ intc=faa(C,K);//requestif(c while((C%K)!=0);//waitforreaderstrue;//succeeded} c=faa(C,−K)−K;//conflict:mustdecrement&retrywhile(c>=K){//whileawriterisactive... c=C;//readC }}}faaWriterUnlock(){faa(C,−K);} Fig.9.Fetch-and-addReaders-WritersLock simulatedhardwareincludescombiningswitches,whichimprovestheperformanceofMCSandiscrucialforthecentralizedalgorithm,aswellasNUMAmemory,whichisimportantforMCSandnotusedbythecentralizedalgorithm.Alltheswitchdesignsdescribedaboveplusothersweresimulatedinthejuniorauthor’sdissertation[6].Asummary B.HybridAlgorithmofMellor-CrummeyandScottoftheresultsispresentedbelow. Neitheralgorithmdominatestheother:Thecen-Thewriter-priorityreaders-writersalgorithmof Mellor-CrummeyandScott[11]iscommonlyusedtralizedalgorithmsaresuperiorexceptwhenonlyonlargeSMPsystems.Thisalgorithm,commonlywritersarepresent.Recallthatanidealreaderlock,referredtoasMCS,isahybridofcentralizedandintheabsenceofcontention,yieldslinearspeedup;distributedapproaches.Centralstatevariables,ma-whereasanidealwriterexhibitsnoslowdownasnipulatedwithvarioussynchronizationprimitives,parallelismincreases.Whentherearelargenumbersareusedtocountthenumberandtypeoflockofreaderspresent,thecentralizedalgorithm,withgrantedatanytimeandtoheadthelistsofwait-itscompletelackofreaderserialization,thusgainsingprocessors.NUMAmemoryisusedforbusyanadvantage,whichisgreaterfortheaggressive architecture.waiting,whicheliminatesnetworkcontention.C.OverviewofExperimentalResults Aseriesofmicro-benchmarkexperimentswereperformedtoevaluatetheperformanceofcentral-izedfetch-and-addbasedcoordinationwiththehy-bridalgorithmofMellor-CrummeyandScott.The SomeauthorsusethetermNUMAtosimplymeanthatthememoryaccessisnon-uniform:certainlocationsarefurtherawaythanothers.Weuseittosignifythat(atleastaportionof)thesharedmemoryisdistributedamongtheprocessors,witheachprocessorhavingdirectaccesstotheportionstoredlocally.7 Butnaturallypermitswritersstarvingreaders. 6 D.ExperimentalFramework Thescalabilityoflockingprimitivesisunim-portantwhentheyareexecutedinfrequentlywithlowcontention.Ourexperimentsconsiderthemoreinterestingcaseofsystemsthatfrequentlyrequestreaderandwriterlocks.Forallexperimentseachprocessrepeatedly: • Stochasticallychooses(atafixed,butparame-terized,probability)whethertoobtainareader 7 orwriterlock.8 •Issuesafixedsequenceofnon-combinablesharedmemoryreferencesdistributedamongmultipleMMs—the“simulatedwork”.•Releasesthelock.•Waitsafixeddelay. Eachexperimentmeasurestheratethatlocksaregrantedoverarangeofsystemsizes(highervaluesaresuperior).Inordertogenerateequivalentcontentionfromwritersineachplot,theexpectednumberofwritersEWisheldconstantforallsystemsizes.Thus,theprobabilityofaprocessbecomingawriterisEW/Parallelism. Eachmicro-benchmarkshasfourparameters:•PAR:thenumberofprocessorsinthesimu-latedsystem. •Ew:Theexpectednumberofwriters. •Work:Thenumberofsharedaccessesexe-cutedwhileholdingalock. •Delay:Thenumberofcyclesaprocesswaitsafterreleasingalock. Twoclassesofexperimentswereperformed:Thoseclassified“I”measurethecostofintensesyn-chronizationinwhicheachprocessorsrequestandreleaselocksatthehighestratepossible,Work=Delay=0.Thoseclassified“R”aresomewhatmorerealistic,Work=10andDelay=100. 1)AllReaderExperiments,Ew=0:Theleft-sidechartsinFigures10and11presentresultsfromexperimentswhereallprocessesarereaders.Thecentralizedalgorithmrequiresasinglehot-spotmemoryreferencetograntareaderlockintheabsenceofwriters.Incontrast,theMCSalgorithmgeneratesaccessestocentralizedstatevariablesandlinkedlistsofrequestingreaders.Notsurprisingly,thecentralizedalgorithmshavesignificantlysupe-riorperformanceinthisexperiment.Forsomecon-figurations,anorderofmagnitudedifferenceisseen.WealsoobservethatMCSdoesbenefitsignificantlyfromcombining.EvenwhenonlyMCSusestheaggressiveswitches,thecentralizedalgorithmissuperior.Naturally,thedifferencesaregreaterinthe“intense”experiments. 2)All-WriterExperiments:Theright-sidechartsinFigures10and11presentresultsfromexperi-Thesimulatedrandomnumbergeneratorexecutesinasinglecycle. 8 10000Readers/KCycle 1000 100 10 2 3 4 Writers/KCycleFaa_AggressiveFaa_Improved Mcs_AggressiveNumaMcs_nocombNuma 100 10 1 0.1 0.01 Mcs_nocombNumaMcs_AggressiveNumaFaa_AggressiveFaa_Improved 2 3 4 5 6 7 8 9 10 log(# processors) 5 6 7 8 9 10 log(# processors) Fig.10.ExperimentR,AllReaders(left),AllWriters(right) Faa_AggressiveFaa_Improved Mcs_AggressiveNumaMcs_nocombNuma 100Writers/KCycle 10 1 0.1 0.01 Mcs_AggressiveNumaMcs_nocombNumaFaa_AggressiveFaa_Improved 2 3 4 5 6 7 8 9 10 log(# processors) 10000Readers/KCycle 1000 100 10 2 3 4 5 6 7 8 9 10 log(# processors) Fig.11.ExperimentI,AllReaders(left),AllWriters(right) mentswhereallprocessesarewriters,whichmustserializeandthereforetypicallyspendasubstantialperiodoftimebusy-waiting.TheMCSalgorithmhassuperiorperformanceintheseexperiments.Sincewritersenforcemutualexclusionnospeedupispossibleasthesystemsizeincreases.Indeedoneexpectsaslowdownduetothein-creasedaveragedistancetomemory.Theanoma-lousspeedupobservedforMCSbetween4and16processorsisduetothealgorithm’sdecouplingofqueueinsertionandlockpassing.Thiseffectisdescribedin[11].Asexplainedin[6]theMCSalgorithmissuesverylittlehotspottrafficwhennoreadersarepresentandthusdoesnotbenefitfromcombiningintheseexperiments. 3)MixedReader-WriterExperiments:Figures12through15presentresultsofexperimentswithbothreadersandwriters.Inthefirstset,thereisonaverage1writerpresent(thislockwillhavesub-stantialcontentionfromwriters)andinthesecondset0.1(asomewhatlesscontendedlock). 100Writers/KCycle 10 1 0.1 0.01 Faa_AggressiveFaa_Improved Mcs_nocombNumaMcs_AggressiveNuma 2 3 4 5 6 7 8log(# processors) 9 10 10000Readers/KCycle 1000 100 10 2 3 Faa_AggressiveFaa_Improved Mcs_AggressiveNumaMcs_nocombNuma 4 5 6 7 8log(# processors) 9 10 Fig.12.ExperimentI,EW=1 8 10000Readers/KCycle 1000 100 10 2 3 4 Writers/KCycleFaa_AggressiveFaa_Improved Mcs_nocombNumaMcs_AggressiveNuma 100 10 1 0.1 0.01 Faa_AggressiveFaa_Improved Mcs_AggressiveNumaMcs_nocombNuma 2 3 4 5 6 7 8 9 10 log(# processors) 5 6 7 8 9 10 log(# processors) bymixedtrafficpatternsthatwouldperformbetterwithlongerqueues.Wehaveneitherwitnessednorinvestigatedthiseffect,whichmaybemitigatedbymoregradualadaptivedesignsthatvariablyadjustqueuecapacityasafunctionofacontinuouslymeasuredrateofcombining. B.GeneralizationofCombiningtoInternetTrafficThetreesaturationproblemduetohotspotaccesspatternsisnotuniquetosharedmemorysystems.Congestiongeneratedbyfloodattacksandflashcrowds[12]presentssimilarchallengesforInternetServiceProviders.In[15]Mahajanetal.proposeatechniquetolimitthedisruptiongeneratedbyhotspotcongestiononnetworktrafficwithoverlappingcommunicationroutes.Intheirscheme,enhancedserversandroutersincorporatemechanismstochar-acterizehotspotreferencepatterns.Aswithadap-tivecombining,upstreamroutersareinstructedtothrottlethehotspottrafficinordertoreducedown-streamcongestion. Hotspotrequestsdonotbenefitfromthisap-proach,howevercombiningmayprovideanalter-nativetothrottling.Forexample,thedetectionofhotspotcongestion,couldtriggerdeploymentofproxiesneartonetworkentrypoints,potentiallyre-ducingdownstreamloadandincreasingthehotspotperformance.Thistypeofcombiningisservice-typespecificandthereforeservice-specificstrategiesmustbeemployed.Dynamicdeploymentofsuchedgeserversrequiresprotocolsforcommunicatingthecharacteristicsofhotspotaggregatestoservers,andsecuremechanismstodynamicallyinstallandactivateupstreamproxies. C.CombiningandCache-Coherency Cachecoherenceprotocolstypicallymanageshared(read-only)andexclusive(read-write)copiesofsharedvariables.Despitetheobviouscorrespon-dencebetweencachecoherenceandthereaders-writerscoordinationproblem,coherenceprotocolstypicallyserializethetransmissionoflinecontentstoindividualcaches.TheSCIcachecoherenceprotocolspecifiesavariantofcombiningfetch-and-storetoefficientlyenqueuerequests.However,datadistributionandlineinvalidationonnetworkconnectedsystemsisstrictlyserialized.Extensionsofcombiningmaybeabletoparallelizecachefill9 Fig.13.ExperimentR,EW=1 100Writers/KCycle 10 1 0.1 0.01 Faa_AggressiveFaa_Improved Mcs_nocombNumaMcs_AggressiveNuma 2 3 4 5 6 7 8 9 10 log(# processors) 10000Readers/KCycle 1000 100 10 2 3 4 Faa_AggressiveFaa_Improved Mcs_AggressiveNumaMcs_nocombNuma 5 6 7 8 9 10 log(# processors) Fig.14.ExperimentI,EW=0.1 Therateatwhichthecentralizedalgorithmgrantsreaderlocksincreaseslinearlywithsystemsizeforalltheseexperimentsand,asaresult,significantlyexceedstherategrantedbyMCSforalllargesystemexperiments.Evenwithouttheaggressiveswitches,thedifferencenormallyexceedsandorderofmagnitude. Eventhoughwritershavepriorityoverreaders,therearevastlyfewerwriterlocksgrantedwiththesevaluesofEw.Herealsothecentralizedalgo-rithmoutperformsMCS,oftenbyanorderofmag-nitude.Forsomeexperiments,writerlockgrantingratesaresolowthatnonewereobservedforsomeMCSexperiments. V.OPENQUESTIONS A.ExtendingtheAdaptiveTechnique Ouradaptivetechniquesharplyreducesqueuecapacitywhenacrudedetectorofhotspottrafficistriggered.Whilethistechniquereducesnetworkla-tencyforhotspotpolling,itmightalsobetriggered 100Writers/KCycle 10 1 0.1 0.01 Mcs_AggressiveNumaFaa_ImprovedFaa_Aggressive 2 3 4 5 6 7 8log(# processors) 9 10 10000Readers/KCycle 1000 100 10 2 3 Faa_AggressiveFaa_Improved Mcs_nocombNumaMcs_AggressiveNuma 4 5 6 7 8log(# processors) 9 10 Fig.15.ExperimentR,EW=0.1 operations.Challengesforsuchschemeswouldin-cludethedevelopmentofanappropriatescalabledi-rectorystructurethatisamenableto(de)combinabletransactions. VI.CONCLUSIONS HavingbeensurprisedatthecomparativelypoorperformanceattainedbytheUltracomputer’scom-biningnetworkwhenpresentedwith100%hotspottrafficthattypifiesbusywaitpollingofcoordi-nationvariables,weidentifiedtwoimprovementstothecombiningswitchdesignthatincreaseitsperformancesignificantly.Thefirstimprovementistosimplyincreasethesizeofoneofthebufferspresent.Themoresurprisingsecondimprovementistoartificiallydecreasethecapacityofcombiningqueuesduringperiodsofheavycombining.Theseadaptivecombiningqueuesbetterdistributethememoryrequestsacrossthestagesofthenetwork,therebyincreasingtheoverallcombiningrates.Wethencomparedtheperformanceofacentral-izedalgorithmforthereaderswritersproblemwiththatofthewidelyusedMCSalgorithmthatreduceshotspotsbybusywaitspinningonlyonvariablesstoredintheportionofsharedmemorythatisco-locatedwiththeprocessorinquestion. Oursimulationstudiesofthesetwoalgorithmshaveyieldedseveralresults:First,theperformanceoftheMCSalgorithmisimprovedbytheavail-abilityofcombining.Second,whennoreadersarepresent,theMCSalgorithmoutperformsthecentral-izedalgorithm.Finally,whenreadersarepresent,theresultsarereversed.Formanyoftheselastexperiments,anorderofmagnitudeimprovementisseen. Aswitchcapableofcombiningmemoryref-erencesismorecomplexthannon-combiningswitches.Anobjectiveofthepreviousdesigneffortswastopermitacycletimecomparabletoasimilarnon-combiningswitch.Inordertomaximizeswitchclockfrequency,a(typeB,uncoupled)designwasselectedthatcancombinemessagesonlyiftheyarriveonthesameinputportandisunabletocombinearequestattheheadofanoutputqueue.Wealsosimulatedanaggressive(typeA,coupled)designwithoutthesetworestrictions.Asexpecteditperformedverywell,butwehavenotestimatedthecycle-timepenaltythatmayoccur. REFERENCES [1]AllanGottlieb,BorisLubachevsky,andLarryRudolph.Basic TechniquesfortheEfficientCoordinationofVeryLargeNum-bersofCooperatingSequentialProcessors.ACMTOPLAS,pages1–1,April1983. [2]ConstantineD.PolychronopoulosCarlJ.Beckmann.Fast barriersynchronizationhardware.InProc.1990ConferenceonSupercomputing,pages180–1.IEEEComputerSocietyPress,1990. [3]ThinkingMachinesCorp.TheConnectionMachineCM-5 TechnicalSummary,1991. [4]DanielM.DiasandJ.RobertJump.Analysisandsimulationof buffereddeltanetworks.IEEETrans.Comp.,C-30(4):273–282,April1981. [5]SusanR.Dickey.SystolicCombiningSwitchDesigns.PhD thesis,CourantInstitute,NewYorkUniveristy,NewYork,1994.[6]EricFreudenthal.ComparingandImprovingCentralizedand DistributedTechniquesforCoordinatingMassivelyParallelShared-MemorySystems.PhDthesis,NYU,NewYork,June2003. [7]GjynghoLee,C.P.Kruskal,andD.J.Kuck.OntheEffective-nessofCombininginResolving‘HotSpot’Contention.JournalofParallelandDistributedComputing,20(2),February1985.[8]GregoryF.Pfister,V.AlanNorton.“HotSpot”Contention andCombininginMultistageInterconnectionNetworks.IEEETransactionsonComputers,c-34(10),October1985. [9]GregoryF.Pfister,WilliamC.Brantley,DavidA.George,Steve L.Harvey,WallyJ.Kleinfielder,KevinP.McAuliffe,EvelinS.Melton,V.AlanNorton,andJodiWeiss.Theibmresearchparallelprocessorprototype(rp3).InProc.ICPP,pages7–771,1985. [10]JamesLaudon,DanielLenoski.TheSGIOrigin:accNUMA highlyscalableserver.ACMSIGARCHComputerArchitectureNews,1997. [11]JohnM.Mellor-CrummeyandMichaelL.Scott.Scalable Reader-WriterSynchronizationforSharedMemoryMultipro-cessors.ACMTrans.Comput.Systems,9(1):21–65,1991.[12]”J.Jung,B.Krishnamurthy,andM.Rabinovich”.”flashcrowds anddenialofserviceattacks:Characterizationandimplicationsforcdnsandwebsites”.InProc.InternationalWorldWideWebConference,pages252–262.”IEEE”,May”2002”. [13]P.KermaniandLeonardKleinrock.VirtualCut-through:A newcomputercommunicationswitchingtechnique.ComputerNetworks,3:267–286,1979. [14]Yue-ShengLiu.ArchitectureandPerformanceofProcessor-MemoryInterconnectionNetworksforMIMDSharedMemoryParallelProcessingSystems.PhDthesis,NewYorkUniversity,1990. [15]R.Mahajan,S.Bellovin,S.Floyd,J.Vern,andP.Scott. Controllinghighbandwidthaggregatesinthenetwork,2001.[16]JonA.SolworthMarcSnir.UltracomputerNote39,The Ultraswitch-AVLSINetworkNodeforParallelProcessing.Technicalreport,CourantInstitute,NewYorkUniversity,1984.[17]P.Courtois,F.Heymans,andD.Parnas.Concurrentcontrolwith readersandwriters.Comm.ACM,14(10):667–668,October1971. [18]RandallD.Rettberg,WilliamR.Crowther,PhillipP.Carvey. TheMonarchParallelProcessorHardwareDesign.IEEEComputer,pages18–30,April1990. [19]HaroldStone.Parallelprocessingwiththeperfectshuffle.IEEE Trans.Computing,C-25(20):55–65,1971. 10
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- 7swz.com 版权所有 赣ICP备2024042798号-8
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务