您的当前位置：首页 Economics

Economics

来源：微智科技网

ParserFrameworkforInformationExtraction

CsabaDezs´enyiBudapestUniversityof

TechnologyandEconomics

dezsenyi@mit.bme.huTam´asM´esz´arosBudapestUniversityof

TechnologyandEconomics

meszaros@mit.bme.huTadeuszDobrowieckiBudapestUniversityof

TechnologyandEconomics

dobrowiecki@mit.bme.hu

AbstractVariousdocumentparsingmethodsarerequiredinapplicationsthatper-formcomplexinformationextraction.Thedevelopmentofsuchparsingschemescanbesimpliﬁedbydecomposingthecomplexextractionprocessintosimplestepsthatcanberealizedbyelementaryparsermodules.AuthorspresentageneralframeworkforthedevelopmentofIEapplicationsbyapplyingdiﬀerentparsersindependentlyandproducingasemanticallyjointresult.

Keywords:Informationextraction,parsernetwork,planningforinformationre-trieval

1Introduction

TechniquesofInformationExtraction(IE)areintenselyusednowadaysinanevergrowingspectrumofapplications[1].Wecanﬁndtheminthesmallestinformationprocessingtools(e.g.emailﬁltering,webassistant),butalsointhelargest,corporation-widesolutions(e.g.knowledgemanagementanddeci-sionsupportsystems,searchengines).IEtools-generallysocalledparsers-arebasedonseveraldistincttechniquesandtheories,likestatistics[2],patternﬁtting[3],machinelearning[4],naturallanguageprocessing(NLP)[5],andmanyothermethods.Insimplecommercialproductsthesetechniquesdonotappearinthecoreservices,onlylocallyconﬁnedtoparticularfunctions.Inapplicationsthatperformcomplexinformationandknowledgediscovery,how-ever,multipleparsingmethodsarerequired.Assumee.g.thatwewouldliketodevelopanapplicationthatretrieveseconomicarticlesfromnewsportals,extractssimplefactsaboutcompaniesandpersons,andputsthemintotheapplication’sknowledgebaseforfurtherprocessing.Severalparsingtasksarerequiredtoextractsuchinformation.WewouldhavetoextractthearticlefromtheHTMLpage,wehavetoidentifythecompanyandpersonnamesinthetext,wehavetouseasentenceparsertoextractsimplesubject-predicate-objecttriplets,andforexamplewecanusestatisticalanalysistocalculate

relevancemeasurementsforcontrollinglocalsearch.Itisrelativelysimpletoimplementandexecutesuchparsers,respectively.However,theresultsoftheindependentparsersshouldbesemanticallylinked,e.g.wemayrequiretoiden-tifywhetherthesubjectofasentenceisacompany.Whatwerequirethusistodiscoverandmapsemanticassociationsbetweenresultsofvariousparsersandprovideservices,likequerying,ornavigationamongthem.Moreover,parsersshoulduseeachother’sresult,whichsupportsthereusabilityofalreadywrittenmodules,andmakeseasiertocreateparsingschemes.Thus,summarizingthemainobjectives:wewanttosolveacomplexdocumentparsingproblembyﬁrstdecomposingtheprocessintosimpleparseroperations,andbyfusingtogethertheparticularresults.Theparserswouldworkasseparatemodules,buttheirresultsshouldbeacoherentandcompleteresultofthewholeparsingprocess.Torealizethisaim,ﬁrstwehavedesignedthetheoreticalbasisoftheframework,whichinvolvesgeneralprinciples,architecturalconsiderations,andmethodolog-icalguidelines.Wehavecreatedadatamodelthatimplicitlymapssemanticassociationsbetweentheelementsofindependentparserresults.Wearealsoworkingontechniquesofautomatedplanningofparserconﬁgurationsthatwouldberequiredforaparticularapplication.Westartedtodevelopanim-plementationoftheabstractframeworkthatisageneral,XML-basedparserframeworksystem.OuraimistorealizethesystemasafundamentalandstandardtoolthatcanbeusedtodeveloparbitraryIEapplications.Inthispaperwepresentthetheoreticalbasisoftheabstractframeworkandthekeychallengesoftheresearch.

TheparserframeworkisapartoftheInformationandKnowledgeFusion(IKF)project[6],whereacomplexknowledge-basedinformationretrievalapplicationisunderdevelopmentthatprovidesintelligentdecisionsupportforﬁnancialinstitutes.However,weconceptualizeanddesigntheframeworktobegeneralandcompletelyapplicationindependent.

2FromSourceDocumentstoExtractedInformation

Theprocessofinformationextractioncanbedividedintothreephases:re-trieval,parsingandextraction(Fig.1).First,adocumenthastoberetrievedfromthesourceenvironmentoftheapplication.Theresultisaninitiallystruc-tureddocument.E.g.intheIKFprojectweapplyanintelligentdocumentretrievalagentthatusesmodelsofdocumentsourcesforrecognizingbasicse-manticstructuresintheincomingdocuments[7].Atthesimplestcase,anapplicationmayonlyrequiretextﬁles,readfromalocaldirectory.There-trievalphasecanalsoincludeformatconversionprocesses,documentmergingorsplitting,andotherrequiredpreprocessing.

Thesecondstepistoparsethedocumentinseveralwaystoproducevari-oussemanticallystructuredrepresentationsoftheoriginalsource,whatwecallviews.Theseviewsshouldcontaintherequiredinformationelementsandtheyprovidetheinputofthethirdstep,wheretheinformationisextractedfromthemandstoredinalocalknowledgerepositoryoftheapplication.

Figure1:Phasesofinformationextraction

Theretrievalsystemfeedstheparserframeworkwithinitialdocuments,whiletheIEprocessspeciﬁeswhatkindoflogicalrepresentationsofthesource(views)arerequiredtoaccomplishtheextraction.Thetaskoftheframeworksystemistoapplysuitableparsers,toplantherunningsequencebycombiningthem,toexecuteandcontroltheparsing,andtoproduceasemanticallylinkedandcompleteresultfortheextraction.

3ViewsandParsers

Atraditionalparserisusuallyconsideredasacompleteprocessingunit.Here,parsersarebasicbuildingblocksthatcanbeutilizedtoconstructacomplexparsingscheme.WeshouldanalyzeIEparserstoﬁndtheirgeneralcharacter-isticsandtodeﬁneanabstractparsermodel.Thisensuresthattheframeworkcanhandleparsingproblemsuniformlyandcanoperatewithoutreferringtotheconcreteimplementations.Wealsoneedtodeﬁnetheabstractionofparserresultstohandlethemindependentlyfromconcreterealizations.

Thetaskofaparseristorecognizecertainpartsorelementsinitsinputdoc-umentandtransformthemintoitsoutputview.Theresultedviewofagiventypeofparserisakindofinformationprojectionofthesourcethatcontainsthetransformedinformationinanew,semanticallystructuredform.Eachviewbelongstoatypethatdeﬁneswhatkindofinformationiscontainedin,andyieldsitsstructurebyschemadeﬁnitions.Aparsercanproduceviews,andtheinputoftheparsingoperationalsoconsistsofoneormoreviews.Hereby,parserscanuseeachother’sresultstocreatenewviews.Thesetofallcreatedviewsintheframeworkisthecompleteresultofaparsingoperation.Theinputoftheframeworkistheinitialviewthatcontainstheoriginalsourcedocumentinitsinitialstructure.

WedecidedtoimplementviewsasXML1documents,becauseitperfectlyﬁtsthedemandofcarryingsemanticallymarkedinformation.XMLisalsoawidespreadstandard,thus,wecouldbeneﬁttheapplicationofthevarious

MoredetailsaboutXMLonhttp://www.w3c.org/XML/

standardtoolsthatisfreelyavailable(e.g.DOM,XSchema,XPath,XSLT,XQuery,etc.2).3.1

ViewNetworks

Independentparserscanproduceseveraldiﬀerentviews,butoneofthekeychallengesistomapassociationsbetweenthemtoenablecompositequeriesandnavigationbylinkingsemanticallyrelatedinformationpieces.Tohaveageneralframeworktheparsingprocessshouldbeincremental,aparsercanonlyaccessthealreadyexistingviews(includingtheinitialviewwiththeoriginalsource),anditcanonlywriteitsownoutputview.Therefore,wecanmapassociationsbetweenelementsofthenewviewandelementsoftheexistingones.Themainideaisthatparserscreatelinksduringtheirexecutionbetweentheirsourceandtargetelementsthatareconsideredasidenticalentities.Insuchaway,viewsarelinkedtogether(bytheirelements)andformacompletestructurecalledViewNetwork(VN).Therelationsbetweenelementsoftwoindependentviewsareestablishedwithoutcreatingexplicit(direct)linksbetweenthem,becausetheycanbeassociatedthroughthelinkstotheirsources.

Figure2:ViewNetwork(theglobalandthedetailedstructure)

Theinternalstructureofaviewcanberepresentedasatree(Fig.2,right).Eachnodeinthistreecorrespondstoaninformationelement.Suchelementsaretheatomiccomponentsthatcancarrysemanticallymarkedinformation.Anelementconsistsofcontentandreferencescalledinformationelementsourcereferences(IESR).Thecontentcanbetextualorcanencapsulateotherchildelementnodes,asinstandardXML.TheIESRsdesignatethesourceelementsthatserveastheinformationsourcesofthegivenelement,thustheycanbeconsideredidenticalentities.TheIESRsestablishrelationsbetweenelements

InformationaboutseveralXMLstandardscanbefoundonhttp://www.w3c.org

ofdiﬀerentviews.Supposee.g.(seeFig.2)thatV1istheresultofaparserthatmarkswordandsentenceboundaries,V2isanindexwithtermsandweights,andV3containssomerecognizedtwo-wordexpressions.Wecaneasilyobtaintherelatione.g.betweentheindextermsandexpressionsthroughthereferencesthatconnecttheminV1.

AttheleftsideofFigure2,theglobalVNcanbeseen,ifweconsiderviewsasatomicnodesandshrinktheIESRstoviewsourcereferences(VSR)thatdesignatethesourceviewsthatserveastheinformationsourcesofagivenview.Atthislevel,theVNisadirectedacyclicgraph(DAG),duetotheincrementalcharacteroftheparsingprocess.ThesinknodeistheinitialviewthatshouldcontaintheinitiallystructuredsourcedocumentthatistherootinformationsourceforallparsingoperationintheVN.3.2

PlanningtheParsing

Thetaskoftheframeworkistoproduceviewsthatarerequiredfortheextrac-tionofrelevantinformation.Itappliestheparsermodules,butitshouldplantherightrunningsequenceorgraph,becausesomeparsersmayneedoutputsofotherones,somecanworkonseveralviews,andtherequiredoneshouldbeconstrained,etc.TheframeworkcanpartiallyorfullyautomatethecreationofparsingplansbyadoptingstandardAIplanningalgorithms(e.g.STRIPSbasedpartialorderedplanning[8]).

Figure3:Parserattributes

Therearetwofactorsthatconstrainaparsingplan:(1)thenatureoftheparsers,and(2)theconditionsofthecurrentapplicationrequirements.Parsersaretheelementarystepsofaparsingoperation.Aparsercanproduceagiventypeofviewbyprocessingcertainotherviews.Thus,parsershavetodeclare(a)conditionsontheirrequiredinputviews,and(b)deﬁnitionsaboutitsre-sultedoutputviews(Fig.3).Inaddition,parsersmayprovidedeﬁnitionsof

theirparameterlistsandpossiblevalues,andcertainqualityandworkmea-surements3thatsupportthegenerationoftheoptimalplans.

Therequirementsoftheactualapplication(whatwecallaparsingtask)canbedeﬁnedsimilarlyaswediditwiththeparsers:weshoulddeclareconditionsoftherequiredviews.Moreover,wemaydeﬁneadditionalconditionsthatsup-porttheinternalconﬂictresolutionoftheplan,e.g.constrainingtheinputviewofanindexermodule,ortheparametersfortunableparsers,etc.

Theaimoftheplannersubsystemistoplantheparserconﬁgurationbyusingparsermodulesforproducingviewsthatsatisfytheconditionsoftheparsingtask.Thesequenceofrunningcanbeguessedbyﬁttingtheinputandoutputconditionsofparsers.Thekeychallengehereistheoptimalresolutionofcon-ﬂictsintheplan(e.g.onetypeofviewcanbeproducedbymoreparsers,oroneparsercanproduceaviewfrommoreinputviews,etc.).

4ExampleScenario

Wecontinuetheexampledescribedintheintroduction,wherewewouldliketoextractfactsfromarticles(seesection1).Assumethatwehaveafactbasewherewecandeﬁnewhatkindoffactsweareinterestedin,liketheoneintheﬁgurebelow(Fig.4).Here,wedeﬁnethatwewouldliketogatherrecentfactsaboutﬁnancialoperationsofthecorporationMOL,andthedescriptionoftheoperationshouldrefertotheexactamountofcurrency.

Figure4:Parserattributes

First,wehavetoﬁndoutwhattypeofviewscouldcontainthedeﬁnedinfor-mationelements.Suchviewsformtherequestofthecurrentapplication(theparsingtask).Forexample,thenameofthecorporationcouldbefoundintheviewthatcontainsextractedpersonandcompanynames.Giventhelistoftherequiredviews,theplannersubsystemcancreateasuitableparsingplantoproducethembyapplyingseveralparsermodulesonthesourcedocuments.Theplanforthisexamplecouldbeasfollows(Fig.5):

First,weneedtoidentifywordandsentenceboundariesinthesourcedocu-ment.Theresultingviewwillbetheinputofthesentenceparser,thenamerecognizerandalsotheindexer,whichensuresthattheoutputviewsoftheseparserswillbelinkedthroughthe(semantic)levelofsentencesandwords.Wealsoapplyapatternﬁttermodule,whichrecognizecommonpatterns,likecur-rency,date,address,email,etc.Additionally,wemayhaveseveralparsersfor

E.g.approximatedinformationqualityoftheresultedview,complexityofthealgorithms,requiredresources,reliabilitymeasurements,etc.

onetask,e.g.twotypesofnamerecognizer:aheuristicandalexiconbased.Theoptimalresultcanbeselectedbyevaluatingboth.

Figure5:Parserattributes

Theextractionofrelevantfactsisperformedthenasfollows:(1)ﬁrst,weneedtoidentifywhetherthisdocumentcouldbeaboutacompanythatweinter-estedin.Wecansearchforexactcompanynames,andwecanusestatisticalrelevancecalculationbyapplyingtheindexview.Then,(2)weshouldiden-tifywhichsentencescontainthecompanyorpersonnames,hopingthatthesesentenceswillcontainsomeinterestingfacts.Wecaneasilyselectthembyfol-lowingbackthesourcereferencesfromthenamestothewords-and-sentencesviewandthenforwardtotheparsed-sentencesview.Thenextstep(3)istoretrievethepredicateandobjectvaluesfromthesentencesandcheckwhethertheyareanotherentitiesinthenamesorthepatternsviews(E.g.theobjectisandothercompany).Thisalsocanbedonebynavigatingbackandforwardonthesource-targetreferences.Finally(4),weshouldidentifyknownconceptsinthesentencebyusinganontologyoralexiconandcreatetheoutputfactslist,whichisdonebythelastmoduleintheﬁgure,therelevantfactextractor.Werepresentthefactextractionasaparsermoduleinﬁgure5(intheparsingplan),butitwillratherbetheresultofaqueryperformedwithaproperlogic-basedquerylanguage,thenauniqueparsermoduleimplementation.

5Conclusions

Weareinanearlyphaseoftheresearch,however,basedonseveralyearsofexperienceonthesubject,wehavefoundthedesignedtheoreticalframeworkpromising.Ournextstepistoconcludethedevelopmentofaprototypethat

willimplementthecorearchitectureandservicesoftheabstractframeworkandwillserveasaworkbenchtodevelopandtryvariousparserschemes.

References

[1]R.Grishman(1997).Informationextraction:Techniquesandchallenges.

InformationExtraction:AMultidisciplinaryApproachtoanEmergingInformationTechnology,Vol.1299:10-27,June1997.[2]J.L.Neto,A.D.Santos,C.A.Kaestner,A.A.Freitas(2000).Document

ClusteringandTextSummarization.InProceedingsof4thInt.ConferenceonPracticalApplicationsofKnowledgeDiscoveryandDataMining,pp.41-55,London,UK,2000.[3]S.Kuhlins,R.Tredwell(2002).ToolkitsforGeneratingWrappers.

Net.ObjectDays-2002,Erfurt,Germany,October2002.[4]N.Kushmerick,B.Thomas(2003).AdaptiveInformationExtraction:

CoreTechnologiesforInformationAgents.InIntelligentInformationAgentsR&DinEurope:AnAgentLinkperspective,LectureNotesinCom-puterScience2586,Springer,2003.[5]R.Mitkov(ed.)(2003).TheOxfordHandbookofComputationalLinguis-tics.OxfordUniversityPress,Oxford,2003.[6]T.Mszros,Z.Barczikay,F.Bodon,T.Dobrowiecki,G.Strausz(2001).

BuildinganInformationandKnowledgeFusionSystem.InProceedingsofIEA/AIE-2001,SpringerLect.NotesinComp.Sc.,2001.[7]P.Varga,T.Mszros,Cs.Dezsnyi,T.Dobrowiecki(2003).AnOntology-basedInformationRetrievalSystem.InProceedingsof,IEA/AIE-2003,Laughborough,UK,LectureNotesinComputerSciencevol.2718Springer-Verlag,2003.[8]S.Russell,P.Norvig(1997).ArtiﬁcialIntelligence.AModernApproach.

PrenticeHallInc.,1997.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文