您好,欢迎来到微智科技网。
搜索
您的当前位置:首页Economics

Economics

来源:微智科技网
ParserFrameworkforInformationExtraction

CsabaDezs´enyiBudapestUniversityof

TechnologyandEconomics

dezsenyi@mit.bme.huTam´asM´esz´arosBudapestUniversityof

TechnologyandEconomics

meszaros@mit.bme.huTadeuszDobrowieckiBudapestUniversityof

TechnologyandEconomics

dobrowiecki@mit.bme.hu

AbstractVariousdocumentparsingmethodsarerequiredinapplicationsthatper-formcomplexinformationextraction.Thedevelopmentofsuchparsingschemescanbesimplifiedbydecomposingthecomplexextractionprocessintosimplestepsthatcanberealizedbyelementaryparsermodules.AuthorspresentageneralframeworkforthedevelopmentofIEapplicationsbyapplyingdifferentparsersindependentlyandproducingasemanticallyjointresult.

Keywords:Informationextraction,parsernetwork,planningforinformationre-trieval

1Introduction

TechniquesofInformationExtraction(IE)areintenselyusednowadaysinanevergrowingspectrumofapplications[1].Wecanfindtheminthesmallestinformationprocessingtools(e.g.emailfiltering,webassistant),butalsointhelargest,corporation-widesolutions(e.g.knowledgemanagementanddeci-sionsupportsystems,searchengines).IEtools-generallysocalledparsers-arebasedonseveraldistincttechniquesandtheories,likestatistics[2],patternfitting[3],machinelearning[4],naturallanguageprocessing(NLP)[5],andmanyothermethods.Insimplecommercialproductsthesetechniquesdonotappearinthecoreservices,onlylocallyconfinedtoparticularfunctions.Inapplicationsthatperformcomplexinformationandknowledgediscovery,how-ever,multipleparsingmethodsarerequired.Assumee.g.thatwewouldliketodevelopanapplicationthatretrieveseconomicarticlesfromnewsportals,extractssimplefactsaboutcompaniesandpersons,andputsthemintotheapplication’sknowledgebaseforfurtherprocessing.Severalparsingtasksarerequiredtoextractsuchinformation.WewouldhavetoextractthearticlefromtheHTMLpage,wehavetoidentifythecompanyandpersonnamesinthetext,wehavetouseasentenceparsertoextractsimplesubject-predicate-objecttriplets,andforexamplewecanusestatisticalanalysistocalculate

relevancemeasurementsforcontrollinglocalsearch.Itisrelativelysimpletoimplementandexecutesuchparsers,respectively.However,theresultsoftheindependentparsersshouldbesemanticallylinked,e.g.wemayrequiretoiden-tifywhetherthesubjectofasentenceisacompany.Whatwerequirethusistodiscoverandmapsemanticassociationsbetweenresultsofvariousparsersandprovideservices,likequerying,ornavigationamongthem.Moreover,parsersshoulduseeachother’sresult,whichsupportsthereusabilityofalreadywrittenmodules,andmakeseasiertocreateparsingschemes.Thus,summarizingthemainobjectives:wewanttosolveacomplexdocumentparsingproblembyfirstdecomposingtheprocessintosimpleparseroperations,andbyfusingtogethertheparticularresults.Theparserswouldworkasseparatemodules,buttheirresultsshouldbeacoherentandcompleteresultofthewholeparsingprocess.Torealizethisaim,firstwehavedesignedthetheoreticalbasisoftheframework,whichinvolvesgeneralprinciples,architecturalconsiderations,andmethodolog-icalguidelines.Wehavecreatedadatamodelthatimplicitlymapssemanticassociationsbetweentheelementsofindependentparserresults.Wearealsoworkingontechniquesofautomatedplanningofparserconfigurationsthatwouldberequiredforaparticularapplication.Westartedtodevelopanim-plementationoftheabstractframeworkthatisageneral,XML-basedparserframeworksystem.OuraimistorealizethesystemasafundamentalandstandardtoolthatcanbeusedtodeveloparbitraryIEapplications.Inthispaperwepresentthetheoreticalbasisoftheabstractframeworkandthekeychallengesoftheresearch.

TheparserframeworkisapartoftheInformationandKnowledgeFusion(IKF)project[6],whereacomplexknowledge-basedinformationretrievalapplicationisunderdevelopmentthatprovidesintelligentdecisionsupportforfinancialinstitutes.However,weconceptualizeanddesigntheframeworktobegeneralandcompletelyapplicationindependent.

2FromSourceDocumentstoExtractedInformation

Theprocessofinformationextractioncanbedividedintothreephases:re-trieval,parsingandextraction(Fig.1).First,adocumenthastoberetrievedfromthesourceenvironmentoftheapplication.Theresultisaninitiallystruc-tureddocument.E.g.intheIKFprojectweapplyanintelligentdocumentretrievalagentthatusesmodelsofdocumentsourcesforrecognizingbasicse-manticstructuresintheincomingdocuments[7].Atthesimplestcase,anapplicationmayonlyrequiretextfiles,readfromalocaldirectory.There-trievalphasecanalsoincludeformatconversionprocesses,documentmergingorsplitting,andotherrequiredpreprocessing.

Thesecondstepistoparsethedocumentinseveralwaystoproducevari-oussemanticallystructuredrepresentationsoftheoriginalsource,whatwecallviews.Theseviewsshouldcontaintherequiredinformationelementsandtheyprovidetheinputofthethirdstep,wheretheinformationisextractedfromthemandstoredinalocalknowledgerepositoryoftheapplication.

Figure1:Phasesofinformationextraction

Theretrievalsystemfeedstheparserframeworkwithinitialdocuments,whiletheIEprocessspecifieswhatkindoflogicalrepresentationsofthesource(views)arerequiredtoaccomplishtheextraction.Thetaskoftheframeworksystemistoapplysuitableparsers,toplantherunningsequencebycombiningthem,toexecuteandcontroltheparsing,andtoproduceasemanticallylinkedandcompleteresultfortheextraction.

3ViewsandParsers

Atraditionalparserisusuallyconsideredasacompleteprocessingunit.Here,parsersarebasicbuildingblocksthatcanbeutilizedtoconstructacomplexparsingscheme.WeshouldanalyzeIEparserstofindtheirgeneralcharacter-isticsandtodefineanabstractparsermodel.Thisensuresthattheframeworkcanhandleparsingproblemsuniformlyandcanoperatewithoutreferringtotheconcreteimplementations.Wealsoneedtodefinetheabstractionofparserresultstohandlethemindependentlyfromconcreterealizations.

Thetaskofaparseristorecognizecertainpartsorelementsinitsinputdoc-umentandtransformthemintoitsoutputview.Theresultedviewofagiventypeofparserisakindofinformationprojectionofthesourcethatcontainsthetransformedinformationinanew,semanticallystructuredform.Eachviewbelongstoatypethatdefineswhatkindofinformationiscontainedin,andyieldsitsstructurebyschemadefinitions.Aparsercanproduceviews,andtheinputoftheparsingoperationalsoconsistsofoneormoreviews.Hereby,parserscanuseeachother’sresultstocreatenewviews.Thesetofallcreatedviewsintheframeworkisthecompleteresultofaparsingoperation.Theinputoftheframeworkistheinitialviewthatcontainstheoriginalsourcedocumentinitsinitialstructure.

WedecidedtoimplementviewsasXML1documents,becauseitperfectlyfitsthedemandofcarryingsemanticallymarkedinformation.XMLisalsoawidespreadstandard,thus,wecouldbenefittheapplicationofthevarious

1

MoredetailsaboutXMLonhttp://www.w3c.org/XML/

standardtoolsthatisfreelyavailable(e.g.DOM,XSchema,XPath,XSLT,XQuery,etc.2).3.1

ViewNetworks

Independentparserscanproduceseveraldifferentviews,butoneofthekeychallengesistomapassociationsbetweenthemtoenablecompositequeriesandnavigationbylinkingsemanticallyrelatedinformationpieces.Tohaveageneralframeworktheparsingprocessshouldbeincremental,aparsercanonlyaccessthealreadyexistingviews(includingtheinitialviewwiththeoriginalsource),anditcanonlywriteitsownoutputview.Therefore,wecanmapassociationsbetweenelementsofthenewviewandelementsoftheexistingones.Themainideaisthatparserscreatelinksduringtheirexecutionbetweentheirsourceandtargetelementsthatareconsideredasidenticalentities.Insuchaway,viewsarelinkedtogether(bytheirelements)andformacompletestructurecalledViewNetwork(VN).Therelationsbetweenelementsoftwoindependentviewsareestablishedwithoutcreatingexplicit(direct)linksbetweenthem,becausetheycanbeassociatedthroughthelinkstotheirsources.

Figure2:ViewNetwork(theglobalandthedetailedstructure)

Theinternalstructureofaviewcanberepresentedasatree(Fig.2,right).Eachnodeinthistreecorrespondstoaninformationelement.Suchelementsaretheatomiccomponentsthatcancarrysemanticallymarkedinformation.Anelementconsistsofcontentandreferencescalledinformationelementsourcereferences(IESR).Thecontentcanbetextualorcanencapsulateotherchildelementnodes,asinstandardXML.TheIESRsdesignatethesourceelementsthatserveastheinformationsourcesofthegivenelement,thustheycanbeconsideredidenticalentities.TheIESRsestablishrelationsbetweenelements

2

InformationaboutseveralXMLstandardscanbefoundonhttp://www.w3c.org

ofdifferentviews.Supposee.g.(seeFig.2)thatV1istheresultofaparserthatmarkswordandsentenceboundaries,V2isanindexwithtermsandweights,andV3containssomerecognizedtwo-wordexpressions.Wecaneasilyobtaintherelatione.g.betweentheindextermsandexpressionsthroughthereferencesthatconnecttheminV1.

AttheleftsideofFigure2,theglobalVNcanbeseen,ifweconsiderviewsasatomicnodesandshrinktheIESRstoviewsourcereferences(VSR)thatdesignatethesourceviewsthatserveastheinformationsourcesofagivenview.Atthislevel,theVNisadirectedacyclicgraph(DAG),duetotheincrementalcharacteroftheparsingprocess.ThesinknodeistheinitialviewthatshouldcontaintheinitiallystructuredsourcedocumentthatistherootinformationsourceforallparsingoperationintheVN.3.2

PlanningtheParsing

Thetaskoftheframeworkistoproduceviewsthatarerequiredfortheextrac-tionofrelevantinformation.Itappliestheparsermodules,butitshouldplantherightrunningsequenceorgraph,becausesomeparsersmayneedoutputsofotherones,somecanworkonseveralviews,andtherequiredoneshouldbeconstrained,etc.TheframeworkcanpartiallyorfullyautomatethecreationofparsingplansbyadoptingstandardAIplanningalgorithms(e.g.STRIPSbasedpartialorderedplanning[8]).

Figure3:Parserattributes

Therearetwofactorsthatconstrainaparsingplan:(1)thenatureoftheparsers,and(2)theconditionsofthecurrentapplicationrequirements.Parsersaretheelementarystepsofaparsingoperation.Aparsercanproduceagiventypeofviewbyprocessingcertainotherviews.Thus,parsershavetodeclare(a)conditionsontheirrequiredinputviews,and(b)definitionsaboutitsre-sultedoutputviews(Fig.3).Inaddition,parsersmayprovidedefinitionsof

theirparameterlistsandpossiblevalues,andcertainqualityandworkmea-surements3thatsupportthegenerationoftheoptimalplans.

Therequirementsoftheactualapplication(whatwecallaparsingtask)canbedefinedsimilarlyaswediditwiththeparsers:weshoulddeclareconditionsoftherequiredviews.Moreover,wemaydefineadditionalconditionsthatsup-porttheinternalconflictresolutionoftheplan,e.g.constrainingtheinputviewofanindexermodule,ortheparametersfortunableparsers,etc.

Theaimoftheplannersubsystemistoplantheparserconfigurationbyusingparsermodulesforproducingviewsthatsatisfytheconditionsoftheparsingtask.Thesequenceofrunningcanbeguessedbyfittingtheinputandoutputconditionsofparsers.Thekeychallengehereistheoptimalresolutionofcon-flictsintheplan(e.g.onetypeofviewcanbeproducedbymoreparsers,oroneparsercanproduceaviewfrommoreinputviews,etc.).

4ExampleScenario

Wecontinuetheexampledescribedintheintroduction,wherewewouldliketoextractfactsfromarticles(seesection1).Assumethatwehaveafactbasewherewecandefinewhatkindoffactsweareinterestedin,liketheoneinthefigurebelow(Fig.4).Here,wedefinethatwewouldliketogatherrecentfactsaboutfinancialoperationsofthecorporationMOL,andthedescriptionoftheoperationshouldrefertotheexactamountofcurrency.

Figure4:Parserattributes

First,wehavetofindoutwhattypeofviewscouldcontainthedefinedinfor-mationelements.Suchviewsformtherequestofthecurrentapplication(theparsingtask).Forexample,thenameofthecorporationcouldbefoundintheviewthatcontainsextractedpersonandcompanynames.Giventhelistoftherequiredviews,theplannersubsystemcancreateasuitableparsingplantoproducethembyapplyingseveralparsermodulesonthesourcedocuments.Theplanforthisexamplecouldbeasfollows(Fig.5):

First,weneedtoidentifywordandsentenceboundariesinthesourcedocu-ment.Theresultingviewwillbetheinputofthesentenceparser,thenamerecognizerandalsotheindexer,whichensuresthattheoutputviewsoftheseparserswillbelinkedthroughthe(semantic)levelofsentencesandwords.Wealsoapplyapatternfittermodule,whichrecognizecommonpatterns,likecur-rency,date,address,email,etc.Additionally,wemayhaveseveralparsersfor

E.g.approximatedinformationqualityoftheresultedview,complexityofthealgorithms,requiredresources,reliabilitymeasurements,etc.

3

onetask,e.g.twotypesofnamerecognizer:aheuristicandalexiconbased.Theoptimalresultcanbeselectedbyevaluatingboth.

Figure5:Parserattributes

Theextractionofrelevantfactsisperformedthenasfollows:(1)first,weneedtoidentifywhetherthisdocumentcouldbeaboutacompanythatweinter-estedin.Wecansearchforexactcompanynames,andwecanusestatisticalrelevancecalculationbyapplyingtheindexview.Then,(2)weshouldiden-tifywhichsentencescontainthecompanyorpersonnames,hopingthatthesesentenceswillcontainsomeinterestingfacts.Wecaneasilyselectthembyfol-lowingbackthesourcereferencesfromthenamestothewords-and-sentencesviewandthenforwardtotheparsed-sentencesview.Thenextstep(3)istoretrievethepredicateandobjectvaluesfromthesentencesandcheckwhethertheyareanotherentitiesinthenamesorthepatternsviews(E.g.theobjectisandothercompany).Thisalsocanbedonebynavigatingbackandforwardonthesource-targetreferences.Finally(4),weshouldidentifyknownconceptsinthesentencebyusinganontologyoralexiconandcreatetheoutputfactslist,whichisdonebythelastmoduleinthefigure,therelevantfactextractor.Werepresentthefactextractionasaparsermoduleinfigure5(intheparsingplan),butitwillratherbetheresultofaqueryperformedwithaproperlogic-basedquerylanguage,thenauniqueparsermoduleimplementation.

5Conclusions

Weareinanearlyphaseoftheresearch,however,basedonseveralyearsofexperienceonthesubject,wehavefoundthedesignedtheoreticalframeworkpromising.Ournextstepistoconcludethedevelopmentofaprototypethat

willimplementthecorearchitectureandservicesoftheabstractframeworkandwillserveasaworkbenchtodevelopandtryvariousparserschemes.

References

[1]R.Grishman(1997).Informationextraction:Techniquesandchallenges.

InformationExtraction:AMultidisciplinaryApproachtoanEmergingInformationTechnology,Vol.1299:10-27,June1997.[2]J.L.Neto,A.D.Santos,C.A.Kaestner,A.A.Freitas(2000).Document

ClusteringandTextSummarization.InProceedingsof4thInt.ConferenceonPracticalApplicationsofKnowledgeDiscoveryandDataMining,pp.41-55,London,UK,2000.[3]S.Kuhlins,R.Tredwell(2002).ToolkitsforGeneratingWrappers.

Net.ObjectDays-2002,Erfurt,Germany,October2002.[4]N.Kushmerick,B.Thomas(2003).AdaptiveInformationExtraction:

CoreTechnologiesforInformationAgents.InIntelligentInformationAgentsR&DinEurope:AnAgentLinkperspective,LectureNotesinCom-puterScience2586,Springer,2003.[5]R.Mitkov(ed.)(2003).TheOxfordHandbookofComputationalLinguis-tics.OxfordUniversityPress,Oxford,2003.[6]T.Mszros,Z.Barczikay,F.Bodon,T.Dobrowiecki,G.Strausz(2001).

BuildinganInformationandKnowledgeFusionSystem.InProceedingsofIEA/AIE-2001,SpringerLect.NotesinComp.Sc.,2001.[7]P.Varga,T.Mszros,Cs.Dezsnyi,T.Dobrowiecki(2003).AnOntology-basedInformationRetrievalSystem.InProceedingsof,IEA/AIE-2003,Laughborough,UK,LectureNotesinComputerSciencevol.2718Springer-Verlag,2003.[8]S.Russell,P.Norvig(1997).ArtificialIntelligence.AModernApproach.

PrenticeHallInc.,1997.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 7swz.com 版权所有 赣ICP备2024042798号-8

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务