CsabaDezs´enyiBudapestUniversityof
TechnologyandEconomics
dezsenyi@mit.bme.huTam´asM´esz´arosBudapestUniversityof
TechnologyandEconomics
meszaros@mit.bme.huTadeuszDobrowieckiBudapestUniversityof
TechnologyandEconomics
dobrowiecki@mit.bme.hu
AbstractVariousdocumentparsingmethodsarerequiredinapplicationsthatper-formcomplexinformationextraction.Thedevelopmentofsuchparsingschemescanbesimplifiedbydecomposingthecomplexextractionprocessintosimplestepsthatcanberealizedbyelementaryparsermodules.AuthorspresentageneralframeworkforthedevelopmentofIEapplicationsbyapplyingdifferentparsersindependentlyandproducingasemanticallyjointresult.
Keywords:Informationextraction,parsernetwork,planningforinformationre-trieval
1Introduction
TechniquesofInformationExtraction(IE)areintenselyusednowadaysinanevergrowingspectrumofapplications[1].Wecanfindtheminthesmallestinformationprocessingtools(e.g.emailfiltering,webassistant),butalsointhelargest,corporation-widesolutions(e.g.knowledgemanagementanddeci-sionsupportsystems,searchengines).IEtools-generallysocalledparsers-arebasedonseveraldistincttechniquesandtheories,likestatistics[2],patternfitting[3],machinelearning[4],naturallanguageprocessing(NLP)[5],andmanyothermethods.Insimplecommercialproductsthesetechniquesdonotappearinthecoreservices,onlylocallyconfinedtoparticularfunctions.Inapplicationsthatperformcomplexinformationandknowledgediscovery,how-ever,multipleparsingmethodsarerequired.Assumee.g.thatwewouldliketodevelopanapplicationthatretrieveseconomicarticlesfromnewsportals,extractssimplefactsaboutcompaniesandpersons,andputsthemintotheapplication’sknowledgebaseforfurtherprocessing.Severalparsingtasksarerequiredtoextractsuchinformation.WewouldhavetoextractthearticlefromtheHTMLpage,wehavetoidentifythecompanyandpersonnamesinthetext,wehavetouseasentenceparsertoextractsimplesubject-predicate-objecttriplets,andforexamplewecanusestatisticalanalysistocalculate
relevancemeasurementsforcontrollinglocalsearch.Itisrelativelysimpletoimplementandexecutesuchparsers,respectively.However,theresultsoftheindependentparsersshouldbesemanticallylinked,e.g.wemayrequiretoiden-tifywhetherthesubjectofasentenceisacompany.Whatwerequirethusistodiscoverandmapsemanticassociationsbetweenresultsofvariousparsersandprovideservices,likequerying,ornavigationamongthem.Moreover,parsersshoulduseeachother’sresult,whichsupportsthereusabilityofalreadywrittenmodules,andmakeseasiertocreateparsingschemes.Thus,summarizingthemainobjectives:wewanttosolveacomplexdocumentparsingproblembyfirstdecomposingtheprocessintosimpleparseroperations,andbyfusingtogethertheparticularresults.Theparserswouldworkasseparatemodules,buttheirresultsshouldbeacoherentandcompleteresultofthewholeparsingprocess.Torealizethisaim,firstwehavedesignedthetheoreticalbasisoftheframework,whichinvolvesgeneralprinciples,architecturalconsiderations,andmethodolog-icalguidelines.Wehavecreatedadatamodelthatimplicitlymapssemanticassociationsbetweentheelementsofindependentparserresults.Wearealsoworkingontechniquesofautomatedplanningofparserconfigurationsthatwouldberequiredforaparticularapplication.Westartedtodevelopanim-plementationoftheabstractframeworkthatisageneral,XML-basedparserframeworksystem.OuraimistorealizethesystemasafundamentalandstandardtoolthatcanbeusedtodeveloparbitraryIEapplications.Inthispaperwepresentthetheoreticalbasisoftheabstractframeworkandthekeychallengesoftheresearch.
TheparserframeworkisapartoftheInformationandKnowledgeFusion(IKF)project[6],whereacomplexknowledge-basedinformationretrievalapplicationisunderdevelopmentthatprovidesintelligentdecisionsupportforfinancialinstitutes.However,weconceptualizeanddesigntheframeworktobegeneralandcompletelyapplicationindependent.
2FromSourceDocumentstoExtractedInformation
Theprocessofinformationextractioncanbedividedintothreephases:re-trieval,parsingandextraction(Fig.1).First,adocumenthastoberetrievedfromthesourceenvironmentoftheapplication.Theresultisaninitiallystruc-tureddocument.E.g.intheIKFprojectweapplyanintelligentdocumentretrievalagentthatusesmodelsofdocumentsourcesforrecognizingbasicse-manticstructuresintheincomingdocuments[7].Atthesimplestcase,anapplicationmayonlyrequiretextfiles,readfromalocaldirectory.There-trievalphasecanalsoincludeformatconversionprocesses,documentmergingorsplitting,andotherrequiredpreprocessing.
Thesecondstepistoparsethedocumentinseveralwaystoproducevari-oussemanticallystructuredrepresentationsoftheoriginalsource,whatwecallviews.Theseviewsshouldcontaintherequiredinformationelementsandtheyprovidetheinputofthethirdstep,wheretheinformationisextractedfromthemandstoredinalocalknowledgerepositoryoftheapplication.
Figure1:Phasesofinformationextraction
Theretrievalsystemfeedstheparserframeworkwithinitialdocuments,whiletheIEprocessspecifieswhatkindoflogicalrepresentationsofthesource(views)arerequiredtoaccomplishtheextraction.Thetaskoftheframeworksystemistoapplysuitableparsers,toplantherunningsequencebycombiningthem,toexecuteandcontroltheparsing,andtoproduceasemanticallylinkedandcompleteresultfortheextraction.
3ViewsandParsers
Atraditionalparserisusuallyconsideredasacompleteprocessingunit.Here,parsersarebasicbuildingblocksthatcanbeutilizedtoconstructacomplexparsingscheme.WeshouldanalyzeIEparserstofindtheirgeneralcharacter-isticsandtodefineanabstractparsermodel.Thisensuresthattheframeworkcanhandleparsingproblemsuniformlyandcanoperatewithoutreferringtotheconcreteimplementations.Wealsoneedtodefinetheabstractionofparserresultstohandlethemindependentlyfromconcreterealizations.
Thetaskofaparseristorecognizecertainpartsorelementsinitsinputdoc-umentandtransformthemintoitsoutputview.Theresultedviewofagiventypeofparserisakindofinformationprojectionofthesourcethatcontainsthetransformedinformationinanew,semanticallystructuredform.Eachviewbelongstoatypethatdefineswhatkindofinformationiscontainedin,andyieldsitsstructurebyschemadefinitions.Aparsercanproduceviews,andtheinputoftheparsingoperationalsoconsistsofoneormoreviews.Hereby,parserscanuseeachother’sresultstocreatenewviews.Thesetofallcreatedviewsintheframeworkisthecompleteresultofaparsingoperation.Theinputoftheframeworkistheinitialviewthatcontainstheoriginalsourcedocumentinitsinitialstructure.
WedecidedtoimplementviewsasXML1documents,becauseitperfectlyfitsthedemandofcarryingsemanticallymarkedinformation.XMLisalsoawidespreadstandard,thus,wecouldbenefittheapplicationofthevarious
1
MoredetailsaboutXMLonhttp://www.w3c.org/XML/
standardtoolsthatisfreelyavailable(e.g.DOM,XSchema,XPath,XSLT,XQuery,etc.2).3.1
ViewNetworks
Independentparserscanproduceseveraldifferentviews,butoneofthekeychallengesistomapassociationsbetweenthemtoenablecompositequeriesandnavigationbylinkingsemanticallyrelatedinformationpieces.Tohaveageneralframeworktheparsingprocessshouldbeincremental,aparsercanonlyaccessthealreadyexistingviews(includingtheinitialviewwiththeoriginalsource),anditcanonlywriteitsownoutputview.Therefore,wecanmapassociationsbetweenelementsofthenewviewandelementsoftheexistingones.Themainideaisthatparserscreatelinksduringtheirexecutionbetweentheirsourceandtargetelementsthatareconsideredasidenticalentities.Insuchaway,viewsarelinkedtogether(bytheirelements)andformacompletestructurecalledViewNetwork(VN).Therelationsbetweenelementsoftwoindependentviewsareestablishedwithoutcreatingexplicit(direct)linksbetweenthem,becausetheycanbeassociatedthroughthelinkstotheirsources.
Figure2:ViewNetwork(theglobalandthedetailedstructure)
Theinternalstructureofaviewcanberepresentedasatree(Fig.2,right).Eachnodeinthistreecorrespondstoaninformationelement.Suchelementsaretheatomiccomponentsthatcancarrysemanticallymarkedinformation.Anelementconsistsofcontentandreferencescalledinformationelementsourcereferences(IESR).Thecontentcanbetextualorcanencapsulateotherchildelementnodes,asinstandardXML.TheIESRsdesignatethesourceelementsthatserveastheinformationsourcesofthegivenelement,thustheycanbeconsideredidenticalentities.TheIESRsestablishrelationsbetweenelements
2
InformationaboutseveralXMLstandardscanbefoundonhttp://www.w3c.org
ofdifferentviews.Supposee.g.(seeFig.2)thatV1istheresultofaparserthatmarkswordandsentenceboundaries,V2isanindexwithtermsandweights,andV3containssomerecognizedtwo-wordexpressions.Wecaneasilyobtaintherelatione.g.betweentheindextermsandexpressionsthroughthereferencesthatconnecttheminV1.
AttheleftsideofFigure2,theglobalVNcanbeseen,ifweconsiderviewsasatomicnodesandshrinktheIESRstoviewsourcereferences(VSR)thatdesignatethesourceviewsthatserveastheinformationsourcesofagivenview.Atthislevel,theVNisadirectedacyclicgraph(DAG),duetotheincrementalcharacteroftheparsingprocess.ThesinknodeistheinitialviewthatshouldcontaintheinitiallystructuredsourcedocumentthatistherootinformationsourceforallparsingoperationintheVN.3.2
PlanningtheParsing
Thetaskoftheframeworkistoproduceviewsthatarerequiredfortheextrac-tionofrelevantinformation.Itappliestheparsermodules,butitshouldplantherightrunningsequenceorgraph,becausesomeparsersmayneedoutputsofotherones,somecanworkonseveralviews,andtherequiredoneshouldbeconstrained,etc.TheframeworkcanpartiallyorfullyautomatethecreationofparsingplansbyadoptingstandardAIplanningalgorithms(e.g.STRIPSbasedpartialorderedplanning[8]).
Figure3:Parserattributes
Therearetwofactorsthatconstrainaparsingplan:(1)thenatureoftheparsers,and(2)theconditionsofthecurrentapplicationrequirements.Parsersaretheelementarystepsofaparsingoperation.Aparsercanproduceagiventypeofviewbyprocessingcertainotherviews.Thus,parsershavetodeclare(a)conditionsontheirrequiredinputviews,and(b)definitionsaboutitsre-sultedoutputviews(Fig.3).Inaddition,parsersmayprovidedefinitionsof
theirparameterlistsandpossiblevalues,andcertainqualityandworkmea-surements3thatsupportthegenerationoftheoptimalplans.
Therequirementsoftheactualapplication(whatwecallaparsingtask)canbedefinedsimilarlyaswediditwiththeparsers:weshoulddeclareconditionsoftherequiredviews.Moreover,wemaydefineadditionalconditionsthatsup-porttheinternalconflictresolutionoftheplan,e.g.constrainingtheinputviewofanindexermodule,ortheparametersfortunableparsers,etc.
Theaimoftheplannersubsystemistoplantheparserconfigurationbyusingparsermodulesforproducingviewsthatsatisfytheconditionsoftheparsingtask.Thesequenceofrunningcanbeguessedbyfittingtheinputandoutputconditionsofparsers.Thekeychallengehereistheoptimalresolutionofcon-flictsintheplan(e.g.onetypeofviewcanbeproducedbymoreparsers,oroneparsercanproduceaviewfrommoreinputviews,etc.).
4ExampleScenario
Wecontinuetheexampledescribedintheintroduction,wherewewouldliketoextractfactsfromarticles(seesection1).Assumethatwehaveafactbasewherewecandefinewhatkindoffactsweareinterestedin,liketheoneinthefigurebelow(Fig.4).Here,wedefinethatwewouldliketogatherrecentfactsaboutfinancialoperationsofthecorporationMOL,andthedescriptionoftheoperationshouldrefertotheexactamountofcurrency.
Figure4:Parserattributes
First,wehavetofindoutwhattypeofviewscouldcontainthedefinedinfor-mationelements.Suchviewsformtherequestofthecurrentapplication(theparsingtask).Forexample,thenameofthecorporationcouldbefoundintheviewthatcontainsextractedpersonandcompanynames.Giventhelistoftherequiredviews,theplannersubsystemcancreateasuitableparsingplantoproducethembyapplyingseveralparsermodulesonthesourcedocuments.Theplanforthisexamplecouldbeasfollows(Fig.5):
First,weneedtoidentifywordandsentenceboundariesinthesourcedocu-ment.Theresultingviewwillbetheinputofthesentenceparser,thenamerecognizerandalsotheindexer,whichensuresthattheoutputviewsoftheseparserswillbelinkedthroughthe(semantic)levelofsentencesandwords.Wealsoapplyapatternfittermodule,whichrecognizecommonpatterns,likecur-rency,date,address,email,etc.Additionally,wemayhaveseveralparsersfor
E.g.approximatedinformationqualityoftheresultedview,complexityofthealgorithms,requiredresources,reliabilitymeasurements,etc.
3
onetask,e.g.twotypesofnamerecognizer:aheuristicandalexiconbased.Theoptimalresultcanbeselectedbyevaluatingboth.
Figure5:Parserattributes
Theextractionofrelevantfactsisperformedthenasfollows:(1)first,weneedtoidentifywhetherthisdocumentcouldbeaboutacompanythatweinter-estedin.Wecansearchforexactcompanynames,andwecanusestatisticalrelevancecalculationbyapplyingtheindexview.Then,(2)weshouldiden-tifywhichsentencescontainthecompanyorpersonnames,hopingthatthesesentenceswillcontainsomeinterestingfacts.Wecaneasilyselectthembyfol-lowingbackthesourcereferencesfromthenamestothewords-and-sentencesviewandthenforwardtotheparsed-sentencesview.Thenextstep(3)istoretrievethepredicateandobjectvaluesfromthesentencesandcheckwhethertheyareanotherentitiesinthenamesorthepatternsviews(E.g.theobjectisandothercompany).Thisalsocanbedonebynavigatingbackandforwardonthesource-targetreferences.Finally(4),weshouldidentifyknownconceptsinthesentencebyusinganontologyoralexiconandcreatetheoutputfactslist,whichisdonebythelastmoduleinthefigure,therelevantfactextractor.Werepresentthefactextractionasaparsermoduleinfigure5(intheparsingplan),butitwillratherbetheresultofaqueryperformedwithaproperlogic-basedquerylanguage,thenauniqueparsermoduleimplementation.
5Conclusions
Weareinanearlyphaseoftheresearch,however,basedonseveralyearsofexperienceonthesubject,wehavefoundthedesignedtheoreticalframeworkpromising.Ournextstepistoconcludethedevelopmentofaprototypethat
willimplementthecorearchitectureandservicesoftheabstractframeworkandwillserveasaworkbenchtodevelopandtryvariousparserschemes.
References
[1]R.Grishman(1997).Informationextraction:Techniquesandchallenges.
InformationExtraction:AMultidisciplinaryApproachtoanEmergingInformationTechnology,Vol.1299:10-27,June1997.[2]J.L.Neto,A.D.Santos,C.A.Kaestner,A.A.Freitas(2000).Document
ClusteringandTextSummarization.InProceedingsof4thInt.ConferenceonPracticalApplicationsofKnowledgeDiscoveryandDataMining,pp.41-55,London,UK,2000.[3]S.Kuhlins,R.Tredwell(2002).ToolkitsforGeneratingWrappers.
Net.ObjectDays-2002,Erfurt,Germany,October2002.[4]N.Kushmerick,B.Thomas(2003).AdaptiveInformationExtraction:
CoreTechnologiesforInformationAgents.InIntelligentInformationAgentsR&DinEurope:AnAgentLinkperspective,LectureNotesinCom-puterScience2586,Springer,2003.[5]R.Mitkov(ed.)(2003).TheOxfordHandbookofComputationalLinguis-tics.OxfordUniversityPress,Oxford,2003.[6]T.Mszros,Z.Barczikay,F.Bodon,T.Dobrowiecki,G.Strausz(2001).
BuildinganInformationandKnowledgeFusionSystem.InProceedingsofIEA/AIE-2001,SpringerLect.NotesinComp.Sc.,2001.[7]P.Varga,T.Mszros,Cs.Dezsnyi,T.Dobrowiecki(2003).AnOntology-basedInformationRetrievalSystem.InProceedingsof,IEA/AIE-2003,Laughborough,UK,LectureNotesinComputerSciencevol.2718Springer-Verlag,2003.[8]S.Russell,P.Norvig(1997).ArtificialIntelligence.AModernApproach.
PrenticeHallInc.,1997.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- 7swz.com 版权所有 赣ICP备2024042798号-8
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务