跨语言信息检索技术

上传人:xins****2008 文档编号:165756928 上传时间:2022-10-29 格式:PPTX 页数:78 大小:3.16MB
收藏 版权申诉 举报 下载
跨语言信息检索技术_第1页
第1页 / 共78页
跨语言信息检索技术_第2页
第2页 / 共78页
跨语言信息检索技术_第3页
第3页 / 共78页
资源描述:

《跨语言信息检索技术》由会员分享,可在线阅读,更多相关《跨语言信息检索技术(78页珍藏版)》请在装配图网上搜索。

1、 Cross Language Information RetrievalRoad MaplCross Lingual IRlMotivation lDefinitionlGeneral Issues With CLIRlBasic Approaches to CLIRlCLIR evaluationlCLIR applications2022-10-293Information RetrievallSingle language:both the users query and documents to be searched are in same language.lCross lang

2、uage:documents written in a language different from the language of the users query documentsquery2022-10-294 2000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)The Internet Big PictureWorld RegionsPopulationInternet UsersPenetration(%population)Users%of TableGrowth 2000-2015Africa1,158,355,663313,257,07427

3、.0%9.6%6,839%Asia4,032,466,8821,563,208,14338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%Middle East236,137,235115,823,88249.0%3.5%3,426%North America357,172,209313,862,86387.9%9.6%191%Latin America617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%Wo

4、rld Total7260,621,1183,270,490,58445%100%806%World Internet Users and 2015 Population Stats2022-10-2952022-10-296Usage of content languages for websites2022-10-29720022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French 3%Spanish4.7%Italian2%French4.1%Dutch2%P

5、ortuguese2.6%Chinese 2%Chinese 2.2%Korean1%Italian2.1%Russian 1%Polish1.9%Portuguese1%Turkish1.6%Cross Language IRlMotivation lInformation unavailability in some languages lLanguage barrier lDefinition:lCross-language information retrieval(CLIR)is a subfield of information retrieval dealing with ret

6、rieving information written in a language different from the language of the users query(wikipedia)lExample:lA user may ask query in Chinese but retrieve relevant documents written in English.Why do we need CLIR systems?lNeeds technologies that enable access to info regardless of geographic/language

7、 barriers.lTo find,retrieve and understand relevant information in whatever language/form.lCLIR has become one of the key factors affecting knowledge sharing all over the world.General Issues With CLIRlMultilingual text access(character sets,etc.)lDifferences between languages-stemming,compound word

8、s,breaks between words,etc.lTerm ambiguity between languageslWhat to translate(query vs.document)and howMatching strategieslNo translationl(1)Cognate matchinglTranslationl(2)Query translationl(3)Document translationl(4)Interlingual techniques2022-10-2911Cognate matching(同源匹配)同源匹配)lIn the case of the

9、 most naive cognate matching,untranslatable terms such as proper nouns or technical terminology are left unchanged through the stage of translation.lThe unchanged term can be expected to match successfully with a corresponding term in another language if the two languages have a close linguistic rel

10、ationship.(for example,generation in English and French)lWhen two languages are very different,by exploring a method for measuring similarity between transliteration and its original word,we may make cognate matching feasible(音译).2022-10-29122022-10-2913Query translation搜索引擎搜索引擎翻译系统翻译系统法语查询法语文档结果结果中

11、文查询选择浏览法语文档集合法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文2022-10-2914query translationlQuery translation is the most widely used matching strategy for CLIR due to its tractability.lthe retrieval system does not have to change its inverted files of index terms in any way against queries in any language.lIt

12、is less computationally costly to process the translation of a query than that of a large set of documentslChallenge:term ambiguity lqueries are often short and short queries provide little context for disambiguationlTerm disambiguation will be discussed later.2022-10-2915查询翻译优缺点查询翻译优缺点l优点l简单l容易操作l灵

13、活l节约时间、空间,效率高l缺点l缺乏上下文l对于短查询式,翻译歧义性大2022-10-2916Document translation中文查询法语文档集合法语文档集合搜索引擎搜索引擎翻译系统翻译系统中文文档集合中文文档集合结果结果选择浏览过程:将整个法语文档翻译成中文文档直接用中文文档检索2022-10-2917Document translationlDocument translation has opposite advantages and disadvantages from query translation.lIn CLIR experiments,this approach

14、is not usually utilized,and query translation is dominant.lHowever,some researchers have used it to translate large sets of documents since more varied context within each document is available for translation,which can improve translation quality.lOard and Hackett(1998)reported that automatic machi

15、ne translation of a set of documents using a commercial MT system outperforms query translation in an experiment of CLIR from German to English2022-10-2918文档翻译优缺点文档翻译优缺点l优点l只翻译一次l文档提供的上下文比较丰富l文档可以线下事先翻译好l缺点l翻译速度慢l占用大量空间、时间,效率低l依赖机器翻译系统的质量2022-10-2919查询翻译查询翻译vs.文档翻译文档翻译l取决于特定语言资源l通常查询翻译使用更广l两种方法都提出了“

16、交互性”挑战Interlingual approachlan intermediate space of subject representation into which both the query and the documents are converted is used to compare them.lOne type of interlingual approach is to use the synsets provided in WordNet,which is a wellknown machine-readable thesaurus.lFor example,Diek

17、ema,Oroumchian,Sheridan,and Liddy(1999)employed the WordNet synset numbers as language-independent representations for CLIR.lSince a synset number(label)representing a concept is corresponded to a set of concrete words in each of languages supported(e.g.,English and French),it is possible that a que

18、ry term in the source languages is linked to words in the target language via the synset number.2022-10-2920Translation techniques2022-10-2921Dictionary-based methodslUsing a bilingual Machine Readable Dictionary(MRD).lmost retrieval systems are still based on so-called bag-of-words architectures,in

19、 which both query statements and document texts are decomposed into a set of words(or phrases)through a process of indexing.lThus we can translate a query easily by replacing each query term with its translation equivalents appearing in a bilingual dictionary or a bilingual term list.2022-10-2922202

20、2-10-2923bilingual dictionary2022-10-2924Term translationoilpetroleumprobesurveytake samples选哪个翻译?没有翻译!restraincymbidium goeringii分词错误oilpetroleumprobesurveytake samples2022-10-2925Some issues in term translationlCompound words,for example GermanldecompositionlNo boundary between words,e.g.Chinesels

21、egmentationlSpecialized vocabulary not contained in the dictionary,e.g.named entity2022-10-2926ExampleslCompound decomposition(复合词分解)lchinese word segmentationl新西兰花l新西兰 花New Zealand flowers l新 西兰花 fresh broccolis2022-10-2927Corpora-based methodlParallel(双语平行语料库)or comparable corpora(双语可比语料库)are usef

22、ul resources enabling us to extract beneficial information for CLIR.lFor example,in order to translate English queries into Spanish,Davis and Dunning(1995)extracted moderately frequent Spanish terms from Spanish documents aligned with English documents which had been searched using an English query(

23、source query).2022-10-2928Parallel corporalA parallel corpus(pl.corpora)is a document collection composed of two or more disjoint subsets,each written in a different language,such that documents in each subset are translations of documents in each other subset.lVery high accuracy2022-10-2929象形文字古埃及文

24、字希腊文2022-10-2930罗塞塔石碑罗塞塔石碑l罗塞塔石碑(Rosetta Stone,也译作罗塞达碑),高1.14米,宽0.73米,是一块制作于公元前196年的大理石石碑,原本是一块刻有埃及国王托勒密五世(Ptolemy V)诏书的石碑。石碑上用希腊文字、古埃及文字和当时的通俗体文字刻了同样的内容。由于这块石碑刻有三种不同语言版本,使得近代的考古学家得以有机会对照各语言版本的内容后,解读出已经失传千余年的埃及象形文之意义与结构,而成为今日研究古埃及历史的重要里程碑。2022-10-2931More parallel corporalnews:lDE-News(German-Englis

25、h)lHong-Kong News,Xinhua News(Chinese-English)lGovernment docuemtns:lCanadian-Hansards(French-English)lEuroparl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)lUN Treaties(Russian,English,Arabic,)lBible(many,many languages)2022-10-2932ExamplesEnglishGermanDivergin

26、g opinions about planned tax reformUnterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues.Die Diskussion um die vorgesehene grosse Steuerreform dauert an.The FDP economics expert,Graf Lambsdorff,today came out in favor of advancing the en

27、actment of significant parts of the overhaul,currently planned for 1999.Der FDP-Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus,wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen.2022-10-2933Comparable corporalA comparable corpus is a pair of corpora in two different langua

28、ges,which come from the same domain.lTalking the same topiclParallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages.lSome researchers extract phrase pairs from comparable corpora using a classifier approach.2022-10-2934Exampl

29、elThe WWW can provide rich and ubiquitous machine-readable resources,from which we may be able to automatically extract information useful for CLIR.lFor example,Chen(2002)and Chen and Gey(2003)made use of a general search engine on the Internet and tried to find English translation equivalents of Ch

30、inese or Japanese terms(mainly proper nouns)by analyzing contexts of these terms in Chinese and Japanese Web documents returned by the engine.2022-10-29352022-10-2936Term disambiguation techniques(翻译歧义性翻译歧义性)lDisambiguation from among multiple alternative term translations,多个翻译如何选择?e.g.,Apple,BanklU

31、se of part-of-speech(POS)tags.lUse of parallel corpus.lUse of co-occurrence statistics in the target corpus.lUse of the query expansion technique.Use of part-of-speech tagslThe basic idea of using part-of-speech(POS)tags for translation disambiguation is to select only translations having the same P

32、OS with that of the source query term.lThis method requires that POS tagging software is available for both languages.2022-10-2937Parallel corpus-based disambiguationlA parallel corpus was used for determining the best translation or set of translations by Davis(1997,1998),where a single translation

33、 for each source term was selected from a set of translations listed in an MRD according to the result of searching a parallel corpus.2022-10-29382022-10-2939Translation probability探测探测survey试探试探样品样品测量测量(p=0.4)(p=0.3)(p=0.25)(p=0.05)多个翻译多个翻译 翻译概率翻译概率Disambiguation based on co-occurrence statisticslt

34、he correct translations of query terms should co-occur in target language documents and incorrect translations should tend not to co-occur.lFirst,the two most related terms in the query were determined based on cooccurrence statistics in the source language corpus,and then the best translations were

35、 selected from all pairs of translations of these two terms according to co-occurrence statistics in the target language corpus.lIt should be noted that these two corpora do not have to be parallel or comparable.2022-10-2940Query expansion for disambiguationlPseudo relevance feedback(PRF),also known

36、 as blind feedback,is widely recognized as an effectiveltechnique for enhancing performance of information retrieval.PRF also works effectively for CLIR tasks.lIn the case of CLIR,two kinds of PRF are feasible:lPre-translation feedback andlPost-translation feedback2022-10-2941Pre-translation feedbac

37、klDocuments from a corpus in the source language can be retrieved prior to translation in order to add a set of new terms to the source query(pre-translation feedback)if such a corpus is available.lPre-translation feedback may contribute to improvement of precision.This is due to the fact that the P

38、RF is basically done using the entire querynot each source term respectively.That is,synonyms or related terms corresponding to the correct meaning of each source term within a context of the query are expected to be automatically added through the PRF process.2022-10-2942Post-translation feedbacklA

39、fter translation,standard PRF can be applied using the target document collection(post-translation feedback).lpost-translation feedback can be considered a device for improving recall ratio,as shown in standard experiments of monolingual retrieval.lIn CLIR,two well-known methods for weighting terms

40、in the top-ranked documents are often utilized for selecting good terms,i.e.,the Rocchio method and the probabilistic method.2022-10-2943bi-directional translationlBoughanem et al.(2002),explored a bi-directional translation technique in which a form of backward translation is used for ranking trans

41、lation candidates.Suppose that we need to translate English query terms into French ones.In bi-directional translation,first a set of French equivalents for an English term is found in an EnglishFrench dictionary.Next,using a FrenchEnglish dictionary,each French equivalent is reversely translated in

42、to a set of English terms.Basically,if the set includes the original source term,the French translation equivalent is chosen as a preferred translation.2022-10-29442022-10-2945跨语言检索评价跨语言检索评价l信息检索评价l给定一个检索主题,一个文档集合,一些人工判断好的相关文献l对系统返回的检索结果进行判断lTREC CLIR(96-02):英语到其他语言 lCLEF(00-):欧洲语言之间 lNTCIR(99-):亚洲语

43、言与英语2022-10-2946跨语言检索评价模型跨语言检索评价模型47Applications of CLIR2022-10-29482.1 Cross language Search EnginelApril 25,2006:European search engine“Quaero”lFrench President announced 90 million-euro support.lMay 16,2007:Google TranslatelProvide CLIR for 12 languages lGoal:take all the Web&translate into multi

44、ple langs.lMay 5,2008:Yahoo Babel FishlProvide CLIR between 12 languageslIt was AltaVistas project,later bought by Yahoo2022-10-2949Google Translatehttp:/2022-10-29502022-10-2951Yahoo Babel Fishhttp:/2022-10-29522022-10-29532022-10-2954提问提问l请比较请比较Google和和Yahoo!的跨语言搜索引!的跨语言搜索引擎的区别,分析各自的优缺点擎的区别,分析各自的优

45、缺点lGoogle:一步完成(translate&search),检索结果翻译回源语言。优点:快速,便于用户理解检索结果。缺点:用户无法修改翻译。lYahoo!:两步完成(translate+search),检索结果未翻译。优点:有中间步骤,用户可以修改翻译。缺点:复杂,检索结果无法识别。2.2 数字图书馆的跨语言检索数字图书馆的跨语言检索l2010年6月11日在芬兰首都赫尔辛基举行的ICSTI(国际科技信息理事会)夏季会议上发布的世界科学跨语言检索平台WorldWideScience2022-10-2955WorldWideSciencehttp:/worldwidescience.org/

46、multilinguall联盟的成员单位都是专业图书情报机构或科技信息事业的领导机构,如美国能源部科技信息局(OSTI)、美国国会图书馆、大英图书馆、加拿大科技信息研究所、韩国科技信息研究所、中国科技信息研究所等。l该平台还可以自动进行跨语言跨库检索2022-10-2956WorldWideSciencehttp:/worldwidescience.org/multilingual2022-10-29572.3 跨语言专利检索跨语言专利检索l根据世界知识产权组织(World Intellectual Property Organization,WIPO)报导,专利文件包含全世界90%95%的科

47、研成果,而其他技术文件(论文或期刊等)中只含5%10%的研发成果。l在研究工作中若能善于利用专利检索可以缩短60%的研发时间,同时减少40%的研发经费。2022-10-2958l2010年5月,世界知识产权组织WIPO发布了跨语言专利检索系统PATENTSCOPE的测试版,标志着跨语言信息检索在专利检索中的应用从实验室走向实用化。l该系统只能提供英语、法语、德语、日语、西班牙语5种语言之间的跨语言专利检索。2022-10-29592022-10-29602022-10-29612.4 跨语言图像检索跨语言图像检索2022-10-29622022-10-29632022-10-29642.5 电

48、子商务中的应用电子商务中的应用lCINDOR 是目前比较成功的一个商业跨语言信息检索系统lCINDOR系统拥有概念中间语言(Conceptual Interlingua)、语言分析(Language Analysis)、搜索管理(Search Management)三大核心技术。lCINDOR目前支持英语、法语、西班牙语,正在研制简体中文、俄语、阿拉伯语。2022-10-29652022-10-29662022-10-2967ReferencelKazuaki Kishida.Technical issues of cross-language information retrieval:a

49、review.Information Processing and Management.2005(41),pp433-455.l葛运东;跨语言信息检索查询翻译技术研究D;苏州大学;2010 l王序文.基于主题伪相关反馈的跨语言信息检索技术研究 D;北京邮电大学,2014l彭琳.汉语词语语义相似度度量及其在跨语言信息检索中的应用研究D;复旦大学,20102022-10-29682022-10-2969对对“交互交互”的挑战的挑战lCLIR poses some unique challenges for interactionlHow do you help users select tran

50、slated query terms?lHow do you help users select document terms for query refinement?lHow do you compensate for poor translation quality?2022-10-2970多语言信息获取多语言信息获取 Cross-Language Information Access,CLIACLIRSystemResult ProcessingResult PresentationQuery formulationQuestion analysisRequest generation

51、Need negotiationNeed identificationSource selectionResult SelectionResult ExaminationInformation ExtractionResult ClassificationResult VisualizationResult SummarizationQuery ReformulationRelevance FeedbackCLIA SystemNeed ClarificationNeed Instantiation2022-10-2971CLIA vs.CLIRlCross-Language Informat

52、ion Retrieval lA narrow view of CLIAlCLIR is limited,good for developing matching techniques lCross-Language Information Access lAim to help users find the information they wantlConcern not just the ranking of results 2022-10-2972多语言信息获取多语言信息获取l用户为中心l关注用户与系统的交互l相关性依赖于特定“用户”与特定“情境”l交互l信息需求不能被完全充分理解l语

53、言歧义性l需求与使用的范围更广l多媒体:图像、声音l聚焦信息:段落检索、问答l凝练信息:摘要、信息抽取2022-10-2973多语言信息获取生命周期多语言信息获取生命周期检索经过翻译的查询式检索结果列表文档选择待浏览的文档文档浏览查询翻译查询形成查询式待传递的文档查询重新形成 翻译重新选择 文档重新选择2022-10-2974支持查询(重新)形成支持查询(重新)形成 lProblemslTerm Mismatch:query translations terms in docs lTranslations in foreign languagelHow to display,interpret

54、 and controllIs query translation an extra step?lQuery reformulationlwhere and how to get info2022-10-2975用户辅助查询翻译用户辅助查询翻译2022-10-2976支持文档(重新)选择支持文档(重新)选择lSelection need translated surrogateslHow to generate surrogates?lHow to translate surrogates?lExamination need translated documents2022-10-2977摘要

55、生成摘要生成lHow to generate surrogateslFirst N words in docs(good for news articles)lKey Word In Context,automatic summarizationlPassage retrievallHow to translate surrogateslGloss translation:term by term translationlPhrase translation:only translate phrases in docslMachine Translation2022-10-2978摘要辅助用户选择判断文档摘要辅助用户选择判断文档

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!