欢迎来到装配图网! | 帮助中心 装配图网zhuangpeitu.com!
装配图网
ImageVerifierCode 换一换
首页 装配图网 > 资源分类 > PPTX文档下载
 

跨语言信息检索技术

  • 资源ID:165756928       资源大小:3.16MB        全文页数:78页
  • 资源格式: PPTX        下载积分:30积分
快捷下载 游客一键下载
会员登录下载
微信登录下载
三方登录下载: 微信开放平台登录 支付宝登录   QQ登录   微博登录  
二维码
微信扫一扫登录
下载资源需要30积分
邮箱/手机:
温馨提示:
用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)
支付方式: 支付宝    微信支付   
验证码:   换一换

 
账号:
密码:
验证码:   换一换
  忘记密码?
    
友情提示
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

跨语言信息检索技术

Cross Language Information RetrievalRoad MaplCross Lingual IRlMotivation lDefinitionlGeneral Issues With CLIRlBasic Approaches to CLIRlCLIR evaluationlCLIR applications2022-10-293Information RetrievallSingle language:both the users query and documents to be searched are in same language.lCross language:documents written in a language different from the language of the users query documentsquery2022-10-294 2000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)The Internet Big PictureWorld RegionsPopulationInternet UsersPenetration(%population)Users%of TableGrowth 2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,14338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%Middle East236,137,235115,823,88249.0%3.5%3,426%North America357,172,209313,862,86387.9%9.6%191%Latin America617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%World Total7260,621,1183,270,490,58445%100%806%World Internet Users and 2015 Population Stats2022-10-2952022-10-296Usage of content languages for websites2022-10-29720022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French 3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese 2%Chinese 2.2%Korean1%Italian2.1%Russian 1%Polish1.9%Portuguese1%Turkish1.6%Cross Language IRlMotivation lInformation unavailability in some languages lLanguage barrier lDefinition:lCross-language information retrieval(CLIR)is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the users query(wikipedia)lExample:lA user may ask query in Chinese but retrieve relevant documents written in English.Why do we need CLIR systems?lNeeds technologies that enable access to info regardless of geographic/language barriers.lTo find,retrieve and understand relevant information in whatever language/form.lCLIR has become one of the key factors affecting knowledge sharing all over the world.General Issues With CLIRlMultilingual text access(character sets,etc.)lDifferences between languages-stemming,compound words,breaks between words,etc.lTerm ambiguity between languageslWhat to translate(query vs.document)and howMatching strategieslNo translationl(1)Cognate matchinglTranslationl(2)Query translationl(3)Document translationl(4)Interlingual techniques2022-10-2911Cognate matching(同源匹配)同源匹配)lIn the case of the most naive cognate matching,untranslatable terms such as proper nouns or technical terminology are left unchanged through the stage of translation.lThe unchanged term can be expected to match successfully with a corresponding term in another language if the two languages have a close linguistic relationship.(for example,generation in English and French)lWhen two languages are very different,by exploring a method for measuring similarity between transliteration and its original word,we may make cognate matching feasible(音译).2022-10-29122022-10-2913Query translation搜索引擎搜索引擎翻译系统翻译系统法语查询法语文档结果结果中文查询选择浏览法语文档集合法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文2022-10-2914query translationlQuery translation is the most widely used matching strategy for CLIR due to its tractability.lthe retrieval system does not have to change its inverted files of index terms in any way against queries in any language.lIt is less computationally costly to process the translation of a query than that of a large set of documentslChallenge:term ambiguity lqueries are often short and short queries provide little context for disambiguationlTerm disambiguation will be discussed later.2022-10-2915查询翻译优缺点查询翻译优缺点l优点l简单l容易操作l灵活l节约时间、空间,效率高l缺点l缺乏上下文l对于短查询式,翻译歧义性大2022-10-2916Document translation中文查询法语文档集合法语文档集合搜索引擎搜索引擎翻译系统翻译系统中文文档集合中文文档集合结果结果选择浏览过程:将整个法语文档翻译成中文文档直接用中文文档检索2022-10-2917Document translationlDocument translation has opposite advantages and disadvantages from query translation.lIn CLIR experiments,this approach is not usually utilized,and query translation is dominant.lHowever,some researchers have used it to translate large sets of documents since more varied context within each document is available for translation,which can improve translation quality.lOard and Hackett(1998)reported that automatic machine translation of a set of documents using a commercial MT system outperforms query translation in an experiment of CLIR from German to English2022-10-2918文档翻译优缺点文档翻译优缺点l优点l只翻译一次l文档提供的上下文比较丰富l文档可以线下事先翻译好l缺点l翻译速度慢l占用大量空间、时间,效率低l依赖机器翻译系统的质量2022-10-2919查询翻译查询翻译vs.文档翻译文档翻译l取决于特定语言资源l通常查询翻译使用更广l两种方法都提出了“交互性”挑战Interlingual approachlan intermediate space of subject representation into which both the query and the documents are converted is used to compare them.lOne type of interlingual approach is to use the synsets provided in WordNet,which is a wellknown machine-readable thesaurus.lFor example,Diekema,Oroumchian,Sheridan,and Liddy(1999)employed the WordNet synset numbers as language-independent representations for CLIR.lSince a synset number(label)representing a concept is corresponded to a set of concrete words in each of languages supported(e.g.,English and French),it is possible that a query term in the source languages is linked to words in the target language via the synset number.2022-10-2920Translation techniques2022-10-2921Dictionary-based methodslUsing a bilingual Machine Readable Dictionary(MRD).lmost retrieval systems are still based on so-called bag-of-words architectures,in which both query statements and document texts are decomposed into a set of words(or phrases)through a process of indexing.lThus we can translate a query easily by replacing each query term with its translation equivalents appearing in a bilingual dictionary or a bilingual term list.2022-10-29222022-10-2923bilingual dictionary2022-10-2924Term translationoilpetroleumprobesurveytake samples选哪个翻译?没有翻译!restraincymbidium goeringii分词错误oilpetroleumprobesurveytake samples2022-10-2925Some issues in term translationlCompound words,for example GermanldecompositionlNo boundary between words,e.g.ChineselsegmentationlSpecialized vocabulary not contained in the dictionary,e.g.named entity2022-10-2926ExampleslCompound decomposition(复合词分解)lchinese word segmentationl新西兰花l新西兰 花New Zealand flowers l新 西兰花 fresh broccolis2022-10-2927Corpora-based methodlParallel(双语平行语料库)or comparable corpora(双语可比语料库)are useful resources enabling us to extract beneficial information for CLIR.lFor example,in order to translate English queries into Spanish,Davis and Dunning(1995)extracted moderately frequent Spanish terms from Spanish documents aligned with English documents which had been searched using an English query(source query).2022-10-2928Parallel corporalA parallel corpus(pl.corpora)is a document collection composed of two or more disjoint subsets,each written in a different language,such that documents in each subset are translations of documents in each other subset.lVery high accuracy2022-10-2929象形文字古埃及文字希腊文2022-10-2930罗塞塔石碑罗塞塔石碑l罗塞塔石碑(Rosetta Stone,也译作罗塞达碑),高1.14米,宽0.73米,是一块制作于公元前196年的大理石石碑,原本是一块刻有埃及国王托勒密五世(Ptolemy V)诏书的石碑。石碑上用希腊文字、古埃及文字和当时的通俗体文字刻了同样的内容。由于这块石碑刻有三种不同语言版本,使得近代的考古学家得以有机会对照各语言版本的内容后,解读出已经失传千余年的埃及象形文之意义与结构,而成为今日研究古埃及历史的重要里程碑。2022-10-2931More parallel corporalnews:lDE-News(German-English)lHong-Kong News,Xinhua News(Chinese-English)lGovernment docuemtns:lCanadian-Hansards(French-English)lEuroparl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)lUN Treaties(Russian,English,Arabic,)lBible(many,many languages)2022-10-2932ExamplesEnglishGermanDiverging opinions about planned tax reformUnterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues.Die Diskussion um die vorgesehene grosse Steuerreform dauert an.The FDP economics expert,Graf Lambsdorff,today came out in favor of advancing the enactment of significant parts of the overhaul,currently planned for 1999.Der FDP-Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus,wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen.2022-10-2933Comparable corporalA comparable corpus is a pair of corpora in two different languages,which come from the same domain.lTalking the same topiclParallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages.lSome researchers extract phrase pairs from comparable corpora using a classifier approach.2022-10-2934ExamplelThe WWW can provide rich and ubiquitous machine-readable resources,from which we may be able to automatically extract information useful for CLIR.lFor example,Chen(2002)and Chen and Gey(2003)made use of a general search engine on the Internet and tried to find English translation equivalents of Chinese or Japanese terms(mainly proper nouns)by analyzing contexts of these terms in Chinese and Japanese Web documents returned by the engine.2022-10-29352022-10-2936Term disambiguation techniques(翻译歧义性翻译歧义性)lDisambiguation from among multiple alternative term translations,多个翻译如何选择?e.g.,Apple,BanklUse of part-of-speech(POS)tags.lUse of parallel corpus.lUse of co-occurrence statistics in the target corpus.lUse of the query expansion technique.Use of part-of-speech tagslThe basic idea of using part-of-speech(POS)tags for translation disambiguation is to select only translations having the same POS with that of the source query term.lThis method requires that POS tagging software is available for both languages.2022-10-2937Parallel corpus-based disambiguationlA parallel corpus was used for determining the best translation or set of translations by Davis(1997,1998),where a single translation for each source term was selected from a set of translations listed in an MRD according to the result of searching a parallel corpus.2022-10-29382022-10-2939Translation probability探测探测survey试探试探样品样品测量测量(p=0.4)(p=0.3)(p=0.25)(p=0.05)多个翻译多个翻译 翻译概率翻译概率Disambiguation based on co-occurrence statisticslthe correct translations of query terms should co-occur in target language documents and incorrect translations should tend not to co-occur.lFirst,the two most related terms in the query were determined based on cooccurrence statistics in the source language corpus,and then the best translations were selected from all pairs of translations of these two terms according to co-occurrence statistics in the target language corpus.lIt should be noted that these two corpora do not have to be parallel or comparable.2022-10-2940Query expansion for disambiguationlPseudo relevance feedback(PRF),also known as blind feedback,is widely recognized as an effectiveltechnique for enhancing performance of information retrieval.PRF also works effectively for CLIR tasks.lIn the case of CLIR,two kinds of PRF are feasible:lPre-translation feedback andlPost-translation feedback2022-10-2941Pre-translation feedbacklDocuments from a corpus in the source language can be retrieved prior to translation in order to add a set of new terms to the source query(pre-translation feedback)if such a corpus is available.lPre-translation feedback may contribute to improvement of precision.This is due to the fact that the PRF is basically done using the entire querynot each source term respectively.That is,synonyms or related terms corresponding to the correct meaning of each source term within a context of the query are expected to be automatically added through the PRF process.2022-10-2942Post-translation feedbacklAfter translation,standard PRF can be applied using the target document collection(post-translation feedback).lpost-translation feedback can be considered a device for improving recall ratio,as shown in standard experiments of monolingual retrieval.lIn CLIR,two well-known methods for weighting terms in the top-ranked documents are often utilized for selecting good terms,i.e.,the Rocchio method and the probabilistic method.2022-10-2943bi-directional translationlBoughanem et al.(2002),explored a bi-directional translation technique in which a form of backward translation is used for ranking translation candidates.Suppose that we need to translate English query terms into French ones.In bi-directional translation,first a set of French equivalents for an English term is found in an EnglishFrench dictionary.Next,using a FrenchEnglish dictionary,each French equivalent is reversely translated into a set of English terms.Basically,if the set includes the original source term,the French translation equivalent is chosen as a preferred translation.2022-10-29442022-10-2945跨语言检索评价跨语言检索评价l信息检索评价l给定一个检索主题,一个文档集合,一些人工判断好的相关文献l对系统返回的检索结果进行判断lTREC CLIR(96-02):英语到其他语言 lCLEF(00-):欧洲语言之间 lNTCIR(99-):亚洲语言与英语2022-10-2946跨语言检索评价模型跨语言检索评价模型47Applications of CLIR2022-10-29482.1 Cross language Search EnginelApril 25,2006:European search engine“Quaero”lFrench President announced 90 million-euro support.lMay 16,2007:Google TranslatelProvide CLIR for 12 languages lGoal:take all the Web&translate into multiple langs.lMay 5,2008:Yahoo Babel FishlProvide CLIR between 12 languageslIt was AltaVistas project,later bought by Yahoo2022-10-2949Google Translatehttp:/2022-10-29502022-10-2951Yahoo Babel Fishhttp:/2022-10-29522022-10-29532022-10-2954提问提问l请比较请比较Google和和Yahoo!的跨语言搜索引!的跨语言搜索引擎的区别,分析各自的优缺点擎的区别,分析各自的优缺点lGoogle:一步完成(translate&search),检索结果翻译回源语言。优点:快速,便于用户理解检索结果。缺点:用户无法修改翻译。lYahoo!:两步完成(translate+search),检索结果未翻译。优点:有中间步骤,用户可以修改翻译。缺点:复杂,检索结果无法识别。2.2 数字图书馆的跨语言检索数字图书馆的跨语言检索l2010年6月11日在芬兰首都赫尔辛基举行的ICSTI(国际科技信息理事会)夏季会议上发布的世界科学跨语言检索平台WorldWideScience2022-10-2955WorldWideSciencehttp:/worldwidescience.org/multilinguall联盟的成员单位都是专业图书情报机构或科技信息事业的领导机构,如美国能源部科技信息局(OSTI)、美国国会图书馆、大英图书馆、加拿大科技信息研究所、韩国科技信息研究所、中国科技信息研究所等。l该平台还可以自动进行跨语言跨库检索2022-10-2956WorldWideSciencehttp:/worldwidescience.org/multilingual2022-10-29572.3 跨语言专利检索跨语言专利检索l根据世界知识产权组织(World Intellectual Property Organization,WIPO)报导,专利文件包含全世界90%95%的科研成果,而其他技术文件(论文或期刊等)中只含5%10%的研发成果。l在研究工作中若能善于利用专利检索可以缩短60%的研发时间,同时减少40%的研发经费。2022-10-2958l2010年5月,世界知识产权组织WIPO发布了跨语言专利检索系统PATENTSCOPE的测试版,标志着跨语言信息检索在专利检索中的应用从实验室走向实用化。l该系统只能提供英语、法语、德语、日语、西班牙语5种语言之间的跨语言专利检索。2022-10-29592022-10-29602022-10-29612.4 跨语言图像检索跨语言图像检索2022-10-29622022-10-29632022-10-29642.5 电子商务中的应用电子商务中的应用lCINDOR 是目前比较成功的一个商业跨语言信息检索系统lCINDOR系统拥有概念中间语言(Conceptual Interlingua)、语言分析(Language Analysis)、搜索管理(Search Management)三大核心技术。lCINDOR目前支持英语、法语、西班牙语,正在研制简体中文、俄语、阿拉伯语。2022-10-29652022-10-29662022-10-2967ReferencelKazuaki Kishida.Technical issues of cross-language information retrieval:a review.Information Processing and Management.2005(41),pp433-455.l葛运东;跨语言信息检索查询翻译技术研究D;苏州大学;2010 l王序文.基于主题伪相关反馈的跨语言信息检索技术研究 D;北京邮电大学,2014l彭琳.汉语词语语义相似度度量及其在跨语言信息检索中的应用研究D;复旦大学,20102022-10-29682022-10-2969对对“交互交互”的挑战的挑战lCLIR poses some unique challenges for interactionlHow do you help users select translated query terms?lHow do you help users select document terms for query refinement?lHow do you compensate for poor translation quality?2022-10-2970多语言信息获取多语言信息获取 Cross-Language Information Access,CLIACLIRSystemResult ProcessingResult PresentationQuery formulationQuestion analysisRequest generationNeed negotiationNeed identificationSource selectionResult SelectionResult ExaminationInformation ExtractionResult ClassificationResult VisualizationResult SummarizationQuery ReformulationRelevance FeedbackCLIA SystemNeed ClarificationNeed Instantiation2022-10-2971CLIA vs.CLIRlCross-Language Information Retrieval lA narrow view of CLIAlCLIR is limited,good for developing matching techniques lCross-Language Information Access lAim to help users find the information they wantlConcern not just the ranking of results 2022-10-2972多语言信息获取多语言信息获取l用户为中心l关注用户与系统的交互l相关性依赖于特定“用户”与特定“情境”l交互l信息需求不能被完全充分理解l语言歧义性l需求与使用的范围更广l多媒体:图像、声音l聚焦信息:段落检索、问答l凝练信息:摘要、信息抽取2022-10-2973多语言信息获取生命周期多语言信息获取生命周期检索经过翻译的查询式检索结果列表文档选择待浏览的文档文档浏览查询翻译查询形成查询式待传递的文档查询重新形成 翻译重新选择 文档重新选择2022-10-2974支持查询(重新)形成支持查询(重新)形成 lProblemslTerm Mismatch:query translations terms in docs lTranslations in foreign languagelHow to display,interpret and controllIs query translation an extra step?lQuery reformulationlwhere and how to get info2022-10-2975用户辅助查询翻译用户辅助查询翻译2022-10-2976支持文档(重新)选择支持文档(重新)选择lSelection need translated surrogateslHow to generate surrogates?lHow to translate surrogates?lExamination need translated documents2022-10-2977摘要生成摘要生成lHow to generate surrogateslFirst N words in docs(good for news articles)lKey Word In Context,automatic summarizationlPassage retrievallHow to translate surrogateslGloss translation:term by term translationlPhrase translation:only translate phrases in docslMachine Translation2022-10-2978摘要辅助用户选择判断文档摘要辅助用户选择判断文档

注意事项

本文(跨语言信息检索技术)为本站会员(xins****2008)主动上传,装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知装配图网(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。




关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!