Chinese Word Auto-Confirming Confirmation Agent

上传人:痛*** 文档编号:88190334 上传时间:2022-05-10 格式:DOC 页数:29 大小:1.05MB
收藏 版权申诉 举报 下载
Chinese Word Auto-Confirming Confirmation Agent_第1页
第1页 / 共29页
Chinese Word Auto-Confirming Confirmation Agent_第2页
第2页 / 共29页
Chinese Word Auto-Confirming Confirmation Agent_第3页
第3页 / 共29页
资源描述:

《Chinese Word Auto-Confirming Confirmation Agent》由会员分享,可在线阅读,更多相关《Chinese Word Auto-Confirming Confirmation Agent(29页珍藏版)》请在装配图网上搜索。

1、wordChinese Word Auto-Confirming Confirmation AgentJia-Lin Tsai, Cheng-Lung Sung and Wen-Lian HsuInstitute of Information Science, Academia SinicaNankang, Taipei, Taiwan, R.O.C. tsaijl,clsung,AbstractIn various Asian languages, including Chinese, there is no white space between words used to delimit

2、 words in texts. Thus, most of Chinese natural language processing (NLP) systems must perform word-segmentation (, or say sentence tokenization), firstly. However, successfullywell word-segmentation is dependsed on having a sufficiently large lexicons. On the average, about 3% of the words in text a

3、re not contained in a lexicon. Therefore, unknown word identification bees aone of bottlenecksinfor Chinese NLP systems.In this paper, we present a method using to simulate a Chinese word auto-confirmingauto-confirmation (CWAC) agent. It CWAC agent usesis a hybrid approach whichthat takessto take ad

4、vantages of statistical andcalmethods and linguistic knowledgeapproaches. The task of a CWAC agent is to auto-confirm whether an n-gram input (n 2) is a Chinese word. Based on our definition, aWe design our CWAC agent to satisfy must satisfy two requirementscriteria: (1) aprecision is greater than 9

5、8% precision rate and a recall is greater than 75% recall rate and (2) domain-independent performance (F-measurement) is domain-independent. These requirements criteriaassure ourthat make sure any CWAC agentscan work entirely without the need for human councilautomatically without human intervention

6、work independently. Furthermore, by bining several CWAC agents designed based on different principles, we can construct a multi-Thus, a multi-CWAC agent can be constructedusingwiththrough a building-block approach. Three experiments were are conducted in this study. The first two experimental result

7、s show that our method was able to act as simulate a CWAC agent when n-grams frequency is 4 and a large-scale corpus is used. On the other hand, fFrom the results of third experiment, it shows thetheword precision of our method is shown to be corpus-independent and its word recallto be is corpus-dep

8、endent. To sum up these experimental results,The results demonstrate that,for n-gram frequency 4 in large corpus,ies of y the our CWAC agent was able tocansatisfy the two criteria and achieve 96.31%and 97.82% word precisionprecisionsaccuracies, 77.18% and 77.11% word recallrecalls,andas well as85.69

9、% and 86.24% domain-independent word F-measurementF-measures for n-grams frequency is 3 and 4, respectively. No existing systems can achieve such a high precision and domain-independent F-measure. Our F-measure is better than other systems by at least 2 3 %.The proposed method is our first attempt f

10、or constructing atryingatto simulateing a CWAC agent. We will continue our works on developing other CWAC agents and integrating them into and integrated them into a multi-CWAC agent system.Keywords: natural language processing, word segmentation, unknown word, agent1. IntroductionFor a human being,

11、 efficient word-segmentation(in Chinese) and word sense disambiguation (WSD) arise naturally while a sentence is understood. However, these problems are still difficult for the puter. For a human beings, efficientwell word-segmentation and efficientwell word sense disambiguation (WSD) both are natur

12、ally results when he/she understands a sentences is understood. However, until now, it is still a very difficultto program puters research problem to concern with getting puters to effectivelywell process and understanding natural languages. One of the reasons is that it is hardto treat puters are m

13、uch less able to weaker than human in handleling and creatieng unseen knowledge in the puter from runningwhile reading texts Dreyfus 1992. Here, unseen knowledge is refers to contextual meaning and referred as contextual meaning, affective meaning, connotative meaning, emotive meaning and unknown le

14、xicon., etc. As our observation, one of the most universal unseen knowledge for any natural language processing (NLP) system is unknown lexicon.Generally, the task problem of unknown lexicon identification is toaddressed on identifying (1) unknown word (2) unknown word sense, (3) unknown word part-o

15、f-speech (POS) ) of a word and (4) unknown word pronunnouncciatione. Obviously, uUnknown word identification (UWI) is the most essentialstep in dealing with unknown lexiconsissue on handling the knowledge of unknown lexicons. However, unknown word identificationUWI is still a quiteadifficult problem

16、 infor Chinese NLP systems. From Lin et al. 1993, Chang et al. 1997, Lai et al. 2000, Chen et al. 2002 and Sun et al. 2002, the difficulty of Chinese unknown word identificationUWI is caused by the following problems:.OneFirstly, jJust as inlike the other Asian languages, Chinese sentences are posed

17、 with string of strings of characters that donot have without blank spaces s to mark word boundaries. TwoSecondly, aAll Chinese characters can either be a morpheme or a word. Take the Chinese character 花as an example.It can be either be a free morpheme or a word. 1. ThreeThirdly, Uunknown words, whi

18、ch usually are pound words and proper names, are too numerous to list in a not possible list all of them in machine-readable dictionary (MRD).To resolve theseunknown word issues in Chinese, statistical approaches, linguistic approaches and hybrid approaches have been developed and investigated, resp

19、ectively. For statistical approaches, researchers use mon statistical features, such as maximum entropy Yu et al. 1998, Chieu et al. 2002, entropy association strength Smadja 1993, Dunnin 1993, mutual information Florianet al. 1999, Church 2000, ambiguous matching Chen & Liu 1992, Sproat et al. 1996

20、, and multi-statistical features Chang et al. 1997 for unknown word detection and extraction. For linguistic approaches, three major types of linguistic rules (knowledge):, i.e. morphology, syntax, and semantics, are used forto identify unknown word identifications. Recently, one important trend of

21、unknown word identificationUWIis followsinga hybrid approach so as to to take both advantages of both merits of statistical method and linguistic approachesknowledge. In general, sStatistical approaches are simple and efficient whereas ; linguistic approaches are stable andeffective in identifying l

22、ow frequency unknown words with low frequency Chang et al. 1997, Chen et al. 2002.Auto-detection and auto-confirmingauto-confirmation are two basicmajor steps in most unknown word identificationUWI systems. The step aAuto-detection is used to detect the possible n-grams (candidates) from running tex

23、ts for a better focusing, so that on in the next step (auto-confirmingauto-confirmation stage,),these identification systems it needs only focus on the set of possible n-grams. In most cases, recall and precision rates are affected by auto-detection and auto-confirmingauto-confirmation, respectively

24、. Since trade-off would occur between recall and precision, how to deriveing a hybrid approach with precision-recall optimization has bee a major one of the most challenges in this filed Chang et al. 1997, Chen et al. 2002.In this paper, we introduce a Chinese word auto-confirmingauto-confirmation (

25、CWAC) agent, which uses as hybrid approach to effectively eliminate human intervention. according to our following-defined requirements. Now, let us talk about what is a Chinese word auto-confirming (CWAC) agent. A CWAC agent is an agent (a program) that can automatically confirms whether an n-gram

26、input is a Chinese word. We design our CWAC agent to satisfy two criteria: (1) a greater than 98% precision rate and a greater than 75% recall rate and (2) domain-independent performance (F-measure). These criteria assure our CWAC agents can work automatically without human intervention.To our knowl

27、edge, no existing system hasyet achievedthe above criteria.Furthermore, by bining several CWAC agents designed based on different principles, we can construct a multi-CWAC agent through a building-block approachAs our definition, aA CWAC agent must fit the following requirements: (1) a precision is

28、greater than 98% precision rate and reacall is greater than 75% recall rate and (2) domain-independent performance (F-measurement) must be domain-independent. These requirements are used to make assure that a CWAC agent can work independently of human intervention.like human beings.Therefore, aA mul

29、ti-CWAC agent can be constructed fromwith different CWAC agents using a building-block approachand service-oriented architecture, (such as web services Graham et al. 2002). Figure 1 is an illustrates one way of ion of a multi-CWAC agent system biningwith three CWAC agents. IfSince the number of iden

30、tified words of a multi-CWAC agent isshould be greater than that of its anyanysingle CWAC agent, we believe awe believe any multi-CWAC agent couldwillbe able to maintain the98% precision rate and increase its recall rate onlyby merelyintegratingby integrating with more CWAC agents.As of our knowledg

31、e, there is no existing systems yet achieve our above-mentioned can fit these requirements for a CWAC agent.Figure 1. An illustration of a multi-CWAC agent systemThis article is structured as follows. In Section 2, we will present a method for simulating a CWAC agent. The eExperimentalalresults and

32、analyses of the CWAC agent will be presented in section 3. Finally, cConclusions and future works directions will be discussed in Section 4.2. Development of the CWAC agentTo develop a CWAC agent, Wwe use CKIP lexicon CKIP 1995 to provide aas oursystem dictionary. The most frequent to provide the kn

33、owledge of Chinese word (in which the top 50,000 words were are selected from the list of CKIP lexicon (CKIP 1995) to create the system dictionary. From this lexicon, we only use word and POS for our algorithm.in descending order of word frequency) and POS.2.1 Major Processes of the CWAC AgentAThe t

34、ask of a CWAC agent automatically is to identifiesy whether an n-gram input (or, say, n-char string) is a Chinese word, automatically. In this paper, an n-gram extractor was is developed to extract all n-grams (n 2 and n-gram frequency 3) from testing sentences as the n-gram input for aour CWAC agen

35、t,(see Figure 2). (Note that: n-gram frequenciesyvariesywidely according to is various and dependent on testing sentences.)Figure 2. An illustration of n-gram extractor and a CWAC agentFigure 3 is the flow chart of the CWAC agent in which there are seven major processes are labeled (1) to (76). The

36、confirming confirmation types, briefly descriptions and examples of the CWAC agent, are given in Table 1. Basic, wWe use applylinguistic approach fir,st and statistical approach and last LFSL (linguistic first, statistical last) approach strategyto develop the CWAC agent. Note (Note: iin Figure 3, t

37、he last two processes (5) and (6) are statistical methods, and the remainingrestfour processes are developed from using linguistic knowledge.)The LFSL approach means a bining process of a linguistic process(such as process4) and a statisticalprocess (such as process 5).TNow, the details of these maj

38、or processes are described as followsbelow.below:(Figure 3 is here)(Table 1 is here)Process1. System dictionary checking: If the n-gram input can be found in the system dictionary, it will be labeled as type “K10,(which means that the n-gram exists in the system dictionary). For example, inIn Table

39、1, the n-gram “計程車 is a system word. (Note: we use Top 50,000 CKIP lexicons as system dictionary.)Process2. Segmentation by system dictionary:In this stage, If the n-gram input cannot be found in system dictionary, it will be segmented by two strategies:, i.e.leftleft-to-rightright longest word firs

40、t (LR-LWF), and right-to-left longest word first (RL-LWF). If LR-LWF and RL-LWF segmentations of the n-gram input are different, the CWAC agent will be triggered to pute the products of all word length for these Figure 3. Flow chart for the CWAC agentTable 1. Confirming results, types, descriptions

41、and examples of the CWAC agent (The symbol / stands for word boundary according to system dictionary using RL-LWF)Auto-ConfirmingResultsTypes Brief DescriptionsExamplesInput OutputWordK0 N-gram exists in system dictionary計程車計程車1K1 Both polysyllabic words exist in online dictionary接駁公車接駁/公車1K2Two pol

42、ysyllabic word pounds 食品公司食品/公司1K3Both first and last word of segmented N-gram are polysyllabic words and N 3東港黑鮪魚東港/黑/鮪魚1K4 Segmentation ambiguity is 50%腸病毒腸/病毒1K5 N-gram frequency exceeds 10阿爾巴尼亞裔阿/爾/巴/尼/亞裔1Not WordD1 Two polysyllabic word pounds with at least function word 問題一直問題/一直2D2 N-gram con

43、tains function word市場指出市場/指出2D5 Segmentation ambiguity is 50%台北市立台北市/立2D6Suffix Chinese digit string隊伍隊/伍1D7Digits suffix polysyllabic word 5火鍋5/火鍋2D8 N-gram is a classifier-noun phrase 名學生名/學生2D9 N-gram includes unknown symbol 公司/公司2D0 Unknown reason 3 31These n-grams were manually confirmed as is-

44、word in this study2These n-grams were manually confirmed as non-word in this study3There were no auto-confirming types “D0 and “K0 in this study(RL-LWF) and right-to-left longest word first (LR-LWF), using the system dictionary. If both segmentations of LR-LWF and RL-LWF for the n-gram input are not

45、 equaldifferentthe same, the CWAC agent will be trigged to pute the products of all word lengthsegmented-wordscharacter amountslength for these segmentations, respectively. If both products are equal, the RL-LWF segmentation will be selected. Otherwise, tThen, tThe segmentation with the greatest pro

46、duct will be selected as the segmentation output in this stage. However, if both products are equalthe same, the RL-LWF segmentation will be selected. According to As our experiment, the segmentation precision of RL-LWF is, on the average, 1% greater than that of LR-LWF in average. Now, tTake the n-

47、gram input “可將軍用怕的毛毯腸病毒 as an example. Its segmentations of LR-LWF and RL-LWF segmentations are 將軍/用/的/毛毯“可怕/的/腸/病毒 and 將/軍用/的/毛毯“可怕/的/腸/病毒, respectively. Since both products segmentations are areequal (2x1x1x2=1x2x1x2)the same, the selected segmentation output oforf this process is 將/軍用/的/毛毯 “可怕/的/

48、腸/病毒.as it is the RL-LFW.Note that the two Process2sas shown in Figure 13 perform the same task.Process3. Stop word checking:To avoid confusion, we refer to call tThe segmentation output from Process2is referred to as segmentation2.in the following. In this stageprocess, all words in segmentation2 w

49、ill be paredchecked with the stop word list, word by word. Here, we defineThere are three types of stop words: begininging, middle, and end. The stop word list used in this study is given in Appendix A.These stop words were selected by native Chinese speakersaccording to thoseputed beginning, middle

50、, and end single-character wordswith 1%of being the beginning, middle, or end words of HownetDong 1999, respectively.Basically, iIf the first and last words of the segmentation2 can be found ion the list of begining and end stop words, the first and last wordsthey will be eliminated from the segment

51、ation2, respectively. For those those cases in which that the word number of segmentation2 is greater than 2, the middle stop word checking will be triggered. If athe middle words in the segmentation2 can be found inion the middle stop word list, the n-gram input will be split into new strings at an

52、y matched stop word. by the matched words into different new strings. MoreoverT, these new strings will be re-sent to Process1 as new n-gram input. For example, thesegmentation2 of the n-gram input “可怕的腸病毒 is “可怕/的/腸/病毒. Since there is a middle stop word “的 in this segmentation2, the new strings “可怕

53、 and “腸病毒 will be re-sent to Process1 as new n-gram input.Process4. Part-of-Speech (POS) pattern checking:Once the segmentation2has been processed by Process3, it the result is called segmentation3. If the word number of the segmentation3 is 2, POS pattern checking will be triggered. Firstly, tThe C

54、WAC agent will first generate all possible POS binations byof the two words using thewith system dictionary. If the number of generated POS binations is one and that bination can be matchesd one of the following POS patterns, i.e.(“N/V, “V/N, “N/N, “V/V, “Adj/N, “Adv/N, “Adj/V, “Adv/V, “Adj/Adv, “Ad

55、v/Adj, “Adv/Advand “Adj/Adj,)tthe 2-word string will be tagged as a word and sent to Process5. This is a rule-based linguistic approach bines approach by mixing of syntax knowledge and heuristic observation in order toto identifyy pound words. For example, since the generated POS bination for the se

56、gmentation3“食品/公司 is the one “N/N, “食品公司 will be sent to Process5.Process5. Online auto-learned dictionary checking: If the n-gram input can be found in online auto-learned dictionary (online store all of auto-confirming words with the CWAC agent), it will be labeled as type “K5 which means the n-gr

57、am exists in the online auto-learned dictionary. Note that, in this paper, the online auto-learned words were not exported into system dictionary dynamically.Process56. Segmentation ambiguity checking:This stage is prisedconsists of 4 steps:.1) TIn this stagehirty, firstly, 30 randomly selected sent

58、ences that includinge the n-gram input will be extracted from either a large-scale or a fixed-size corpus. For example, the Chinese sentence “人人做環保is a selected sentence that includes the n-gram input “人人. The details of the large-scale and the fixed-sizefixed size.(Note that the number of selected

59、sentences may be less than thirty and may even be zero due to corpus sparseness problems.) 2)Note: the number of selected sentences may be less than 30 or even 0. Secondly,tThese selected sentences will be segmented by using the system dictionary, and will be segmented by the the n-gram input with t

60、he technique of RL-LFWF and LR-LWF technique.3) Thirdly, fFor each selected sentence, if both the RL-LWF and LR-LWF RL-LWF segmentations are not equaldifferentthe same, the sentence it will be regardedas anset as a segmentation ambiguous sentence. For example, the Chinese sentence “人人做環保4) is not an

61、 ambiguous sentence.Finally, cpute the ambiguous ratio of (# of segmentation ambiguous sentences)to and (# of selected sentences). If the ambiguous ratio is less or equal to than 50%50%, the n-gram input will be labeledconfirmed as word typeas type “K61,K2 or K4,which means auto-auto-confirmedingas

62、word by Process5(see Fig. 3) 6; otherwise, it will be labeled as type “D161. or D2 (see Fig. 3).According to our observation, the ambiguous ratiosof nonword n-gramsusually are greater than 50%.Note that the three Process5s as shown in Figure 3 perform the same task.In addition, once the number of se

63、lected sentences is zero, the n-gram input will be labeled as type “D62.Process67. Threshold value checking:In this stage, if the frequency of an n-gram input is greater or equal to 10, it will be labeled as word type K5, which means that the n-gram input has been auto-confirmed as a Chinese word us

64、ingby threshold value checkingby Process 6. According to our experiment, According to our observation, iif we directly treatregardtheanthe CWAC agent directly confirms n-gram input,whichwhose frequencies arey isgreater than or equal to a certain threshold value, as ainput as a words according to n-g

65、ram frequency, we found the trade-off frequency of 99% precision rate occurs at the threshold value 7.By this observation, we set 10 (higher than 7) as systems threshold value.If the frequency of a n-gram input is greater or equal to 10, it will be labeled as type “K7, which means the n-gram input is a word auto-confirmed by threshold value checking.3. Experimental ResultsThe objective of the follow

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!