第09章文本处理TextProcessing

上传人:沈*** 文档编号:174993320 上传时间:2022-12-18 格式:PPT 页数:28 大小:773KB
收藏 版权申诉 举报 下载
第09章文本处理TextProcessing_第1页
第1页 / 共28页
第09章文本处理TextProcessing_第2页
第2页 / 共28页
第09章文本处理TextProcessing_第3页
第3页 / 共28页
资源描述:

《第09章文本处理TextProcessing》由会员分享,可在线阅读,更多相关《第09章文本处理TextProcessing(28页珍藏版)》请在装配图网上搜索。

1、12/18/2022 2:53 AMText Processing1wChapter 9:Text Processing12/18/2022 2:53 AMText Processing2Outline and ReadingwStrings and Pattern Matching(9.1)wTries(9.2)w Text Compression(9.3)wOptional:Text Similarity(9.4).No Slides.12/18/2022 2:53 AMText Processing3Texts&Pattern Matching1abacaab234abacababaca

2、b12/18/2022 2:53 AMText Processing4StringswA string is a sequence of characterswExamples of strings:nJava programnHTML documentnDNA sequencenDigitized imagewAn alphabet S S is the set of possible characters for a family of stringswExample of alphabets:nASCIInUnicoden0,1nA,C,G,TwLet P be a string of

3、size m nA substring Pi.j of P is the subsequence of P consisting of the characters with ranks between i and jnA prefix of P is a substring of the type P0.inA suffix of P is a substring of the type Pi.m-1 wGiven strings T(text)and P(pattern),the pattern matching problem consists of finding a substrin

4、g of T equal to PwApplications:nText editorsnSearch enginesnBiological research12/18/2022 2:53 AMText Processing5Brute-Force AlgorithmwThe brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T,until eitherna match is found,ornall pla

5、cements of the pattern have been triedwBrute-force pattern matching runs in time O(nm)wExample of worst case:nT=aaa ahnP=aaahnmay occur in images and DNA sequencesnunlikely in English textAlgorithm BruteForceMatch(T,P)Input text T of size n and pattern P of size mOutput starting index of a substring

6、 of T equal to P or-1 if no such substring exists for i 0 to n-m test shift i of the pattern j 0while j n-1return -1 no match 12/18/2022 2:53 AMText Processing8Example1abacaabadcabacabaabb234567891012abacababacababacababacababacababacab111312/18/2022 2:53 AMText Processing9AnalysiswBoyer-Moores algo

7、rithm runs in time O(nm+s)wExample of worst case:nT=aaa anP=baaawThe worst case may occur in images and DNA sequences but is unlikely in English textwBoyer-Moores algorithm is significantly faster than the brute-force algorithm on English text111aaaaaaaaa23456baaaaabaaaaabaaaaabaaaaa7891012131415161

8、71819202122232412/18/2022 2:53 AMText Processing10The KMP Algorithm-MotivationwKnuth-Morris-Pratts algorithm compares the pattern to the text in left-to-right,but shifts the pattern more intelligently than the brute-force algorithm.wWhen a mismatch occurs,what is the most we can shift the pattern so

9、 as to avoid redundant comparisons?wAnswer:the largest prefix of P0.j that is a suffix of P1.jxj.a b a a b.a b a a b aa b a a b aNo need torepeat thesecomparisonsResumecomparinghere12/18/2022 2:53 AMText Processing11KMP Failure FunctionwKnuth-Morris-Pratts algorithm preprocesses the pattern to find

10、matches of prefixes of the pattern with the pattern itselfwThe failure function F(j)is defined as the size of the largest prefix of P0.j that is also a suffix of P1.jwKnuth-Morris-Pratts algorithm modifies the brute-force algorithm so that if a mismatch occurs at Pj Ti we set j F(j-1)j012345Pjabaaba

11、F(j)001123xj.a b a a b.a b a a b aF(j-1)a b a a b a12/18/2022 2:53 AMText Processing12The KMP AlgorithmwThe failure function can be represented by an array and can be computed in O(m)timewAt each iteration of the while-loop,eitherni increases by one,ornthe shift amount i-j increases by at least one(

12、observe that F(j-1)j)wHence,there are no more than 2n iterations of the while-loopwThus,KMPs algorithm runs in optimal time O(m+n)Algorithm KMPMatch(T,P)F failureFunction(P)i 0j 0while i 0j Fj-1elsei i+1return -1 no match 12/18/2022 2:53 AMText Processing13Computing the Failure FunctionwThe failure

13、function can be represented by an array and can be computed in O(m)timewThe construction is similar to the KMP algorithm itselfwAt each iteration of the while-loop,eitherni increases by one,ornthe shift amount i-j increases by at least one(observe that F(j-1)j)wHence,there are no more than 2m iterat

14、ions of the while-loopAlgorithm failureFunction(P)F0 0i 1j 0while i 0 thenuse failure function to shift Pj Fj-1elseFi 0 no match i i+112/18/2022 2:53 AMText Processing14Example1a b aca a b aca b aca b a a b b7819181715a b aca b161413234569a b aca ba b aca ba b aca ba b aca b10 11 12cj012345PjabacabF

15、(j)00101212/18/2022 2:53 AMText Processing15Triesenimizenimizezezeimimizenimizeze12/18/2022 2:53 AMText Processing16Preprocessing StringswPreprocessing the pattern speeds up pattern matching queriesnAfter preprocessing the pattern,KMPs algorithm performs pattern matching in time proportional to the

16、text sizewIf the text is large,immutable and searched for often(e.g.,works by Shakespeare),we may want to preprocess the text instead of the patternwA trie is a compact data structure for representing a set of strings,such as all the words in a textnA tries supports pattern matching queries in time

17、proportional to the pattern size12/18/2022 2:53 AMText Processing17Standard Trie(1)wThe standard trie for a set of strings S is an ordered tree such that:nEach node but the root is labeled with a characternThe children of a node are alphabetically orderednThe paths from the external nodes to the roo

18、t yield the strings of SwExample:standard trie for the set of stringsS=bear,bell,bid,bull,buy,sell,stock,stop aebrllsullyetllockpid12/18/2022 2:53 AMText Processing18Standard Trie(2)wA standard trie uses O(n)space and supports searches,insertions and deletions in time O(dm),where:n total size of the

19、 strings in Sm size of the string parameter of the operationd size of the alphabet aebrllsullyetllockpid12/18/2022 2:53 AMText Processing19Word Matching with a TriewWe insert the words of the text into a triewEach leaf stores the occurrences of the associated word in the text s e eb e ar?s ellsto c

20、k!s e eb ull?b u ysto c k!bidsto c k!aah eth eb ell?sto p!bidsto c k!0123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 2324 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4647 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 6869 70 71 72 73 74 75 76 77 78 79 80 81 82 83 8

21、4 85 86ar87 88aeblsulete0,24ocilr6l78d47,58l30y36l12k17,40,51,62p84her69a12/18/2022 2:53 AMText Processing20Compressed TriewA compressed trie has internal nodes of degree at least twowIt is obtained from standard trie by compressing chains of“redundant”nodesebarllsullyelltockpidaebrllsullyetllockpid

22、12/18/2022 2:53 AMText Processing21Compact RepresentationwCompact representation of a compressed trie for an array of strings:nStores at the nodes ranges of indices instead of substringsnUses O(s)space,where s is the number of strings in the arraynServes as an auxiliary index structures e eb e a rs

23、e l ls t o c kb u l lb u yb i dh eb e l ls t o p0 1 2 3 4a rS0=S1=S2=S3=S4=S5=S6=S7=S8=S9=0 1 2 30 1 2 31,1,11,0,00,0,04,1,10,2,23,1,21,2,38,2,36,1,24,2,35,2,22,2,33,3,49,3,37,0,30,1,112/18/2022 2:53 AMText Processing22Suffix Trie(1)wThe suffix trie of a string X is the compressed trie of all the su

24、ffixes of Xenimizenimizezezeimimizenimizezem iniz em i0 1 2 3 4 5 6 712/18/2022 2:53 AMText Processing23Suffix Trie(2)wCompact representation of the suffix trie for a string X of size n from an alphabet of size dnUses O(n)spacenSupports arbitrary pattern matching queries in X in O(dm)time,where m is

25、 the size of the pattern7,72,72,76,76,74,72,76,71,10,1m iniz em i0 1 2 3 4 5 6 712/18/2022 2:53 AMText Processing24Encoding Trie(1)wA code is a mapping of each character of an alphabet to a binary code-wordwA prefix code is a binary code such that no code-word is the prefix of another code-wordwAn e

26、ncoding trie represents a prefix codenEach leaf stores a characternThe code word of a character is given by the path from the root to the leaf storing the character(0 for a left child and 1 for a right childabcde000100111011abcde12/18/2022 2:53 AMText Processing25Encoding Trie(2)wGiven a text string

27、 X,we want to find a prefix code for the characters of X that yields a small encoding for XnFrequent characters should have long code-wordsnRare characters should have short code-wordswExamplenX=abracadabranT1 encodes X into 29 bitsnT2 encodes X into 24 bitscardbacdbrT1T212/18/2022 2:53 AMText Proce

28、ssing26Text Compression12/18/2022 2:53 AMText Processing27Huffmans AlgorithmwGiven a string X,Huffmans algorithm construct a prefix code the minimizes the size of the encoding of XwIt runs in timeO(n+d log d),where n is the size of X and d is the number of distinct characters of XwA heap-based prior

29、ity queue is used as an auxiliary structureAlgorithm HuffmanEncoding(X)Input string X of size nOutput optimal encoding trie for XC distinctCharacters(X)computeFrequencies(C,X)Q new empty heap for all c CT new single-node tree storing cQ.insert(getFrequency(c),T)while Q.size()1f1 Q.minKey()T1 Q.removeMin()f2 Q.minKey()T2 Q.removeMin()T join(T1,T2)Q.insert(f1+f2,T)return Q.removeMin()12/18/2022 2:53 AMText Processing28Exampleabcdr52112X=abracadabraFrequenciescardb52112cardb2522cabdr254cabdr2546cabdr24611

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!