GappedBLASTandPSIBLASTanewgenerationofproteindatabasesearchprograms

上传人:痛*** 文档编号:134237359 上传时间:2022-08-12 格式:DOC 页数:57 大小:268KB
收藏 版权申诉 举报 下载
GappedBLASTandPSIBLASTanewgenerationofproteindatabasesearchprograms_第1页
第1页 / 共57页
GappedBLASTandPSIBLASTanewgenerationofproteindatabasesearchprograms_第2页
第2页 / 共57页
GappedBLASTandPSIBLASTanewgenerationofproteindatabasesearchprograms_第3页
第3页 / 共57页
资源描述:

《GappedBLASTandPSIBLASTanewgenerationofproteindatabasesearchprograms》由会员分享,可在线阅读,更多相关《GappedBLASTandPSIBLASTanewgenerationofproteindatabasesearchprograms(57页珍藏版)》请在装配图网上搜索。

1、Gapped BLAST and PSI-BLAST: a new generation of protein database search programsStephen F. Altschul*, Thomas L. Madden, Alejandro A. Schffer1, Jinghui Zhang, Zheng Zhang2, Webb Miller2 and David J. Lipman National Center for Biotechnology Information, National Library of Medicine, National Institute

2、s of Health, Bethesda, MD 20894, USA, 1Laboratory of Genetic Disease Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA and 2Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA Received

3、June 20, 1997; Revised and Accepted July 16, 1997 ABSTRACT The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time

4、 of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times

5、 the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program r

6、uns at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily. INTRODUCTIONVariations of the BLAST algorithm

7、 (1 ) have been incorporated into several popular programs for searching protein and DNA databases for sequence similarities. BLAST programs have been written to compare protein or DNA queries with protein or DNA databases in any combination, with DNA sequences often undergoing conceptual translatio

8、n before any comparison is performed. We will use the blastp program, which compares protein queries to protein databases, as a prototype for BLAST, although the ideas presented extend immediately to other versions that involve the translation of a DNA query or database. Some of the refinements desc

9、ribed are applicable as well to DNA-DNA comparison, but have yet to be implemented. BLAST is a heuristic that attempts to optimize a specific similarity measure. It permits a tradeoff between speed and sensitivity, with the setting of a threshold parameter, T. A higher value of T yields greater spee

10、d, but also an increased probability of missing weak similarities. The BLAST program requires time proportional to the product of the lengths of the query sequence and the database searched. Since the rate of change in database sizes currently exceeds that of processor speeds, computers running BLAS

11、T are subjected to increasing load. However, the conjunction of several new algorithmic ideas allow a new version of BLAST to achieve improved sensitivity at substantially augmented speed. This paper describes three major refinements to BLAST. (i) For increased speed, the criterion for extending wor

12、d pairs has been modified. The original BLAST program seeks short word pairs whose aligned score is at least T. Each such hit is then extended, to test whether it is contained within a high-scoring alignment. For the default T value, this extension step consumes most of the processing time. The new

13、two-hit method requires the existence of two non-overlapping word pairs on the same diagonal, and within a distance A of one another, before an extension is invoked. To achieve comparable sensitivity, the threshold parameter T must be lowered, yielding more hits than previously. However, because onl

14、y a small fraction of these hits are extended, the average amount of computation required decreases. (ii) The ability to generate gapped alignments has been added. The original BLAST program often finds several alignments involving a single database sequence which, when considered together, are stat

15、istically significant. Overlooking any one of these alignments can compromise the combined result. By introducing an algorithm for generating gapped alignments, it becomes necessary to find only one rather than all the ungapped alignments subsumed in a significant result. This allows the T parameter

16、 to be raised, increasing the speed of the initial database scan. The new gapped alignment algorithm uses dynamic programming to extend a central pair of aligned residues in both directions. For speed, earlier heuristic methods (2 ,3 ) confined the alignments produced to a predefined strip of the dy

17、namic programming path graph (4 ). Our approach considers only alignments that drop in score no more than Xg below the best score yet seen. The algorithm is able thereby to adapt the region of the path graph it explores to the data. (iii) BLAST searches may be iterated, with a position-specific scor

18、e matrix generated from significant alignments found in round i used for round i + 1. Motif or profile search methods frequently are much more sensitive than pairwise comparison methods at detecting distant relationships. However, creating a set of motifs or a profile that describes a protein family

19、, and searching a database with them, typically has involved running several different programs, with substantial user intervention at various stages. The BLAST algorithm is easily generalized to use an arbitrary position-specific score matrix in place of a query sequence and associated substitution

20、 matrix. Accordingly, we have automated the procedure of generating such a matrix from the output produced by a BLAST search, and adapted the BLAST algorithm to take this matrix as input. The resulting Position-Specific Iterated BLAST, or PSI-BLAST, program may not be as sensitive as the best availa

21、ble motif search programs, but its speed and ease of operation can bring the power of these methods into more common use. After describing these refinements to BLAST in greater detail, we consider several biological examples for which the sensitivity and speed of the program are greatly enhanced. ST

22、ATISTICAL PRELIMINARIESTo analyze the BLAST algorithm and its refinements, we need first to review the statistics of high-scoring local alignments. BLAST employs a substitution matrix, which specifies a score sij for aligning each pair of amino acids i and j. Given two sequences to compare, the orig

23、inal BLAST program seeks equal-length segments within each that, when aligned to one another without gaps, have maximal aggregate score. Not only the single best segment pair may be found, but also other locally optimal pairs (3 ,5 -7 ), whose scores cannot be improved by extension or trimming. Such

24、 locally optimal alignments are called high-scoring segment pairs or HSPs. For the sake of the statistical theory, we assume a simple protein model in which the amino acids occur randomly at all positions with background probabilities Pi. We require that the expected score for two random amino acids

25、 sum from i , j P sub roman i P sub roman j s sub roman i j be negative. Given the Pi and sij, the basic theory (8 ,9 ) yields two calculable parameters, lambda and K, which can be used to convert nominal HSP scores to normalized scores, thereby rendering all scoring systems directly comparable from

26、 a statistical perspective. The normalized score S for an HSP is given by the equation: Sprime = lambdaS - ln K / ln 21In this paper, a nominal score is given without units, while a score normalized by equation 1 is said to be expressed in bits (10 ,11 ). When two random protein sequences of suffici

27、ent lengths m and n are compared, the number E of distinct HSPs with normalized score at least S expected to occur by chance is well approximated by: E = N/2Sprime2where N = mn is the search space size (8 -9 ). If a protein is compared to a whole database rather than a single sequence, n is the data

28、base length in residues. Equation 2 may be inverted to yield S = log2(N/E), the normalized score required to achieve a particular E-value. In a typical current database search, a protein of length 250 might be compared to a protein database of 50 000 000 total residues. To achieve a marginally signi

29、ficant E-value of 0.05, a normalized score of 38 bits is necessary. While the theory just outlined has not been proved for gapped local alignments and their associated scores, computational experiments strongly suggest that it remains valid (3 ,12 -15 ). The statistical parameters lambda and K, howe

30、ver, are no longer supplied by theory but must be estimated using comparisons of either simulated or real but unrelated sequences. To distinguish below whether a given set of parameters lambda and K refer to gapped or ungapped alignments, we use the subscripts g and u respectively. When gaps are not

31、 allowed, a further important theorem states that within HSPs the aligned pair of letters (i,j) tends to occur with the target frequency: qij = PiPjelambdauSij3The qij of equation 3 sum to 1; indeed, lambdau is calculated to be the unique positive number for which this is the case (8 ,9 ). The score

32、s sij are optimal for detecting alignments with these particular target frequencies (8 ,10 ), and by inverting equation 3 to sij = ln(qij/PiPj)/lambdau, scores may be chosen, with arbitrary scale, that correspond to any desired set of qij. The popular PAM (16 ,17 ) and BLOSUM (18 ) substitution matr

33、ices are constructed with explicit use of this log-odds formula. No corresponding result has been established for gapped alignment scoring systems. However, if the gap costs used are sufficiently large, it is expected that the target frequencies observed in high-scoring local alignments of random se

34、quences will not differ greatly from those for the no-gap case. REFINEMENT OF THE BASIC ALGORITHM: THE TWO-HIT METHODThe central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. BLAST first scans the database for wor

35、ds (typically of length three for proteins) that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit. The second step of the algorithm checks whether each hit lies within an alignment with score sufficient to be repo

36、rted. This is done by extending a hit in both directions, until the running alignments score has dropped more than X below the maximum score yet attained. This extension step is computationally quite costly; with the T and X parameters necessary to attain reasonable sensitivity to weak alignments, t

37、he extension step typically accounts for 90% of BLASTs execution time. It is therefore desirable to reduce the number of extensions performed. Our refined algorithm is based upon the observation that an HSP of interest is much longer than a single word pair, and may therefore entail multiple hits on

38、 the same diagonal and within a relatively short distance of one another. (The diagonal of a hit involving words starting at positions (x1, x2) of the database and query sequences may be defined as x1 - x2. The distance between two hits on the same diagonal is the difference between their first coor

39、dinates.) This signature may be used to locate HSPs more efficiently. Specifically, we choose a window length A, and invoke an extension only when two non-overlapping hits are found within distance A of one another on the same diagonal. Any hit that overlaps the most recent one is ignored. Efficient

40、 execution requires an array to record, for each diagonal, the first coordinate of the most recent hit found. Since database sequences are scanned sequentially, this coordinate always increases for successive hits. The idea of seeking multiple hits on the same diagonal was first used in the context

41、of biological database searches by Wilbur and Lipman (19 ). Figure 1. The sensitivity of the two-hit and one-hit heuristics as a function of HSP score. Using the BLOSUM-62 amino acid substitution matrix (18), and the target frequencies qij implied by equation 3 and the background amino acid frequenc

42、ies Pi of Robinson and Robinson (20), 100 000 model HSPs were generated for each of the nominal scores 37-92, corresponding to normalized scores 19.9-45.1 bits. It was determined by inspection whether each HSP failed to contain two non-overlapping length-3 word pairs with nominal score at least 11,

43、and within a distance 40 of one another, or a single length-3 word pair with nominal score at least 13. The corresponding probabilities of missing an HSP using the two-hit heuristic with T = 11, and the one-hit heuristic with T = 13, are plotted as a function of normalized HSP score. The two-hit met

44、hod is more sensitive for HSPs with score at least 33 bits. Because we require two hits rather than one to invoke an extension, the threshold parameter T must be lowered to retain comparable sensitivity. The effect is that many more single hits are found, but only a small fraction have an associated

45、 second hit on the same diagonal that triggers an extension. The great majority of hits may be dismissed after the minor calculation of looking up, for the appropriate diagonal, the coordinate of the most recent hit, checking whether it is within distance A of the current hits coordinate, and finall

46、y replacing the old with the new coordinate. Empirically, the computation saved by requiring fewer extensions more than offsets the extra computation required to process the larger number of hits. To study the relative abilities of the one-hit and two-hit methods to detect HSPs of varying score, we

47、model proteins using the background amino acid frequencies of Robinson and Robinson (20 ), and use the BLOSUM-62 substitution matrix (18 ) for sequence comparison. Given these Pi and sij, the statistical parameters for ungapped local alignments are calculated to be lambdau = 0.3176 and Ku = 0.134. U

48、sing equation 3 above, we may calculate the qij for which the scoring system is optimized, and employ these target frequencies to generate model HSPs. Finally, we evaluate the sensitivity of the one-hit and two-hit BLAST heuristics using these HSPs. The one-hit method will detect an HSP if it somewh

49、ere contains a length-W word of score at least T. For W = 3 and T = 13, Figure 1 shows the empirically estimated probability that an HSP is missed by this method, as a function of its normalized score. The two-hit method will detect an HSP if it contains two non-overlapping length-W words of score a

50、t least T, with starting positions that differ by no more than A residues. For W = 3, T = 11 and A = 40, Figure 1 shows the estimated probability that an HSP is missed by this method, as a function of its normalized score. For HSPs with score at least 33 bits, the two-hit heuristic is more sensitive

51、. To analyze the relative speeds of the one-hit and two-hit methods, using the parameters studied above, we note that the two-hit method generates on average 3.2 times as many hits, but only 0.14 times as many hit extensions (Fig. 2 ). Because it takes approximately one ninth as long to decide wheth

52、er a hit need be extended as actually to extend it, the hit-processing component of the two-hit method is approximately twice as rapid as the same component of the one-hit method. Figure 2. The BLAST comparison of broad bean leghemoglobin I (87) (SWISS-PROT accession no. P02232) and horse beta-globi

53、n (88) (SWISS-PROT accession no. P02062). The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another.

54、 Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13. Because this is just one example, the relative numbers of hits and extensions at the various settings of T correspond only roughly to the ratios found in a ful

55、l database search. An ungapped extension of the leftward of the two hit pairs yields an HSP with nominal score 45, or 23.6 bits, calculated using lambdau and Ku.TRIGGERING THE GENERATION OF GAPPED ALIGNMENTSFigure 1 shows that even when using the original one-hit method with threshold parameter T =

56、13, there is generally no greater than a 4% chance of missing an HSP with score 38 bits. While this would appear sufficient for most purposes, the one-hit default T parameter has typically been set as low as 11, yielding an execution time nearly three times that for T = 13. Why pay this price for wh

57、at appears at best marginal gains in sensitivity? The reason is that the original BLAST program treats gapped alignments implicitly by locating, in many cases, several distinct HSPs involving the same database sequence, and calculating a statistical assessment of the combined result (21 ,22 ). This

58、means that two or more HSPs with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result. The approach taken here allows BLAST to simultaneously produce gapped alignments and run significantly faster than prev

59、iously. The central idea is to trigger a gapped extension for any HSP that exceeds a moderate score Sg, chosen so that no more than about one extension is invoked per 50 database sequences. (By equation 2, for a typical-length protein query, Sg should be set at 22 bits.) A gapped extension takes muc

60、h longer to execute than an ungapped extension, but by performing very few of them the fraction of the total running time they consume can be kept relatively low. By seeking a single gapped alignment, rather than a collection of ungapped ones, only one of the constituent HSPs need be located for the

61、 combined result to be generated successfully. This means that we may tolerate a much higher chance of missing any single moderately scoring HSP. For example, consider a result involving two HSPs, each with the same probability P of being missed at the hit-stage of the BLAST algorithm, and suppose that we desire to find the combined result with probability at least 0.95. The original algorithm, needing to find both HSPs, requires 2P - P2 = 0.05, or P less than 0.025. In contrast, the new algorithm requires only that P2 50 ti

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!