mega操作过程-多序列比对、进化树.ppt

上传人:xiao****1972 文档编号:15577459 上传时间:2020-08-21 格式:PPT 页数:173 大小:4.11MB
收藏 版权申诉 举报 下载
mega操作过程-多序列比对、进化树.ppt_第1页
第1页 / 共173页
mega操作过程-多序列比对、进化树.ppt_第2页
第2页 / 共173页
mega操作过程-多序列比对、进化树.ppt_第3页
第3页 / 共173页
资源描述:

《mega操作过程-多序列比对、进化树.ppt》由会员分享,可在线阅读,更多相关《mega操作过程-多序列比对、进化树.ppt(173页珍藏版)》请在装配图网上搜索。

1、基础生物信息学及应用,王兴平,多序列比对分子进化分析系统发生树构建核酸序列的预测与鉴定酶切图谱制作引物设计,内 容,多序列比对,内容: 多序列比对 多序列比对程序及应用,第一节、多序列比对(Multiple sequence alignment),概念 多序列比对的意义 多序列比对的打分函数 多序列比对的方法,1、概念,多序列比对(Multiple sequence alignment) align multiple related sequences to achieve optimal matching of the sequences. 为了便于描述,对多序列比对过程可以给出下面的定义:

2、把多序列比对看作一张二维表,表中每一行代表一个序列,每一列代表一个残基的位置。将序列依照下列规则填入表中: (a)一个序列所有残基的相对位置保持不变; (b)将不同序列间相同或相似的残基放入同一列,即尽可能将序列间相同或相似残基上下对齐(下表)。,表1 多序列比对的定义,表示五个短序列(I-V)的比对结果。通过插入空位,使5个序列中大多数相同或相似残基放入同一列,并保持每个序列残基顺序不变,2、多序列比对的意义,用于描述一组序列之间的相似性关系,以便了解一个分子家族的基本特征,寻找motif,保守区域等。 用于描述一组同源序列之间的亲缘关系的远近,应用到分子进化分析中。 序列同源性分析:是将待

3、研究序列加入到一组与之同源,但来自不同物种的序列中进行多序列同时比较,以确定该序列与其它序列间的同源性大小。 其他应用,如构建profile,打分矩阵等,手工比对 在运行经过测试并具有比较高的可信度的计算机程序(辅助编辑软件如bioedit,seaview,Genedoc等)基础上,结合实验结果或文献资料,对多序列比对结果进行手工修饰,应该说是非常必要的。 为了便于进行交互式手工比对,通常使用不同颜色表示具有不同特性的残基,以帮助判别序列之间的相似性。 计算机程序自动比对 通过特定的算法(如穷举法,启发式算法等),由计算机程序自动搜索最佳的多序列比对状态。,3、多序列比对的方法,穷举法,穷举法

4、(exhaustive alignment method) 将序列两两比对时的二维动态规划矩阵扩展到多维矩阵。即用矩阵的维数来反映比对的序列数目。这种方法的计算量很大,对于计算机系统的资源要求比较高,一般只有在进行少数的较短的序列的比对的时候才会用到这个方法 DCA (Divide-and-Conquer Alignment):a web-based program that is semiexhaustive http:/bibiserv.techfak.uni-bielefeld.de/dca/,启发式算法,启发式算法(heuristic algorithms): 大多数实用的多序列比对程

5、序采用启发式算法(heuristic algorithms),以降低运算复杂度。 随着序列数量的增加,算法复杂性也不断增加。用O(m1m2m3mn)表示对n个序列进行比对时的算法复杂性,其中mn是最后一条序列的长度。若序列长度相差不大,则可简化成O(mn),其中n表示序列的数目,m表示序列的长度。显然,随着序列数量的增加,序列比对的算法复杂性按指数规律增长。,第二节 多序列比对程序及应用,Progressive Alignment Method Iterative Alignment Block-Based Alignment DNASTAR DNAMAN,1、Progressive Alig

6、nment Method,Clustal: Clustal,是由Feng和Doolittle于1987年提出的。 Clustal程序有许多版本 ClustalW(Thompson等,1994)是目前使用最广泛的多序列比对程序 它的PC版本是ClustalX 作为程序的一部分,Clustal 可以输出用于构建进化树的数据。,ClustalW 程序:ClustalW 程序可以自由使用 在NCBI/EBI的FTP服务器上可以找到下载的软件包。ClustalW 程序用选项单逐步指导用户进行操作,用户可根据需要选择打分矩阵、设置空位罚分等。 ftp:/ftp.ebi.ac.uk/pub/software

7、/ EBI的主页还提供了基于Web的ClustalW服务,用户可以把序列和各种要求通过表单提交到服务器上,服务器把计算的结果用Email返回用户(或在线交互使用)。 http:/www.ebi.ac.uk/clustalw/,Progressive Alignment Method,ClustalW 程序 ClustalW对输入序列的格式比较灵活,可以是FASTA格式,还可以是PIR、SWISS-PROT、GDE、Clustal、GCG/MSF、RSF等格式。 输出格式也可以选择,有ALN、GCG、PHYLIP和GDE等,用户可以根据自己的需要选择合适的输出格式。 用ClustalW得到的多序

8、列比对结果中,所有序列排列在一起,并以特定的符号代表各个位点上残基的保守性,“*”号表示保守性极高的残基位点;“.”号代表保守性略低的残基位点。,Progressive Alignment Method,Clustal W 使用 输入地址:http:/www.ebi.ac.uk/clustalw/ 设置选项 (next),Progressive Alignment Method,Clustal W 使用 一些选项说明 PHYLOGENETIC TREE有三个选项 TREE TYPE:构建系统发育树的算法,有四个个选择none、nj(neighbour joining)、phylip、dist

9、CORRECT DIST:决定是否做距离修正。对于小的序列歧异(10),选择与否不会产生差异;对于大的序列歧异,需做出修正。因为观察到的距离要比真实的进化距离低。 IGNORE GAPS:选择on,序列中的任何空位将被忽视。 详细说明参见 http:/www.ebi.ac.uk/clustalw/clustalw_frame.html,Progressive Alignment Method,Clustal W 使用 输入5个16S RNA 基因序列 AF310602 AF308147 AF283499 AF012090 AF447394 点击“RUN”,Progressive Alignme

10、nt Method,Progressive Alignment Method,T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation): Progressive alignment method www.ch.embnet.org/software/TCoffee.html In processing a query, T-Coffee performs both global and local pairwise alignment for all possible pairs involved

11、. A distance matrix is built to derive a guide tree, which is then used to direct a full multiple alignment using the progressive approach. Outperforms Clustal when aligning moderately divergent sequences Slower than Clustal,Progressive Alignment Method,PRALINE: web-based: http:/ibivu.cs.vu.nl/progr

12、ams/pralinewww/ First build profiles for each sequence using PSI-BLAST database searching. Each profile is then used for multiple alignment using the progressive approach. the closest neighbor to be joined to a larger alignment by comparing the profile scores does not use a guide tree Incorporate pr

13、otein secondary structure information to modify the profile scores. Perhaps the most sophisticated and accurate alignment program available. Extremely slow computation.,Progressive Alignment Method,DbClustal: http:/igbmc.u-strasbg.fr:8080/DbClustal/dbclustal.html Poa (Partial order alignments): http

14、:/www.bioinformatics.ucla.edu/poa/,2、Iterative Alignment,PRRN: web-based program http:/prrn.ims.u-tokyo.ac.jp/ Uses a double nested iterative strategy for multiple alignment. Based on the idea that an optimal solution can be found by repeatedly modifying existing suboptimal solutions,Block-Based Ali

15、gnment,DIALIGN2: a web based program http:/bioweb.pasteur.fr/seqanal/interfaces/dialign2.html It places emphasis on block-to-block comparison rather than residue-to-residue comparison. The sequence regions between the blocks are left unaligned. The program has been shown to be especially suitable fo

16、r aligning divergent sequences with only local similarity.,Block-Based Alignment,Match-Box: web-based server http:/www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.shtml Aims to identify conserved blocks (or boxes) among sequences. The server requires the user to submit a set of sequences in th

17、e FASTA format and the results are returned by e-mail.,DNASTAR DNAMAN,软件:,分子进化分析系统发生树构建,本章内容: 分子进化分析介绍 系统发生树构建方法 系统发生树构建实例,第一节 分子进化分析介绍,基本概念: 系统发生(phylogeny)是指生物形成或进化的历史 系统发生学(phylogenetics)研究物种之间的进化关系 系统发生树(phylogenetic tree)表示形式,描述物种之间进化关系,分子进化研究的目的 从物种的一些分子特性出发,从而了解物种之间的生物系统发生的关系。 蛋白和核酸序列 通过序列同源性

18、的比较进而了解基因的进化以及生物系统发生的内在规律,分子进化分析介绍,分子进化分析介绍,分子进化研究的基础 基本理论:在各种不同的发育谱系及足够大的进化时间尺度中,许多序列的进化速率几乎是恒定不变的。(分子钟理论, Molecular clock 1965 ),实际情况:虽然很多时候仍然存在争议,但是分子进化确实能阐述一些生物系统发生的内在规律,分子进化分析介绍,直系同源与旁系同源 Orthologs(直系同源): Homologous sequences in different species that arose from a common ancestral gene during s

19、peciation; may or may not be responsible for a similar function. Paralogs(旁系同源): Homologous sequences within a single species that arose by gene duplication. 。 以上两个概念代表了两个不同的进化事件。用于分子进化分析中的序列必须是直系同源的,才能真实反映进化过程。,分子进化分析介绍,分子进化分析介绍,系统发生树(phylogenetic tree): 又名进化树(evolutionary tree)已发展成为多学科交叉形成的一个边缘领域。

20、 包括生命科学中的进化论、遗传学、分类学、分子生物学、生物化学、生物物理学和生态学,又包括数学中的概率统计、图论、计算机科学和群论。 闻名国际生物学界的美国冷泉港定量生物学会议于1987年特辟出进化树专栏进行学术讨论,标志着该领域已成为现代生物学的前沿之一,迄今仍很活跃。,分子进化分析介绍,分子进化分析介绍,系统发生树结构 The lines in the tree are called branches(分支). At the tips of the branches are present-day species or sequences known as taxa (分类,the sin

21、gular form is taxon) or operational taxonomic units(运筹分类单位). The connecting point where two adjacent branches join is called a node(节点), which represents an inferred ancestor of extant taxa. The bifurcating point at the very bottom of the tree is the root node(根节), which represents the common ancest

22、or of all members of the tree. A group of taxa descended from a single common ancestor is defined as a clade or monophyletic group (单源群). The branching pattern in a tree is called tree topology(拓扑结构).,分子进化分析介绍,有根树与无根树 树根代表一组分类的共同祖先,分子进化分析介绍,如何确定树根 根据外围群:One is to use an outgroup(外围群), which is a seq

23、uence that is homologous to the sequences under consideration, but separated from those sequences at an early evolutionary time. 根据中点:In the absence of a good outgroup, a tree can be rooted using the midpoint rooting approach, in which the midpoint of the two most divergent groups judged by overall

24、branch lengths is assigned as the root.,Rooted by outgroup,分子进化分析介绍,分子进化分析介绍,树形 系统发生图(Phylograms):有分支和支长信息 分支图( Cladograms)只有分支信息,无支长信息,第二节 系统发生树构建方法,Molecular phylogenetic tree construction can be divided into five steps: (1) choosing molecular markers; (2) performing multiple sequence alignment; (

25、3) choosing a model of evolution; (4) determining a tree building method; (5) assessing tree reliability.,第三节 系统发生树构建实例,系统发生分析常用软件 (1) PHYLIP (2) PAUP (3) TREE-PUZZLE (4) MEGA (5) PAML (6) TreeView,(7) VOSTORG (8) Fitch programs (9) Phylo_win (10) ARB (11) DAMBE (12) PAL (13) Bionumerics,其它程序见: http

26、:/evolution.genetics.washington.edu/phylip/software.html,系统发生树构建实例,Mega 3 下载地址,离散特征数据 (discrete character data): 即所获得的是2个或更多的离散的值。如: DNA序列某一位置是或者不是剪切位点(二态特征); 序列中某一位置,可能的碱基有A、T、G、C共4种(多态特征); 相似性和距离数据 (similarity and distance data): 是用彼此间的相似性或距离所表示出来的各分类单位间的相互关系。,核酸序列的预测和鉴定,内容: 序列概率信息的统计模型 核酸序列的预测与鉴定

27、,第一节、序列概率信息的统计模型,One of the applications of multiple sequence alignments in identifying related sequences in databases is by construction of some statistical models. Position-specific scoring matrices (PSSMs) Profiles Hidden Markov models (HMMs).,收集已知的功能序列和非功能序列实例 (这些序列之间是非相关的 ),训练集 (training set),测

28、试集或控制集 (control set),建立完成识别任务的模型,检验所建模型的正确性,对预测模型进行训练, 使之通过学习后具有 正确处理和辨别能力。,进行“功能”与“非功能”的 判断,根据判断结果计算 模识别的准确性。,识别“功能序列”和“非功能序列”的过程,多序列比对,相关序列选取,模型构建,模型训练,参数调整,应用,确立模型 Profile HMM,Hmmcalibrate,ClustalX,Hmmbuild,Hmmt,Hidden Markov Model,Hidden Markov Model,应用 HMMs has more predictive power than Profil

29、es. HMM is able to differentiate between insertion and deletion states In profile calculation, a single gap penalty score that is often subjectively determined represents either an insertion or deletion.,Hidden Markov Model,应用 Once an HMM is established based on the training sequences, It can be use

30、d to determine how well an unknown sequence matches the model. It can be used for the construction of multiple alignment of related sequences. HMMs can be used for database searching to detect distant sequence homologs. HMMs are also used in Protein family classification through motif and pattern id

31、entification Advanced gene and promoter prediction, Transmembrane protein prediction, Protein fold recognition.,第二节 核酸序列的预测与鉴定,本节内容 核酸序列预测概念 基因预测 启动子和调控元件预测 酶切位点分析与引物设计,1、核酸序列预测概念,指利用一些计算方式(计算机程序)从基因组序列中发现基因及其表达调控元件的位置和结构的过程。包括: 基因预测( Gene Prediction ) 基因表达调控元件预测(Promoter and Regulatory Element Pred

32、iction),Structure of Eukaryotic Genes,AGCATCGAAGTTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCT

33、AGCAAGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGCGATGCATGACCTAGCAAGAAGTTGCATGACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACTGACCTAGCAGCATCGAAGTTGCAT

34、GACGATGCATGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGTTGCATGACGATTGACCTAGTGCATGACGATGCATGACCTAGCAGCATCGAAGTTGCATGACGATGCATGACCTAGCAAGAAGTTGCATGACGATGCATGACCTAATGC,第二节 核酸序列的预测与鉴定,本节内容 核酸序列预测概念 基因预测 启动子和调控元件预测 酶切位点分析与引物设计,基因预测的概念及意义 原核基因识别 真核基因预测的困难性 真核基因预测的依据 真核基因预测的基本步骤及策略 真核

35、基因预测方法及其基本原理,2、基因预测,概念: Gene Prediction: Given an uncharacterized DNA sequence, find out: Where does the gene starts and ends? detection of the location of open reading frames (ORFs) Which regions code for a protein? delineation of the structures of introns as well as exons (eukaryotic),2.1 基因预测的概念及

36、意义,基因预测的概念及意义,意义: Computational Gene Finding (Gene Prediction) is one of the most challenging and interesting problems in bioinformatics at the moment. Computational Gene Finding is important because So many genomes have been being sequenced so rapidly. Pure biological means are time consuming and c

37、ostly. Finding genes in DNA sequences is the foundation for all further investigation (Knowledge of the protein-coding regions underpins functional genomics).,基因预测的概念及意义 原核基因识别 真核基因预测的困难性 真核基因预测的依据 真核基因预测的基本步骤及策略 真核基因预测方法及其基本原理,2、基因预测,2.2、原核基因识别,原核基因识别任务的重点是识别开放阅读框,或者说识别长的编码区域。 一个开放阅读框(ORF, open rea

38、ding frame)是一个没有终止编码的密码子序列。,原核基因预测工具介绍 ORF Finder HMM-based gene finding programs GeneMark Glimmer FGENESB RBSfinder,原核基因识别,ORF Finder (Open Reading Frame Finder) http:/www.ncbi.nlm.nih.gov/gorf/gorf.html,原核基因识别,zinc-binding alcohol dehydrogenase, novicida(弗朗西丝菌 ),HMM-based gene finding programs Gen

39、eMark: Trained on a number of complete microbial genomes http:/opal.biology.gatech.edu/GeneMark/,原核基因识别,HMM-based gene finding programs Glimmer (Gene Locator and Interpolated Markov Modeler): A UNIX program www.tigr.org/softlab/glimmer/glimmer.html,原核基因识别,HMM-based gene finding programs FGENESB: Web

40、-based program Trained for bacterial sequences ,原核基因识别,HMM-based gene finding programs RBSfinder: UNIX program Predicted start sites ftp:/ftp.tigr.org/pub/software/RBSfinder/,原核基因识别,基因预测的概念及意义 原核基因识别 真核基因预测的困难性 真核基因预测的依据 真核基因预测的基本步骤及策略 真核基因预测方法及其基本原理,2、基因预测,Why is Gene Prediction Challenging? Coding

41、 density: as the coding/non-coding length ratio decreases, exon prediction becomes more complex. Some facts about human genome Coding regions comprise less than 3% of the genome There is a gene of 2400000 bps, only 14000 bps are CDS ( 1%),2.3 真核基因预测的困难性,worm,E.coli,Splicing of genes: finding multipl

42、e (short) exons is harder than finding a single (long) exon. Some facts about human genome Average of 5-6 exons/gene Average exon length: 200 bp Average intron length: 2000 bp 8% genes have a single exon Some exons can be as small as 3 bp. Alternate splicing are very difficult to predict(next),真核基因预

43、测的困难性,真核基因预测的困难性,基因预测的概念及意义 原核基因识别 真核基因预测的困难性 真核基因预测的依据 真核基因预测的基本步骤及策略 真核基因预测方法及其基本原理,2、基因预测,真核基因预测的依据,功能位点 Splicing site signals 剪切供体位点和受体位点(Donor/Acceptor):the splice junctions of introns and exons follow the GTAG rule in which an intron at the 5 splice junction has a consensus motif of GTAAGT (Do

44、nor); and at the 3 splice junction is a consensus motif of (Py)12NCAG (Acceptor),Nucleotide Distribution Probabilities around Donor Sites,真核基因预测的依据,Nucleotide Distribution Probabilities around non Donor Sites,真核基因预测的依据,Nucleotide Distribution around Splicing Sites,功能位点 Translation initiation site si

45、gnal translation start codon: Most vertebrate genes use ATG as the translation start codon and have a uniquely conserved flanking sequence call a Kozak sequence (CCGCCATGG). Translation termination site signal translation stop codon:TGA,真核基因预测的依据,功能位点 Transcription start signals Transcription start

46、signals: CpG island: to identify the transcription initiation site of a eukaryotic gene most of these genes have a high density of CG dinucleotides near the transcription start site. This region is referred to as a CpG island 。,真核基因预测的依据,酵母基因组两联核苷酸频率表,仅为随机概率的20,但在真核基因启动子区,CpG出现密度达到随机预测水平。长度几百bp。人类基于

47、组中大约有45000个CpG岛,其中一半与管家基因有关,其余与组织特异性基于启动子关联。,功能位点 Transcription stop signals Transcription stop signals:. The poly-A signal can also help locate the final coding sequence,真核基因预测的依据,编码区与非编码区基因组成特征 密码子使用偏好 外显子长度 等值区(isochore),真核基因预测的依据,编码区与非编码区基因组成特征 Codon Usage Preference(密码子使用偏好) Statistical results

48、 show that some codons are used with different frequencies in coding and non-coding regions,e.g: hexamer frequencies Codon Usage Frequency:,真核基因预测的依据,For coding region,For non-coding region,编码区与非编码区基因组成特征 Codon Usage Preference Hexamer (Di-codon Usage, 双连密码子 ) frequencies :hexamer frequencies(连续6核苷酸

49、)出现频率的比对是确定一个窗口是否属于编码区或非编码区的最好单个指标,真核基因预测的依据,编码区与非编码区基因组成特征 Codon Usage Preference Codon Usage Frequency(密码子的使用频率) 由于密码子的简并性(degeneracy),每个氨基酸至少对应1种密码子,最多有6种对应的密码子。 在基因中,同义密码子的使用并不是完全一致的。 不同物种、不同生物体的基因密码子使用存在着很大的差异 在不同物种中,类型相同的基因具有相近的同义密码子使用偏性 对于同一类型的基因由物种引起的同义密码子使用偏性的差异较小,真核基因预测的依据,Codon Usage Freq

50、uency,For coding region,Length Distribution of Internal Exons of Human Genes,编码区与非编码区基因组成特征 外显子长度,真核基因预测的依据,编码区与非编码区基因组成特征 等值区 定义: 具有一致碱基组成的长区域 长度超过1 000 000 bp 同一等值区GC含量相对均衡,但不同等值区GC含量差异显著 人类基因组划分为5个等值区 L1:GC 39 L2:GC 42 L1和L2包含80的组织特异性基因 H1:GC 46 H2:GC 49 H3:GC 54。包含80的管家基因,真核基因预测的依据,The Dependenc

51、e of Codon Usage Score on CG Content,基因预测的概念及意义 原核基因识别 真核基因预测的困难性 真核基因预测的依据 真核基因预测的基本步骤及策略 真核基因预测方法及其基本原理,2、基因预测,2. 5 真核基因预测的步骤和策略,The main issue in prediction of eukaryotic genes is the identification of exons, introns, and splicing sites。,真核基因预测的步骤和策略,真核基因预测的步骤和策略,基本步骤 判定序列中的载体污染 屏蔽重复序列 发现基因 结果评估,

52、真核基因预测的步骤和策略,序列中的污染和重复元件必须首先去除。 序列污染(sequence contamination)的来源: 载体 接头和PCR引物 转座子和插入序列 DNA/RNA样品纯度不高 重复元件(repetitive element): 散在重复元件、卫星DNA、简单重复序列、低复杂度序列等,基因发现策略: The current gene prediction methods can be classified into two major categories 从头计算法或基于统计的方法(ab initiobased approaches or Statistically b

53、ased method ): predicts genes based on the given sequence alone 基于同源序列比对的方法(homology-based approaches or Sequence alignment based method) : makes predictions based on significant matches of the query sequence with sequences of known genes.,真核基因预测的步骤和策略,基因发现的策略选择,真核基因预测的步骤和策略,基因预测的概念及意义 原核基因识别 真核基因预测

54、的困难性 真核基因预测的依据 真核基因预测的基本步骤及策略 真核基因预测方法及其基本原理,2、基因预测,载体污染判定方法 重复序列分析程序 基因预测程序(Eukaryotic ),2.6、真核基因预测方法及其基本原理,载体污染判定 载体污染判定方法 载体数据库相似性搜索 搜索序列中的限制酶切位点 工具: VecScreen:NCBI Blast2 EVEC:EMBL www.ebi.ac.uk/blastall/vectors.html,真核基因预测方法及其基本原理,真核基因预测方法及其基本原理,屏蔽重复序列 重复序列分析程序 RepeatMasker:针对灵长类、啮齿类、拟南芥、草本植物、果

55、蝇 ftp.genome.washington.edu/cgi-bin/RepeatMasker XBLAST:适用于任何物种 bioweb.pasteur.fr/seqanal/interfaces/xblast.html#-data/,真核基因预测方法及其基本原理,Gene Prediction Programs(Eukaryotic) Ab InitioBased Programs Homology-Based Programs Consensus-Based Programs Performance Evaluation,真核基因预测方法及其基本原理,Ab InitioBased Pr

56、ograms The goal of the ab initio gene prediction programs is to discriminate exons from noncoding sequences and subsequently join the exons together in the correct order. The algorithms rely on two features: gene signals gene content To derive an assessment for this feature, HMMs or neural network-b

57、ased algorithms can be used The frequently used ab initio programs are described next.,Ab InitioBased Programs GENSCAN : Web based: http:/genes.mit.edu/GENSCAN.html makes predictions based on fifth-order HMMs. It combines hexamer frequencies with coding signals (initiation codons, TATA box, cap site

58、, poly-A, etc.) in prediction. Putative exons are assigned a probability score (P) of being a true exon. Only predictions with P 0.5 are deemed reliable. This program is trained for sequences from vertebrates, Arabidopsis, and maize. It has been used extensively in annotating the human genome.,真核基因预

59、测方法及其基本原理,Ab InitioBased Programs GRAIL (Gene Recognition and Assembly Internet Link): a web-based program: http:/compbio.ornl.gov/public/tools/ based on a neural network algorithm. The program is trained on several statistical features such as splice junctions, start and stop codons, poly-A sites,

60、promoters, and CpG islands. The program scans the query sequence with windows of variable lengths and scores for coding potentials and finally produces an output that is the result of exon candidates. The program is currently trained for human, mouse, Arabidopsis, Drosophila, and Escherichia coli se

61、quences.,真核基因预测方法及其基本原理,Ab InitioBased Programs FGENES (FindGenes) Web-based program: Uses LDA to determine whether a signal is an exon. In addition to FGENES, there are many variants of the program: FGENESH: make use of HMMs. FGENESH C: similarity based. FGENESH+: combine both ab initio and similar

62、ity-based approaches.,真核基因预测方法及其基本原理,Ab InitioBased Programs MZEF (Michael Zhangs Exon Finder) Web based: http:/argon.cshl.org/genefinder/ Uses QDA for exon prediction. Has not been obvious in actual gene prediction.,真核基因预测方法及其基本原理,Ab InitioBased Programs HMMgene: Web based: www.cbs.dtu.dk/services/

63、HMMgene HMM-based program. The unique feature of the program is that it uses a criterion called the conditional maximum likelihood to discriminate coding from noncoding features. If a sequence already has a subregion identified as coding region, which may be based on similarity with cDNAs or protein

64、s in a database, these regions are locked as coding regions. An HMM prediction is subsequently made with a bias toward the locked region and is extended from the locked region to predict the rest of the gene coding regions and even neighboring genes. The program is in a way a hybrid algorithm that u

65、ses both ab initio-based and homology-based criteria.,真核基因预测方法及其基本原理,真核基因预测方法及其基本原理,Homology-Based Programs Homology-based programs are based on the fact that exon structures and exon sequences of related species are highly conserved. When potential coding frames in a query sequence are translated a

66、nd used to align with closest protein homologs found in databases, near perfectly matched regions can be used to reveal the exon boundaries in the query. This approach assumes that the database sequences are correct. It is a reasonable assumption in light of the fact that many homologous sequences to be compared with are derived f

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!