DNA1Lastweek'stake-homelessons

上传人:xx****x 文档编号:240562425 上传时间:2024-04-15 格式:PPT 页数:53 大小:430KB
收藏 版权申诉 举报 下载
DNA1Lastweek'stake-homelessons_第1页
第1页 / 共53页
DNA1Lastweek'stake-homelessons_第2页
第2页 / 共53页
DNA1Lastweek'stake-homelessons_第3页
第3页 / 共53页
资源描述:

《DNA1Lastweek'stake-homelessons》由会员分享,可在线阅读,更多相关《DNA1Lastweek'stake-homelessons(53页珍藏版)》请在装配图网上搜索。

1、DNA1:Last weeks take-home lessonsTypes of mutantsMutation,drift,selection Binomial for eachAssociation studies c2 statisticLinked&causative allelesAlleles,Haplotypes,genotypesComputing the first genome,the second.New technologiesRandom and systematic errors1DNA2:Todays story and goalsMotivation and

2、connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands2DNA 2figureIntro2:Common&simpleDNA1:the last 5000 generations3Applications of Dynamic ProgrammingzTo seq

3、uence analysisShotgun sequence assemblyMultiple alignmentsDispersed&tandem repeatsBird song alignmentsGene Expression time-warpingzThrough HMMsRNA gene search&structure predictionDistant protein homologiesSpeech recognition4Alignments&ScoresGlobal(e.g.haplotype)ACCACACA:xx:x:ACACCATAScore=5(+1)+3(-1

4、)=2Suffix(shotgun assembly)ACCACACA :ACACCATAScore=3(+1)=3 Local(motif)ACCACACA :ACACCATAScore=4(+1)=45Increasingly complex(accurate)searchesExact(StringSearch)CGCGRegular expression(PrositeSearch)CGN0-9CG =CGAACGSubstitution matrix(BlastN)CGCG =CACG Pro(PSI-blast)CGc(g/a)=CACGGaps (Gap-Blast)CGCG =

5、CGAACGDynamic Programming(NW,SM)CGCG =CAGACGHidden Markov Models(HMMER)WU6Hardness of(multi-)sequence alignmentAlign 2 sequences of length N allowing gaps.ACCAC-ACA ACCACACA :x:x:x:xxxxxx:AC-ACCATA ,A-CACCATA ,etc.2N gap positions,gap lengths of 0 to N each:A nave algorithm might scale by O(N2N).For

6、 N=3x109 this is rather large.Now,what about k2 sequences?or rearrangements other than gaps?7Separate Training set and Testing setsNeed databases of non-redundant sets.Need evaluation criteria(programs)Sensistivity and Specificity (false negatives&positives)sensitivity(true_predicted/true)specificit

7、y(true_predicted/all_predicted)Where do training sets come from?More expensive experiments:crystallography,genetics,biochemistryTesting search&classification algorithms8Pearson WR Protein Sci 1995 Jun;4(6):1145-60 Comparison of methods for searching protein sequence databases.Methods Enzymol 1996;26

8、6:227-58 Effective protein sequence comparison.Algorithm:FASTA,Blastp,BlitzSubstitution matrix:PAM120,PAM250,BLOSUM50,BLOSUM62Database:PIR,SWISS-PROT,GenPeptComparisons of homology scores9Switch to protein searches when possibleMAdjacent mRNA codonsF 3 uac 5.aug 3aag uuu.10A Multiple Alignment of Im

9、munoglobulins11Scoring matrix based on large set of distantly related blocks:Blosum6212Scoring Functions and AlignmentszScoring function:(match)=+1;(mismatch)=-1;(indel)=-2;(other)=0.zAlignment score:sum of columns.zOptimal alignment:maximum score.substitution matrix13Calculating Alignment Scores14D

10、NA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands15What is dynamic programming?A dynamic programming algorithm so

11、lves every subsubproblem just once and then saves its answer in a table,avoiding the work of recomputing the answer every time the subsubproblem is encountered.-Cormen et al.Introduction to Algorithms,The MIT Press.16Recursion of Optimal Global Alignments17Recursion of Optimal Local Alignments18Comp

12、uting Row-by-Rowmin=-109919Traceback Optimal Global Alignment20Local and Global Alignments21Time and Space Complexity of Computing Alignments22Time and Space ProblemszComparing two one-megabase genomes.zSpace:An entry:4 bytes;Table:4*106*106=4 G bytes memory.zTime:1000 MHz CPU:1M entries/second;1012

13、 entries:1M seconds=10 days.23Time&Space Improvement for w-band Global AlignmentszTwo sequences differ by at most w bps(wn).zw-band algorithm:O(wn)time and space.zExample:w=3.24SummaryDynamic programmingStatistical interpretation of alignmentsComputing optimal global alignmentComputing optimal local

14、 alignmentTime and space complexityImprovement of time and spaceScoring functions25DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov M

15、odel for CpG Islands26A Multiple Alignment of Immunoglobulins27A multiple alignment Dynamic programming on a hyperlatticeFrom G.Fullen,1996.28Multiple Alignment vs Pairwise AlignmentOptimal Multiple AlignmentNon-Optimal Pairwise Alignment29Computing a Node on HyperlatticeVSAk=3 2k 1=730Challenges of

16、 Optimal Multiple AlignmentszSpace complexity(hyperlattice size):O(nk)for k sequences each n long.zComputing a hyperlattice node:O(2k).zTime complexity:O(2knk).zFind the optimal solution is exponential in k(non-polynomial,NP-hard).31Methods and Heuristics for Optimal Multiple AlignmentszOptimal:dyna

17、mic programmingPruning the hyperlattice(MSA)zHeuristics:tree alignments(ClustalW)star alignmentssampling(Gibbs)(discussed in RNA2)local profiling with iteration(PSI-Blast,.)32ClustalW:Progressive Multiple AlignmentAll PairwiseAlignmentsCluster AnalysisSimilarity MatrixDendrogramFrom Higgins(1991)and

18、 Thompson(1994).33Star AlignmentsPairwise AlignmentFind the CentralSequence s1Pairwise AlignmentMultiple AlignmentCombine intoMultiple Alignment34DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-tim

19、e-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands35What is distinctive?Failure to find edges?0.Promoters&CGs islandsVariety&combinations1.Preferred codonsTiny proteins(&RNAs)2.RNA splice signals Alternatives&weak motifs3.Frame across splicesAlternatives4.Inter-speci

20、es conservationGene too close or distant 5.cDNA for splice edgesRare transcriptAccurately finding genes&their edges36Annotated Protein Sizes in Yeast&Mycoplasmax=Protein size in#aa%of proteins at length xYeast37Predicting small proteins(ORFs)minmaxYeast38Mutations in domain II of 23 S rRNA facilitat

21、e translation of a 23 S rRNA-encoded pentapeptide conferring erythromycin resistance.Dam et al.1996 J Mol Biol 259:1-6Trp(W)leader peptide,14 codons:MKAIFVLKGWWRTS Phe(F)leader peptide,15 codons:MKHIPFFFAFFFTFPHis(H)leader peptide,16 codons:MTRVQFKHHHHHHHPDSmall coding regionsSTOPSTOPSTOPOther examp

22、les in proteomics lectures39Motif Matrices a a t g c a t g g a t g t g t g a 1 3 0 0c 1 0 0 0g 1 1 0 4t 1 0 4 0Align and calculate frequencies.Note:Higher order correlations lost.40Protein startsGeneMark41Motif Matrices a a t g 1+3+4+4=12 c a t g 1+3+4+4=12 g a t g 1+3+4+4=12 t g t g 1+1+4+4=10a 1 3

23、 0 0c 1 0 0 0g 1 1 0 4t 1 0 4 0Align and calculate frequencies.Note:Higher order correlations lost.Score test sets:a c c c 1+0+0+0=142DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy t

24、radeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands43Why probabilistic models in sequence analysis?z Recognition-Is this sequence a protein start?z Discrimination-Is this protein more like a hemoglobin or a myoglobin?z Database search-What are all of sequences in SwissProt that l

25、ook like a serine protease?44A Basic ideaAssign a number to every possible sequence such that sP(s|M)=1P(s|M)is a probability of sequence s given a model M.45Sequence recognition Recognition question-What is the probability that the sequence s is from the start site model M?P(M|s)=P(M)*P(s|M)/P(s)(B

26、ayes theorem)P(M)and P(s)are prior probabilities and P(M|s)is posterior probability.46Database searchz N=null model(random bases or AAs)z Report all sequences with logP(s|M)-logP(s|N)logP(N)-logP(M)z Example,say a/b hydrolase fold is rare in the database,about 10 in 10,000,000.The threshold is 20 bi

27、ts.If considering 0.05 as a significant level,then the threshold is 20+4.4=24.4 bits.47C rare due to lack of uracil glycosylase(cytidine deamination)TT rare due to lack of UV repair enzymes.CG rare due to 5methylCG to TG transitions(cytidine deamination)AGG rare due to low abundance of the correspon

28、ding Arg-tRNA.CTAG rare in bacteria due to error-prone repair of CTAGG to C*CAGG.AAAA excess due to polyA pseudogenes and/or polymerase slippage.AmAcid Codon Number /1000 Fraction Arg AGG 3363.00 1.93 0.03Arg AGA 5345.00 3.07 0.06Arg CGG 10558.00 6.06 0.11Arg CGA 6853.00 3.94 0.07Arg CGT 34601.00 19

29、.87 0.36Arg CGC 36362.00 20.88 0.37Plausible sources of mono,di,tri,&tetra-nucleotide biases48C+A+G+T+P(G+|C+)P(A+|A+)CpG Island+in a ocean of-First order Markov ModelMM=16,HMM=64 transition probabilities(adjacent bp)C-A-G-T-P(C-|A+)Hidden49Estimate transistion probabilities-an exampleTraining setP(

30、G|C)=#(CG)/N#(CN)Laplace pseudocount:Add+1 count to each observed.(p.9,108,321 Dirichlet)50Estimated transistion probabilities from 48 known islandsTraining setP(G|C)=#(CG)/N#(CN)51Viterbi:dynamic programming for HMMRecursion:vl(i+1)=el(xi+1)max(vk(i)akl)si=Most probable pathl,k=2statesa=table in slide 51e=emit si in state l (Durbin p.56)1/8*.27 52DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands53

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!