DNA1Lastweek'stake-homelessons
《DNA1Lastweek'stake-homelessons》由会员分享,可在线阅读,更多相关《DNA1Lastweek'stake-homelessons(53页珍藏版)》请在装配图网上搜索。
1、DNA1:Last weeks take-home lessonsTypes of mutantsMutation,drift,selection Binomial for eachAssociation studies c2 statisticLinked&causative allelesAlleles,Haplotypes,genotypesComputing the first genome,the second.New technologiesRandom and systematic errors1DNA2:Todays story and goalsMotivation and
2、connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands2DNA 2figureIntro2:Common&simpleDNA1:the last 5000 generations3Applications of Dynamic ProgrammingzTo seq
3、uence analysisShotgun sequence assemblyMultiple alignmentsDispersed&tandem repeatsBird song alignmentsGene Expression time-warpingzThrough HMMsRNA gene search&structure predictionDistant protein homologiesSpeech recognition4Alignments&ScoresGlobal(e.g.haplotype)ACCACACA:xx:x:ACACCATAScore=5(+1)+3(-1
4、)=2Suffix(shotgun assembly)ACCACACA :ACACCATAScore=3(+1)=3 Local(motif)ACCACACA :ACACCATAScore=4(+1)=45Increasingly complex(accurate)searchesExact(StringSearch)CGCGRegular expression(PrositeSearch)CGN0-9CG =CGAACGSubstitution matrix(BlastN)CGCG =CACG Pro(PSI-blast)CGc(g/a)=CACGGaps (Gap-Blast)CGCG =
5、CGAACGDynamic Programming(NW,SM)CGCG =CAGACGHidden Markov Models(HMMER)WU6Hardness of(multi-)sequence alignmentAlign 2 sequences of length N allowing gaps.ACCAC-ACA ACCACACA :x:x:x:xxxxxx:AC-ACCATA ,A-CACCATA ,etc.2N gap positions,gap lengths of 0 to N each:A nave algorithm might scale by O(N2N).For
6、 N=3x109 this is rather large.Now,what about k2 sequences?or rearrangements other than gaps?7Separate Training set and Testing setsNeed databases of non-redundant sets.Need evaluation criteria(programs)Sensistivity and Specificity (false negatives&positives)sensitivity(true_predicted/true)specificit
7、y(true_predicted/all_predicted)Where do training sets come from?More expensive experiments:crystallography,genetics,biochemistryTesting search&classification algorithms8Pearson WR Protein Sci 1995 Jun;4(6):1145-60 Comparison of methods for searching protein sequence databases.Methods Enzymol 1996;26
8、6:227-58 Effective protein sequence comparison.Algorithm:FASTA,Blastp,BlitzSubstitution matrix:PAM120,PAM250,BLOSUM50,BLOSUM62Database:PIR,SWISS-PROT,GenPeptComparisons of homology scores9Switch to protein searches when possibleMAdjacent mRNA codonsF 3 uac 5.aug 3aag uuu.10A Multiple Alignment of Im
9、munoglobulins11Scoring matrix based on large set of distantly related blocks:Blosum6212Scoring Functions and AlignmentszScoring function:(match)=+1;(mismatch)=-1;(indel)=-2;(other)=0.zAlignment score:sum of columns.zOptimal alignment:maximum score.substitution matrix13Calculating Alignment Scores14D
10、NA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands15What is dynamic programming?A dynamic programming algorithm so
11、lves every subsubproblem just once and then saves its answer in a table,avoiding the work of recomputing the answer every time the subsubproblem is encountered.-Cormen et al.Introduction to Algorithms,The MIT Press.16Recursion of Optimal Global Alignments17Recursion of Optimal Local Alignments18Comp
12、uting Row-by-Rowmin=-109919Traceback Optimal Global Alignment20Local and Global Alignments21Time and Space Complexity of Computing Alignments22Time and Space ProblemszComparing two one-megabase genomes.zSpace:An entry:4 bytes;Table:4*106*106=4 G bytes memory.zTime:1000 MHz CPU:1M entries/second;1012
13、 entries:1M seconds=10 days.23Time&Space Improvement for w-band Global AlignmentszTwo sequences differ by at most w bps(wn).zw-band algorithm:O(wn)time and space.zExample:w=3.24SummaryDynamic programmingStatistical interpretation of alignmentsComputing optimal global alignmentComputing optimal local
14、 alignmentTime and space complexityImprovement of time and spaceScoring functions25DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov M
15、odel for CpG Islands26A Multiple Alignment of Immunoglobulins27A multiple alignment Dynamic programming on a hyperlatticeFrom G.Fullen,1996.28Multiple Alignment vs Pairwise AlignmentOptimal Multiple AlignmentNon-Optimal Pairwise Alignment29Computing a Node on HyperlatticeVSAk=3 2k 1=730Challenges of
16、 Optimal Multiple AlignmentszSpace complexity(hyperlattice size):O(nk)for k sequences each n long.zComputing a hyperlattice node:O(2k).zTime complexity:O(2knk).zFind the optimal solution is exponential in k(non-polynomial,NP-hard).31Methods and Heuristics for Optimal Multiple AlignmentszOptimal:dyna
17、mic programmingPruning the hyperlattice(MSA)zHeuristics:tree alignments(ClustalW)star alignmentssampling(Gibbs)(discussed in RNA2)local profiling with iteration(PSI-Blast,.)32ClustalW:Progressive Multiple AlignmentAll PairwiseAlignmentsCluster AnalysisSimilarity MatrixDendrogramFrom Higgins(1991)and
18、 Thompson(1994).33Star AlignmentsPairwise AlignmentFind the CentralSequence s1Pairwise AlignmentMultiple AlignmentCombine intoMultiple Alignment34DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-tim
19、e-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands35What is distinctive?Failure to find edges?0.Promoters&CGs islandsVariety&combinations1.Preferred codonsTiny proteins(&RNAs)2.RNA splice signals Alternatives&weak motifs3.Frame across splicesAlternatives4.Inter-speci
20、es conservationGene too close or distant 5.cDNA for splice edgesRare transcriptAccurately finding genes&their edges36Annotated Protein Sizes in Yeast&Mycoplasmax=Protein size in#aa%of proteins at length xYeast37Predicting small proteins(ORFs)minmaxYeast38Mutations in domain II of 23 S rRNA facilitat
21、e translation of a 23 S rRNA-encoded pentapeptide conferring erythromycin resistance.Dam et al.1996 J Mol Biol 259:1-6Trp(W)leader peptide,14 codons:MKAIFVLKGWWRTS Phe(F)leader peptide,15 codons:MKHIPFFFAFFFTFPHis(H)leader peptide,16 codons:MTRVQFKHHHHHHHPDSmall coding regionsSTOPSTOPSTOPOther examp
22、les in proteomics lectures39Motif Matrices a a t g c a t g g a t g t g t g a 1 3 0 0c 1 0 0 0g 1 1 0 4t 1 0 4 0Align and calculate frequencies.Note:Higher order correlations lost.40Protein startsGeneMark41Motif Matrices a a t g 1+3+4+4=12 c a t g 1+3+4+4=12 g a t g 1+3+4+4=12 t g t g 1+1+4+4=10a 1 3
23、 0 0c 1 0 0 0g 1 1 0 4t 1 0 4 0Align and calculate frequencies.Note:Higher order correlations lost.Score test sets:a c c c 1+0+0+0=142DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy t
24、radeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands43Why probabilistic models in sequence analysis?z Recognition-Is this sequence a protein start?z Discrimination-Is this protein more like a hemoglobin or a myoglobin?z Database search-What are all of sequences in SwissProt that l
25、ook like a serine protease?44A Basic ideaAssign a number to every possible sequence such that sP(s|M)=1P(s|M)is a probability of sequence s given a model M.45Sequence recognition Recognition question-What is the probability that the sequence s is from the start site model M?P(M|s)=P(M)*P(s|M)/P(s)(B
26、ayes theorem)P(M)and P(s)are prior probabilities and P(M|s)is posterior probability.46Database searchz N=null model(random bases or AAs)z Report all sequences with logP(s|M)-logP(s|N)logP(N)-logP(M)z Example,say a/b hydrolase fold is rare in the database,about 10 in 10,000,000.The threshold is 20 bi
27、ts.If considering 0.05 as a significant level,then the threshold is 20+4.4=24.4 bits.47C rare due to lack of uracil glycosylase(cytidine deamination)TT rare due to lack of UV repair enzymes.CG rare due to 5methylCG to TG transitions(cytidine deamination)AGG rare due to low abundance of the correspon
28、ding Arg-tRNA.CTAG rare in bacteria due to error-prone repair of CTAGG to C*CAGG.AAAA excess due to polyA pseudogenes and/or polymerase slippage.AmAcid Codon Number /1000 Fraction Arg AGG 3363.00 1.93 0.03Arg AGA 5345.00 3.07 0.06Arg CGG 10558.00 6.06 0.11Arg CGA 6853.00 3.94 0.07Arg CGT 34601.00 19
29、.87 0.36Arg CGC 36362.00 20.88 0.37Plausible sources of mono,di,tri,&tetra-nucleotide biases48C+A+G+T+P(G+|C+)P(A+|A+)CpG Island+in a ocean of-First order Markov ModelMM=16,HMM=64 transition probabilities(adjacent bp)C-A-G-T-P(C-|A+)Hidden49Estimate transistion probabilities-an exampleTraining setP(
30、G|C)=#(CG)/N#(CN)Laplace pseudocount:Add+1 count to each observed.(p.9,108,321 Dirichlet)50Estimated transistion probabilities from 48 known islandsTraining setP(G|C)=#(CG)/N#(CN)51Viterbi:dynamic programming for HMMRecursion:vl(i+1)=el(xi+1)max(vk(i)akl)si=Most probable pathl,k=2statesa=table in slide 51e=emit si in state l (Durbin p.56)1/8*.27 52DNA2:Todays story and goalsMotivation and connection to DNA1Comparing types of alignments&algorithms Dynamic programming Multi-sequence alignmentSpace-time-accuracy tradeoffsFinding genes-motif profilesHidden Markov Model for CpG Islands53
- 温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 卡通可爱绿色小学生家长会模板课件
- 卡通可爱老师教育教学模板课件
- 卡通可爱幼儿园大班家长会模板课件
- 卡通夏日暑假班会家长会模板课件
- 卡通可爱创意爱情告白求婚婚礼婚庆策划方案模板课件
- 卡通可爱军人动态模板通用模板课件
- 卡通可爱五一劳动最光荣主题班会模板课件
- 卡通可爱小学生常用急救知识模板课件
- 卡通动画小乌龟Franklin_02_02【声音字幕同步】课件
- 卡通儿童预防冬季流感科普宣传模板课件
- 卡通动漫动物人物绘制课件
- 卡通可爱儿童节主题活动策划方案模板课件
- 卡通儿童珍爱生命防溺水主题班会模板课件
- 卡通动漫教育教学课程设计教师说课模板课件
- 身体工作动态静心资料来源Osho的静心与健康若欲详解敬请课件