Cluster Analsis of聚类分析

上传人:仙*** 文档编号:66471778 上传时间:2022-03-28 格式:PPT 页数:123 大小:1.67MB
收藏 版权申诉 举报 下载
Cluster Analsis of聚类分析_第1页
第1页 / 共123页
Cluster Analsis of聚类分析_第2页
第2页 / 共123页
Cluster Analsis of聚类分析_第3页
第3页 / 共123页
资源描述:

《Cluster Analsis of聚类分析》由会员分享,可在线阅读,更多相关《Cluster Analsis of聚类分析(123页珍藏版)》请在装配图网上搜索。

1、1Cluster Analysis ofMicroarray Data4/13/2009Copyright 2009 Dan Nettleton2Clustering Group objects that are similar to one another together in a cluster. Separate objects that are dissimilar from each other into different clusters. The similarity or dissimilarity of two objects is determined by compa

2、ring the objects with respect to one or more attributes that can be measured for each object.3Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 4Microarray Data for Clustering attributeobj

3、ect 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestime pointsestimated expression levels5Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . .

4、 . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestissue typesestimated expression levels6Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestreatmentconditionsestimated e

5、xpression levels7Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 samplesgenesestimated expression levels8Clustering: An Example Experiment Researchers were interested in study

6、ing gene expression patterns in developing soybean seeds. Seeds were harvested from soybean plants at 25, 30, 40, 45, and 50 days after flowering (daf). One RNA sample was obtained for each level of daf.9An Example Experiment (continued) Each of the 5 samples was measured on two two-color cDNA micro

7、array slides using a loop design. The entire process we repeated on a second occasion to obtain a total of two independent biological replications.1025304045502530404550Rep 1Rep 2Diagram Illustrating the Experimental Design11The daf means estimated for each gene from a mixed linear model analysis pr

8、ovide a useful summary of the data for cluster analysis.Normalized Data for One Example GenedafdafNormalized Log SignalEstimated Means + or 1 SE12400 genes exhibited significant evidence of differential expression across time (p-value= G(4)-SE56The Gap Statistic Suggests K=3 Clusters57Gap Analysis f

9、or Two-Color Array Data (N=100)k=Number of Clustersk=Number of Clusterslog WkG(k)G(k)=log Wk log Wk vs. k(+ or 1 standard error) log Wk and log Wk vs. k *58Gap Analysis for Two-Color Array Data (N=100)k=Number of ClustersG(k)Gap AnalysisEstimates K=11Clusters“zoomed in” versionof previous plot596061

10、626364656667686970Plot of Cluster Medoids71Principal Components Principal components can be useful for providing low-dimensional views of high-dimensional data. 1 2 . m 1 2 X = . . . nDataMatrixorDataSetx11 x12 . . . x1mx21.xn1x2m.xnmxn2 . . .observationorobjectvariableorattributenumber of variables

11、number of observations72Principal Components (continued) Each principal component of a data set is a variable obtained by taking a linear combination of the original variables in the data set. A linear combination of m variables x1, x2, ., xm is given by c1x1 + c2x2 + + cmxm. For the purpose of cons

12、tructing principal components, the vector of coefficients is restricted to have unit length, i.e., c1 + c2 + + cm = 1.22273Principal Components (continued) The first principal component is the linear combination of the variables that has maximum variation across the observations in the data set. The

13、 jth principal component is the linear combination of the variables that has maximum variation across the observations in the data set subject to the constraint that the vector of coefficients be orthogonal to coefficient vectors for principal components 1, ., j-1.74The Simple Data Example x1 x2 75T

14、he First Principal Component Axis x1 x2 76The First Principal Components x1 x2 1st PC forthis pointis signeddistancebetween itsprojectiononto the1st PC axisand theorigin.77The Second Principal Component Axis x1 x2 78The Second Principal Component x1 x2 2nd PC forthis pointis signeddistancebetween it

15、sprojectiononto the2nd PC axisand theorigin.79Plot of PC1 vs. PC2 PC1 PC2 80Compare the PC plot to the plotof the original data below. x1 x2 Because thereare only twovariables here,the plot ofPC2 vs. PC1 isjust a rotationof the originalplot.81There is more to be gained when the number of variables i

16、s greater than 2. Consider the principal components for the 400 significant genes from our two-color microarray experiment. Our data matrix has n=400 rows and m=5 columns. We have looked at this data using parallel coordinate plots. What would it look like if we projected the data points to 2-dimens

17、ions?82Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC2 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1183Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC3 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1184 PC2 85Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC3 a=1b=2c=

18、3d=4e=5f=6g=7h=8i=9j=10k=1186Hierarchical Clustering Methods Hierarchical clustering methods build a nested sequence of clusters that can be displayed using a dendrogram. We will begin with some simple illustrations and then move on to a more general discussion.87The Simple Example Datawith Observat

19、ion Numbers x1 x2 88Dendrogram for the Simple Example DataTree Structurenodesa parent nodeterminal nodes or leavescorrespondingto objectsroot nodedaughter nodes(daughter nodeswith same parentare sister nodes)89A Hierarchical Clustering of theSimple Example DataScatterplot of DataDendrogramx1x2cluste

20、rs within clusterswithin clusters.90Dendrogram for the Simple Example DataThe heightof a noderepresents thedissimilaritybetween thetwo clustersmergedtogether atthe node.These two clusters have a dissimilarity of about 1.75.91The appearance of a dendrogram is not unique.Any twosister nodescould trade

21、places withoutchanging themeaning of thedendrogram.Thus 14 next to 7 does not imply that these objects are similar.92Dendrogram for the Simple Example DataBy convention,R dendrogramsshow the lowersister nodeon the left.Ties are brokenby observationnumber.The appearance of a dendrogram is not unique.

22、e.g., 13 is to the left of 14 93The lengthsof the branchesleading toterminal nodeshave noparticularmeaning in Rdendrograms.The appearance of a dendrogram is not unique.94Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=2 Clusters95Cutting the tree at

23、 a given height will correspond to a partitioning of the data into k clusters.k=3 Clusters96Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=4 Clusters97Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.

24、k=10 Clusters98Agglomerative (Bottom-Up) Hierarchical Clustering Define a measure of distance between any two clusters. (An individual object is considered a cluster of size one.) Find the two nearest clusters and merge them together to form a new cluster. Repeat until all objects have been merged i

25、nto a single cluster.99Common Measures of Between-Cluster Distance Single Linkage a.k.a. Nearest Neighbor: the distance between any two clusters A and B is the minimum of all distances from an object in cluster A to an object in cluster B. Complete Linkage a.k.a Farthest Neighbor: the distance betwe

26、en any two clusters A and B is the maximum of all distances from an object in cluster A to an object in cluster B.100Common Measures of Between-Cluster Distance Average Linkage: the distance between any two clusters A and B is the average of all distances from an object in cluster A to an object in

27、cluster B. Centroid Linkage: the distance between any two clusters A and B is the distance between the centroids of cluster A and B. (The centroid of a cluster is the componentwise average of the objects in a cluster.) 101Agglomerative Clustering Using Average Linkage for the Simple Example Data Set

28、Scatterplot of DataDendrogramx1x2ABCDEFGHIJMKLNOP102Agglomerative Clustering Using Average Linkage for the Simple Example Data SetA. 1-2 AB. 9-10 BC. 3-4 CD. 5-6 DE. 7-(5,6) EF. 13-14 FG. 11-12 GH. (1,2)-(3,4) HI. (9,10)-(11,12) Ietc.JKLMNOP103Agglomerative Clustering Using Single Linkage for the Si

29、mple Example Data Set104Agglomerative Clustering Using Complete Linkage for the Simple Example Data Set105Agglomerative Clustering Using Centroid Linkage for the Simple Example Data SetCentroid linkage isnot monotone inthe sense thatlater cluster mergescan involve clustersthat are more similarto eac

30、h other thanearlier merges.106Agglomerative Clustering Using Centroid Linkage for the Simple Example Data SetThe merge between4 and (1,2,3,5) createsa cluster whose centroidis closer to the (6,7)centroid than 4 was tothe centroid of (1,2,3,5).107Agglomerative Clustering Using Single Linkage for the

31、Two-Color Microarray Data Set108Agglomerative Clustering Using Complete Linkage for the Two-Color Microarray Data Set109Agglomerative Clustering Using Average Linkage for the Two-Color Microarray Data Set110Agglomerative Clustering Using Centroid Linkage for the Two-Color Microarray Data Set111Which

32、 Between-Cluster Distance is Best? Depends, of course, on what is meant by “best”. Single linkage tends to produce “long stringy” clusters. Complete linkage produces compact spherical clusters but might result in some objects that are closer to objects in clusters other than their own. (See next exa

33、mple.) Average linkage is a compromise between single and complete linkage. Centroid linkage is not monotone. 1121. Conduct agglomerative hierarchical clustering for this datausing Euclidean distance and complete linkage.2. Display your results using a dendrogram.3. Identify the k=2 clustering using

34、 your results.113Results of Complete-Linkage ClusteringResults for k=2 Clusters114Divisive (Top-Down) Hierarchical Clustering Start with all data in one cluster and divide it into two clusters (using, e.g., 2-means or 2-medoids clustering). At each subsequent step, choose one of the existing cluster

35、s and divide it into two clusters. Repeat until there are n clusters each containing a single object.115Potential Problem with Divisive Clustering15116Macnaughton-Smith et al. (1965)1.Start with objects in one cluster A.2.Find the object with the largest average dissimilarity to all other objects in

36、 A and move that object to a new cluster B.3.Find the object in cluster A whose average dissimilarity to other objects in cluster A minus its average dissimilarity to objects in cluster B is maximum. If this difference is positive, move the object to cluster B.4.Repeat step 3 until no objects satisf

37、ying 3 are found.5.Repeat steps 1 through 4 to one of the existing clusters (e.g., the one with the largest average within-cluster dissimilarity) until n clusters of 1 object each are obtained. 117Macnaughton-Smith Divisive Clustering15AB118Macnaughton-Smith Divisive Clustering15AB119Macnaughton-Smi

38、th Divisive Clustering15AB120Macnaughton-Smith Divisive Clustering15ABB121Macnaughton-Smith Divisive Clustering15ABBNext continue to split each of these clustersuntil each object is in a cluster by itself.122Dendrogram for the Macnaughton-Smith Approach123Agglomerative vs. Divisive Clustering Divisive clustering has not been studied as extensively as agglomerative clustering. Divisive clustering may be preferred if only a small number of large clusters is desired. Agglomerative clustering may be preferred if a large number of small clusters is desired.

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!