Cluster Analsis of聚类分析
1Cluster Analysis ofMicroarray Data4/13/2009Copyright 2009 Dan Nettleton2Clustering Group objects that are similar to one another together in a cluster. Separate objects that are dissimilar from each other into different clusters. The similarity or dissimilarity of two objects is determined by comparing the objects with respect to one or more attributes that can be measured for each object.3Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 4Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestime pointsestimated expression levels5Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestissue typesestimated expression levels6Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 genestreatmentconditionsestimated expression levels7Microarray Data for Clustering attributeobject 1 2 3 . m 1 4.7 3.8 5.9 . 1.3 2 5.2 6.9 3.8 . 2.9 3 5.8 4.2 3.9 . 4.4 . . . . . . . . . . . . . . . . . . n 6.3 1.6 4.7 . 2.0 samplesgenesestimated expression levels8Clustering: An Example Experiment Researchers were interested in studying gene expression patterns in developing soybean seeds. Seeds were harvested from soybean plants at 25, 30, 40, 45, and 50 days after flowering (daf). One RNA sample was obtained for each level of daf.9An Example Experiment (continued) Each of the 5 samples was measured on two two-color cDNA microarray slides using a loop design. The entire process we repeated on a second occasion to obtain a total of two independent biological replications.1025304045502530404550Rep 1Rep 2Diagram Illustrating the Experimental Design11The daf means estimated for each gene from a mixed linear model analysis provide a useful summary of the data for cluster analysis.Normalized Data for One Example GenedafdafNormalized Log SignalEstimated Means + or 1 SE12400 genes exhibited significant evidence of differential expression across time (p-value= G(4)-SE56The Gap Statistic Suggests K=3 Clusters57Gap Analysis for Two-Color Array Data (N=100)k=Number of Clustersk=Number of Clusterslog WkG(k)G(k)=log Wk log Wk vs. k(+ or 1 standard error) log Wk and log Wk vs. k *58Gap Analysis for Two-Color Array Data (N=100)k=Number of ClustersG(k)Gap AnalysisEstimates K=11Clusters“zoomed in” versionof previous plot596061626364656667686970Plot of Cluster Medoids71Principal Components Principal components can be useful for providing low-dimensional views of high-dimensional data. 1 2 . m 1 2 X = . . . nDataMatrixorDataSetx11 x12 . . . x1mx21.xn1x2m.xnmxn2 . . .observationorobjectvariableorattributenumber of variablesnumber of observations72Principal Components (continued) Each principal component of a data set is a variable obtained by taking a linear combination of the original variables in the data set. A linear combination of m variables x1, x2, ., xm is given by c1x1 + c2x2 + + cmxm. For the purpose of constructing principal components, the vector of coefficients is restricted to have unit length, i.e., c1 + c2 + + cm = 1.22273Principal Components (continued) The first principal component is the linear combination of the variables that has maximum variation across the observations in the data set. The jth principal component is the linear combination of the variables that has maximum variation across the observations in the data set subject to the constraint that the vector of coefficients be orthogonal to coefficient vectors for principal components 1, ., j-1.74The Simple Data Example x1 x2 75The First Principal Component Axis x1 x2 76The First Principal Components x1 x2 1st PC forthis pointis signeddistancebetween itsprojectiononto the1st PC axisand theorigin.77The Second Principal Component Axis x1 x2 78The Second Principal Component x1 x2 2nd PC forthis pointis signeddistancebetween itsprojectiononto the2nd PC axisand theorigin.79Plot of PC1 vs. PC2 PC1 PC2 80Compare the PC plot to the plotof the original data below. x1 x2 Because thereare only twovariables here,the plot ofPC2 vs. PC1 isjust a rotationof the originalplot.81There is more to be gained when the number of variables is greater than 2. Consider the principal components for the 400 significant genes from our two-color microarray experiment. Our data matrix has n=400 rows and m=5 columns. We have looked at this data using parallel coordinate plots. What would it look like if we projected the data points to 2-dimensions?82Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC2 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1183Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC3 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1184 PC2 85Projection of Two-Color Array Datawith 11-Medoid Clustering PC1 PC3 a=1b=2c=3d=4e=5f=6g=7h=8i=9j=10k=1186Hierarchical Clustering Methods Hierarchical clustering methods build a nested sequence of clusters that can be displayed using a dendrogram. We will begin with some simple illustrations and then move on to a more general discussion.87The Simple Example Datawith Observation Numbers x1 x2 88Dendrogram for the Simple Example DataTree Structurenodesa parent nodeterminal nodes or leavescorrespondingto objectsroot nodedaughter nodes(daughter nodeswith same parentare sister nodes)89A Hierarchical Clustering of theSimple Example DataScatterplot of DataDendrogramx1x2clusters within clusterswithin clusters.90Dendrogram for the Simple Example DataThe heightof a noderepresents thedissimilaritybetween thetwo clustersmergedtogether atthe node.These two clusters have a dissimilarity of about 1.75.91The appearance of a dendrogram is not unique.Any twosister nodescould tradeplaces withoutchanging themeaning of thedendrogram.Thus 14 next to 7 does not imply that these objects are similar.92Dendrogram for the Simple Example DataBy convention,R dendrogramsshow the lowersister nodeon the left.Ties are brokenby observationnumber.The appearance of a dendrogram is not unique.e.g., 13 is to the left of 14 93The lengthsof the branchesleading toterminal nodeshave noparticularmeaning in Rdendrograms.The appearance of a dendrogram is not unique.94Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=2 Clusters95Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=3 Clusters96Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=4 Clusters97Cutting the tree at a given height will correspond to a partitioning of the data into k clusters.k=10 Clusters98Agglomerative (Bottom-Up) Hierarchical Clustering Define a measure of distance between any two clusters. (An individual object is considered a cluster of size one.) Find the two nearest clusters and merge them together to form a new cluster. Repeat until all objects have been merged into a single cluster.99Common Measures of Between-Cluster Distance Single Linkage a.k.a. Nearest Neighbor: the distance between any two clusters A and B is the minimum of all distances from an object in cluster A to an object in cluster B. Complete Linkage a.k.a Farthest Neighbor: the distance between any two clusters A and B is the maximum of all distances from an object in cluster A to an object in cluster B.100Common Measures of Between-Cluster Distance Average Linkage: the distance between any two clusters A and B is the average of all distances from an object in cluster A to an object in cluster B. Centroid Linkage: the distance between any two clusters A and B is the distance between the centroids of cluster A and B. (The centroid of a cluster is the componentwise average of the objects in a cluster.) 101Agglomerative Clustering Using Average Linkage for the Simple Example Data SetScatterplot of DataDendrogramx1x2ABCDEFGHIJMKLNOP102Agglomerative Clustering Using Average Linkage for the Simple Example Data SetA. 1-2 AB. 9-10 BC. 3-4 CD. 5-6 DE. 7-(5,6) EF. 13-14 FG. 11-12 GH. (1,2)-(3,4) HI. (9,10)-(11,12) Ietc.JKLMNOP103Agglomerative Clustering Using Single Linkage for the Simple Example Data Set104Agglomerative Clustering Using Complete Linkage for the Simple Example Data Set105Agglomerative Clustering Using Centroid Linkage for the Simple Example Data SetCentroid linkage isnot monotone inthe sense thatlater cluster mergescan involve clustersthat are more similarto each other thanearlier merges.106Agglomerative Clustering Using Centroid Linkage for the Simple Example Data SetThe merge between4 and (1,2,3,5) createsa cluster whose centroidis closer to the (6,7)centroid than 4 was tothe centroid of (1,2,3,5).107Agglomerative Clustering Using Single Linkage for the Two-Color Microarray Data Set108Agglomerative Clustering Using Complete Linkage for the Two-Color Microarray Data Set109Agglomerative Clustering Using Average Linkage for the Two-Color Microarray Data Set110Agglomerative Clustering Using Centroid Linkage for the Two-Color Microarray Data Set111Which Between-Cluster Distance is Best? Depends, of course, on what is meant by “best”. Single linkage tends to produce “long stringy” clusters. Complete linkage produces compact spherical clusters but might result in some objects that are closer to objects in clusters other than their own. (See next example.) Average linkage is a compromise between single and complete linkage. Centroid linkage is not monotone. 1121. Conduct agglomerative hierarchical clustering for this datausing Euclidean distance and complete linkage.2. Display your results using a dendrogram.3. Identify the k=2 clustering using your results.113Results of Complete-Linkage ClusteringResults for k=2 Clusters114Divisive (Top-Down) Hierarchical Clustering Start with all data in one cluster and divide it into two clusters (using, e.g., 2-means or 2-medoids clustering). At each subsequent step, choose one of the existing clusters and divide it into two clusters. Repeat until there are n clusters each containing a single object.115Potential Problem with Divisive Clustering15116Macnaughton-Smith et al. (1965)1.Start with objects in one cluster A.2.Find the object with the largest average dissimilarity to all other objects in A and move that object to a new cluster B.3.Find the object in cluster A whose average dissimilarity to other objects in cluster A minus its average dissimilarity to objects in cluster B is maximum. If this difference is positive, move the object to cluster B.4.Repeat step 3 until no objects satisfying 3 are found.5.Repeat steps 1 through 4 to one of the existing clusters (e.g., the one with the largest average within-cluster dissimilarity) until n clusters of 1 object each are obtained. 117Macnaughton-Smith Divisive Clustering15AB118Macnaughton-Smith Divisive Clustering15AB119Macnaughton-Smith Divisive Clustering15AB120Macnaughton-Smith Divisive Clustering15ABB121Macnaughton-Smith Divisive Clustering15ABBNext continue to split each of these clustersuntil each object is in a cluster by itself.122Dendrogram for the Macnaughton-Smith Approach123Agglomerative vs. Divisive Clustering Divisive clustering has not been studied as extensively as agglomerative clustering. Divisive clustering may be preferred if only a small number of large clusters is desired. Agglomerative clustering may be preferred if a large number of small clusters is desired.