How to do clustering
7
0
Entering edit mode
ssls sddd ▴ 260
@ssls-sddd-2202
Last seen 6.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070607/ 677da71d/attachment.pl
• 1.5k views
0
Entering edit mode
@william-shannon-1787
Last seen 6.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070607/ aa595cd1/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070608/ 63890576/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ cb3d1455/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ 7228dfdb/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ 777db539/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ 2b786b50/attachment.pl
0
Entering edit mode
Here is an example that shows one way of doing this: # Generate a sample matrix y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Transpose the matrix if nessecary like this: y <- t(y) # Use the following step if you want to use Pearson correlations as distance method # instead of the default Euclidean distances. mydist <- as.dist(1-cor(t(y), method="pearson")) # PAM clustering, which is an advanced k-means method in R. The basic k-means function is kmeans() library(cluster) pamy <- pam(mydist, k=3) pamy$clustering # provides the cluster assigments plot(pamy) # plots the results # MDS clustering to obtain 'meaningful' coordinates for a scatter plot loc <- cmdscale(mydist) # Generate a scatter plot for the MDS results where the PAM (k-means) clusters are labeled by color mycol <- as.vector(pamy$clustering) mycol <- rainbow(length(unique(mycol)), start=0.1, end=0.9)[mycol] # color selection steps plot(loc[,1], loc[,2], pch=20, col=mycol, xlab="", ylab="", main="Scatter Plot") # Scatter plot with sample labels plot(loc[,1], loc[,2], type="n", xlab="", ylab="", main="Scatter Plot") text(loc[,1], loc[,2], col=mycol, rownames(loc), cex=0.8) More detailed instructions on basic clustering methods in R can be found on this page: http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.htm l#R_clustering Thomas On Sun 06/10/07 02:55, ssls sddd wrote: > Dear Bill, > > I am new to R so would you please elaborate further on how to > extract the names of the snp's in each of the K clusters? In addition, > is it possible for me to get the scatter plot of the clusters? > > Thanks a lot! > > Sincerely, > Alex > > On 6/9/07, William Shannon <william.shannon at="" sbcglobal.net=""> wrote: > > > > It depends on your goal for the analysis. > > > > If you are wanting to find snp's whose log2(ratio's) are similar across > > the samples then you are done with the analysis after k-means (though you > > should read the literature on k-means for various ways to select the optimal > > k). In this case you can extract the names of the snp's in each of the K > > clusters directly from the kmeans object. > > > > If however you want to go one step further and see how these clusters > > separate the samples then you could try what we did a long time ago in the > > paper cited below (I can email you a of on Monday if you can't access it). > > > > In this paper we took the k-mean cluster centers and sorted them by > > their log2(ratio) and looked to see how well they separated 2 (or maybe it > > was 3) classes of skin samples. > > > > A. M. Bowcock, W. Shannon, F. Du, J. Duncan, K. Cao, K. Aftergut, J. > > Catier, M. A. Fernandez-Vina, and A. Menter > > *Insights into psoriasis and other inflammatory diseases from large-scale > > gene expression studies* > > Hum. Mol. Genet., August 1, 2001; 10(17): 1793 - 1805. > > > > Bill > > *ssls sddd <ssls.sddd at="" gmail.com="">* wrote: > > > > Dear Bill, > > > > Thanks a lot for the suggestions. Yes, they are Affy SNP data. > > I used the MantelCorr Package. It worked well. Specifically, the commands > > I ran are: > > > > library(MantelCorr) > > kmeans.result <- GetClusters(x, 500, 100) > > DistMatrices.result <- DistMatrices(x, kmeans.result$clusters) > > MantelCorrs.result <- MantelCorrs(DistMatrices.result$Dfull, > > DistMatrices.result$Dsubsets) > > permuted.pval <- PermutationTest(DistMatrices.result$Dfull, > > DistMatrices.result$Dsubsets, 100, 49, 0.05) > > ClusterLists <- ClusterList(permuted.pval, kmeans.result$cluster.sizes, > > MantelCorrs.result) > > ClusterGenes <- ClusterGeneList(kmeans.result$clusters, > > ClusterLists$SignificantClusters, data) > > > > Can you suggest me how to view the result? Is there a way to visualize the > > clusters? > > > > Thanks a lot! > > > > Sincerely, > > > > Alex > > > > On 6/7/07, William Shannon wrote: > > > > > > You may want to consider a k-means cluster. The pvclust appears to be a > > > hierarchical clustering algorithm (with subsequent p value estimation) > > which > > > is causing the problem. > > > > > > Hierarchical clustering uses a pairwise distance matrix to form the tree > > > dendrogram. With N = 238804 this will require a matrix with N(N-1)/2 or > > > about (238804^2)/2 elements. That's what causes the memory problem. > > > > > > K-means is not so intensive and will result in clustering the 238804 > > rows > > > (I assume they are snp's) and each cluster will be represented by a men > > > vector for the 49 variables. > > > > > > If on the other hand you want to cluster the 49 columns you may need to > > > transpose the data matrix and then run a hierarchical clustering, but I > > > would look into kmeans first. > > > > > > Bill Shannon > > > Washington Univ. School of Medicine > > > > > > > > > *ssls sddd * wrote: > > > > > > Dear List, > > > > > > I have a question to bother you about how to do clustering. > > > My data consists of 49 columns (49 variables) and 238804 rows. > > > I would like to do hierarchical clustering (unsupervised clustering > > > and PCA). So far I tried pvclust > > > www.is.titech.ac.jp/~shimo/prog/<http: www.is.titech.ac.jp="" %7e="" shimo="" prog=""/> > > > *pvclust*/) > > > but I always had the problem like for R like "cannot allocate the > > memory". > > > > > > I am curious about what else packages can perform the clustering > > analysis > > > while memory efficient. > > > > > > Meanwhile, is there any way that I can extract the features of each > > > cluster. > > > > > > In other words, I would like to identify which are responsible for > > > classifying these > > > variables (samples). > > > > > > Thanks a lot! > > > > > > Sincerely, > > > > > > Alex > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ 0b3a1b48/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070611/ cbc829fc/attachment.pl
0
Entering edit mode
ssls sddd wrote: > Dear Dr.Thomas Girke, > > Thank you very much for the info. > > I tried mydist <- as.dist(1-cor(t(y), method="pearson")) on my data but > it failed. My 'y' consists of 238000 observations (rows) and 49 samples > (columns) and R said: > > error in cor(t(x), method = "pearson") : allocMatrix: too many elements > > Do you think I can make this work out in another way? For folks relatively new to R and Bioconductor, it is worthwhile the help pages for ALL new commands used. In this case, the help page for cor() states that it will compute the correlation between all COLUMNS of the matrix, if given a matrix. You have a matrix with 49 columns and 238,000 rows. If you were to run cor() on that matrix, it would produce a matrix of size 49x49 containing all pairwise correlations between samples. However, in this case, a transpose is applied first, so R is going to try to compute a 238,000x238,000 matrix of correlation coefficients. I'm assuming that is not what you want and that you really want to cluster the samples. Drop the t() and all will be well. Sean
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070611/ 9e7a0d29/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070613/ 8fcf7149/attachment.pl
0
Entering edit mode
ssls sddd wrote: > Dear Dr.Thomas Girke, > > I have one more question for you. I tried pvclust in the session of > 'Obtain significant clusters by pvclust bootstrap analysis' for my data, x. > > But I have a problem with: > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), > scale="row", RowSideColors=mycolhc) > > the error was: > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col = > my.colorFct(), : > 'x' must be a numeric matrix > > I ran 'x[1:3,1:3]' and it produced the following: > > AIRNS_A09 AIRNS_A11 AIRNS_A12 > SNP_A-1780271 1.85642 1.50956 1.73154 > SNP_A-1780274 1.72140 1.83712 1.85948 > SNP_A-1780277 2.04241 1.53458 1.65270 > > I think the x is a numeric matrix. Do you think where I may get wrong? Try coercing the x into a matrix directly: heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc) Does this fix the problem? You can always check the class of an object by doing something like: class(x) which should report: [1] "matrix" Hope that helps. Sean
0
Entering edit mode
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 95065112/attachment.pl
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 4 days ago
United States
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ d5f4f353/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ 751afd7f/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ b17232e0/attachment.pl
0
Entering edit mode
@martin-morgan-1513
Last seen 9 hours ago
United States
0
Entering edit mode
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 89c30aea/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070626/ 27b7fc14/attachment.pl
0
Entering edit mode
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070627/ 02300291/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 926bde26/attachment.pl
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 6.6 years ago
"ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > Thanks Martin. Converting my data frame to matrix soon solved > the problem. Also, for non-specific filtering see the nsFilter function in the Category package (latest release only) which provides an easy to use interface for variance-based filtering. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 6.6 years ago
Seth Falcon <sfalcon at="" fhcrc.org=""> writes: > "ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > >> Thanks Martin. Converting my data frame to matrix soon solved >> the problem. > > Also, for non-specific filtering see the nsFilter function in the > Category package (latest release only) which provides an easy to use > interface for variance-based filtering. Sorry, nsFilter is actually in the genefilter package. -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070620/ 7a58c565/attachment.pl
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070620/ eb1d08d1/attachment.pl
0
Entering edit mode
Yolande Tra ▴ 160
@yolande-tra-1821
Last seen 6.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070620/ 1b6fd6b2/attachment.pl
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 6.6 years ago
"ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > Thanks Seth! I got a chance to check with nsFilter in genefilter package but > there is some problem with my dataset. > > My codes are: > > library(genefilter) > > ans <- nsFilter(as.matrix(x), require.entrez = FALSE, require.symbol = > FALSE, require.GOBP = FALSE, require.GOCC = FALSE, require.GOMF = FALSE, > remove.dupEntrez = FALSE, var.func = IQR, var.cutoff = 0.75, var.filter = > TRUE) > > ans$eset > ans$filter.log > > and the error message is : > > error in function (classes, fdef, mtable) : > unable to find an inherited method for function "nsFilter", for > signature "matrix" Yes, nsFilter only supports ExpressionSet objects -- this is clearly stated in the man page for the function. > And I did the following: > >> library(genefilter) >> args(nsFilter) > function (eset, require.entrez = TRUE, require.symbol = TRUE, > require.GOBP = FALSE, require.GOCC = FALSE, require.GOMF = FALSE, > remove.dupEntrez = TRUE, var.func = IQR, var.cutoff = 0.5, > var.filter = TRUE) > NULL >> showMethods("nsFilter") > Function: nsFilter (package genefilter) > eset="ExpressionSet" > > Perhaps my matrix x is not compatible with "ExpressionSet"? Any suggestions > on this? A matrix object in R is certainly not equivalent to an ExpressionSet object. I would recommend reading over the first vignette in the Biobase package: _An introduction to Biobase and ExpressionSets_. You can get to it by doing: library("Biobase") openVignette(packge="Biobase") ## choose the first item + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070621/ b18f8f23/attachment.pl