How to do clustering
7
0
Entering edit mode
ssls sddd ▴ 260
@ssls-sddd-2202
Last seen 9.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070607/ 677da71d/attachment.pl
• 3.9k views
ADD COMMENT
0
Entering edit mode
@william-shannon-1787
Last seen 9.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070607/ aa595cd1/attachment.pl
ADD COMMENT
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070608/ 63890576/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ cb3d1455/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ 7228dfdb/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ 777db539/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ 2b786b50/attachment.pl
ADD REPLY
0
Entering edit mode
Here is an example that shows one way of doing this: # Generate a sample matrix y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Transpose the matrix if nessecary like this: y <- t(y) # Use the following step if you want to use Pearson correlations as distance method # instead of the default Euclidean distances. mydist <- as.dist(1-cor(t(y), method="pearson")) # PAM clustering, which is an advanced k-means method in R. The basic k-means function is kmeans() library(cluster) pamy <- pam(mydist, k=3) pamy$clustering # provides the cluster assigments plot(pamy) # plots the results # MDS clustering to obtain 'meaningful' coordinates for a scatter plot loc <- cmdscale(mydist) # Generate a scatter plot for the MDS results where the PAM (k-means) clusters are labeled by color mycol <- as.vector(pamy$clustering) mycol <- rainbow(length(unique(mycol)), start=0.1, end=0.9)[mycol] # color selection steps plot(loc[,1], loc[,2], pch=20, col=mycol, xlab="", ylab="", main="Scatter Plot") # Scatter plot with sample labels plot(loc[,1], loc[,2], type="n", xlab="", ylab="", main="Scatter Plot") text(loc[,1], loc[,2], col=mycol, rownames(loc), cex=0.8) More detailed instructions on basic clustering methods in R can be found on this page: http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.htm l#R_clustering Thomas On Sun 06/10/07 02:55, ssls sddd wrote: > Dear Bill, > > I am new to R so would you please elaborate further on how to > extract the names of the snp's in each of the K clusters? In addition, > is it possible for me to get the scatter plot of the clusters? > > Thanks a lot! > > Sincerely, > Alex > > On 6/9/07, William Shannon <william.shannon at="" sbcglobal.net=""> wrote: > > > > It depends on your goal for the analysis. > > > > If you are wanting to find snp's whose log2(ratio's) are similar across > > the samples then you are done with the analysis after k-means (though you > > should read the literature on k-means for various ways to select the optimal > > k). In this case you can extract the names of the snp's in each of the K > > clusters directly from the kmeans object. > > > > If however you want to go one step further and see how these clusters > > separate the samples then you could try what we did a long time ago in the > > paper cited below (I can email you a of on Monday if you can't access it). > > > > In this paper we took the k-mean cluster centers and sorted them by > > their log2(ratio) and looked to see how well they separated 2 (or maybe it > > was 3) classes of skin samples. > > > > A. M. Bowcock, W. Shannon, F. Du, J. Duncan, K. Cao, K. Aftergut, J. > > Catier, M. A. Fernandez-Vina, and A. Menter > > *Insights into psoriasis and other inflammatory diseases from large-scale > > gene expression studies* > > Hum. Mol. Genet., August 1, 2001; 10(17): 1793 - 1805. > > > > Bill > > *ssls sddd <ssls.sddd at="" gmail.com="">* wrote: > > > > Dear Bill, > > > > Thanks a lot for the suggestions. Yes, they are Affy SNP data. > > I used the MantelCorr Package. It worked well. Specifically, the commands > > I ran are: > > > > library(MantelCorr) > > kmeans.result <- GetClusters(x, 500, 100) > > DistMatrices.result <- DistMatrices(x, kmeans.result$clusters) > > MantelCorrs.result <- MantelCorrs(DistMatrices.result$Dfull, > > DistMatrices.result$Dsubsets) > > permuted.pval <- PermutationTest(DistMatrices.result$Dfull, > > DistMatrices.result$Dsubsets, 100, 49, 0.05) > > ClusterLists <- ClusterList(permuted.pval, kmeans.result$cluster.sizes, > > MantelCorrs.result) > > ClusterGenes <- ClusterGeneList(kmeans.result$clusters, > > ClusterLists$SignificantClusters, data) > > > > Can you suggest me how to view the result? Is there a way to visualize the > > clusters? > > > > Thanks a lot! > > > > Sincerely, > > > > Alex > > > > On 6/7/07, William Shannon wrote: > > > > > > You may want to consider a k-means cluster. The pvclust appears to be a > > > hierarchical clustering algorithm (with subsequent p value estimation) > > which > > > is causing the problem. > > > > > > Hierarchical clustering uses a pairwise distance matrix to form the tree > > > dendrogram. With N = 238804 this will require a matrix with N(N-1)/2 or > > > about (238804^2)/2 elements. That's what causes the memory problem. > > > > > > K-means is not so intensive and will result in clustering the 238804 > > rows > > > (I assume they are snp's) and each cluster will be represented by a men > > > vector for the 49 variables. > > > > > > If on the other hand you want to cluster the 49 columns you may need to > > > transpose the data matrix and then run a hierarchical clustering, but I > > > would look into kmeans first. > > > > > > Bill Shannon > > > Washington Univ. School of Medicine > > > > > > > > > *ssls sddd * wrote: > > > > > > Dear List, > > > > > > I have a question to bother you about how to do clustering. > > > My data consists of 49 columns (49 variables) and 238804 rows. > > > I would like to do hierarchical clustering (unsupervised clustering > > > and PCA). So far I tried pvclust > > > www.is.titech.ac.jp/~shimo/prog/<http: www.is.titech.ac.jp="" %7e="" shimo="" prog=""/> > > > *pvclust*/) > > > but I always had the problem like for R like "cannot allocate the > > memory". > > > > > > I am curious about what else packages can perform the clustering > > analysis > > > while memory efficient. > > > > > > Meanwhile, is there any way that I can extract the features of each > > > cluster. > > > > > > In other words, I would like to identify which are responsible for > > > classifying these > > > variables (samples). > > > > > > Thanks a lot! > > > > > > Sincerely, > > > > > > Alex > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ 0b3a1b48/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070611/ cbc829fc/attachment.pl
ADD REPLY
0
Entering edit mode
ssls sddd wrote: > Dear Dr.Thomas Girke, > > Thank you very much for the info. > > I tried mydist <- as.dist(1-cor(t(y), method="pearson")) on my data but > it failed. My 'y' consists of 238000 observations (rows) and 49 samples > (columns) and R said: > > error in cor(t(x), method = "pearson") : allocMatrix: too many elements > > Do you think I can make this work out in another way? For folks relatively new to R and Bioconductor, it is worthwhile the help pages for ALL new commands used. In this case, the help page for cor() states that it will compute the correlation between all COLUMNS of the matrix, if given a matrix. You have a matrix with 49 columns and 238,000 rows. If you were to run cor() on that matrix, it would produce a matrix of size 49x49 containing all pairwise correlations between samples. However, in this case, a transpose is applied first, so R is going to try to compute a 238,000x238,000 matrix of correlation coefficients. I'm assuming that is not what you want and that you really want to cluster the samples. Drop the t() and all will be well. Sean
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070611/ 9e7a0d29/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070613/ 8fcf7149/attachment.pl
ADD REPLY
0
Entering edit mode
ssls sddd wrote: > Dear Dr.Thomas Girke, > > I have one more question for you. I tried pvclust in the session of > 'Obtain significant clusters by pvclust bootstrap analysis' for my data, x. > > But I have a problem with: > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), > scale="row", RowSideColors=mycolhc) > > the error was: > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col = > my.colorFct(), : > 'x' must be a numeric matrix > > I ran 'x[1:3,1:3]' and it produced the following: > > AIRNS_A09 AIRNS_A11 AIRNS_A12 > SNP_A-1780271 1.85642 1.50956 1.73154 > SNP_A-1780274 1.72140 1.83712 1.85948 > SNP_A-1780277 2.04241 1.53458 1.65270 > > I think the x is a numeric matrix. Do you think where I may get wrong? Try coercing the x into a matrix directly: heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc) Does this fix the problem? You can always check the class of an object by doing something like: class(x) which should report: [1] "matrix" Hope that helps. Sean
ADD REPLY
0
Entering edit mode
Dear Alex, In addition, to Sean's advice, I would like to point out that the sample you are giving below indicates that you are trying to pass on to the heatmap function a column dendrogram plus a row dendrogram. With your matrix of 238,000 rows by 49 columns you should have only a column dendrogram, because the row dendrogram would take more than 200 GB of memory to calculate. You can still use the heatmap or heatmap.2 functions by turning off the row sorting by setting the Rowv argument to NA. In addition to this, I would consider to filter your rows in a meaningful manner to a much smaller number, perhaps by using R's IQR function to remove all rows with very low variability. I am suggesting this because, you won't see any patterns in the heatmap when you have so many rows. If the row filtering works then you could generate a dendrogram for the row dimension as well. Remember: hclust will require ~4 GB of memory to cluster ~30,000 items and < 1 GB for 10,000 items, and pvclust that uses hclust internally will need even much more than this. As a more general advice, when working with large data sets in R always subset your data to something very small to test out your strategy first, because this will save you a lot of time. In your case, this could by done by selecting just the first 100 rows of your matrix like this: my_matrix <- my_matrix[1:100, ] Once you have tested things out then just remove in your script/protocol the '[1:100,]' part. Best, Thomas On Wed 06/13/07 06:02, Sean Davis wrote: > ssls sddd wrote: > > Dear Dr.Thomas Girke, > > > > I have one more question for you. I tried pvclust in the session of > > 'Obtain significant clusters by pvclust bootstrap analysis' for my data, x. > > > > But I have a problem with: > > > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), > > scale="row", RowSideColors=mycolhc) > > > > the error was: > > > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col = > > my.colorFct(), : > > 'x' must be a numeric matrix > > > > I ran 'x[1:3,1:3]' and it produced the following: > > > > AIRNS_A09 AIRNS_A11 AIRNS_A12 > > SNP_A-1780271 1.85642 1.50956 1.73154 > > SNP_A-1780274 1.72140 1.83712 1.85948 > > SNP_A-1780277 2.04241 1.53458 1.65270 > > > > I think the x is a numeric matrix. Do you think where I may get wrong? > > Try coercing the x into a matrix directly: > > heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc), > col=my.colorFct(), scale="row", RowSideColors=mycolhc) > > Does this fix the problem? You can always check the class of an object > by doing something like: > > class(x) > > which should report: > > [1] "matrix" > > Hope that helps. > > Sean > -- Dr. Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 95065112/attachment.pl
ADD REPLY
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 2.9 years ago
United States
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ d5f4f353/attachment.pl
ADD COMMENT
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070609/ 751afd7f/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070610/ b17232e0/attachment.pl
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
Alex, > library(Biobase) [snip] > args(rowQ) function (imat, which) NULL > showMethods("rowQ") Function: rowQ (package Biobase) imat="ExpressionSet", which="numeric" imat="exprSet", which="numeric" imat="matrix", which="numeric" so it looks like x should be a matrix rather than a data frame. Martin "ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > Hi Thomas, > > Thanks! Sorry for getting back to it late because I was out > of town for a couple of days. > > I like the idea of 'removing all rows with low variability across > samples'. I searched around and found an online tutorial > http://www.economia.unimi.it/projects/marray/2006/material/Lab3/Mach ineLearning/ML-lab.pdfis > doing very similar thing which teaches how to filter some > undifferentially > expressed genes. > > It takes the simplistic approach of using the 75th percentile of the > interquartile range > (IQR) as the cut-off point and computes quantiles using rowQ. > > I followed their method and my code is: > > library("Biobase") > lowQ = rowQ(x, floor(0.25 * 49))#49 for 49 samples > upQ = rowQ(x, ceiling(0.75 * 49)) > iqrs = upQ - lowQ > giqr = iqrs > quantile(iqrs, probs = 0.75) > sum(giqr) > xsub = x[giqr, ] > dim(xsub) > > But the error message is like: > > function (classes, fdef, mtable) : > unable to find an inherited method for function "rowQ", for > signature "data.frame", "numeric" > > Perhaps you can any experience in using 'rowQ'? If I want to use IQR > function, how should I approach this? > > I really appreciate your help! > > Thank you very much! > > Sincerely, > > Alex > > > > On 6/13/07, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: >> >> Dear Alex, >> >> In addition, to Sean's advice, I would like to point out that the >> sample you are giving below indicates that you are trying to pass on >> to the heatmap function a column dendrogram plus a row dendrogram. With >> your >> matrix of 238,000 rows by 49 columns you should have only a column >> dendrogram, because the row dendrogram would take more than 200 GB of >> memory to >> calculate. You can still use the heatmap or heatmap.2 functions by turning >> off the row >> sorting by setting the Rowv argument to NA. In addition to this, I would >> consider to filter your rows in a meaningful manner to a much smaller >> number, perhaps by using R's IQR function to remove all rows with very >> low variability. I am suggesting this because, you won't see any >> patterns in the heatmap when you have so many rows. If the row filtering >> works then you could generate a dendrogram for the row dimension as well. >> Remember: hclust will require ~4 GB of memory to cluster ~30,000 items >> and < 1 GB for 10,000 items, and pvclust that uses hclust internally will >> need even much more than this. >> >> As a more general advice, when working with large data sets in R always >> subset >> your data to something very small to test out your strategy first, because >> this >> will save you a lot of time. >> In your case, this could by done by selecting just the first 100 rows of >> your >> matrix like this: >> my_matrix <- my_matrix[1:100, ] >> >> Once you have tested things out then just remove in your script/protocol >> the '[1:100,]' part. >> >> Best, >> >> Thomas >> >> >> On Wed 06/13/07 06:02, Sean Davis wrote: >> > ssls sddd wrote: >> > > Dear Dr.Thomas Girke, >> > > >> > > I have one more question for you. I tried pvclust in the session of >> > > 'Obtain significant clusters by pvclust bootstrap analysis' for my >> data, x. >> > > >> > > But I have a problem with: >> > > >> > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct >> (), >> > > scale="row", RowSideColors=mycolhc) >> > > >> > > the error was: >> > > >> > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col >> = >> > > my.colorFct(), : >> > > 'x' must be a numeric matrix >> > > >> > > I ran 'x[1:3,1:3]' and it produced the following: >> > > >> > > AIRNS_A09 AIRNS_A11 AIRNS_A12 >> > > SNP_A-1780271 1.85642 1.50956 1.73154 >> > > SNP_A-1780274 1.72140 1.83712 1.85948 >> > > SNP_A-1780277 2.04241 1.53458 1.65270 >> > > >> > > I think the x is a numeric matrix. Do you think where I may get wrong? >> > >> > Try coercing the x into a matrix directly: >> > >> > heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc), >> > col=my.colorFct(), scale="row", RowSideColors=mycolhc) >> > >> > Does this fix the problem? You can always check the class of an object >> > by doing something like: >> > >> > class(x) >> > >> > which should report: >> > >> > [1] "matrix" >> > >> > Hope that helps. >> > >> > Sean >> > >> >> -- >> Dr. Thomas Girke >> Assistant Professor of Bioinformatics >> Director, IIGB Bioinformatic Facility >> Center for Plant Cell Biology (CEPCEB) >> Institute for Integrative Genome Biology (IIGB) >> Department of Botany and Plant Sciences >> 1008 Noel T. Keen Hall >> University of California >> Riverside, CA 92521 >> >> E-mail: thomas.girke at ucr.edu >> Website: http://faculty.ucr.edu/~tgirke <http: faculty.ucr.edu="" %7etgirke=""> >> Ph: 951-827-2469 >> Fax: 951-827-4437 >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Bioconductor / Computational Biology http://bioconductor.org
ADD COMMENT
0
Entering edit mode
Alex, I guess Martin answered your question. A similar result, but with slower computation, can obtained by applying the IQR function like this: apply(iris[,1:3], 1, IQR) Thomas On Tue 06/19/07 21:10, Martin Morgan wrote: > Alex, > > > library(Biobase) > [snip] > > args(rowQ) > function (imat, which) > NULL > > showMethods("rowQ") > Function: rowQ (package Biobase) > imat="ExpressionSet", which="numeric" > imat="exprSet", which="numeric" > imat="matrix", which="numeric" > > so it looks like x should be a matrix rather than a data frame. > > Martin > > "ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > > > Hi Thomas, > > > > Thanks! Sorry for getting back to it late because I was out > > of town for a couple of days. > > > > I like the idea of 'removing all rows with low variability across > > samples'. I searched around and found an online tutorial > > http://www.economia.unimi.it/projects/marray/2006/material/Lab3/Ma chineLearning/ML-lab.pdfis > > doing very similar thing which teaches how to filter some > > undifferentially > > expressed genes. > > > > It takes the simplistic approach of using the 75th percentile of the > > interquartile range > > (IQR) as the cut-off point and computes quantiles using rowQ. > > > > I followed their method and my code is: > > > > library("Biobase") > > lowQ = rowQ(x, floor(0.25 * 49))#49 for 49 samples > > upQ = rowQ(x, ceiling(0.75 * 49)) > > iqrs = upQ - lowQ > > giqr = iqrs > quantile(iqrs, probs = 0.75) > > sum(giqr) > > xsub = x[giqr, ] > > dim(xsub) > > > > But the error message is like: > > > > function (classes, fdef, mtable) : > > unable to find an inherited method for function "rowQ", for > > signature "data.frame", "numeric" > > > > Perhaps you can any experience in using 'rowQ'? If I want to use IQR > > function, how should I approach this? > > > > I really appreciate your help! > > > > Thank you very much! > > > > Sincerely, > > > > Alex > > > > > > > > On 6/13/07, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > >> > >> Dear Alex, > >> > >> In addition, to Sean's advice, I would like to point out that the > >> sample you are giving below indicates that you are trying to pass on > >> to the heatmap function a column dendrogram plus a row dendrogram. With > >> your > >> matrix of 238,000 rows by 49 columns you should have only a column > >> dendrogram, because the row dendrogram would take more than 200 GB of > >> memory to > >> calculate. You can still use the heatmap or heatmap.2 functions by turning > >> off the row > >> sorting by setting the Rowv argument to NA. In addition to this, I would > >> consider to filter your rows in a meaningful manner to a much smaller > >> number, perhaps by using R's IQR function to remove all rows with very > >> low variability. I am suggesting this because, you won't see any > >> patterns in the heatmap when you have so many rows. If the row filtering > >> works then you could generate a dendrogram for the row dimension as well. > >> Remember: hclust will require ~4 GB of memory to cluster ~30,000 items > >> and < 1 GB for 10,000 items, and pvclust that uses hclust internally will > >> need even much more than this. > >> > >> As a more general advice, when working with large data sets in R always > >> subset > >> your data to something very small to test out your strategy first, because > >> this > >> will save you a lot of time. > >> In your case, this could by done by selecting just the first 100 rows of > >> your > >> matrix like this: > >> my_matrix <- my_matrix[1:100, ] > >> > >> Once you have tested things out then just remove in your script/protocol > >> the '[1:100,]' part. > >> > >> Best, > >> > >> Thomas > >> > >> > >> On Wed 06/13/07 06:02, Sean Davis wrote: > >> > ssls sddd wrote: > >> > > Dear Dr.Thomas Girke, > >> > > > >> > > I have one more question for you. I tried pvclust in the session of > >> > > 'Obtain significant clusters by pvclust bootstrap analysis' for my > >> data, x. > >> > > > >> > > But I have a problem with: > >> > > > >> > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct > >> (), > >> > > scale="row", RowSideColors=mycolhc) > >> > > > >> > > the error was: > >> > > > >> > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col > >> = > >> > > my.colorFct(), : > >> > > 'x' must be a numeric matrix > >> > > > >> > > I ran 'x[1:3,1:3]' and it produced the following: > >> > > > >> > > AIRNS_A09 AIRNS_A11 AIRNS_A12 > >> > > SNP_A-1780271 1.85642 1.50956 1.73154 > >> > > SNP_A-1780274 1.72140 1.83712 1.85948 > >> > > SNP_A-1780277 2.04241 1.53458 1.65270 > >> > > > >> > > I think the x is a numeric matrix. Do you think where I may get wrong? > >> > > >> > Try coercing the x into a matrix directly: > >> > > >> > heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc), > >> > col=my.colorFct(), scale="row", RowSideColors=mycolhc) > >> > > >> > Does this fix the problem? You can always check the class of an object > >> > by doing something like: > >> > > >> > class(x) > >> > > >> > which should report: > >> > > >> > [1] "matrix" > >> > > >> > Hope that helps. > >> > > >> > Sean > >> > > >> > >> -- > >> Dr. Thomas Girke > >> Assistant Professor of Bioinformatics > >> Director, IIGB Bioinformatic Facility > >> Center for Plant Cell Biology (CEPCEB) > >> Institute for Integrative Genome Biology (IIGB) > >> Department of Botany and Plant Sciences > >> 1008 Noel T. Keen Hall > >> University of California > >> Riverside, CA 92521 > >> > >> E-mail: thomas.girke at ucr.edu > >> Website: http://faculty.ucr.edu/~tgirke <http: faculty.ucr.edu="" %7etgirke=""> > >> Ph: 951-827-2469 > >> Fax: 951-827-4437 > >> > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Martin Morgan > Bioconductor / Computational Biology > http://bioconductor.org > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 89c30aea/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070626/ 27b7fc14/attachment.pl
ADD REPLY
0
Entering edit mode
Alex, If you post a message on a new topic to this list, then please start a new email thread instead of replying to an old thread that deals with a different topic. The answer to your question is that the method for accessing the wilcoxon p-values from mas5calls has been changed with the latest BioConductor release 2.0 from se.exprs(eset_pma) to assayDataElement(eset_pma, "se.exprs") In the provided example you would type the following: my_frame <- data.frame( exprs(eset_rma), exprs(eset_pma), assayDataElement(eset_pma, "se.exprs")) I have updated this change now in the exercise code you are referring to. Best, Thomas On Tue 06/26/07 21:58, ssls sddd wrote: > Hi Thomas, > > I have another question and need your help. I followed the link > http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.h tml#biocon_limmaaffy > and tried the code presented in the session of 'BioConductor Exercises'. > > I first downloaded 'workshop.zip' file and unpack the files to my computer. > I also tired > six files of my Affy arrays but found the code would not work with *.CEL > files. I manually > changed .CEL to .cel and I can play with the code well. > > The problem is that when I ran the code: > > *my_frame <- data.frame(exprs(eset_rma), exprs(eset_pma), se.exprs > (eset_pma))* # Combine RMA intensities, P/M/A calls plus their wilcoxon > p-values in one data frame. > > The error message popped up as: > > >my_frame <- data.frame(exprs(eset_rma), exprs(eset_pma), > >se.exprs(eset_pma)) > > error in function (classes, fdef, mtable) : > unable to find an inherited method for function "se.exprs", for > signature "ExpressionSet" > > > > This also happened for the files from 'workshop.zip'. Can you suggest me how > to > correct this? > > > Thanks a lot! > > Sincerely, > > Alex > > > On 6/19/07, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > > > >Alex, > > > >I guess Martin answered your question. > > > >A similar result, but with slower computation, can obtained by applying > >the IQR function like this: > > > > apply(iris[,1:3], 1, IQR) > > > >Thomas > > > >On Tue 06/19/07 21:10, Martin Morgan wrote: > >> Alex, > >> > >> > library(Biobase) > >> [snip] > >> > args(rowQ) > >> function (imat, which) > >> NULL > >> > showMethods("rowQ") > >> Function: rowQ (package Biobase) > >> imat="ExpressionSet", which="numeric" > >> imat="exprSet", which="numeric" > >> imat="matrix", which="numeric" > >> > >> so it looks like x should be a matrix rather than a data frame. > >> > >> Martin > >> > >> "ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > >> > >> > Hi Thomas, > >> > > >> > Thanks! Sorry for getting back to it late because I was out > >> > of town for a couple of days. > >> > > >> > I like the idea of 'removing all rows with low variability across > >> > samples'. I searched around and found an online tutorial > >> > > >http://www.economia.unimi.it/projects/marray/2006/material/Lab3/Mac hineLearning/ML-lab.pdfis > >> > doing very similar thing which teaches how to filter some > >> > undifferentially > >> > expressed genes. > >> > > >> > It takes the simplistic approach of using the 75th percentile of the > >> > interquartile range > >> > (IQR) as the cut-off point and computes quantiles using rowQ. > >> > > >> > I followed their method and my code is: > >> > > >> > library("Biobase") > >> > lowQ = rowQ(x, floor(0.25 * 49))#49 for 49 samples > >> > upQ = rowQ(x, ceiling(0.75 * 49)) > >> > iqrs = upQ - lowQ > >> > giqr = iqrs > quantile(iqrs, probs = 0.75) > >> > sum(giqr) > >> > xsub = x[giqr, ] > >> > dim(xsub) > >> > > >> > But the error message is like: > >> > > >> > function (classes, fdef, mtable) : > >> > unable to find an inherited method for function "rowQ", for > >> > signature "data.frame", "numeric" > >> > > >> > Perhaps you can any experience in using 'rowQ'? If I want to use IQR > >> > function, how should I approach this? > >> > > >> > I really appreciate your help! > >> > > >> > Thank you very much! > >> > > >> > Sincerely, > >> > > >> > Alex > >> > > >> > > >> > > >> > On 6/13/07, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > >> >> > >> >> Dear Alex, > >> >> > >> >> In addition, to Sean's advice, I would like to point out that the > >> >> sample you are giving below indicates that you are trying to pass on > >> >> to the heatmap function a column dendrogram plus a row dendrogram. > >With > >> >> your > >> >> matrix of 238,000 rows by 49 columns you should have only a column > >> >> dendrogram, because the row dendrogram would take more than 200 GB of > >> >> memory to > >> >> calculate. You can still use the heatmap or heatmap.2 functions by > >turning > >> >> off the row > >> >> sorting by setting the Rowv argument to NA. In addition to this, I > >would > >> >> consider to filter your rows in a meaningful manner to a much smaller > >> >> number, perhaps by using R's IQR function to remove all rows with > >very > >> >> low variability. I am suggesting this because, you won't see any > >> >> patterns in the heatmap when you have so many rows. If the row > >filtering > >> >> works then you could generate a dendrogram for the row dimension as > >well. > >> >> Remember: hclust will require ~4 GB of memory to cluster ~30,000 > >items > >> >> and < 1 GB for 10,000 items, and pvclust that uses hclust internally > >will > >> >> need even much more than this. > >> >> > >> >> As a more general advice, when working with large data sets in R > >always > >> >> subset > >> >> your data to something very small to test out your strategy first, > >because > >> >> this > >> >> will save you a lot of time. > >> >> In your case, this could by done by selecting just the first 100 rows > >of > >> >> your > >> >> matrix like this: > >> >> my_matrix <- my_matrix[1:100, ] > >> >> > >> >> Once you have tested things out then just remove in your > >script/protocol > >> >> the '[1:100,]' part. > >> >> > >> >> Best, > >> >> > >> >> Thomas > >> >> > >> >> > >> >> On Wed 06/13/07 06:02, Sean Davis wrote: > >> >> > ssls sddd wrote: > >> >> > > Dear Dr.Thomas Girke, > >> >> > > > >> >> > > I have one more question for you. I tried pvclust in the session > >of > >> >> > > 'Obtain significant clusters by pvclust bootstrap analysis' for > >my > >> >> data, x. > >> >> > > > >> >> > > But I have a problem with: > >> >> > > > >> >> > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col= > >my.colorFct > >> >> (), > >> >> > > scale="row", RowSideColors=mycolhc) > >> >> > > > >> >> > > the error was: > >> >> > > > >> >> > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), > >col > >> >> = > >> >> > > my.colorFct(), : > >> >> > > 'x' must be a numeric matrix > >> >> > > > >> >> > > I ran 'x[1:3,1:3]' and it produced the following: > >> >> > > > >> >> > > AIRNS_A09 AIRNS_A11 AIRNS_A12 > >> >> > > SNP_A-1780271 1.85642 1.50956 1.73154 > >> >> > > SNP_A-1780274 1.72140 1.83712 1.85948 > >> >> > > SNP_A-1780277 2.04241 1.53458 1.65270 > >> >> > > > >> >> > > I think the x is a numeric matrix. Do you think where I may get > >wrong? > >> >> > > >> >> > Try coercing the x into a matrix directly: > >> >> > > >> >> > heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc), > >> >> > col=my.colorFct(), scale="row", RowSideColors=mycolhc) > >> >> > > >> >> > Does this fix the problem? You can always check the class of an > >object > >> >> > by doing something like: > >> >> > > >> >> > class(x) > >> >> > > >> >> > which should report: > >> >> > > >> >> > [1] "matrix" > >> >> > > >> >> > Hope that helps. > >> >> > > >> >> > Sean > >> >> > > >> >> > >> >> -- > >> >> Dr. Thomas Girke > >> >> Assistant Professor of Bioinformatics > >> >> Director, IIGB Bioinformatic Facility > >> >> Center for Plant Cell Biology (CEPCEB) > >> >> Institute for Integrative Genome Biology (IIGB) > >> >> Department of Botany and Plant Sciences > >> >> 1008 Noel T. Keen Hall > >> >> University of California > >> >> Riverside, CA 92521 > >> >> > >> >> E-mail: thomas.girke at ucr.edu > >> >> Website: http://faculty.ucr.edu/~tgirke < > >http://faculty.ucr.edu/%7Etgirke> > >> >> Ph: 951-827-2469 > >> >> Fax: 951-827-4437 > >> >> > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor at stat.math.ethz.ch > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> -- > >> Martin Morgan > >> Bioconductor / Computational Biology > >> http://bioconductor.org > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > >-- > >Thomas Girke > >Assistant Professor of Bioinformatics > >Director, IIGB Bioinformatic Facility > >Center for Plant Cell Biology (CEPCEB) > >Institute for Integrative Genome Biology (IIGB) > >Department of Botany and Plant Sciences > >1008 Noel T. Keen Hall > >University of California > >Riverside, CA 92521 > > > >E-mail: thomas.girke at ucr.edu > >Website: http://faculty.ucr.edu/~tgirke > >Ph: 951-827-2469 > >Fax: 951-827-4437 > > -- Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070627/ 02300291/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070619/ 926bde26/attachment.pl
ADD REPLY
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 9.6 years ago
"ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > Thanks Martin. Converting my data frame to matrix soon solved > the problem. Also, for non-specific filtering see the nsFilter function in the Category package (latest release only) which provides an easy to use interface for variance-based filtering. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org
ADD COMMENT
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 9.6 years ago
Seth Falcon <sfalcon at="" fhcrc.org=""> writes: > "ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > >> Thanks Martin. Converting my data frame to matrix soon solved >> the problem. > > Also, for non-specific filtering see the nsFilter function in the > Category package (latest release only) which provides an easy to use > interface for variance-based filtering. Sorry, nsFilter is actually in the genefilter package. -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org
ADD COMMENT
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070620/ 7a58c565/attachment.pl
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070620/ eb1d08d1/attachment.pl
ADD REPLY
0
Entering edit mode
Yolande Tra ▴ 160
@yolande-tra-1821
Last seen 9.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070620/ 1b6fd6b2/attachment.pl
ADD COMMENT
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 9.6 years ago
"ssls sddd" <ssls.sddd at="" gmail.com=""> writes: > Thanks Seth! I got a chance to check with nsFilter in genefilter package but > there is some problem with my dataset. > > My codes are: > > library(genefilter) > > ans <- nsFilter(as.matrix(x), require.entrez = FALSE, require.symbol = > FALSE, require.GOBP = FALSE, require.GOCC = FALSE, require.GOMF = FALSE, > remove.dupEntrez = FALSE, var.func = IQR, var.cutoff = 0.75, var.filter = > TRUE) > > ans$eset > ans$filter.log > > and the error message is : > > error in function (classes, fdef, mtable) : > unable to find an inherited method for function "nsFilter", for > signature "matrix" Yes, nsFilter only supports ExpressionSet objects -- this is clearly stated in the man page for the function. > And I did the following: > >> library(genefilter) >> args(nsFilter) > function (eset, require.entrez = TRUE, require.symbol = TRUE, > require.GOBP = FALSE, require.GOCC = FALSE, require.GOMF = FALSE, > remove.dupEntrez = TRUE, var.func = IQR, var.cutoff = 0.5, > var.filter = TRUE) > NULL >> showMethods("nsFilter") > Function: nsFilter (package genefilter) > eset="ExpressionSet" > > Perhaps my matrix x is not compatible with "ExpressionSet"? Any suggestions > on this? A matrix object in R is certainly not equivalent to an ExpressionSet object. I would recommend reading over the first vignette in the Biobase package: _An introduction to Biobase and ExpressionSets_. You can get to it by doing: library("Biobase") openVignette(packge="Biobase") ## choose the first item + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org
ADD COMMENT
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070621/ b18f8f23/attachment.pl
ADD REPLY

Login before adding your answer.

Traffic: 883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6