Question: Re: [S] Error in clustering procedure
0
14.8 years ago by
cstrato3.9k
Austria
cstrato3.9k wrote:
Sorry, but I cannot resist: Any comments of the microarry community on the usefulness of hierarchical clustering of 7000 genes? Best regards Christian -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a V.i.e.n.n.a. .A.u.s.t.r.i.a -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- Prof Brian Ripley wrote: > A distance matrix on 7000 objects alone takes up 187Mb. I don't know how > your machine is set up re swap space, but you should use your task manager > to monitor memory usage. Almost certainly you are running out of memory. > > However, I have never seen an agglomerative clustering of 7000 objects > make sense scientifically (not that that stops the bioinformatics people). > I think you need either to work in smaller subsets or to combine objects > into clusters before starting. > > On Tue, 7 Sep 2004, Joao Baptista de O. e Souza Filho wrote: > > >>I am working with SPLUS 2000 using Windows 2000 SP4, 512 MBytes RAM, >>3 GBytes of free space in HD. >> >>When I try to do an aglomerative clustering upon a matriz of >>dimensions 7000 x 5, the program, after some time spent in >>calculations, returns the following error message: >> >>==================================================================== ======================================================== >>Error in disv == -1: Unable to obtain requested dynamic memory (this >>request is for 200194252 bytes, 0 bytes already in use) >>==================================================================== ======================================================== >> >>First, I have used the command: "options(object.size=300e6)", since the >>program presented the messsage: >> >>==================================================================== ============================================================= >>Error in double(1 + (n * (n - 1))/2): Cannot allocate 200194208 bytes: >>options("object.size") is 100000000: see options help file >>==================================================================== ============================================================= >> >>Does someone know how should I proceed? >> >>Thanks in advance >> >>Joao Baptista Filho >> >>-------------------------------------------------------------------- >>This message was distributed by s-news@lists.biostat.wustl.edu. To >>...(s-news.. clipped)... >> >> > >
clustering • 625 views
modified 14.8 years ago by Stephen Henderson1.0k • written 14.8 years ago by cstrato3.9k
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
United States
James W. MacDonald50k wrote:
cstrato wrote: > Sorry, but I cannot resist: > > Any comments of the microarry community on the usefulness of > hierarchical clustering of 7000 genes? > I think this would be almost completely useless. First off, clustering is not an inferential technique, so its use has been completely oversold IMO to the biological community. Secondly, clustering is usually done to produce a 'heat map' to put in a paper or flash on the screen during a presentation. How on earth would this be of any use? You couldn't even read any of the gene names! Of course you could use the heatmap to impress friends and colleagues with the fact that you rate a computer powerful enough to *do* a heatmap with a 7000 x 5 matrix ;-D Jim > Best regards > Christian > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > V.i.e.n.n.a. .A.u.s.t.r.i.a > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109
Dear all First of all, I want to apologize to Prof. Ripley, since I forgot to ask him first for permission to publish his comment. Personally, I agree that this would be useless, as Prof. Ripley has already told me some years ago. However, almost everybody still seems to do it and publish the corresponding results. Companies such as Spotfire are proud that you can do hierarchical clustering with more than 20,000 genes. I have never seen a publication where it was done differently. I think that the bioconductor list would be the best forum to discuss this issue, and provide solutions (besides the obvious suggestion to filter non-varying genes). Best regards Christian James W. MacDonald wrote: > cstrato wrote: > >> Sorry, but I cannot resist: >> >> Any comments of the microarry community on the usefulness of >> hierarchical clustering of 7000 genes? >> > > I think this would be almost completely useless. First off, clustering > is not an inferential technique, so its use has been completely oversold > IMO to the biological community. Secondly, clustering is usually done to > produce a 'heat map' to put in a paper or flash on the screen during a > presentation. How on earth would this be of any use? You couldn't even > read any of the gene names! > > Of course you could use the heatmap to impress friends and colleagues > with the fact that you rate a computer powerful enough to *do* a heatmap > with a 7000 x 5 matrix ;-D > > Jim > > > > >> Best regards >> Christian >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >> C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a >> V.i.e.n.n.a. .A.u.s.t.r.i.a >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > >
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
David K Pritchard70 wrote:
Christian, I think it is overstating the matter to say it is useless to hierarchically cluster 7000 genes. In most studies where one is comparing only a two or a few different conditions there is generally not alot of structure in the data and clustering is not useful. However, I have been involved with rare experiments where there is alot of structure in the data and clustering the whole dataset (10 or 20K genes) is useful to see that structure. I am presently analyzing an experiment where overexpression of a gene is compared to overexpression of a number of mutant forms of the gene. In this study hierarchically clustering the data (20K genes) revealed structure in the data that would have been hard to see otherwise. Clearly there is no good way to look at all of this data at one time - however, programs like MEV from TIGR do a good job of presenting a useful interface for browsing that much data. I also believe that MEV will hierarchically cluster ~20K genes and is freely available from the TIGR website. David Pritchard On Tue, 7 Sep 2004, cstrato wrote: > Sorry, but I cannot resist: > > Any comments of the microarry community on the usefulness of > hierarchical clustering of 7000 genes? > > Best regards > Christian > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > C.h.r.i.s.t.i.a.n. .S.t.r.a.t.o.w.a > V.i.e.n.n.a. .A.u.s.t.r.i.a > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > > Prof Brian Ripley wrote: > > > A distance matrix on 7000 objects alone takes up 187Mb. I don't know how > > your machine is set up re swap space, but you should use your task manager > > to monitor memory usage. Almost certainly you are running out of memory. > > > > However, I have never seen an agglomerative clustering of 7000 objects > > make sense scientifically (not that that stops the bioinformatics people). > > I think you need either to work in smaller subsets or to combine objects > > into clusters before starting. > > > > On Tue, 7 Sep 2004, Joao Baptista de O. e Souza Filho wrote: > > > > > >>I am working with SPLUS 2000 using Windows 2000 SP4, 512 MBytes RAM, > >>3 GBytes of free space in HD. > >> > >>When I try to do an aglomerative clustering upon a matriz of > >>dimensions 7000 x 5, the program, after some time spent in > >>calculations, returns the following error message: > >> > >>================================================================== ========================================================== > >>Error in disv == -1: Unable to obtain requested dynamic memory (this > >>request is for 200194252 bytes, 0 bytes already in use) > >>================================================================== ========================================================== > >> > >>First, I have used the command: "options(object.size=300e6)", since the > >>program presented the messsage: > >> > >>================================================================== =============================================================== > >>Error in double(1 + (n * (n - 1))/2): Cannot allocate 200194208 bytes: > >>options("object.size") is 100000000: see options help file > >>================================================================== =============================================================== > >> > >>Does someone know how should I proceed? > >> > >>Thanks in advance > >> > >>Joao Baptista Filho > >> > >>-------------------------------------------------------------------- > >>This message was distributed by s-news@lists.biostat.wustl.edu. To > >>...(s-news.. clipped)... > > >> > >> > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
michael watson IAH-C3.4k wrote:
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
michael watson IAH-C3.4k wrote:
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
michael watson IAH-C3.4k wrote:
-----Original Message----- From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] >But MDS-like methods (note, not algorithms) are better for your stated >purpose. Hi Just thinking out-loud here, which can be a painful process... So MDS/PCA is an exercise in dimension reduction. Therefore, if we reduce the dimensionality of the dataset to few(er) dimensions which explain most of the variability, then order the data set by those dimensions, then that will place together genes (in the list) which are behaving similarly - is that what you are suggesting?
"Dimension reduction" brings up another important issue: I had discussions with quite a few scientists who believe that dimension reduction is not allowed, since you are loosing worthwile information. With respect to gene expression I believe hat it makes sense to filter first non-variant genes to reduce the number of dimensions. But..., these people are using hierarchical clustering to cluster chemical compound libraries in "chemical space", and there are no compounds to eliminate. So, another question is, which method would be best to cluster about one million compounds in chemical space in order to be able reduce the number of compounds used in screening by selecting only representative members of a certain cluster. Best regards Christian michael watson (IAH-C) wrote: > -----Original Message----- > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] > > >>But MDS-like methods (note, not algorithms) are better for your stated >>purpose. > > > Hi > > Just thinking out-loud here, which can be a painful process... > > So MDS/PCA is an exercise in dimension reduction. Therefore, if we > reduce the dimensionality of the dataset to few(er) dimensions which > explain most of the variability, then order the data set by those > dimensions, then that will place together genes (in the list) which are > behaving similarly - is that what you are suggesting? > >
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
Liaw, Andy360
Liaw, Andy360 wrote:
> From: cstrato > > "Dimension reduction" brings up another important issue: > I had discussions with quite a few scientists who believe > that dimension reduction is not allowed, since you are > loosing worthwile information. Eh? By this logic, we shouldn't believe any conclusions drawn in any paper that does not contain the rawest of raw data? Part of data analysis is summmarizing data into the bare essentials (have you heard of sufficient statistics'? If not, might worth your while) and extracting useful information from data that contain noise. People who make statements like that probably believe there's no such thing as noise in their data. May God have mercy on them. > With respect to gene expression I believe hat it makes > sense to filter first non-variant genes to reduce the > number of dimensions. > > But..., these people are using hierarchical clustering > to cluster chemical compound libraries in "chemical space", > and there are no compounds to eliminate. Who are these people' now? Seems like you're changing the subject to one that's probably off-topic for BioC. > So, another question is, which method would be best to > cluster about one million compounds in chemical space in > order to be able reduce the number of compounds used in > screening by selecting only representative members of a > certain cluster. There's quite a bit of work done on this subject in the computational chemistry literature. The context is really quite different from gene expression. Molecules are clustered based on their chemical structures (which are known), and those data are not measured (usually), but computed, so there's no measurement errors. The goal is also quite different. I have not heard of anyone trying to find representative genes' (but I'm not familiar with bioinformatics--- maybe someone _would_ be interested in that?). Andy > Best regards > Christian > > michael watson (IAH-C) wrote: > > -----Original Message----- > > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] > > > > > >>But MDS-like methods (note, not algorithms) are better for > your stated > >>purpose. > > > > > > Hi > > > > Just thinking out-loud here, which can be a painful process... > > > > So MDS/PCA is an exercise in dimension reduction. Therefore, if we > > reduce the dimensionality of the dataset to few(er) dimensions which > > explain most of the variability, then order the data set by those > > dimensions, then that will place together genes (in the > list) which are > > behaving similarly - is that what you are suggesting? > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > >
Liaw, Andy wrote: >>From: cstrato >> >>"Dimension reduction" brings up another important issue: >>I had discussions with quite a few scientists who believe >>that dimension reduction is not allowed, since you are >>loosing worthwile information. > > > Eh? By this logic, we shouldn't believe any conclusions drawn in any paper > that does not contain the rawest of raw data? Part of data analysis is > summmarizing data into the bare essentials (have you heard of sufficient > statistics'? If not, might worth your while) and extracting useful > information from data that contain noise. People who make statements like > that probably believe there's no such thing as noise in their data. May God > have mercy on them. > I have mentioned this only to show that it still sometimes hard to argue; mentioning "sufficient statistics" could be helpful. > >>With respect to gene expression I believe hat it makes >>sense to filter first non-variant genes to reduce the >>number of dimensions. >> >>But..., these people are using hierarchical clustering >>to cluster chemical compound libraries in "chemical space", >>and there are no compounds to eliminate. > > > Who are these people' now? Seems like you're changing the subject to one > that's probably off-topic for BioC. > I would not consider this off-topic but a natural extension: "expression profiling -> compound profiling -> compound activity profiling -> compound structure profiling" All these steps share the same problem: What is the best clustering algorithm to use (if there is any)? Furthermore, it is my believe that in the future these steps will be analyzed together resulting in a much deeper understanding. P.S.: Looking at the BioC packages, BioC is already expanding to include proteomics analysis. It would be a natural step for BioC to expand further to cover chemoinformatics. > >>So, another question is, which method would be best to >>cluster about one million compounds in chemical space in >>order to be able reduce the number of compounds used in >>screening by selecting only representative members of a >>certain cluster. > > > There's quite a bit of work done on this subject in the computational > chemistry literature. The context is really quite different from gene > expression. Molecules are clustered based on their chemical structures > (which are known), and those data are not measured (usually), but computed, > so there's no measurement errors. The goal is also quite different. I have > not heard of anyone trying to find representative genes' (but I'm not > familiar with bioinformatics--- maybe someone _would_ be interested in > that?). > > Andy > Christian > >>Best regards >>Christian >>
Answer: Re: [S] Error in clustering procedure
0
14.8 years ago by
Stephen Henderson1.0k wrote: