clustering question

0

Entering edit mode

Kimpel, Mark W ▴ 890

@kimpel-mark-w-727

Last seen 9.6 years ago

I have a general question about clustering of genomic data. The heatmaps that are generated are usually scaled row-wise so that variations are apparent within rows but not between rows. In looking at the documentation of heatmap and hclust, however, is appears that this scaling is done after the actual clustering is performed. If heatmap is performed on the hclust object with scale="none", it is apparent that most of the row clustering is based on overall gene expression levels, not on similar column-wise behavior between rows. Wouldn't it make sense to scale row-wise before clustering so that the row clusters are based more on the correlation of the behavior of rows between columns, i.e. two genes would be near each other if the genes behaved similarly across samples? I realize that some of this effect may be achieved with unscaled data, but it seems to me that the large overall expression differences may minimize that. I hope this makes sense, I have perhaps not used all of the correct nomenclature. Thanks, Mark Mark W. Kimpel MD Department of Psychiatry Indiana University School of Medicine Biotechnology, Research, & Training Center 1345 W. 16th Street Indianapolis, IN 46202

Clustering Clustering • 1.5k views

ADD COMMENT • link updated 18.2 years ago by Naomi Altman ★ 6.0k • written 18.2 years ago by Kimpel, Mark W ▴ 890

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at="" iupui.edu=""> wrote: > I have a general question about clustering of genomic data. The heatmaps > that are generated are usually scaled row-wise so that variations are > apparent within rows but not between rows. In looking at the > documentation of heatmap and hclust, however, is appears that this > scaling is done after the actual clustering is performed. If heatmap is > performed on the hclust object with scale="none", it is apparent that > most of the row clustering is based on overall gene expression levels, > not on similar column-wise behavior between rows. > > Wouldn't it make sense to scale row-wise before clustering so that the > row clusters are based more on the correlation of the behavior of rows > between columns, i.e. two genes would be near each other if the genes > behaved similarly across samples? I realize that some of this effect may > be achieved with unscaled data, but it seems to me that the large > overall expression differences may minimize that. Mark, If I understand you correctly, you might want to look at the "distfun" argument to heatmap. The distfun argument allows you to use any dissimilarity function that you like, including 1-correlation if you like. Sean

ADD COMMENT • link 18.2 years ago Sean Davis 21k

0

Entering edit mode

Kimpel, Mark W ▴ 890

@kimpel-mark-w-727

Last seen 9.6 years ago

Sean, Thank you for the reply. Would you be able to provide a brief code chunk for the 1-correlation function you describe? Also, would anyone like to comment on the more bioinformatic slant of my question, i.e. do you gain more knowledge about the system by clustering using 1-correlation or by hclust? As a biologist, it seems to me that we are more often interested in finding genes that behave similarly between samples rather than genes with similar mean expression. If this is true, I wonder if the developers of GOClust could comment on the clustering algorithms included as options in their package, which are Clara, Hclust, Kmeans, and Pam. Again, as a biologist, I believe that GO clustering would most be most appropriately done on genes that behave similarly between samples, rather than have similar mean expression. If, for example, two genes are exactly inversely proportional, then they should cluster right next to each other as they may be co-regulated. I feel fairly confident in my assertions as a biologist, but I am not a mathematician, and, if I am misunderstanding how clustering works under these various algorithms, please correct me. Thanks, Mark Mark W. Kimpel, M.D. -----Original Message----- From: Sean Davis [mailto:sdavis2@mail.nih.gov] Sent: Monday, February 20, 2006 8:04 AM To: Kimpel, Mark William; Bioconductor Subject: Re: [BioC] clustering question On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at="" iupui.edu=""> wrote: > I have a general question about clustering of genomic data. The heatmaps > that are generated are usually scaled row-wise so that variations are > apparent within rows but not between rows. In looking at the > documentation of heatmap and hclust, however, is appears that this > scaling is done after the actual clustering is performed. If heatmap is > performed on the hclust object with scale="none", it is apparent that > most of the row clustering is based on overall gene expression levels, > not on similar column-wise behavior between rows. > > Wouldn't it make sense to scale row-wise before clustering so that the > row clusters are based more on the correlation of the behavior of rows > between columns, i.e. two genes would be near each other if the genes > behaved similarly across samples? I realize that some of this effect may > be achieved with unscaled data, but it seems to me that the large > overall expression differences may minimize that. Mark, If I understand you correctly, you might want to look at the "distfun" argument to heatmap. The distfun argument allows you to use any dissimilarity function that you like, including 1-correlation if you like. Sean

ADD COMMENT • link 18.2 years ago Kimpel, Mark W ▴ 890

0

Entering edit mode

On 2/20/06 9:58 AM, "Kimpel, Mark William" <mkimpel at="" iupui.edu=""> wrote: > Sean, > > Thank you for the reply. Would you be able to provide a brief code chunk > for the 1-correlation function you describe? > > Also, would anyone like to comment on the more bioinformatic slant of my > question, i.e. do you gain more knowledge about the system by clustering > using 1-correlation or by hclust? As a biologist, it seems to me that we > are more often interested in finding genes that behave similarly between > samples rather than genes with similar mean expression. Clustering algorithms often take as input a matrix of dissimilarities (how different the things are that one is clustering). Hclust and friends all have default measures of dissimilarity; for hclust, this is euclidean distance. You can use any distance metric you like such as: plot(hclust(as.dist(1-cor(mymatrix)))) If you want to get both correlated and anticorrelated genes, then use 1-abs(cor(...)). Hopefully, you get the idea. > If this is true, I wonder if the developers of GOClust could comment on > the clustering algorithms included as options in their package, which > are Clara, Hclust, Kmeans, and Pam. Again, as a biologist, I believe > that GO clustering would most be most appropriately done on genes that > behave similarly between samples, rather than have similar mean > expression. If, for example, two genes are exactly inversely > proportional, then they should cluster right next to each other as they > may be co-regulated. You probably need to read the help pages for each of these different clustering methods carefully if you are concerned about the details. If I am not mistaken, Goclust simply uses the clara, kmeans, etc. from other packages to perform the clustering, so reading the corresponding help pages will likely be enlightening. > I feel fairly confident in my assertions as a biologist, but I am not a > mathematician, and, if I am misunderstanding how clustering works under > these various algorithms, please correct me. > > Thanks, > > Mark > > > > Mark W. Kimpel, M.D. > -----Original Message----- > From: Sean Davis [mailto:sdavis2 at mail.nih.gov] > Sent: Monday, February 20, 2006 8:04 AM > To: Kimpel, Mark William; Bioconductor > Subject: Re: [BioC] clustering question > > > > > On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at="" iupui.edu=""> wrote: > >> I have a general question about clustering of genomic data. The > heatmaps >> that are generated are usually scaled row-wise so that variations are >> apparent within rows but not between rows. In looking at the >> documentation of heatmap and hclust, however, is appears that this >> scaling is done after the actual clustering is performed. If heatmap > is >> performed on the hclust object with scale="none", it is apparent that >> most of the row clustering is based on overall gene expression levels, >> not on similar column-wise behavior between rows. >> >> Wouldn't it make sense to scale row-wise before clustering so that the >> row clusters are based more on the correlation of the behavior of rows >> between columns, i.e. two genes would be near each other if the genes >> behaved similarly across samples? I realize that some of this effect > may >> be achieved with unscaled data, but it seems to me that the large >> overall expression differences may minimize that. > > Mark, > > If I understand you correctly, you might want to look at the "distfun" > argument to heatmap. The distfun argument allows you to use any > dissimilarity function that you like, including 1-correlation if you > like. > > Sean > >

ADD REPLY • link 18.2 years ago Sean Davis 21k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060221/ ccc071e0/attachment.pl

ADD COMMENT • link 18.2 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Kimpel, Mark W ▴ 890

@kimpel-mark-w-727

Last seen 9.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060223/ 100fb09c/attachment.pl

ADD COMMENT • link 18.2 years ago Kimpel, Mark W ▴ 890

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

Dear Mark, Thanks for your kind remarks. I think all you want to do is use Euclidean distance after removing the gene mean. ExprNoMean=exprs(myData)-apply(myData,1,mean) hclust(dist(ExprNoMean)) or heatmap(ExprNoMean) will then do complete linkage clustering on your data using Euclidean distance. There are so many clustering options, and I am not very knowledgeable about the pros and cons of each, but complete linkage often seems reasonable. Sorry that I do not have time to look at your data - too much of my own right now. --Naomi At 04:15 PM 2/23/2006, Kimpel, Mark William wrote: >Naomi, > > > >Thanks for addressing my clustering question. I wanted to follow up >because I'm still not clear on what I need to do and how I need to do >it. > > > >Basically, I would like genes to cluster together that behave similarly >between two samples. I don't care what there absolute level of >expression is, but if, over 12 samples, genes that go up and down a >similar amount should cluster together. > > > >I have looked at heatmap, hclust, and dist functions and am still unsure >how to proceed. > > > >If, for example, I have a eset that I'm working with, you provide me a >code example of how to get what I want? > > > >I really appreciate your help, not only with this, but all the BioC >posts that I've learned from. > > > >Mark > > > >Mark W. Kimpel, M.D. > > _____ > >From: Naomi Altman [mailto:naomi at stat.psu.edu] >Sent: Tuesday, February 21, 2006 11:46 AM >To: Kimpel, Mark William; Sean Davis; Bioconductor >Subject: Re: [BioC] clustering question > > > >hclust can use any distance metric. > >1-correlation is one of several metrics you could use. > >1-correlation focuses on the "up and down" behavior, but scales each >gene to have the same standard deviation. >Euclidean distance focuses more on the overall level of expression. >Euclidean distance with the mean or median removed focuses on the "up >and down" behavior, but also considers the magnitude of that behavior. > > >There are many other choices of metric and clustering method. I don't >think you can really state that one method or metric produces results >that are more >"biologically meaningful". I think that depends on what you mean by >"biologically meaningful". > >If you have replicate arrays for the same biological condition, these >should be averaged before clustering, as you want to cluster based on >the response to the >condition, not on the noise. > >The paper below is very readable and sheds a lot of light on these >issues > >Problems in gene clustering based on gene expression data ><http: www.sciencedirect.com="" science?_ob="MImg&_imagekey=B6WK9-4CB0HC" h-1="">-1&_cdi=6901&_user=209810&_orig=search&_coverDate=07/31/2004&_sk=9990 999 >98&view=c&wchp=dGLbVtb- zSkWz&md5=6cda9b21f97456578db16db820339ae1&ie=/sd >article.pdf> Journal of Multivariate Analysis 90 (2004) 44-66 Jenny >Bryan > >--Naomi > >At 09:58 AM 2/20/2006, Kimpel, Mark William wrote: > > > >Sean, > >Thank you for the reply. Would you be able to provide a brief code chunk >for the 1-correlation function you describe? > >Also, would anyone like to comment on the more bioinformatic slant of my >question, i.e. do you gain more knowledge about the system by clustering >using 1-correlation or by hclust? As a biologist, it seems to me that we >are more often interested in finding genes that behave similarly between >samples rather than genes with similar mean expression. > >If this is true, I wonder if the developers of GOClust could comment on >the clustering algorithms included as options in their package, which >are Clara, Hclust, Kmeans, and Pam. Again, as a biologist, I believe >that GO clustering would most be most appropriately done on genes that >behave similarly between samples, rather than have similar mean >expression. If, for example, two genes are exactly inversely >proportional, then they should cluster right next to each other as they >may be co-regulated. > >I feel fairly confident in my assertions as a biologist, but I am not a >mathematician, and, if I am misunderstanding how clustering works under >these various algorithms, please correct me. > >Thanks, > >Mark > > > >Mark W. Kimpel, M.D. >-----Original Message----- >From: Sean Davis [ mailto:sdavis2 at mail.nih.gov ><mailto:sdavis2 at="" mail.nih.gov=""> ] >Sent: Monday, February 20, 2006 8:04 AM >To: Kimpel, Mark William; Bioconductor >Subject: Re: [BioC] clustering question > > > > >On 2/19/06 23:23, "Kimpel, Mark William" <mkimpel at="" iupui.edu=""> wrote: > > > I have a general question about clustering of genomic data. The >heatmaps > > that are generated are usually scaled row-wise so that variations are > > apparent within rows but not between rows. In looking at the > > documentation of heatmap and hclust, however, is appears that this > > scaling is done after the actual clustering is performed. If heatmap >is > > performed on the hclust object with scale="none", it is apparent that > > most of the row clustering is based on overall gene expression levels, > > not on similar column-wise behavior between rows. > > > > Wouldn't it make sense to scale row-wise before clustering so that the > > row clusters are based more on the correlation of the behavior of rows > > between columns, i.e. two genes would be near each other if the genes > > behaved similarly across samples? I realize that some of this effect >may > > be achieved with unscaled data, but it seems to me that the large > > overall expression differences may minimize that. > >Mark, > >If I understand you correctly, you might want to look at the "distfun" >argument to heatmap. The distfun argument allows you to use any >dissimilarity function that you like, including 1-correlation if you >like. > >Sean > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 18.2 years ago Naomi Altman ★ 6.0k

Login before adding your answer.