Heatmap with 7120x500 array

0

Entering edit mode

Gaston Fiore ▴ 40

@gaston-fiore-4224

Last seen 9.7 years ago

Hello everyone, I'm trying to produce a heat map that clusters 7120 genes into 6 groups based on 500 conditions. I'm using kmeans and then image, but I've two problems. The first one is that kmeans sometimes doesn't converge even with 10 restarts, and the second one is that the image produced is basically all read (I'm using the standard color scheme), not to mention it's size is massive and very hard to deal with. Does anyone have any suggestions on how I could accomplish this task efficiently, or is this data just too big to cluster? Thanks a lot, -Gaston

• 1.1k views

ADD COMMENT • link updated 13.7 years ago by Paul Leo ▴ 970 • written 13.7 years ago by Gaston Fiore ▴ 40

0

Entering edit mode

Gerhard Thallinger ▴ 180

@gerhard-thallinger-1552

Last seen 6 weeks ago

Austria

Hi Gaston, > I'm trying to produce a heat map that clusters 7120 genes > into 6 groups based on 500 conditions. I'm using kmeans and > then image, but I've two problems. The first one is that > kmeans sometimes doesn't converge even with 10 restarts, and > the second one is that the image produced is basically all > read (I'm using the standard color scheme), not to mention > it's size is massive and very hard to deal with. Does anyone > have any suggestions on how I could accomplish this task > efficiently, or is this data just too big to cluster? Genesis should be able to handle datasets that large (http://genome.tugraz.at/genesisclient/genesisclient_description.shtm l) Adapting the color scale is very easy. I can't comment on the convergence of k-means, this could depend on the data. Regards, Gerhard

ADD COMMENT • link 13.7 years ago Gerhard Thallinger ▴ 180

0

Entering edit mode

On 08/28/2010 10:52 AM, Gerhard Thallinger wrote: > Hi Gaston, > >> I'm trying to produce a heat map that clusters 7120 genes >> into 6 groups based on 500 conditions. I'm using kmeans and >> then image, but I've two problems. The first one is that >> kmeans sometimes doesn't converge even with 10 restarts, and >> the second one is that the image produced is basically all >> read (I'm using the standard color scheme), not to mention >> it's size is massive and very hard to deal with. Does anyone >> have any suggestions on how I could accomplish this task >> efficiently, or is this data just too big to cluster? > > Genesis should be able to handle datasets that large > (http://genome.tugraz.at/genesisclient/genesisclient_description.sh tml) > Adapting the color scale is very easy. > > I can't comment on the convergence of k-means, this could depend > on the data. Hi Gaston I'd guess the 'all read' (? red) is due to a few extreme values driving the color palette -- perhaps you intend to log-transform or otherwise pre-process the data before clustering / display, which might also help convergence? Likewise applying a filter like varFilter in the genefilter package to reduce the number of genes being clustered -- most will not be contributing anything meaningful to the clustering algorithm. I think what you want to do is to separate the steps of clustering, reordering rows / columns, and displaying the image. See ?dendrogram, ?reorder, ?heatmap. Heatmpap should be doing little more than plotting an image (no sense in printing the dendrograms, as they'll be too dense to make sense of). Martin > > Regards, > > Gerhard > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 13.7 years ago Martin Morgan 25k

0

Entering edit mode

Paul Leo ▴ 970

@paul-leo-2092

Last seen 9.7 years ago

Think you need to filter those genes as you almost certainly have too much noise . Suggest you filter the genes: if you are doing class discovery perhaps use the ratio of mean/sd - choose ones which the most variation and used about 500-1000 ish . The heatmap won't be readable in any case ....I would perhaps try principle components/ spectral decomposition like: the.pca <- prcomp(data,scale = TRUE) # for samples/genes (try using attributes(the.pca ) dim(the.pca$x) ### estimate PCA's you need the.pca.var <- round(the.pca$sdev^2 / sum(the.pca$sdev^2)*100,2) plot(c(1:length(the.pca.var)),the.pca.var,type="b",xlab="# components",ylab="% variance",main="Scree Plot for Hits",col="red",cex=1.5,cex.lab=1.5) savePlot("scree plot.jpeg",type="jpeg") centers<-15 the.cl<-kmeans(the.pca$x[,1:2],centers=centers,iter.max=1000) #Do kmeans colours <- rainbow(centers) ##2D plot(range(the.pca$x[,1]),range(the.pca $x[,2]),xlab="PCA1",ylab="PCA2",main="Spectral clustering of differential hits") text(the.pca$x[,1],the.pca$x[,2],label=rownames(the.pca $x),col=colours[the.cl$cluster],cex=0.75) library(scatterplot3d) ### 3D s3d<-scatterplot3d(range(the.pca$x[,1]),range(the.pca $x[,2]),range(the.pca $x[,3]),xlab="PCA1",ylab="PCA2",zlab="PCA3",main="Spectral clustering of differential hits",angle=120) text(s3d$xyz.convert(the.pca$x[,1],the.pca$x[,2],the.pca $x[,3]),label=rownames(the.pca$x),col=colours[the.cl$cluster],cex=0.75 ) points(s3d$xyz.convert(the.pca$x[wanted,1],the.pca$x[wanted,2],the.pca $x[wanted,3]),col=color,cex=5.0) Otherwise if you have class labels SAM or PAM. Hope that helps Cheers Paul -----Original Message----- From: Gaston Fiore <gaston.fiore@gmail.com> To: bioconductor@stat.math.ethz.ch Subject: [BioC] Heatmap with 7120x500 array Date: Fri, 27 Aug 2010 15:39:37 -0400 Hello everyone, I'm trying to produce a heat map that clusters 7120 genes into 6 groups based on 500 conditions. I'm using kmeans and then image, but I've two problems. The first one is that kmeans sometimes doesn't converge even with 10 restarts, and the second one is that the image produced is basically all read (I'm using the standard color scheme), not to mention it's size is massive and very hard to deal with. Does anyone have any suggestions on how I could accomplish this task efficiently, or is this data just too big to cluster? Thanks a lot, -Gaston _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 13.7 years ago Paul Leo ▴ 970

Login before adding your answer.