Clustering in R

0

Entering edit mode

wmak@brandeis.edu ▴ 10

@wmakbrandeisedu-814

Last seen 9.6 years ago

Dear list members, I'm an undergrad and I work in a lab at Brandeis. I am trying to cluster around 14,000 genes across 6 microarray experiments. Two of these experiments are replicates. I have decided to use R since it seems to be the most complete and flexible software package for normalization and clustering of microarray data. The problem is that I am new to clustering and to R. Just to mention of a few of the problems I'm having: the dendrogram that is drawn by R from the agnes object is far too dense to see any of the gene names; kmeans won't work, returning an error saying that my data has NAs in it (there weren't any missing values in the original table though); I'd like to be able to see a heatmap or a cumulative plot of expression profiles for genes that are clustered together or are on the same branch of the dendrogram. I know that these questions are probably very simple, but I can't seem to find the answer to them online or in the documentation. If anyone can answer these questions or direct me toward resources that deal with clustering in R or BioConductor, a basic tutorial that takes a practical approach to it, I would really appreciate it. Any other reading material that isn't too heavy on statistics that deals with clustering for that matter, would be very helpful. Thank you in advance, Wayne Mak

Microarray Normalization Clustering Microarray Normalization Clustering • 1.8k views

ADD COMMENT • link updated 19.8 years ago by michael watson IAH-C ★ 3.4k • written 19.8 years ago by wmak@brandeis.edu ▴ 10

0

Entering edit mode

Johan Lindberg ▴ 270

@johan-lindberg-815

Last seen 9.6 years ago

Hi Wayne. A couple of months ago I tried to deal with the same issues as you probably do today. I found no good answers to how to cluster genes in R in a way to get something useful out of it if you have a large set of genes. In order to see some details in a dendrogram with 14000genes one would have to have a heck of a screen as large as a house. I suggest you use MEV from TIGR or some other freeware tool out there to do the job for you. I use MEV myself after normalizing and preparing my data in R. // Johan ********************************************************************** ** ******************* Johan Lindberg Royal Institute of Technology AlbaNova University Center Stockholm Center for Physics, Astronomy and Biotechnology Department of Molecular Biotechnology 106 91 Stockholm, Sweden http://www.biotech.kth.se/molbio/microarray/index.html ********************************************************************** ** ******************* -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch] On Behalf Of wmak@brandeis.edu Sent: den 16 juni 2004 22:26 To: bioconductor@stat.math.ethz.ch Subject: [BioC] Clustering in R Dear list members, I'm an undergrad and I work in a lab at Brandeis. I am trying to cluster around 14,000 genes across 6 microarray experiments. Two of these experiments are replicates. I have decided to use R since it seems to be the most complete and flexible software package for normalization and clustering of microarray data. The problem is that I am new to clustering and to R. Just to mention of a few of the problems I'm having: the dendrogram that is drawn by R from the agnes object is far too dense to see any of the gene names; kmeans won't work, returning an error saying that my data has NAs in it (there weren't any missing values in the original table though); I'd like to be able to see a heatmap or a cumulative plot of expression profiles for genes that are clustered together or are on the same branch of the dendrogram. I know that these questions are probably very simple, but I can't seem to find the answer to them online or in the documentation. If anyone can answer these questions or direct me toward resources that deal with clustering in R or BioConductor, a basic tutorial that takes a practical approach to it, I would really appreciate it. Any other reading material that isn't too heavy on statistics that deals with clustering for that matter, would be very helpful. Thank you in advance, Wayne Mak _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 19.8 years ago Johan Lindberg ▴ 270

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 9.6 years ago

OK, admittedly it is not incredibly simple, but it is not *that* difficult. If you are familiar with R, it should take you an hour or two; if unfamiliar, perhaps a day or two. The commands you want (and need to read the help on) are: hclust plclust dendrogram as.dendrogram cutree heatmap With intelligent use of hclust -> cutree -> subsetting -> hclust (in that order) you will be able to drill down into your dendrogram and create sub-trees - until you get to the level where you can see your gene names. An important message to take home here is that if you have 14000 genes and therefore 14000 labels, it's going to be difficult to display your tree in ANY software, including the expensive commercial products. Let me know how you get on Thanks Mick -----Original Message----- From: wmak@brandeisedu [mailto:wmak@brandeis.edu] Sent: 16 June 2004 21:26 To: bioconductor@stat.math.ethz.ch Subject: [BioC] Clustering in R Dear list members, I'm an undergrad and I work in a lab at Brandeis. I am trying to cluster around 14,000 genes across 6 microarray experiments. Two of these experiments are replicates. I have decided to use R since it seems to be the most complete and flexible software package for normalization and clustering of microarray data. The problem is that I am new to clustering and to R. Just to mention of a few of the problems I'm having: the dendrogram that is drawn by R from the agnes object is far too dense to see any of the gene names; kmeans won't work, returning an error saying that my data has NAs in it (there weren't any missing values in the original table though); I'd like to be able to see a heatmap or a cumulative plot of expression profiles for genes that are clustered together or are on the same branch of the dendrogram. I know that these questions are probably very simple, but I can't seem to find the answer to them online or in the documentation. If anyone can answer these questions or direct me toward resources that deal with clustering in R or BioConductor, a basic tutorial that takes a practical approach to it, I would really appreciate it. Any other reading material that isn't too heavy on statistics that deals with clustering for that matter, would be very helpful. Thank you in advance, Wayne Mak _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 19.8 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

Thanks a lot, Michael! I cc to R-help, where this question really belongs {as the 'Subject' suggests itself...} -- please drop 'bioconductor' from CC'ing further replies. >>>>> "michael" == michael watson (IAH-C) <michael.watson@bbsrc.ac.uk> >>>>> on Thu, 17 Jun 2004 09:16:59 +0100 writes: michael> OK, admittedly it is not incredibly simple, but it michael> is not *that* difficult. michael> If you are familiar with R, it should take you an michael> hour or two; if unfamiliar, perhaps a day or two. michael> The commands you want (and need to read the help on) are: michael> hclust michael> plclust michael> cutree and I would add identify.hclust() {and rect.hclust()} a very neat but not known / used enough function a link to which I have just added to the help(hclust) page. Look at its examples {not with example() since they are "dontrun"} correcting the extraneous "." in the last (and coolest!) example! michael> dendrogram michael> as.dendrogram michael> heatmap where you use "dendrogram"s produced from "hclust" objects via as.dendrogram(<hc-obj>) or also "twins" objects produced from package cluster's agnes() or diana() via as.dendrogram(as.hclust( <twins-obj> ) ) help(dendrogram) also mentions "[[" (and shows examples) and cut() for cutting dendrograms and shows how you can depict dendrograms into its parts. michael> With intelligent use of hclust -> cutree -> subsetting -> hclust michael> (in that order) you will be able to drill down michael> into your dendrogram and create sub-trees - until michael> you get to the level where you can see your gene michael> names. or also hclust -> as.dendrogram -> cut -> .. -> [[ -> Note that there also is reorder.dendrogram() for reordering dendrogram nodes ``sensibly'' --- something that heatmap() does, but you can play with quite a bit. Further, note Catherine Hurley's "gclus" package which orders/reorders 'hclust' objects directly, but with a more interesting algorithm. Note that I'd strongly recommend to use R 1.9.1 beta for these, since I know which bugs in the dendrogram code I have fixed since R 1.9.0... michael> An important message to take home here is that if michael> you have 14000 genes and therefore 14000 labels, michael> it's going to be difficult to display your tree in michael> ANY software, including the expensive commercial products. not showing the labels and using identify.hclust() and the command line to extract the indices of observations in clusters (and subclusters) and visualize them in other, non-dendrogram plots, might well be feasible. michael> Let me know how you get on michael> Thanks michael> Mick michael> -----Original Message----- michael> From: wmak@brandeisedu [mailto:wmak@brandeis.edu] michael> Sent: 16 June 2004 21:26 michael> To: bioconductor@stat.math.ethz.ch michael> Subject: [BioC] Clustering in R >> Dear list members, >> I'm an undergrad and I work in a lab at Brandeis. >> I am trying to cluster around 14,000 genes across 6 >> microarray experiments. Two of these experiments >> are replicates. I have decided to use R since it >> seems to be the most complete and flexible software >> package for normalization and clustering of >> microarray data. >> The problem is that I am new to clustering and to >> R. Just to mention of a few of the problems I'm >> having: the dendrogram that is drawn by R from the >> agnes object is far too dense to see any of the >> gene names; kmeans won't work, returning an error >> saying that my data has NAs in it (there weren't >> any missing values in the original table though); >> I'd like to be able to see a heatmap or a >> cumulative plot of expression profiles for genes >> that are clustered together or are on the same >> branch of the dendrogram. >> I know that these questions are probably very >> simple, but I can't seem to find the answer to them >> online or in the documentation. If anyone can >> answer these questions or direct me toward >> resources that deal with clustering in R or >> BioConductor, a basic tutorial that takes a >> practical approach to it, I would really appreciate >> it. Any other reading material that isn't too >> heavy on statistics that deals with clustering for >> that matter, would be very helpful. >> Thank you in advance, >> Wayne Mak

ADD REPLY • link 19.8 years ago Martin Maechler ▴ 330

Login before adding your answer.