Fwd: How to decide which distance metric to use for micoarray data clustering?

0

Entering edit mode

Peng Yu ▴ 940

@peng-yu-3586

Last seen 9.7 years ago

On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at="" gmail.com=""> wrote: > > > On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >> >> Besides the distance metrics, there are other things that may also be >> important. For example, multiple probesets map to a same gene. I can >> do clustering on probeset values or on averaged probeset values of >> genes. What factors should I consider when I make this kind of >> decisions? >> > > It is generally best not to average probes.? You could choose one to be > representative of each gene, but averaging is not the best way to go. Is there any justification why it is not good to average probes? >> bioDist says something about two popular metrics, but the description >> is distilled. I am wondering whether there are some more detailed >> comparisons between metrics. > > Often, the metrics produce highly compatible pictures of the data.? The > actual metric you will use may be directed somewhat by the goals of the > analysis but, at least for hierarchical clustering, I think it is difficult > to argue for one "best" or "recommended" metric. > > In practice, you may want to try a few to see how they behave on your data. If the results by different metrics are different, how to do decide which one I should use? >> On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche at="" gmail.com=""> wrote: >> > look at the bioDist package for some suggestions. >> > >> > the metric to use depends on your task. >> > >> > >> > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >> >> >> >> Hi, >> >> >> >> I am looking for the most appropriate distance metrics for the >> >> clustering of a set of microarray data. And I read Chapter 12 of >> >> Bioinformatics and Computational Biology Solutions Using R and >> >> Bioconductor, But I'm still not clear what the general guide line is >> >> to choose an appropriate distance metrics out of many ones list in >> >> that chapter. Could somebody let me know how to choose an appropriate >> >> distance metrics? >> >> >> >> Regards, >> >> Peng >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at stat.math.ethz.ch >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> > >> > -- >> > Statisticians, like artists, have a bad habit of falling in love with >> > their >> > models. >> > --George Box >> > >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

Microarray Clustering bioDist Microarray Clustering bioDist • 1.1k views

ADD COMMENT • link updated 14.6 years ago by Steve Lianoglou ★ 13k • written 14.6 years ago by Peng Yu ▴ 940

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi Peng, On Oct 7, 2009, at 11:54 AM, Peng Yu wrote: > On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at="" gmail.com=""> > wrote: >> >> >> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >>> >>> Besides the distance metrics, there are other things that may also >>> be >>> important. For example, multiple probesets map to a same gene. I can >>> do clustering on probeset values or on averaged probeset values of >>> genes. What factors should I consider when I make this kind of >>> decisions? >>> >> >> It is generally best not to average probes. You could choose one >> to be >> representative of each gene, but averaging is not the best way to go. > > Is there any justification why it is not good to average probes? There is a very informative discussion that touches this topic on the BioC list from back in April 2009. I have it flagged with the intention of going back to it to work out some examples myself, but alas, haven't yet done so. Anyway, this is the thread: http://thread.gmane.org/gmane.science.biology.informatics.conductor/22 758 While I recommend you read the whole thing, if you go ~9 Messages deep, you'll find a post by James MacDonald (April 24th) with the following comment: """Yes. You are missing the fact that the data from Affy probes usually are not normally distributed. In fact, it is not uncommon for a given probeset to have widely divergent intensity levels for its component probes. Because of the fact that the mean is not robust to outliers, people long ago abandoned methods based on a normal distribution.""" Hope that's helpful, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On Wed, Oct 7, 2009 at 11:13 AM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi Peng, > > On Oct 7, 2009, at 11:54 AM, Peng Yu wrote: > >> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at="" gmail.com=""> wrote: >>> >>> >>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >>>> >>>> Besides the distance metrics, there are other things that may also be >>>> important. For example, multiple probesets map to a same gene. I can >>>> do clustering on probeset values or on averaged probeset values of >>>> genes. What factors should I consider when I make this kind of >>>> decisions? >>>> >>> >>> It is generally best not to average probes. ?You could choose one to be >>> representative of each gene, but averaging is not the best way to go. >> >> Is there any justification why it is not good to average probes? > > There is a very informative discussion that touches this topic on the BioC > list from back in April 2009. I have it flagged with the intention of going > back to it to work out some examples myself, but alas, haven't yet done so. > > Anyway, this is the thread: > > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 22758 > > While I recommend you read the whole thing, if you go ~9 Messages deep, > you'll find a post by James MacDonald (April 24th) with the following > comment: > > """Yes. You are missing the fact that the data from Affy probes usually are > not normally distributed. In fact, it is not uncommon for a given > probeset to have widely divergent intensity levels for its component > probes. Because of the fact that the mean is not robust to outliers, > people long ago abandoned methods based on a normal distribution.""" Then I can use median instead of mean for all the probesets of a gene, right? But the choice of probeset level vs. gene level is still arbitrary to me. Is there a guideline on when probeset level data should be used and when gene level data should be used? Regards, Peng

ADD REPLY • link 14.6 years ago Peng Yu ▴ 940

0

Entering edit mode

Hi, On Oct 7, 2009, at 12:31 PM, Peng Yu wrote: > On Wed, Oct 7, 2009 at 11:13 AM, Steve Lianoglou <snip> >> There is a very informative discussion that touches this topic on >> the BioC >> list from back in April 2009. I have it flagged with the intention >> of going >> back to it to work out some examples myself, but alas, haven't yet >> done so. >> >> Anyway, this is the thread: >> >> http://thread.gmane.org/gmane.science.biology.informatics.conductor /22758 >> >> While I recommend you read the whole thing, if you go ~9 Messages >> deep, >> you'll find a post by James MacDonald (April 24th) with the following >> comment: >> >> """Yes. You are missing the fact that the data from Affy probes >> usually are >> not normally distributed. In fact, it is not uncommon for a given >> probeset to have widely divergent intensity levels for its component >> probes. Because of the fact that the mean is not robust to outliers, >> people long ago abandoned methods based on a normal distribution.""" > > Then I can use median instead of mean for all the probesets of a gene, > right? I'm not sure that you'll get a direct answer to this question. It depends on what you're trying to do, right? If you can appreciate what Sean mentioned earlier, and some of the things that came up in that thread I linked to, then you would be in a better position to (i) make a judgement call yourself, and (ii) justify it if someone wonders why you did what you did. > But the choice of probeset level vs. gene level is still > arbitrary to me. Do you understand the difference between the two? Some figures (and perhaps even the text) in here help: http://www.biomedcentral.com/1471-2105/7/276 Just fished out a sentence from the discussion section that you might find disheartening, given your hunt to find meaning in clustering: """For this reason, particular care must be taken when analysing expression data using correlation-based approaches""" > Is there a guideline on when probeset level data > should be used and when gene level data should be used? There's a whole mess load of papers dealing with: 1. microarrays 2. their design 3. the problems with their design 4. how to normalize them considering (2) and (3) 5. the flaws in papers dealing with (4) 6. why a different type of microarray is needed (double vs. single channel) 7. go to 2 ... etc .... Now imagine for a moment that there was such a guideline that you're asking for, what kind of info would be in it? Perhaps equally important given the pseudo-list I made above: what info would you exclude? I think you're looking for easy answers to difficult problems (eg. "I can just use the median, right?"). As I said before, I don't think you'll get any, sorry[1]. As mentioned above, I'm guessing the best you can do is to try to appreciate issues dealing with microarray data and make an informed decision. HTH, -steve [1] Although it would be great if some seasoned practitioner will chime in on the contrary, at which point I'd gladly eat my hat. -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.6 years ago Steve Lianoglou ★ 13k

Login before adding your answer.