How to decide which distance metric to use for micoarray data clustering?

0

Entering edit mode

Peng Yu ▴ 940

@peng-yu-3586

Last seen 9.6 years ago

Hi, I am looking for the most appropriate distance metrics for the clustering of a set of microarray data. And I read Chapter 12 of Bioinformatics and Computational Biology Solutions Using R and Bioconductor, But I'm still not clear what the general guide line is to choose an appropriate distance metrics out of many ones list in that chapter. Could somebody let me know how to choose an appropriate distance metrics? Regards, Peng

Microarray Microarray • 1.7k views

ADD COMMENT • link updated 14.5 years ago by Sean Davis 21k • written 14.5 years ago by Peng Yu ▴ 940

0

Entering edit mode

anna freni sterrantino ▴ 120

@anna-freni-sterrantino-2847

Last seen 9.6 years ago

Hi Peng, As long I'm concerned about your issue, I guess that you can choose any metric that you can interpret easily once you have done your analysis. But I think that for clustering microarray genes, ( I guess it's also suggested in the book you mention, which I don't have right here) a good solution is to cluster based on the correlation measure, this because correlation as a straightforward and easily interpretation. You may also take a look at the R-pkg such as pamr. Hope it helps Cheers A Anna Freni Sterrantino Ph.D Student Department of Statistics University of Bologna, Italy via Belle Arti 41, 40124 BO. ________________________________ Da: Peng Yu <pengyu.ut@gmail.com> A: bioconductor <bioconductor@stat.math.ethz.ch> Inviato: Mer 7 ottobre 2009, 5:52:19 Oggetto: [BioC] How to decide which distance metric to use for micoarray data clustering? Hi, I am looking for the most appropriate distance metrics for the clustering of a set of microarray data. And I read Chapter 12 of Bioinformatics and Computational Biology Solutions Using R and Bioconductor, But I'm still not clear what the general guide line is to choose an appropriate distance metrics out of many ones list in that chapter. Could somebody let me know how to choose an appropriate distance metrics? Regards, Peng _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 14.5 years ago anna freni sterrantino ▴ 120

0

Entering edit mode

Peng Yu ▴ 940

@peng-yu-3586

Last seen 9.6 years ago

Besides the distance metrics, there are other things that may also be important. For example, multiple probesets map to a same gene. I can do clustering on probeset values or on averaged probeset values of genes. What factors should I consider when I make this kind of decisions? bioDist says something about two popular metrics, but the description is distilled. I am wondering whether there are some more detailed comparisons between metrics. On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche at="" gmail.com=""> wrote: > look at the bioDist package for some suggestions. > > the metric to use depends on your task. > > > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >> >> Hi, >> >> I am looking for the most appropriate distance metrics for the >> clustering of a set of microarray data. And I read Chapter 12 of >> Bioinformatics and Computational Biology Solutions Using R and >> Bioconductor, But I'm still not clear what the general guide line is >> to choose an appropriate distance metrics out of many ones list in >> that chapter. Could somebody let me know how to choose an appropriate >> distance metrics? >> >> Regards, >> Peng >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Statisticians, like artists, have a bad habit of falling in love with their > models. > --George Box >

ADD COMMENT • link 14.5 years ago Peng Yu ▴ 940

0

Entering edit mode

On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut@gmail.com> wrote: > Besides the distance metrics, there are other things that may also be > important. For example, multiple probesets map to a same gene. I can > do clustering on probeset values or on averaged probeset values of > genes. What factors should I consider when I make this kind of > decisions? > > It is generally best not to average probes. You could choose one to be representative of each gene, but averaging is not the best way to go. > bioDist says something about two popular metrics, but the description > is distilled. I am wondering whether there are some more detailed > comparisons between metrics. > Often, the metrics produce highly compatible pictures of the data. The actual metric you will use may be directed somewhat by the goals of the analysis but, at least for hierarchical clustering, I think it is difficult to argue for one "best" or "recommended" metric. In practice, you may want to try a few to see how they behave on your data. Sean > > On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche@gmail.com> wrote: > > look at the bioDist package for some suggestions. > > > > the metric to use depends on your task. > > > > > > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut@gmail.com> wrote: > >> > >> Hi, > >> > >> I am looking for the most appropriate distance metrics for the > >> clustering of a set of microarray data. And I read Chapter 12 of > >> Bioinformatics and Computational Biology Solutions Using R and > >> Bioconductor, But I'm still not clear what the general guide line is > >> to choose an appropriate distance metrics out of many ones list in > >> that chapter. Could somebody let me know how to choose an appropriate > >> distance metrics? > >> > >> Regards, > >> Peng > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > > Statisticians, like artists, have a bad habit of falling in love with > their > > models. > > --George Box > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.5 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Wed, Oct 7, 2009 at 11:53 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: > On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at="" gmail.com=""> wrote: >> >> >> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >>> >>> Besides the distance metrics, there are other things that may also be >>> important. For example, multiple probesets map to a same gene. I can >>> do clustering on probeset values or on averaged probeset values of >>> genes. What factors should I consider when I make this kind of >>> decisions? >>> >> >> It is generally best not to average probes.? You could choose one to be >> representative of each gene, but averaging is not the best way to go. > > Is there any justification why it is not good to average probes? It is pretty simple, actually. Different probes for the same gene do not measure the same thing. In statistical terms, they are not drawn from the same distribution. >>> bioDist says something about two popular metrics, but the description >>> is distilled. I am wondering whether there are some more detailed >>> comparisons between metrics. >> >> Often, the metrics produce highly compatible pictures of the data.? The >> actual metric you will use may be directed somewhat by the goals of the >> analysis but, at least for hierarchical clustering, I think it is difficult >> to argue for one "best" or "recommended" metric. >> >> In practice, you may want to try a few to see how they behave on your data. > > If the results by different metrics are different, how to do decide > which one I should use? If you have a gold standard or another source of information about how samples/genes should be measured, you can justify your choice based on subjects that are supposed to be most similar are. Lacking such information, there are other techniques such as looking at the cluster stability under resampling that might be useful to think about. Others might have more concrete suggestions about how to go about measuring clustering effectiveness; it is a research topic of its own. Sean >>> On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche at="" gmail.com=""> wrote: >>> > look at the bioDist package for some suggestions. >>> > >>> > the metric to use depends on your task. >>> > >>> > >>> > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >>> >> >>> >> Hi, >>> >> >>> >> I am looking for the most appropriate distance metrics for the >>> >> clustering of a set of microarray data. And I read Chapter 12 of >>> >> Bioinformatics and Computational Biology Solutions Using R and >>> >> Bioconductor, But I'm still not clear what the general guide line is >>> >> to choose an appropriate distance metrics out of many ones list in >>> >> that chapter. Could somebody let me know how to choose an appropriate >>> >> distance metrics? >>> >> >>> >> Regards, >>> >> Peng >>> >> >>> >> _______________________________________________ >>> >> Bioconductor mailing list >>> >> Bioconductor at stat.math.ethz.ch >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> > >>> > >>> > -- >>> > Statisticians, like artists, have a bad habit of falling in love with >>> > their >>> > models. >>> > --George Box >>> > >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >

ADD COMMENT • link 14.5 years ago Sean Davis 21k

0

Entering edit mode

On Wed, Oct 7, 2009 at 11:06 AM, Sean Davis <seandavi at="" gmail.com=""> wrote: > On Wed, Oct 7, 2009 at 11:53 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at="" gmail.com=""> wrote: >>> >>> >>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >>>> >>>> Besides the distance metrics, there are other things that may also be >>>> important. For example, multiple probesets map to a same gene. I can >>>> do clustering on probeset values or on averaged probeset values of >>>> genes. What factors should I consider when I make this kind of >>>> decisions? >>>> >>> >>> It is generally best not to average probes.? You could choose one to be >>> representative of each gene, but averaging is not the best way to go. >> >> Is there any justification why it is not good to average probes? > > It is pretty simple, actually. ?Different probes for the same gene do > not measure the same thing. ?In statistical terms, they are not drawn > from the same distribution. > >>>> bioDist says something about two popular metrics, but the description >>>> is distilled. I am wondering whether there are some more detailed >>>> comparisons between metrics. >>> >>> Often, the metrics produce highly compatible pictures of the data.? The >>> actual metric you will use may be directed somewhat by the goals of the >>> analysis but, at least for hierarchical clustering, I think it is difficult >>> to argue for one "best" or "recommended" metric. >>> >>> In practice, you may want to try a few to see how they behave on your data. >> >> If the results by different metrics are different, how to do decide >> which one I should use? > > If you have a gold standard or another source of information about how > samples/genes should be measured, you can justify your choice based on > subjects that are supposed to be most similar are. ?Lacking such > information, there are other techniques such as looking at the cluster > stability under resampling that might be useful to think about. > Others might have more concrete suggestions about how to go about > measuring clustering effectiveness; it is a research topic of its own. Do you have a good reference so that I can trace the current research frontier? >>>> On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche at="" gmail.com=""> wrote: >>>> > look at the bioDist package for some suggestions. >>>> > >>>> > the metric to use depends on your task. >>>> > >>>> > >>>> > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut at="" gmail.com=""> wrote: >>>> >> >>>> >> Hi, >>>> >> >>>> >> I am looking for the most appropriate distance metrics for the >>>> >> clustering of a set of microarray data. And I read Chapter 12 of >>>> >> Bioinformatics and Computational Biology Solutions Using R and >>>> >> Bioconductor, But I'm still not clear what the general guide line is >>>> >> to choose an appropriate distance metrics out of many ones list in >>>> >> that chapter. Could somebody let me know how to choose an appropriate >>>> >> distance metrics?

ADD REPLY • link 14.5 years ago Peng Yu ▴ 940

Login before adding your answer.