Correct use of a distance measure when clustering gene expression data
1
0
Entering edit mode
@michael-watson-iah-c-378
Last seen 10.2 years ago
Hi I have two different data sets, both time-courses. One uses a common reference for the Cy3 channel, the other performs direct comparisons between treated/untreated samples at each time-point. In both cases the actual data is log2(Cy5/Cy3). After a bit of thought, I've come to the conclusion that as a distance measure for the first dataset I will use "1 - pearson correlation coefficient". However, for the second dataset, as we performed direct comparisons at each time-point, using the correlation coefficient is not appropriate, so have decided to use euclidean distance. Does anyone have experience of what the best distance measure to use is for time-courses where direct comparisons are made at each time-point? Cheers Mick
• 1.3k views
ADD COMMENT
0
Entering edit mode
Floor Stam ▴ 90
@floor-stam-582
Last seen 10.2 years ago
Hi Mick I think it depends on the kind of similarity that is important to you. -If you think it is important that genes that show parallel profiles are clustered together, use pearsons correlation coefficient. In this case two genes that peak at the same moment in time, but at a (very) different height, will be found in the same cluster. -If you on the other hand think that it is important that genes which have similar extent of regulation are clustered together, use Euclidian distance. This clusters together genes of which the peaks occur at roughly the same height, but of which the profiles are not necessarily parallel. So it depends on your question. For timecourse data, i'd say Pearsons correlation coefficient gives more relevant data. We don't really know how much of a gene product is necessary for a biological effect anyway, and moreover the amount of active protein in a cell is dependent on a lot more than just number of mRNA molecules and we have no way of looking at that with a microarray. So i think the shape of the curves are more important than the amplitude. Furthermore, if i were you, i would subtract the log values of the ref-t0 comparison from all other ref-tx comparisons in your first dataset so that the values in your two different datasets are comparable and reflect gene regulation compared to timepoint 0. It would make it easier to get your head around what the numbers on your screen actually mean. This is all from a biologist so consult with a mathematician as well! Hope this is of use to you. Floor _______________________________________________________ Floor Stam Vrije Universiteit Amsterdam Faculty of Earth and Life Sciences Department of Molecular and Cellular Neurobiology De Boelelaan 1085 1081HV Amsterdam The Netherlands Ph: +31-20-4447114 +31-20-5665512 Fax: +31-20-4447112 e-mail: fjstam@bio.vu.nl _______________________________________________________ On 2 Sep 2004 , at 17:38, michael watson (IAH-C) wrote: > Hi > > I have two different data sets, both time-courses. One uses a common > reference for the Cy3 channel, the other performs direct comparisons > between treated/untreated samples at each time-point. In both cases > the > actual data is log2(Cy5/Cy3). > > After a bit of thought, I've come to the conclusion that as a distance > measure for the first dataset I will use "1 - pearson correlation > coefficient". However, for the second dataset, as we performed direct > comparisons at each time-point, using the correlation coefficient is > not > appropriate, so have decided to use euclidean distance. > > Does anyone have experience of what the best distance measure to use is > for time-courses where direct comparisons are made at each time- point? > > Cheers > Mick > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >
ADD COMMENT

Login before adding your answer.

Traffic: 739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6