significance of "wrong" clustering of differential genes
1
0
Entering edit mode
Benjamin Otto ▴ 830
@benjamin-otto-1519
Last seen 9.6 years ago
Hi, Please imagine the following situation: For two sample sets (set1, set2) the most differentially expressed genes are identified by limma. The p.value correction would be "holm". Afterwards a heatmap is printed for these genes. The procedure would look like: > f <- factor(as.character(pheno[,marker])) > design <- model.matrix(~f) > fit <- eBayes(lmFit(eSet,design)) > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm") > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1 > ## print a heatmap for eSet[selected,] What can lead to a misclassification in the clustering, say one sample of set1 is clustered together with set2? Afterall according to the workflow I have explicitly been searching for the genes which should discriminate between the two sets! However the expression values displayed in the heatmap assume, that this samle IS more similar to the "wrong" set than to the true one. (have a look at the jpg) Is it possible, that this sample is always treated as outlier in the significance calculations? And if it is so, then: Is it sensible to take such a misclassification as kind of significane? Regards Benjamin -- Benjamin Otto Universitaetsklinikum Eppendorf Hamburg Institut fuer Klinische Chemie Martinistrasse 52 20246 Hamburg
Clustering limma Clustering limma • 1.3k views
ADD COMMENT
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 2.9 years ago
United States
The heatmap did not come through (to me). However, clustering is highly dependent on the choice of distance measure. --Naomi At 09:57 AM 11/13/2006, Benjamin Otto wrote: >Hi, > > > >Please imagine the following situation: > >For two sample sets (set1, set2) the most differentially expressed genes are >identified by limma. The p.value correction would be "holm". Afterwards a >heatmap is printed for these genes. The procedure would look like: > > > > > f <- factor(as.character(pheno[,marker])) > > > design <- model.matrix(~f) > > > fit <- eBayes(lmFit(eSet,design)) > > > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm") > > > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1 > > > ## print a heatmap for eSet[selected,] > > > > > >What can lead to a misclassification in the clustering, say one sample of >set1 is clustered together with set2? Afterall according to the workflow I >have explicitly been searching for the genes which should discriminate >between the two sets! However the expression values displayed in the heatmap >assume, that this samle IS more similar to the "wrong" set than to the true >one. (have a look at the jpg) > >Is it possible, that this sample is always treated as outlier in the >significance calculations? > >And if it is so, then: Is it sensible to take such a misclassification as >kind of significane? > >Regards > > > >Benjamin > > > > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > > > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD COMMENT
0
Entering edit mode
In addition to Naomi's comments, remember that a desired property of a statistic is that it be "robust" to outliers (ignoring them when appropriate). I think it is probably fine to have some proportion of the samples "misclassified" by your clustering. However, when this happens, it is a good idea to make sure that a sample mislabeling or some such thing has not occurred. I have discovered an adult sample in what were supposed to be pediatric samples, a mouse cell line among what were supposed to be all canine, and other oddities like that by looking back at data. Most of the time, though, these samples simply represent biological or technical variation that we cannot fully explain. Sean On Monday 13 November 2006 16:02, Naomi Altman wrote: > The heatmap did not come through (to me). However, clustering is > highly dependent on the choice of distance measure. > > --Naomi > > At 09:57 AM 11/13/2006, Benjamin Otto wrote: > >Hi, > > > > > > > >Please imagine the following situation: > > > >For two sample sets (set1, set2) the most differentially expressed genes > > are identified by limma. The p.value correction would be "holm". > > Afterwards a > > > >heatmap is printed for these genes. The procedure would look like: > > > f <- factor(as.character(pheno[,marker])) > > > > > > design <- model.matrix(~f) > > > > > > fit <- eBayes(lmFit(eSet,design)) > > > > > > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm") > > > > > > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1 > > > > > > ## print a heatmap for eSet[selected,] > > > >What can lead to a misclassification in the clustering, say one sample of > >set1 is clustered together with set2? Afterall according to the workflow I > >have explicitly been searching for the genes which should discriminate > >between the two sets! However the expression values displayed in the > > heatmap assume, that this samle IS more similar to the "wrong" set than > > to the true one. (have a look at the jpg) > > > >Is it possible, that this sample is always treated as outlier in the > >significance calculations? > > > >And if it is so, then: Is it sensible to take such a misclassification as > >kind of significane? > > > >Regards > > > > > > > >Benjamin > > > > > > > > > > > >-- > >Benjamin Otto > >Universitaetsklinikum Eppendorf Hamburg > >Institut fuer Klinische Chemie > >Martinistrasse 52 > >20246 Hamburg > > > > > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Naomi, sorry, probably the image size (37kb) exeeded the 40kb limit together with the rest of the mail. Here it comes again in higher compression. Concerning the distance measure I would agree with you. However --that's why I initially thought to provide the cluster plot-- according the expression values I DO agree with the clustering result! And that is the point I wouldn't normally expect from clustering extra determined significant genes... Benjamin -----Urspr?ngliche Nachricht----- Von: Naomi Altman [mailto:naomi at stat.psu.edu] Gesendet: 13 November 2006 22:03 An: Benjamin Otto; 'BioClist' Betreff: Re: [BioC] significance of "wrong" clustering of differential genes The heatmap did not come through (to me). However, clustering is highly dependent on the choice of distance measure. --Naomi
ADD REPLY

Login before adding your answer.

Traffic: 954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6