DESeq2 pitfalls of rlog transformation
2
0
Entering edit mode
@kristina-m-fontanez-6323
Last seen 10.2 years ago
Dear Bioconductors- After previous discussions on this list, I had settled on using regularized log transformed counts in order to present relative abundance heatmaps/barcharts of my metagenomic data. As this transformation accounts for both differing library sizes and stabilizes the variance among counts, it seemed an appropriate choice. However, now I am running into the issue that the transformed data does not make biological sense. For example, below is a small table with 12 samples and the OTU counts for one taxa which have been either rlog transformed (using the blind option to ignore experimental design) or represent the raw counts. This particular taxa is one of the most abundant in my dataset. As you can see, the rlog transformed counts are much more stable across samples than the raw counts. This is exactly the expected behavior of the rlog transformation in that the variance of counts among samples has been reduced and the library sizes are now much more similar. The problem is that the results do not make biological sense. This particular taxa is a copiotroph and so grows to high abundance in the presence of short bursts of nutrients. That kind of burst of growth is expected in samples 1,3,5, and 7, however samples 9-12 represent seawater samples and so they are nutrient-poor (relatively). This taxa ought to be in low abundance in those samples, which is the case in the raw counts. However, the rlog transformation inflates their counts so that they are now similar to those found in all other samples. This result obscures what is likely the true abundance of this taxa in the water column by inflating it beyond what is biologically reasonable given the environment in which these samples were collected. sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 OTU1 rlog trans 136,280 93,429 131,635 93,351 120,957 93,334 120,358 93,317 93,304 93,310 93,308 93,322 OTU1 raw 468,459 1,190 168,280 330 237,248 94 232,483 195 249 1,647 331 501 With expression data, it seems appropriate to assume that in a treatment vs control scenario, all genes are likely to be expressed at some level (if even a low level) in all samples. However, I do not think that holds true for the abundance of taxa, especially in a case like mine where the samples are not so much different treatments as different ways of collecting microbes. It very well may be that the abundance of a particular taxa drops to a few hundred cells in any given sample because it is simply outcompeted by other microbes - that is a distinctly plausible biological event. The variance stabilization appears to mute that effect thereby making it appear that all taxa appear in all samples at relatively similar abundances. Given these results, is rlog transformation really an appropriate choice when comparing metagenomic data? The variance stabilization seems to drastically change the biological patterns in the dataset. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 [[alternative HTML version deleted]]
• 7.2k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 7 days ago
United States

hi Kristina,

In my previous email I suggested using the default, blind=TRUE, but I think I was mistaken given your particular kind of data. What is the point of the difference with this argument? If you set blind=TRUE, we are supposing that, for most rows, there are no differences across conditions and we can estimate a global mean-dispersion trend line using the majority of the rows. This fitted line is then used to inform the rlog tranformation. But in your case, the null is clearly inappropriate for the majority of rows. The fitted line is producing very high estimates of dispersion, essentially the algorithm sees all this variation as all noise. Given that the design of ~ 1 is inappropriate for the majority of rows your dataset, I would recommend specifying your true experimental design and setting blind=FALSE. We should find a way to better inform this choice in the documentation.

Mike

ADD COMMENT
0
Entering edit mode
Mike- Thank you for your suggestion to try the regularized log transformation with blind=FALSE. The variance in the blind=FALSE data appears more similar to the untransformed variance. sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 OTU1206 norm_rld_blindtrue 136,280 93,429 131,636 93,351 120,957 93,334 120,358 93,317 93,304 93,310 93,308 93,322 OTU1206 norm_rld_blindfalse 117,964 2,594 107,115 1,816 84,599 1,623 83,377 1,399 1,212 1,295 1,262 1,465 OTU1206 raw 468,459 1,190 168,280 330 237,248 94 232,483 195 249 1,647 331 501 I also plotted the standard deviation of each transformation (blind=true and blind=false) against the mean as described in the DESeq manual to explore the effects of the different transformations on the variance. > jes class: DESeqDataSet dim: 2151 12 exptData(0): assays(1): counts rownames(2151): OTU1 OTU2 ... OTU2150 OTU2151 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD9_SW_300_22 HD9_SW_500_22 colData names(5): DEPTH TREATMENT Cmg.m2.day Nmg.m2.day Pmg.m2.day > design(jes) ~TREATMENT > jes<-estimateSizeFactors(jes) > jes<-estimateDispersions(jes,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > rld_jes<-rlogTransformation(jes,blind=FALSE) > rld_trueblind_jes<-rlogTransformation(jes,blind=TRUE) > par(mfrow=c(1,3)) > notAllZero<-(rowSums(counts(jes))>0) > meanSdPlot(log2(counts(jes,normalized=TRUE)[notAllZero,]+1),ylim=c(0 ,2.5),main="Per taxa Stdev shifted logarithm") > meanSdPlot(assay(rld_jes[notAllZero,]),ylim=c(0,2.5),main="Per taxa SD blind false") > meanSdPlot(assay(rld_trueblind_jes[notAllZero,]),ylim=c(0,2.5),main= "Per taxa SD blind true?) Thank you for this suggestion. Kristina [cid:FC0E5165-A604-4F0D-B469-046A01034EB0 at mit.edu] ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez at mit.edu<mailto:fontanez at="" mit.edu=""> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 24, 2014, at 11:51 AM, Michael Love <michaelisaiahlove at="" gmail.com<mailto:michaelisaiahlove="" at="" gmail.com="">> wrote: hi Kristina, In my previous email I suggested using the default, blind=TRUE, but I think I was mistaken given your particular kind of data. What is the point of the difference with this argument? If you set blind=TRUE, we are supposing that, for most rows, there are no differences across conditions and we can estimate a global mean-dispersion trend line using the majority of the rows. This fitted line is then used to inform the rlog tranformation. But in your case, the null is clearly inappropriate for the majority of rows. The fitted line is producing very high estimates of dispersion, essentially the algorithm sees all this variation as all noise. Given that the design of ~ 1 is inappropriate for the majority of rows your dataset, I would recommend specifying your true experimental design and setting blind=FALSE. We should find a way to better inform this choice in the documentation. Mike On Fri, Jan 24, 2014 at 11:12 AM, Kristina M Fontanez <fontanez at="" mit.edu<mailto:fontanez="" at="" mit.edu="">> wrote: Dear Bioconductors- After previous discussions on this list, I had settled on using regularized log transformed counts in order to present relative abundance heatmaps/barcharts of my metagenomic data. As this transformation accounts for both differing library sizes and stabilizes the variance among counts, it seemed an appropriate choice. However, now I am running into the issue that the transformed data does not make biological sense. For example, below is a small table with 12 samples and the OTU counts for one taxa which have been either rlog transformed (using the blind option to ignore experimental design) or represent the raw counts. This particular taxa is one of the most abundant in my dataset. As you can see, the rlog transformed counts are much more stable across samples than the raw counts. This is exactly the expected behavior of the rlog transformation in that the variance of counts among samples has been reduced and the library sizes are now much more similar. The problem is that the results do not make biological sense. This particular taxa is a copiotroph and so grows to high abundance in the presence of short bursts of nutrients. That kind of burst of growth is expected in samples 1,3,5, and 7, however samples 9-12 represent seawater samples and so they are nutrient-poor (relatively). This taxa ought to be in low abundance in those samples, which is the case in the raw counts. However, the rlog transformation inflates their counts so that they are now similar to those found in all other samples. This result obscures what is likely the true abundance of this taxa in the water column by inflating it beyond what is biologically reasonable given the environment in which these samples were collected. sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 OTU1 rlog trans 136,280 93,429 131,635 93,351 120,957 93,334 120,358 93,317 93,304 93,310 93,308 93,322 OTU1 raw 468,459 1,190 168,280 330 237,248 94 232,483 195 249 1,647 331 501 With expression data, it seems appropriate to assume that in a treatment vs control scenario, all genes are likely to be expressed at some level (if even a low level) in all samples. However, I do not think that holds true for the abundance of taxa, especially in a case like mine where the samples are not so much different treatments as different ways of collecting microbes. It very well may be that the abundance of a particular taxa drops to a few hundred cells in any given sample because it is simply outcompeted by other microbes - that is a distinctly plausible biological event. The variance stabilization appears to mute that effect thereby making it appear that all taxa appear in all samples at relatively similar abundances. Given these results, is rlog transformation really an appropriate choice when comparing metagenomic data? The variance stabilization seems to drastically change the biological patterns in the dataset. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez at mit.edu<mailto:fontanez at="" mit.edu=""><mailto:fontanez at="" mit.edu<mailto:fontanez="" at="" mit.edu="">> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
@kristina-m-fontanez-6323
Last seen 10.2 years ago
Susan- Thank you for weighing in. I actually learned a good deal from your recently submitted paper, Waste Not, Want Not:Why rarefying microbiome data is inadmissible, available on Joey McMurdie’s website (for anyone else interested). Reading that manuscript is actually what turned me on to using DESeq2 for variance stabilization in the first place. I appreciate your group’s work in this area as often the techniques used in RNA-seq are not easily transferred to metagenomic studies. I agree that my particular study design makes statistical analyses much more difficult, which we hope to rectify by sequencing further samples across multiple years. Thank you for your insights. Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 24, 2014, at 11:36 AM, Susan Holmes <susanatstat@gmail.com<mailto:susanatstat@gmail.com>> wrote: Dear Kristina, I don't quite know whether the BioConductor users group is the right forum for this interesting scientific discussion, the problem with your data is that you do not have replicates and the varying library sizes give a heteroscedasticity that reduces the power with which you could try to compare the different abundances. All the BioC people can do is help you fix your code, not your experiment. It's not because you fix the radio that the news will be good. With Joey McMurdie, we are putting alot of work in making this passage to standard normalization procedures more understandable for biologists and your input is valuable, however your experiment seems to be have a design flaw for what you want to do. I still believe that normalization the data will give a much more reasonable scale of comparison than taking the raw data. Data transformations are always arbitrary, but the original measurements scales are also. I think reading more thoroughly a chapter on the Delta method and variance stabilization in a book like John Rice's Introduction to Mathematical Statistics and Data Analysis will enable to get a better handle on the transformation and variance stabilization issues which are often encountered in these settings. Good Luck Susan Susan Holmes Professor, Statistics and BioX John Henry Samter University Fellow in Undergraduate Education Director, Mathematical and Computational Sciences Stanford http://www-stat.stanford.edu/~susan/ On Fri, Jan 24, 2014 at 5:12 PM, Kristina M Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: Dear Bioconductors- After previous discussions on this list, I had settled on using regularized log transformed counts in order to present relative abundance heatmaps/barcharts of my metagenomic data. As this transformation accounts for both differing library sizes and stabilizes the variance among counts, it seemed an appropriate choice. However, now I am running into the issue that the transformed data does not make biological sense. For example, below is a small table with 12 samples and the OTU counts for one taxa which have been either rlog transformed (using the blind option to ignore experimental design) or represent the raw counts. This particular taxa is one of the most abundant in my dataset. As you can see, the rlog transformed counts are much more stable across samples than the raw counts. This is exactly the expected behavior of the rlog transformation in that the variance of counts among samples has been reduced and the library sizes are now much more similar. The problem is that the results do not make biological sense. This particular taxa is a copiotroph and so grows to high abundance in the presence of short bursts of nutrients. That kind of burst of growth is expected in samples 1,3,5, and 7, however samples 9-12 represent seawater samples and so they are nutrient-poor (relatively). This taxa ought to be in low abundance in those samples, which is the case in the raw counts. However, the rlog transformation inflates their counts so that they are now similar to those found in all other samples. This result obscures what is likely the true abundance of this taxa in the water column by inflating it beyond what is biologically reasonable given the environment in which these samples were collected. sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 OTU1 rlog trans 136,280 93,429 131,635 93,351 120,957 93,334 120,358 93,317 93,304 93,310 93,308 93,322 OTU1 raw 468,459 1,190 168,280 330 237,248 94 232,483 195 249 1,647 331 501 With expression data, it seems appropriate to assume that in a treatment vs control scenario, all genes are likely to be expressed at some level (if even a low level) in all samples. However, I do not think that holds true for the abundance of taxa, especially in a case like mine where the samples are not so much different treatments as different ways of collecting microbes. It very well may be that the abundance of a particular taxa drops to a few hundred cells in any given sample because it is simply outcompeted by other microbes - that is a distinctly plausible biological event. The variance stabilization appears to mute that effect thereby making it appear that all taxa appear in all samples at relatively similar abundances. Given these results, is rlog transformation really an appropriate choice when comparing metagenomic data? The variance stabilization seems to drastically change the biological patterns in the dataset. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu><mailto:fontanez@mit.edu<mail to:fontanez@mit.edu="">> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 1065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6