Question: DEseq2 metagenomic analysis without replicates
0
gravatar for Guest User
5.7 years ago by
Guest User12k
Guest User12k wrote:
Hello- I am working with a metagenomic dataset in DESeq2 that does not have true biological replicates but has multiple factors and levels. Due to the difficulty in obtaining these samples, replicates were not an option. The study design consists of seawater microbes collected at 4 depths. For each depth, there are 3 treatments (live, dead and none). I expect there to be true biological differences in abundance both among depths and among treatments. So, combining the samples across depths is not a great option because I expect there to be some variation in abundance of taxa among depths. The same holds true for combining samples across treatments. However, if I have to choose one I would expect there to be less variation among depths than among treatments. The goal is to assess the differences in abundance among treatments and among depths. Ultimately, I would like to create relative abundance heat maps and bar charts displaying the taxa which are differentially abundant among treatments and among depths. I would also like to cluster the samples using PCoA or nMDS to see the effects of treatment and depth on sample similarity. Is it possible to normalize these data (using variance-stabilized normalization or regularized log transformation) without choosing a study design? In my perusing of the mailing list, one suggestion was to use a design ~ 1 to estimate the dispersions so that the dispersions treat all samples as replicates. In my case, that is not an ideal choice due to the expected variation in abundance of taxa among depths and treatments. So, I see two options for moving forward that I would like feedback on: 1) Make relative abundance heat maps and bar charts based on simple counts/library size proportions. Later, to do differential abundance tests, pool samples across depths to compare the 3 treatments (4 replicates per treatment), calculating the dispersion using ~Treatment as the study design. 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn???t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can???t figure out a way to export the regularized log transformed counts. One approach to (2) for the regularized log transformation is below and the variable Deptreat_rlog contains the regularized log transformed count data which is the same size as the original dataset (but the ???counts??? matrix is inaccessible): > design(Deptreat)<- ~1 > Deptreat class: DESeqDataSet dim: 1064 12 exptData(0): assays(1): counts rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day > Deptreat<-estimateSizeFactors(Deptreat) > Deptreat<-estimateDispersions(Deptreat,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > plotDispEsts(Deptreat) <dispestplot.jpeg> > Deptreat_rlog=rlogData(Deptreat) > dim(Deptreat_rlog) [1] 1064 12 > assay(Deptreat_rlog) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ???assay??? for signature ???"matrix", "missing"??? > traceback() 3: stop(gettextf("unable to find an inherited method for function %s for signature %s", sQuote(fdef at generic), sQuote(cnames)), domain = NA) 2: (function (classes, fdef, mtable) { methods <- .findInheritedMethods(classes, fdef, mtable) if (length(methods) == 1L) return(methods[[1L]]) else if (length(methods) == 0L) { cnames <- paste0("\"", sapply(classes, as.character), "\"", collapse = ", ") stop(gettextf("unable to find an inherited method for function %s for signature %s", sQuote(fdef at generic), sQuote(cnames)), domain = NA) } else stop("Internal error in finding inherited methods; didn't return a unique method", domain = NA) })(list("matrix", "missing"), function (x, i, ...) standardGeneric("assay"), <environment>) 1: assay(Deptreat_rlog) Thank you for your insights! Kristina -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.2.5 RcppArmadillo_0.3.920.1 Rcpp_0.10.6 [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 [7] BiocGenerics_0.8.0 ggplot2_0.9.3.1 scales_0.2.3 [10] phyloseq_1.7.9 loaded via a namespace (and not attached): [1] ade4_1.5-2 annotate_1.40.0 AnnotationDbi_1.24.0 [4] ape_3.0-11 Biobase_2.22.0 biom_0.3.10 [7] Biostrings_2.30.1 cluster_1.14.4 codetools_0.2-8 [10] colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0 [13] digest_0.6.3 foreach_1.4.1 genefilter_1.44.0 [16] grid_3.0.2 gtable_0.1.2 igraph_0.6.6 [19] iterators_1.0.6 labeling_0.2 lattice_0.20-24 [22] locfit_1.5-9.1 MASS_7.3-29 Matrix_1.1-0 [25] multtest_2.18.0 munsell_0.4.2 nlme_3.1-113 [28] permute_0.7-0 plyr_1.8 proto_0.3-10 [31] RColorBrewer_1.0-5 reshape2_1.2.2 RJSONIO_1.0-3 [34] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 [37] stringr_0.6.2 survival_2.37-4 tools_3.0.2 [40] vegan_2.0-9 XML_3.95-0.2 xtable_1.7-1 -- Sent via the guest posting facility at bioconductor.org.
deseq2 • 2.6k views
ADD COMMENTlink modified 5.7 years ago by Kristina M Fontanez220 • written 5.7 years ago by Guest User12k
Answer: DEseq2 metagenomic analysis without replicates
0
gravatar for Steve Lianoglou
5.7 years ago by
Denali
Steve Lianoglou12k wrote:
Hi Kristina, Just a quick comment regarding your trouble w/ exporting the regularized data you mention here: On Mon, Jan 13, 2014 at 11:08 AM, Kristina Fontanez [guest] <guest at="" bioconductor.org=""> wrote: [snip] > 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn???t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can???t figure out a way to export the regularized log transformed counts. [/snip] You've already done it. You are calling `rlogData` on your DESeqDataSet here: >> Deptreat_rlog=rlogData(Deptreat) >> dim(Deptreat_rlog) > [1] 1064 12 >> assay(Deptreat_rlog) > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ???assay??? for signature ???"matrix", "missing"??? You are getting an error because Deptreat_rlog is *already* a matrix of regularized counts (in log2) ... job done ... as in: this is the matrix you are looking for. See `?rlogData` for more details, but ... there's not much more to it than that. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech
ADD COMMENTlink written 5.7 years ago by Steve Lianoglou12k
Answer: DEseq2 metagenomic analysis without replicates
0
gravatar for Simon Anders
5.7 years ago by
Simon Anders3.6k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.6k wrote:
Hi On 13/01/14 20:08, Kristina Fontanez [guest] wrote: > I am working with a metagenomic dataset in DESeq2 that does not have > true biological replicates but has multiple factors and levels. Due > to the difficulty in obtaining these samples, replicates were not an > option. The study design consists of seawater microbes collected at 4 > depths. Out of genuine curiosity: Why are replicates not an option? Maybe I have a wrong idea about metagenomics, but I now imagine that you went somewhere with a ship and lowered a bucket (or, maybe, a more sophisticated sample-taking apparatus ;-) ) four times, each time to a different depth, to take a sample. Why would it have been prohibitively expensive to let down the bucket twice for each depth? Even if you do all sample takings at the very same spot, you should get a reasonable estimate for variation within a depth layer, because your boat probably drifts a few hundred metres anyway in the time. > For each depth, there are 3 treatments (live, dead and none). Can you explain more what treatments these are? > I expect there to be true biological differences in abundance both > among depths and among treatments. So, combining the samples across > depths is not a great option because I expect there to be some > variation in abundance of taxa among depths. The same holds true for > combining samples across treatments. However, if I have to choose one > I would expect there to be less variation among depths than among > treatments. > > The goal is to assess the differences in abundance among treatments > and among depths. Ultimately, I would like to create relative > abundance heat maps and bar charts displaying the taxa which are > differentially abundant among treatments and among depths. I would > also like to cluster the samples using PCoA or nMDS to see the > effects of treatment and depth on sample similarity. > > Is it possible to normalize these data (using variance-stabilized > normalization or regularized log transformation) without choosing a > study design? In my perusing of the mailing list, one suggestion was > to use a design ~ 1 to estimate the dispersions so that the > dispersions treat all samples as replicates. In my case, that is not > an ideal choice due to the expected variation in abundance of taxa > among depths and treatments. The normalization does not use a study design, nor does the VST or rlog transform. (Note that the term "normalization" here only refers to accounting for sequencing depth, and this has little to do with study design.) The respective DESeq2 functions all ignore the 'design' argument. > 1) Make relative abundance heat maps and bar charts based on simple > counts/library size proportions. You should base them on the rlog-transformed data, not on normalized counts. > Later, to do differential abundance > tests, pool samples across depths to compare the 3 treatments (4 > replicates per treatment), calculating the dispersion using > ~Treatment as the study design. I would be surprised if you find much this way. I don't know what the treatments are but "live" and "dead" sounds like rather dramatic differences, giving rise to very high dispersion estimates. If you are lucky, you get some results nevertheless. In the end, however, the best option for non-replicated data is to _guess_ a good value for the dispersion instead of estimating it. For example, instead of the usual "dds <- estimateDispersions(dds)", you write "dispersions(dds) <- 0.1" to set the dispersion value for all genes to your best guess, in this example 0.1. Where to get the guess from? Well, hopefully you have some idea how much these counts typically vary between two samples taken from the depth and treated the same manner. If you think, they typically vary by 50%, then set the dispersion to 0.5^2=0.25. (If you don't have any idea at all, then you really should have tried to find out experimentally.) Of course, this approach may only be used to analyse preliminary data. After all, if you mention in a paper that you used a mere guess, the reviewers will (hopefully) ask how you arrived at it. As you don't have replicates, you cannot argue from your data that your guess is a reasonable value. It may also be hard to argue from previously published data, because metagenomics is too new a field to have much to compare with, and you cannot know whether your data is of as good quality as the one you based your guess on. It should be clear that you cannot claim that a difference is related to depth or treatment if you don't know whether differences of this magnitude could as well have been found between two samples taken at the same depth and treated the same way. I have explained this in a bit more detail, because we haven't had the topic on this mailing list for a while, and it might make sense from time to time to remind newer readers why replicates (or better: data from several independent samples) are so important. > 2) Normalize count data using a design ~1 option in DEseq2, export > normalized counts. As far I can tell - there isn???t a documented > procedure to do this in manual, vignettes or on the mailing list. The > main problem here is that I can???t figure out a way to export the > regularized log transformed counts. Again, the normalization has nothing to with the study design, as it _only_ normalizes for sequencing depth. You will get always the same normalized counts, which you can access with counts( dds, normalized=TRUE ) no matter what design formula you specify. > One approach to (2) for the regularized log transformation is below > and the variable Deptreat_rlog contains the regularized log > transformed count data which is the same size as the original dataset > (but the ???counts??? matrix is inaccessible): Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData'. Simon
ADD COMMENTlink written 5.7 years ago by Simon Anders3.6k
Simon- Thank you for your insights. Replicates aren’t an option often in marine metagenomics because many of us aren’t working at depths where dropping a bucket is not a feasible option. Notably, we are working with medium-term deployments of in situ incubations where the number of sampling spots was extremely limited. It’s a first pass at a difficult issue in marine microbiology but until we have the physical infrastructure (sampling mechanisms) to get a large number of samples, we often sacrifice replicates for better coverage across depths. Looking forward, replicates is definitely a goal. The treatments are in situ incubations representing live microbes, poisoned microbes (dead) and surrounding seawater (not an incubation, can be thought of as a control). Your point about guessing the dispersions is a difficult one for me to agree with. First, the differences between the treatments are likely to far exceed the differences within the depths. Preliminary analyses in bayseq where samples are grouped by treatment bear this point out. So, I think it’s reasonable to combine the dispersions across depths so that I can compare live and dead treatments. Second, as you point out, guessing the appropriate dispersion makes the data difficult to publish given I won’t have a good reason to argue for any value. These particular marine metagenomes would be the first of their kind so there really isn’t a good reference point. Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData’. /snip As you suggest, I would prefer to make relative abundance bar charts/heatmaps using regularized log transformed counts while accounting for varying sequencing depth. This was my Option 2 where I tried to follow a code example I found on the Bioconductor forums<http s:="" stat.ethz.ch="" pipermail="" bioconductor="" 2013-april="" 051878.html=""> relating to use of data without replicates, which used design ~ 1. Based on your insights about exporting directly as .csv (and a previous poster - thank you Steve), I plan to use the below code to get counts normalized by sequencing depth and regularized log transformed. I will estimate the dispersions by pooling the treatments across depths. One remaining question is whether I need the design ~ 1 option. I’m still not clear on that. design(Deptreat)<- ~1 Deptreat class: DESeqDataSet dim: 1064 12 exptData(0): assays(1): counts rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day Deptreat<-estimateSizeFactors(Deptreat) Deptreat<-estimateDispersions(Deptreat,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates >Deptreat_rlog=rlogData(Deptreat) >write.csv(Deptreat_rlog,”Deptreat_rlog_counts.csv”) Thank you again, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 3:25 PM, Simon Anders <anders@embl.de<mailto:anders@embl.de>> wrote: Hi On 13/01/14 20:08, Kristina Fontanez [guest] wrote: I am working with a metagenomic dataset in DESeq2 that does not have true biological replicates but has multiple factors and levels. Due to the difficulty in obtaining these samples, replicates were not an option. The study design consists of seawater microbes collected at 4 depths. Out of genuine curiosity: Why are replicates not an option? Maybe I have a wrong idea about metagenomics, but I now imagine that you went somewhere with a ship and lowered a bucket (or, maybe, a more sophisticated sample-taking apparatus ;-) ) four times, each time to a different depth, to take a sample. Why would it have been prohibitively expensive to let down the bucket twice for each depth? Even if you do all sample takings at the very same spot, you should get a reasonable estimate for variation within a depth layer, because your boat probably drifts a few hundred metres anyway in the time. For each depth, there are 3 treatments (live, dead and none). Can you explain more what treatments these are? I expect there to be true biological differences in abundance both among depths and among treatments. So, combining the samples across depths is not a great option because I expect there to be some variation in abundance of taxa among depths. The same holds true for combining samples across treatments. However, if I have to choose one I would expect there to be less variation among depths than among treatments. The goal is to assess the differences in abundance among treatments and among depths. Ultimately, I would like to create relative abundance heat maps and bar charts displaying the taxa which are differentially abundant among treatments and among depths. I would also like to cluster the samples using PCoA or nMDS to see the effects of treatment and depth on sample similarity. Is it possible to normalize these data (using variance-stabilized normalization or regularized log transformation) without choosing a study design? In my perusing of the mailing list, one suggestion was to use a design ~ 1 to estimate the dispersions so that the dispersions treat all samples as replicates. In my case, that is not an ideal choice due to the expected variation in abundance of taxa among depths and treatments. The normalization does not use a study design, nor does the VST or rlog transform. (Note that the term "normalization" here only refers to accounting for sequencing depth, and this has little to do with study design.) The respective DESeq2 functions all ignore the 'design' argument. 1) Make relative abundance heat maps and bar charts based on simple counts/library size proportions. You should base them on the rlog-transformed data, not on normalized counts. Later, to do differential abundance tests, pool samples across depths to compare the 3 treatments (4 replicates per treatment), calculating the dispersion using ~Treatment as the study design. I would be surprised if you find much this way. I don't know what the treatments are but "live" and "dead" sounds like rather dramatic differences, giving rise to very high dispersion estimates. If you are lucky, you get some results nevertheless. In the end, however, the best option for non-replicated data is to _guess_ a good value for the dispersion instead of estimating it. For example, instead of the usual "dds <- estimateDispersions(dds)", you write "dispersions(dds) <- 0.1" to set the dispersion value for all genes to your best guess, in this example 0.1. Where to get the guess from? Well, hopefully you have some idea how much these counts typically vary between two samples taken from the depth and treated the same manner. If you think, they typically vary by 50%, then set the dispersion to 0.5^2=0.25. (If you don't have any idea at all, then you really should have tried to find out experimentally.) Of course, this approach may only be used to analyse preliminary data. After all, if you mention in a paper that you used a mere guess, the reviewers will (hopefully) ask how you arrived at it. As you don't have replicates, you cannot argue from your data that your guess is a reasonable value. It may also be hard to argue from previously published data, because metagenomics is too new a field to have much to compare with, and you cannot know whether your data is of as good quality as the one you based your guess on. It should be clear that you cannot claim that a difference is related to depth or treatment if you don't know whether differences of this magnitude could as well have been found between two samples taken at the same depth and treated the same way. I have explained this in a bit more detail, because we haven't had the topic on this mailing list for a while, and it might make sense from time to time to remind newer readers why replicates (or better: data from several independent samples) are so important. 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn’t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can’t figure out a way to export the regularized log transformed counts. Again, the normalization has nothing to with the study design, as it _only_ normalizes for sequencing depth. You will get always the same normalized counts, which you can access with counts( dds, normalized=TRUE ) no matter what design formula you specify. One approach to (2) for the regularized log transformation is below and the variable Deptreat_rlog contains the regularized log transformed count data which is the same size as the original dataset (but the “counts” matrix is inaccessible): Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData'. Simon _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
Hi Kristina On 13/01/14 22:08, Kristina M Fontanez wrote: > Thank you for your insights. Replicates aren?t an option often in marine > metagenomics because many of us aren?t working at depths where dropping > a bucket is not a feasible option. Notably, we are working with > medium-term deployments of in situ incubations where the number of > sampling spots was extremely limited. It?s a first pass at a difficult > issue in marine microbiology but until we have the physical > infrastructure (sampling mechanisms) to get a large number of samples, > we often sacrifice replicates for better coverage across depths. Looking > forward, replicates is definitely a goal. First, you may (or may not) have a valid substitute for replication, see below, and if so, there might be little to complain about your design. Of course, doing all four depths twice might be overkill for a first try. If I had money for four samples, I might have opted for three depths and taken, say, the middle depth twice, to get some idea how much variation there can be. Precisely because this is a pilot study and a first try for a new field, it seems important to me to establish that the measurements for a given depth are stable and reproducible, _before_ rolling out a bigger experiment, and this is why I'm always so puzzled that people consider replicates as important for follow-ups but not for pilot studies. I would say it is the other way round. > The treatments are in situ incubations representing live microbes, > poisoned microbes (dead) and surrounding seawater (not an incubation, > can be thought of as a control). > > Your point about guessing the dispersions is a difficult one for me to > agree with. First, the differences between the treatments are likely to > far exceed the differences within the depths. Preliminary analyses in > bayseq where samples are grouped by treatment bear this point out. So, I > think it?s reasonable to combine the dispersions across depths so that I > can compare live and dead treatments. Actually, I fully agree. If treatment causes much stronger differences than depth, then using the samples from different depths as replicates to assess the effect of treatment is entirely reasonable and valid. I somehow thought you wanted to to it the other way round (pooling treatments to find differences between depths -- though this is not what you wrote), and this won't work, because nothing will be significant. > Second, as you point out, guessing > the appropriate dispersion makes the data difficult to publish given I > won?t have a good reason to argue for any value. These particular marine > metagenomes would be the first of their kind so there really isn?t a > good reference point. For the treatment effect, you are fine because you can argue that the dispersion estimated across different depths is certainly a conservative estimate for what you would have gotten from replicates -- so you are fine there. Problem is, you cannot reason about the effect of depth this way, but you would not want to throw away your knowledge about depths. However, I neglected in my first post that depth is an _ordered_ factor, and this might make quite a difference. I am not a marine biologist, but I imagine that most taxa will either keep increasing or keep decreasing in abundance when you go deeper and deeper, at least if your four depths are not that extreme. I assume that only few of your taxa will be happiest or unhappiest at the medium depths, i.e. have strongest or weakest abundance not at the most shallow or deepest sample but in one of the middle ones. If so (and only if so), you have a good substitute for replication. Strong differences between depths would then nearly always be consistent: the direction of change for all three steps (from depth 1 to depth 2, from depth 2 to depth 3 and from depth 3 to depth 4) is the same. Taxa without depth preference will fluctuate randomly: the differences between adjacent depth are not only weak but also inconsistent in their directions. Hence, the taxa with the strongest differences between depths will nearly always be consistent while those with the weakest differences will have random directions of change, and between the two, the line for significance can be drawn. This is why an ordered factor can substitute for replication. To do this with DESeq2, specify depth in you sample table not as a factor, but as numerical vector, e.g. by putting the depth in metres. The estimated log2 fold change can then be interpreted as reciprocal halving/doubling distances: A log2 fold change of +0.2, for example, would mean that the taxon's abundance doubles every 5 metres of depth (1/0.2=5), -0.4 would mean that it's abundance halves every 2.5 metres. If this assumption of exponential change is not too far from reality (and at least for light, it is true, and hence maybe also for light-dependent microbes), you will get valid p values. Your design for this should then be: ~ treatment + depth I should caution that we haven't tested regression on continuous variables much, so if it does not work, ask again. It should work but may need some tweaking, especially you may need to switch off coefficient shrinkage. > One remaining question is whether I need the design ~ 1 option. I?m still > not clear on that. When you create a DESeqDataSet, you have to pass a design formula. However, the 'estimateSizeFactor' function's output does not depend on this. The function 'rlogTransformation' takes an option, 'blind'. Its default, TRUE, means that the function should _ignore_ the design formula and dispersion values stored in the object and recompute them using the design formula '~ 1'. With '~ 1', the function does not know anything about the samples' assignment to groups, which is why we called it 'blind'. If you use 'rlogData', you have to specify '~ 1' manually. Simon
ADD REPLYlink written 5.7 years ago by Simon Anders3.6k
Simon- I agree that for the treatment effect, combining the dispersions across depths is the way to proceed with these data. I think your reasoning for depth as an ordered factor is sound and likely applies to the depth profile I am working with. I attempted to follow your suggestion (snipped below) to create a design that allows me to use depth as an ordered factor. However, I found it confusing that the design is ~treatment + depth. With depth as the second variable, doesn’t that mean that the dispersions will be combined across treatments? I thought the goal was the combine dispersions across depths so that the treatments could be compared. To do this with DESeq2, specify depth in you sample table not as a factor, but as numerical vector, e.g. by putting the depth in metres. The estimated log2 fold change can then be interpreted as reciprocal halving/doubling distances: A log2 fold change of +0.2, for example, would mean that the taxon's abundance doubles every 5 metres of depth (1/0.2=5), -0.4 would mean that it's abundance halves every 2.5 metres. If this assumption of exponential change is not too far from reality (and at least for light, it is true, and hence maybe also for light-dependent microbes), you will get valid p values. Your design for this should then be: ~ treatment + depth I should caution that we haven't tested regression on continuous variables much, so if it does not work, ask again. It should work but may need some tweaking, especially you may need to switch off coefficient shrinkage. /snip In the code below I used a ~TREATMENT+DEPTH design with depth as numerical vectors and the “None” (seawater) treatment as the base level of the treatment factor. My main concern is that the output results in hundreds of differentially abundant taxa (344 OTUs for the Dead vs Seawater comparison with p-adjusted values <1e-5). These results include both independent Filtering and the cooksCutoff (by default). Given the really high numbers I am inclined to think that a large number of these are false positives and we have modeled the dispersion incorrectly. What do you think of this assessment? > jam class: DESeqDataSet dim: 2151 12 exptData(0): assays(1): counts rownames(2151): OTU1 OTU2 ... OTU2150 OTU2151 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD9_SW_300_22 HD9_SW_500_22 colData names(5): DEPTH TREATMENT Cmg.m2.day Nmg.m2.day Pmg.m2.day > design(jam) ~TREATMENT + DEPTH > colData(jam)$TREATMENT<-relevel(colData(jam)$TREATMENT,"None") > colData(jam)$DEPTH<-relevel(colData(jam)$DEPTH,"150") Error in relevel.default(colData(jam)$DEPTH, "150") : 'relevel' only for factors > jam<-estimateSizeFactors(jam) > jam<-estimateDispersions(jam,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > plotDispEsts(jam) > jam<-nbinomWaldTest(jam) > resultsNames(jam) [1] "Intercept" "TREATMENT_Dead_vs_None" "TREATMENT_Live_vs_None" [4] "DEPTH" > DvN_clean<-results(jam,"TREATMENT_Dead_vs_None",independentFiltering =TRUE) > DvN_clean<-DvN_clean[order(DvN_clean$padj,na.last=NA),] > write.csv(DvN_clean,file="Dead_vs_Seawater.TreatDepth.iF_true.cook_t rue.csv") Thanks, Kristina > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.2.5 RcppArmadillo_0.3.920.1 Rcpp_0.10.6 [4] snow_0.3-13 baySeq_1.16.0 GenomicRanges_1.14.3 [7] XVector_0.2.0 IRanges_1.20.5 BiocGenerics_0.8.0 [10] ggplot2_0.9.3.1 scales_0.2.3 phyloseq_1.7.9 loaded via a namespace (and not attached): [1] ade4_1.5-2 annotate_1.40.0 AnnotationDbi_1.24.0 [4] ape_3.0-11 Biobase_2.22.0 biom_0.3.10 [7] Biostrings_2.30.1 cluster_1.14.4 codetools_0.2-8 [10] colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0 [13] digest_0.6.3 foreach_1.4.1 genefilter_1.44.0 [16] grid_3.0.2 gtable_0.1.2 igraph_0.6.6 [19] iterators_1.0.6 labeling_0.2 lattice_0.20-24 [22] locfit_1.5-9.1 MASS_7.3-29 Matrix_1.1-0 [25] multtest_2.18.0 munsell_0.4.2 nlme_3.1-113 [28] permute_0.7-0 plyr_1.8 proto_0.3-10 [31] RColorBrewer_1.0-5 reshape2_1.2.2 RJSONIO_1.0-3 [34] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 [37] stringr_0.6.2 survival_2.37-4 tools_3.0.2 [40] vegan_2.0-9 XML_3.95-0.2 xtable_1.7-1 ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 5:54 PM, Simon Anders <anders@embl.de<mailto:anders@embl.de>> wrote: Hi Kristina On 13/01/14 22:08, Kristina M Fontanez wrote: Thank you for your insights. Replicates aren’t an option often in marine metagenomics because many of us aren’t working at depths where dropping a bucket is not a feasible option. Notably, we are working with medium-term deployments of in situ incubations where the number of sampling spots was extremely limited. It’s a first pass at a difficult issue in marine microbiology but until we have the physical infrastructure (sampling mechanisms) to get a large number of samples, we often sacrifice replicates for better coverage across depths. Looking forward, replicates is definitely a goal. First, you may (or may not) have a valid substitute for replication, see below, and if so, there might be little to complain about your design. Of course, doing all four depths twice might be overkill for a first try. If I had money for four samples, I might have opted for three depths and taken, say, the middle depth twice, to get some idea how much variation there can be. Precisely because this is a pilot study and a first try for a new field, it seems important to me to establish that the measurements for a given depth are stable and reproducible, _before_ rolling out a bigger experiment, and this is why I'm always so puzzled that people consider replicates as important for follow-ups but not for pilot studies. I would say it is the other way round. The treatments are in situ incubations representing live microbes, poisoned microbes (dead) and surrounding seawater (not an incubation, can be thought of as a control). Your point about guessing the dispersions is a difficult one for me to agree with. First, the differences between the treatments are likely to far exceed the differences within the depths. Preliminary analyses in bayseq where samples are grouped by treatment bear this point out. So, I think it’s reasonable to combine the dispersions across depths so that I can compare live and dead treatments. Actually, I fully agree. If treatment causes much stronger differences than depth, then using the samples from different depths as replicates to assess the effect of treatment is entirely reasonable and valid. I somehow thought you wanted to to it the other way round (pooling treatments to find differences between depths -- though this is not what you wrote), and this won't work, because nothing will be significant. Second, as you point out, guessing the appropriate dispersion makes the data difficult to publish given I won’t have a good reason to argue for any value. These particular marine metagenomes would be the first of their kind so there really isn’t a good reference point. For the treatment effect, you are fine because you can argue that the dispersion estimated across different depths is certainly a conservative estimate for what you would have gotten from replicates -- so you are fine there. Problem is, you cannot reason about the effect of depth this way, but you would not want to throw away your knowledge about depths. However, I neglected in my first post that depth is an _ordered_ factor, and this might make quite a difference. I am not a marine biologist, but I imagine that most taxa will either keep increasing or keep decreasing in abundance when you go deeper and deeper, at least if your four depths are not that extreme. I assume that only few of your taxa will be happiest or unhappiest at the medium depths, i.e. have strongest or weakest abundance not at the most shallow or deepest sample but in one of the middle ones. If so (and only if so), you have a good substitute for replication. Strong differences between depths would then nearly always be consistent: the direction of change for all three steps (from depth 1 to depth 2, from depth 2 to depth 3 and from depth 3 to depth 4) is the same. Taxa without depth preference will fluctuate randomly: the differences between adjacent depth are not only weak but also inconsistent in their directions. Hence, the taxa with the strongest differences between depths will nearly always be consistent while those with the weakest differences will have random directions of change, and between the two, the line for significance can be drawn. This is why an ordered factor can substitute for replication. To do this with DESeq2, specify depth in you sample table not as a factor, but as numerical vector, e.g. by putting the depth in metres. The estimated log2 fold change can then be interpreted as reciprocal halving/doubling distances: A log2 fold change of +0.2, for example, would mean that the taxon's abundance doubles every 5 metres of depth (1/0.2=5), -0.4 would mean that it's abundance halves every 2.5 metres. If this assumption of exponential change is not too far from reality (and at least for light, it is true, and hence maybe also for light-dependent microbes), you will get valid p values. Your design for this should then be: ~ treatment + depth I should caution that we haven't tested regression on continuous variables much, so if it does not work, ask again. It should work but may need some tweaking, especially you may need to switch off coefficient shrinkage. One remaining question is whether I need the design ~ 1 option. I’m still not clear on that. When you create a DESeqDataSet, you have to pass a design formula. However, the 'estimateSizeFactor' function's output does not depend on this. The function 'rlogTransformation' takes an option, 'blind'. Its default, TRUE, means that the function should _ignore_ the design formula and dispersion values stored in the object and recompute them using the design formula '~ 1'. With '~ 1', the function does not know anything about the samples' assignment to groups, which is why we called it 'blind'. If you use 'rlogData', you have to specify '~ 1' manually. Simon [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
On 14/01/14 18:03, Kristina M Fontanez wrote: > I agree that for the treatment effect, combining the dispersions across > depths is the way to proceed with these data. I think your reasoning for > depth as an ordered factor is sound and likely applies to the depth > profile I am working with. I attempted to follow your suggestion > (snipped below) to create a design that allows me to use depth as an > ordered factor. However, I found it confusing that the design is > ~treatment + depth. > With depth as the second variable, doesn?t that mean > that the dispersions will be combined across treatments? I thought the > goal was the combine dispersions across depths so that the treatments > could be compared. Oh, I was just one step ahead. To see the effect of treatment, just leave out the depth factor, i.e., use "~ treatment". Only once you want to see the effect of depth, use "~ treatment + depth", with depth as a numeric variable (not an ordered factor, I may have been a bit unclear there). > In the code below I used a ~TREATMENT+DEPTH design with depth as > numerical vectors and the ?None? (seawater) treatment as the base level > of the treatment factor. My main concern is that the output results in > hundreds of differentially abundant taxa (344 OTUs for the Dead vs > Seawater comparison with p-adjusted values <1e-5). These results include > both independent Filtering and the cooksCutoff (by default). Given the > really high numbers I am inclined to think that a large number of these > are false positives and we have modeled the dispersion incorrectly. What > do you think of this assessment? Maybe try again with ignoring "depth" altogether, even though, in principle, including ti should not be wrong. Also have a look at the estimated fold changes. Maybe they are significant but very small. An MA plot would help here, and also allow to check the normalization. I'm still puzzled what kind of treatment "live" and "dead" might be. I find it hard to come up with a guess what kind of treatment might have no effect on most taxa, which nevertheless has such a drastic name as "dead". Can you explain a bit more about the biology? (By private mail, in case you don't want to expose unpublished work on a public mailing list.) Simon
ADD REPLYlink written 5.7 years ago by Simon Anders3.6k
Simon- Thank you again for your help answering all of my questions. I really appreciate the time you?ve taken on this problem. For the dead vs seawater comparison, 111/ 344 with p-adjusted values <1e-5 have log2Fold changes of > 3 or < -3. So, if I use that as my significance threshold - that?s still a lot of taxa. I have attached the MAplot created by: > plotMA(jam,lfcColname="TREATMENT_Dead_vs_None",pvalCutoff=0.00001,ma in="Design ~Treatment+Depth") Next, I tried your suggestion of just doing ~Treatment and I found 123 taxa with p-adjusted values <1e-5 have log2Fold changes of > 3 or < -3. > jes class: DESeqDataSet dim: 2151 12 exptData(0): assays(1): counts rownames(2151): OTU1 OTU2 ... OTU2150 OTU2151 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD9_SW_300_22 HD9_SW_500_22 colData names(5): DEPTH TREATMENT Cmg.m2.day Nmg.m2.day Pmg.m2.day > design(jes) ~TREATMENT > colData(jes)$TREATMENT<-relevel(colData(jes)$TREATMENT,"None") > jes<-estimateSizeFactors(jes) > jes<-estimateDispersions(jes,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > jes<-nbinomWaldTest(jes) > resultsNames(jes) [1] "Intercept" "TREATMENT_Dead_vs_None" "TREATMENT_Live_vs_None" > DvsN_clean<-results(jes,"TREATMENT_Dead_vs_None",independentFilterin g=TRUE) > DvsN_clean<-DvsN_clean[order(DvsN_clean$padj,na.last=NA),] > head(DvsN_clean) DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> OTU1608 234.50106 5.150802 0.4392783 11.72560 9.423191e-32 1.897831e-28 OTU1643 592.54806 5.594136 0.4892074 11.43510 2.792133e-30 2.811678e-27 OTU808 908.49633 -4.542030 0.4091021 -11.10244 1.220707e-28 8.195015e-26 OTU1576 216.68941 4.960188 0.4562776 10.87099 1.584772e-27 7.979329e-25 OTU1574 222.80627 5.204206 0.4836326 10.76066 5.279121e-27 2.126430e-24 OTU52 82.98462 -3.508510 0.3291026 -10.66084 1.551899e-26 5.209207e-24 > write.csv(DvsN_clean,file="Dead_vs_Seawater.Treat.iF_true.cook_true. csv") > plotMA(jes,lfcColname="TREATMENT_Dead_vs_None",pvalCutoff=0.00001,ma in="Design ~Treatment") At this point, it looks like including the effect of Depth reduces the total number of taxa under my significance threshold, but not by much. It seems like I need to decide what log2fold changes are biologically meaningful rather than relying on the p-adjusted values as a guide. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez at mit.edu<mailto:fontanez at="" mit.edu=""> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 14, 2014, at 12:18 PM, Simon Anders <anders at="" embl.de<mailto:anders="" at="" embl.de="">> wrote: On 14/01/14 18:03, Kristina M Fontanez wrote: I agree that for the treatment effect, combining the dispersions across depths is the way to proceed with these data. I think your reasoning for depth as an ordered factor is sound and likely applies to the depth profile I am working with. I attempted to follow your suggestion (snipped below) to create a design that allows me to use depth as an ordered factor. However, I found it confusing that the design is ~treatment + depth. With depth as the second variable, doesn?t that mean that the dispersions will be combined across treatments? I thought the goal was the combine dispersions across depths so that the treatments could be compared. Oh, I was just one step ahead. To see the effect of treatment, just leave out the depth factor, i.e., use "~ treatment". Only once you want to see the effect of depth, use "~ treatment + depth", with depth as a numeric variable (not an ordered factor, I may have been a bit unclear there). In the code below I used a ~TREATMENT+DEPTH design with depth as numerical vectors and the ?None? (seawater) treatment as the base level of the treatment factor. My main concern is that the output results in hundreds of differentially abundant taxa (344 OTUs for the Dead vs Seawater comparison with p-adjusted values <1e-5). These results include both independent Filtering and the cooksCutoff (by default). Given the really high numbers I am inclined to think that a large number of these are false positives and we have modeled the dispersion incorrectly. What do you think of this assessment? Maybe try again with ignoring "depth" altogether, even though, in principle, including ti should not be wrong. Also have a look at the estimated fold changes. Maybe they are significant but very small. An MA plot would help here, and also allow to check the normalization. I'm still puzzled what kind of treatment "live" and "dead" might be. I find it hard to come up with a guess what kind of treatment might have no effect on most taxa, which nevertheless has such a drastic name as "dead". Can you explain a bit more about the biology? (By private mail, in case you don't want to expose unpublished work on a public mailing list.) Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: MAplot_des_treatplusdepth.pdf Type: application/pdf Size: 332374 bytes Desc: MAplot_des_treatplusdepth.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140114="" ed754c3d="" attachment-0002.pdf=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: MAplot_des_treatonly.pdf Type: application/pdf Size: 331704 bytes Desc: MAplot_des_treatonly.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140114="" ed754c3d="" attachment-0003.pdf="">
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
Hi Kristina I'll write a bit a longer reply, to advertise a kind of paradigm shift that we are now exploring in DESeq2. First a note to other readers of this thread: Kristina has sent me a private mail explaining the experiment in more detail. To sum this up without disclosing anything, the three "treatments" labelled "none", "life" and "dead" are not so much different treatments applied to one kind of sample but rather three different (actually: very different) manners in which the samples were collected, which are hence expected to capture quite different subsets of the species that are present at the sample collection spots. Therefore, it is no surprise that observed OTU abundances differ so greatly between the "treatments". Remember that the purpose of a test for differential expression in its usual sense is to find gene for which you can reject with confidence the null hypothesis that the gene is _not_at_all_ affected by the treatment, i.e., that its abundance stays exactly the same. It is quite common that this specific null hypothesis turns out to be a bit silly. All genes are connected by the cell's complicated regulatory network, and hence, there typically might not be a single gene which is not at least very slightly affected by the treatment. In essence, what a significant p value hence really means is that the effect was big enough that we can say with confidence, in which _direction_ the gene has changed. Quite often, the effect strength required to establish the direction of change is larger than the effect strength required to make the change of biological interest, and therefore, one looks at p values. If it is the other way round, one needs to also have a cut-off on fold change or do a banded test (see below). In your case, it is also clear a priori that no OTU will have the same abundance in two "treatments", given that the different sample collection approaches favour quite different species. Hence, it is no surprise that at a reasonable cut-off on the adjusted p values (say, 0.1), nearly all OTUs are significant. This means that you are in the favourable situation that you can take the estimated differences between treatments pretty much at face value. However, you want to know _how_precise they actually are, and for this, DESeq2 now reports standard errors alongside all estimated log fold changes. This data, and not the p values, is likely the more useful output for you. Simon
ADD REPLYlink written 5.7 years ago by Simon Anders3.6k
Hi Simon- Thank you for your detailed reply. I agree that it is the case often in marine microbiology that the null hypothesis of no changes between treatments, often does not really apply. When you frame the discussion this way, it makes sense that I am seeing large numbers of differentially abundant taxa. Your discussion about how to narrow down the large number of differentially abundant is quite interesting. If I am understanding you correctly you are suggesting that a reasonable cutoff on adjust p-values (0.1 or 10% false discovery rate) is simply too lenient for my data. I agree and that is why I initially dropped to 1e-5 however that number is really quite abritrary. Prior to reading your e-mail I had been considering using the base Mean of normalized counts across samples as a primary cutoff. I believe that this is exactly what independent Filtering does, in that it attempts to find a specific number of normalized mean counts across samples which maximizes the number of taxa with a FDR below alpha (0.1). However, this optimization isn’t very effective for my data because in one comparison (Dead vs None) 655/1596 OTUs are below the 0.1 cutoff and yet have base mean of normalized counts less than 10. It filters the data somewhat but not to a very helpful degree. In order to use it effectively I would need to lower alpha to 1e-5 or something similar, which is again - an arbitrary cutoff. In the end there are really two questions I need to answer: 1) How many taxa are “truly" differentially abundant? 2) Of those truly differentially abundant taxa, what log2fold changes are biologically meaningful? In regards to (1), it seems that based on your email and what I’ve read in the DESeq manual that anything with a FDR of 10% ought to be regarded as “truly” differentially abundant. Is that correct? (Notably, if you turn on independent filtering there is a threshold selected for the base mean of normalized cutoffs that optimizes the total number of taxa below the 0.1 (alpha) - so in that case the mean of normalized counts is indeed taken into account and well as the FDR). Cooks cutoff for outliers doesn’t apply for my samples because hugely differentially abundant taxa in my treatment are likely true biological events (microbes rapidly growing in response to influx of nutrients) rather than technical or experimental artifacts. In regards to (2), I think this is where the log2fold changes but not standard errors become useful. In my data, small standard errors (high precision) (< 0.3) are correlated with small log2fold changes (< 3) and low mean of normalized counts across samples (< 1000). If I choose to only keep those taxa with small standard errors then I will be throwing away some of the most differentially abundant taxa in my dataset (log2fold changes > 3, mean counts > 1000 and standard errors > 0.3). That seems backwards. Alternatively, if I choose to keep those taxa with high log2fold changes (>3), considering those to be biologically meaningful, then I reduce the dataset considerably. That leads me to the final and most difficult question: what level of log2fold change in a microbial community is biologically relevant on the time scales and in the ecological location in which these samples were collected? I can think of one possible ways to assess this: We have metagenomic samples of microbial communities taken at this location, at particular depths, going back several years (albeit sequenced with different genomic technologies). Previous research has shown that the microbial community in the seawater at any given depth doesn’t change very much from year to year. If I then make the assumption that any changes in the seawater community at a particular depth from one year to the next are the “baseline” level of variation that can be expected, I should be able to calculate the maximum log2fold change in abundance for any given taxa from one year to the next. I could then use that log2fold change as my threshold for biologically meaningful change when analyzing my treatments. Any insights you may have to these ideas would be greatly appreciated. Thank you, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 16, 2014, at 10:02 AM, Simon Anders <anders@embl.de<mailto:anders@embl.de>> wrote: Hi Kristina I'll write a bit a longer reply, to advertise a kind of paradigm shift that we are now exploring in DESeq2. First a note to other readers of this thread: Kristina has sent me a private mail explaining the experiment in more detail. To sum this up without disclosing anything, the three "treatments" labelled "none", "life" and "dead" are not so much different treatments applied to one kind of sample but rather three different (actually: very different) manners in which the samples were collected, which are hence expected to capture quite different subsets of the species that are present at the sample collection spots. Therefore, it is no surprise that observed OTU abundances differ so greatly between the "treatments". Remember that the purpose of a test for differential expression in its usual sense is to find gene for which you can reject with confidence the null hypothesis that the gene is _not_at_all_ affected by the treatment, i.e., that its abundance stays exactly the same. It is quite common that this specific null hypothesis turns out to be a bit silly. All genes are connected by the cell's complicated regulatory network, and hence, there typically might not be a single gene which is not at least very slightly affected by the treatment. In essence, what a significant p value hence really means is that the effect was big enough that we can say with confidence, in which _direction_ the gene has changed. Quite often, the effect strength required to establish the direction of change is larger than the effect strength required to make the change of biological interest, and therefore, one looks at p values. If it is the other way round, one needs to also have a cut-off on fold change or do a banded test (see below). In your case, it is also clear a priori that no OTU will have the same abundance in two "treatments", given that the different sample collection approaches favour quite different species. Hence, it is no surprise that at a reasonable cut-off on the adjusted p values (say, 0.1), nearly all OTUs are significant. This means that you are in the favourable situation that you can take the estimated differences between treatments pretty much at face value. However, you want to know _how_precise they actually are, and for this, DESeq2 now reports standard errors alongside all estimated log fold changes. This data, and not the p values, is likely the more useful output for you. Simon [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
hi Kristina, On Thu, Jan 16, 2014 at 1:33 PM, Kristina M Fontanez <fontanez@mit.edu>wrote: > Hi Simon- > > Thank you for your detailed reply. I agree that it is the case often in > marine microbiology that the null hypothesis of no changes between > treatments, often does not really apply. When you frame the discussion this > way, it makes sense that I am seeing large numbers of differentially > abundant taxa. > > Your discussion about how to narrow down the large number of > differentially abundant is quite interesting. If I am understanding you > correctly you are suggesting that a reasonable cutoff on adjust p-values > (0.1 or 10% false discovery rate) is simply too lenient for my data. I > agree and that is why I initially dropped to 1e-5 however that number is > really quite abritrary. > > Prior to reading your e-mail I had been considering using the base Mean of > normalized counts across samples as a primary cutoff. I believe that this > is exactly what independent Filtering does, in that it attempts to find a > specific number of normalized mean counts across samples which maximizes > the number of taxa with a FDR below alpha (0.1). However, this optimization > isn’t very effective for my data because in one comparison (Dead vs None) > 655/1596 OTUs are below the 0.1 cutoff and yet have base mean of normalized > counts less than 10. It filters the data somewhat but not to a very helpful > degree. ​I think the point is that this is not even the kind of filter that you would like. It is filtering based on the tests that the log fold changes are zero, when what you would like is ​to filter based on a test that the absolute log fold changes are larger than some threshold greater than zero. We have in fact implemented this in the development version of DESeq2 (v1.3 paired with Bioc 2.14), as arguments to the results() function, which will be released in a few months. I should resist pasting some code in this email thread, which might become out of date and is not under source control. You can try to download the Package Source, a .tar.gz file here, http://www.bioconductor.org/packages/devel/bioc/html/DESeq2.html, and then install this by filling in 'xx' with the appropriate version number you download: install.packages("DESeq2_1.3.xx.tar.gz",repos=NULL) > In order to use it effectively I would need to lower alpha to 1e-5 or > something similar, which is again - an arbitrary cutoff. > > In the end there are really two questions I need to answer: > 1) How many taxa are “truly" differentially abundant? > > 2) Of those truly differentially abundant taxa, what log2fold changes are > biologically meaningful? > > In regards to (1), it seems that based on your email and what I’ve read in > the DESeq manual that anything with a FDR of 10% ought to be regarded as > “truly” differentially abundant. Is that correct? (Notably, if you turn on > independent filtering there is a threshold selected for the base mean of > normalized cutoffs that optimizes the total number of taxa below the 0.1 > (alpha) - so in that case the mean of normalized counts is indeed taken > into account and well as the FDR). Cooks cutoff for outliers doesn’t apply > for my samples because hugely differentially abundant taxa in my treatment > are likely true biological events (microbes rapidly growing in response to > influx of nutrients) rather than technical or experimental artifacts. > > In regards to (2), I think this is where the log2fold changes but not > standard errors become useful. In my data, small standard errors (high > precision) (< 0.3) are correlated with small log2fold changes (< 3) and low > mean of normalized counts across samples (< 1000). If I choose to only keep > those taxa with small standard errors then I will be throwing away some of > the most differentially abundant taxa in my dataset (log2fold changes > 3, > mean counts > 1000 and standard errors > 0.3). That seems backwards. > Alternatively, if I choose to keep those taxa with high log2fold changes > (>3), considering those to be biologically meaningful, then I reduce the > dataset considerably. > > That leads me to the final and most difficult question: what level of > log2fold change in a microbial community is biologically relevant on the > time scales and in the ecological location in which these samples were > collected? > > I can think of one possible ways to assess this: > > We have metagenomic samples of microbial communities taken at this > location, at particular depths, going back several years (albeit sequenced > with different genomic technologies). Previous research has shown that the > microbial community in the seawater at any given depth doesn’t change very > much from year to year. If I then make the assumption that any changes in > the seawater community at a particular depth from one year to the next are > the “baseline” level of variation that can be expected, I should be able to > calculate the maximum log2fold change in abundance for any given taxa from > one year to the next. I could then use that log2fold change as my threshold > for biologically meaningful change when analyzing my treatments. > > Any insights you may have to these ideas would be greatly appreciated. > > Thank you, > Kristina > ------------------------------------------------------------------ > Kristina Fontanez, Postdoctoral Fellow > fontanez@mit.edu<mailto:fontanez@mit.edu> > Massachusetts Institute of Technology > Department of Civil and Environmental Engineering > 48-120E > 15 Vassar Street > Cambridge, MA 02139 > > > > On Jan 16, 2014, at 10:02 AM, Simon Anders <anders@embl.de<mailto:> anders@embl.de>> wrote: > > Hi Kristina > > I'll write a bit a longer reply, to advertise a kind of paradigm shift > that we are now exploring in DESeq2. > > First a note to other readers of this thread: Kristina has sent me a > private mail explaining the experiment in more detail. To sum this up > without disclosing anything, the three "treatments" labelled "none", > "life" and "dead" are not so much different treatments applied to one > kind of sample but rather three different (actually: very different) > manners in which the samples were collected, which are hence expected to > capture quite different subsets of the species that are present at the > sample collection spots. > > Therefore, it is no surprise that observed OTU abundances differ so > greatly between the "treatments". Remember that the purpose of a test > for differential expression in its usual sense is to find gene for which > you can reject with confidence the null hypothesis that the gene is > _not_at_all_ affected by the treatment, i.e., that its abundance stays > exactly the same. > > It is quite common that this specific null hypothesis turns out to be a > bit silly. All genes are connected by the cell's complicated regulatory > network, and hence, there typically might not be a single gene which is > not at least very slightly affected by the treatment. In essence, what a > significant p value hence really means is that the effect was big enough > that we can say with confidence, in which _direction_ the gene has > changed. Quite often, the effect strength required to establish the > direction of change is larger than the effect strength required to make > the change of biological interest, and therefore, one looks at p values. > If it is the other way round, one needs to also have a cut-off on fold > change or do a banded test (see below). > > In your case, it is also clear a priori that no OTU will have the same > abundance in two "treatments", given that the different sample > collection approaches favour quite different species. Hence, it is no > surprise that at a reasonable cut-off on the adjusted p values (say, > 0.1), nearly all OTUs are significant. > > This means that you are in the favourable situation that you can take > the estimated differences between treatments pretty much at face value. > However, you want to know _how_precise they actually are, and for this, > DESeq2 now reports standard errors alongside all estimated log fold > changes. This data, and not the p values, is likely the more useful > output for you. > > Simon > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Michael Love25k
Hi On 16/01/14 19:33, Kristina M Fontanez wrote: > Your discussion about how to narrow down the large number of > differentially abundant is quite interesting. If I am understanding you > correctly you are suggesting that a reasonable cutoff on adjust p-values > (0.1 or 10% false discovery rate) is simply too lenient for my data. No, not at all. Cut-offs below 1% on adjusted p values are very rarely sensible. Your issue is that your _question_ is "too lenient": The test provides an answer to the following question: "Does the OTU's abundance in the sample depend on the treatment (sampling method)?" A cut-off on adjusted p values at 10% provides you with a list of OTUs for which this question can be answered with a clear "yes", such that the "yes" is erroneous for at most 10% of the OTUs. > I agree and that is why I initially dropped to 1e-5 however that number is > really quite abritrary. With such a low FDR, you reduce the expected absolute number of false positives in your list of several hundred OTUs from "less than very few" to "probably none at all". If nearly all your taxons look significant at 10% FDR, this simply means that you can say with confidence that the abundance of most OTUs in the sample depend on the sampling method. Frankly, this does not sound at all surprising, and is hence not only a plausible, but probably an entirely correct result. Only, it is not a very useful or deep insight. If you change to a lower FDR threshold, this does _not_ change the wording of the question and is hence not helpful. What you really want to know is not _which_ OTUs are affected by treatment but maybe _how_strong_ this effect is for each OTUs. This is why I wrote that you should look at the fold changes and try to find some biology in there. So, instead of making a list of OTUs, based on a "yes/no" question, you might consider a quantitive down-stream analysis, which includes _all_ OTUs, regardless of p value. For example, you could group your OTUs into larger clades and ask whether the _average_ fold change in certain clades are much larger than in the others. > In my data, small standard errors (high precision) (< 0.3) are > correlated with small log2fold changes (< 3) > and low mean of > normalized counts across samples (< 1000). If I choose to only keep > those taxa with small standard errors then I will be throwing away > some of the most differentially abundant taxa in my dataset (log2fold > changes > 3, mean counts > 1000 and standard errors > 0.3). No, you should not use the standard errors to choose _which_ genes to take. They are only useful if you do a quantitative analysis and want to know how precise your inference of fold changes actually is. > 2) Of those truly differentially abundant taxa, what log2fold changes > are biologically meaningful? This is one way, but possibly less insightful than a quantitative analysis. However, if you follow this direction then you should use DESeq2's new "thresholded" hypothesis testing that Mike mentioned. > We have metagenomic samples of microbial communities taken at this > location, at particular depths, going back several years (albeit > sequenced with different genomic technologies). Previous research has > shown that the microbial community in the seawater at any given depth > doesn?t change very much from year to year. If I then make the > assumption that any changes in the seawater community at a particular > depth from one year to the next are the ?baseline? level of variation > that can be expected, I should be able to calculate the maximum log2fold > change in abundance for any given taxa from one year to the next. I > could then use that log2fold change as my threshold for biologically > meaningful change when analyzing my treatments. No, this rather sounds like an assessment of _statistical_, not _biological_ significance. You propose to get replicate information, not using samples from different depths (as I suggested last time) but from different years. This is very useful and allows you to assess the precision of your abundance measure, i.e., how close they get to the true "population" values, which you would get from averaging over many samples from different times. This is exactly what is meant by statistical significance. Biological significance means that the effect is strong enough that it influences the system in a manner that supports the specific story you want to tell in your paper -- and hence depends on the hypothesis you want to investigate. Simon
ADD REPLYlink written 5.7 years ago by Simon Anders3.6k
Answer: DEseq2 metagenomic analysis without replicates
0
gravatar for Kristina M Fontanez
5.7 years ago by
Kristina M Fontanez220 wrote:
**I am including an earlier message which I neglected to cc to the Bioconductor forum, pasted below** Mike- Just one more short, related question - why is it that some of the rlogTransformed counts have negative values? I can certainly just add 1 to all of the values in order to make plotting relative abundance easier, but it seems strange that there should be negative values at all. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 16, 2014, at 5:04 PM, Kristina Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: Mike- I want to make sure I understand your point here correctly. Instead of using rlogTransformation, I am simply using the DESeq function to do the differential expression analysis and then subsequently exporting the normalized counts. An example code would be: > normalizedcounts<-counts(DEseqobject,normalized=TRUE) > write.csv(normalizedcounts,file="Normalizedcounts.csv”) In this case, the normalized counts use the design formula associated with the DEseqobject. In my case, that means the dispersion was estimated by treating depths as biological replications (design ~TREATMENT) as opposed to with design ~ 1. However, Simon mentioned in a previous post that I should be using the rlog-transformed data to make relative abundance heatmaps/barcharts and NOT the normalized counts, which only take into account sequencing depth. So, that gets me back to the place where I am trying to use rlogTransformation to export a matrix of rlog transformed values. I was able to do this successfully but I found that the rlog-transformed values created from 2 different DESeq objects with different designs resulted in slightly different count tables despite using rlogTransformation(DEseqobject, blind=TRUE) which should have ignored the designs. The values are different but only if you go out to several decimal places. For example: Design ~ Treatment Count value for OTU 1 in sample 1 is 3.096214077 Design ~ Treatment + Depth Count value for OTU 1 in sample 1 is 3.093644372 Why are these values different at all if blind was set to TRUE? Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 5:27 PM, Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> wrote: hi Kristina, In the vignette we show the use of the rlogTransformation() function which takes care of everything for the user: The two functions return SummarizedExperiment objects, as the data are no longer counts. The assay function is used to extract the matrix of normalized values. rld <- rlogTransformation(dds, blind=TRUE) This by default estimates the dispersion with design of ~ 1 (as the 'blind' argument defaults to TRUE). We recommend in general estimating dispersion with ~ 1, so that the transformation is entirely unaware of the experimental design (though using the experimental design would only inform or alter the transformation in so far as the estimation of the dispersion-mean trend, so it's not a huge concern). rlogData is the lower-level function called by rlogTransformation, after everything has been "taken care of" for the user, i.e., checking if there are size factors, if not estimating them, etc. if you are surprised by the output of a function, it's always good to check the help page for the function under the Value section. For ?rlogTransformation and ?rlogData. Value for rlogTransformation, a SummarizedExperiment with assay data elements equal to log2 (qij ) = xj.βi, see formula at DESeq. for rlogData, a matrix of the same dimension as the count data, containing the transformed values. Mike On Mon, Jan 13, 2014 at 4:08 PM, Kristina M Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: Simon- Thank you for your insights. Replicates aren’t an option often in marine metagenomics because many of us aren’t working at depths where dropping a bucket is not a feasible option. Notably, we are working with medium-term deployments of in situ incubations where the number of sampling spots was extremely limited. It’s a first pass at a difficult issue in marine microbiology but until we have the physical infrastructure (sampling mechanisms) to get a large number of samples, we often sacrifice replicates for better coverage across depths. Looking forward, replicates is definitely a goal. The treatments are in situ incubations representing live microbes, poisoned microbes (dead) and surrounding seawater (not an incubation, can be thought of as a control). Your point about guessing the dispersions is a difficult one for me to agree with. First, the differences between the treatments are likely to far exceed the differences within the depths. Preliminary analyses in bayseq where samples are grouped by treatment bear this point out. So, I think it’s reasonable to combine the dispersions across depths so that I can compare live and dead treatments. Second, as you point out, guessing the appropriate dispersion makes the data difficult to publish given I won’t have a good reason to argue for any value. These particular marine metagenomes would be the first of their kind so there really isn’t a good reference point. Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData’. /snip As you suggest, I would prefer to make relative abundance bar charts/heatmaps using regularized log transformed counts while accounting for varying sequencing depth. This was my Option 2 where I tried to follow a code example I found on the Bioconductor forums<http s:="" stat.ethz.ch="" pipermail="" bioconductor="" 2013-april="" 051878.html=""> relating to use of data without replicates, which used design ~ 1. Based on your insights about exporting directly as .csv (and a previous poster - thank you Steve), I plan to use the below code to get counts normalized by sequencing depth and regularized log transformed. I will estimate the dispersions by pooling the treatments across depths. One remaining question is whether I need the design ~ 1 option. I’m still not clear on that. design(Deptreat)<- ~1 Deptreat class: DESeqDataSet dim: 1064 12 exptData(0): assays(1): counts rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day Deptreat<-estimateSizeFactors(Deptreat) Deptreat<-estimateDispersions(Deptreat,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates >Deptreat_rlog=rlogData(Deptreat) >write.csv(Deptreat_rlog,”Deptreat_rlog_counts.csv”) Thank you again, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu><mailto:fontanez@mit.edu<mail to:fontanez@mit.edu="">> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 3:25 PM, Simon Anders <anders@embl.de<mailto:ander s@embl.de=""><mailto:anders@embl.de<mailto:anders@embl.de>>> wrote: Hi On 13/01/14 20:08, Kristina Fontanez [guest] wrote: I am working with a metagenomic dataset in DESeq2 that does not have true biological replicates but has multiple factors and levels. Due to the difficulty in obtaining these samples, replicates were not an option. The study design consists of seawater microbes collected at 4 depths. Out of genuine curiosity: Why are replicates not an option? Maybe I have a wrong idea about metagenomics, but I now imagine that you went somewhere with a ship and lowered a bucket (or, maybe, a more sophisticated sample-taking apparatus ;-) ) four times, each time to a different depth, to take a sample. Why would it have been prohibitively expensive to let down the bucket twice for each depth? Even if you do all sample takings at the very same spot, you should get a reasonable estimate for variation within a depth layer, because your boat probably drifts a few hundred metres anyway in the time. For each depth, there are 3 treatments (live, dead and none). Can you explain more what treatments these are? I expect there to be true biological differences in abundance both among depths and among treatments. So, combining the samples across depths is not a great option because I expect there to be some variation in abundance of taxa among depths. The same holds true for combining samples across treatments. However, if I have to choose one I would expect there to be less variation among depths than among treatments. The goal is to assess the differences in abundance among treatments and among depths. Ultimately, I would like to create relative abundance heat maps and bar charts displaying the taxa which are differentially abundant among treatments and among depths. I would also like to cluster the samples using PCoA or nMDS to see the effects of treatment and depth on sample similarity. Is it possible to normalize these data (using variance-stabilized normalization or regularized log transformation) without choosing a study design? In my perusing of the mailing list, one suggestion was to use a design ~ 1 to estimate the dispersions so that the dispersions treat all samples as replicates. In my case, that is not an ideal choice due to the expected variation in abundance of taxa among depths and treatments. The normalization does not use a study design, nor does the VST or rlog transform. (Note that the term "normalization" here only refers to accounting for sequencing depth, and this has little to do with study design.) The respective DESeq2 functions all ignore the 'design' argument. 1) Make relative abundance heat maps and bar charts based on simple counts/library size proportions. You should base them on the rlog-transformed data, not on normalized counts. Later, to do differential abundance tests, pool samples across depths to compare the 3 treatments (4 replicates per treatment), calculating the dispersion using ~Treatment as the study design. I would be surprised if you find much this way. I don't know what the treatments are but "live" and "dead" sounds like rather dramatic differences, giving rise to very high dispersion estimates. If you are lucky, you get some results nevertheless. In the end, however, the best option for non-replicated data is to _guess_ a good value for the dispersion instead of estimating it. For example, instead of the usual "dds <- estimateDispersions(dds)", you write "dispersions(dds) <- 0.1" to set the dispersion value for all genes to your best guess, in this example 0.1. Where to get the guess from? Well, hopefully you have some idea how much these counts typically vary between two samples taken from the depth and treated the same manner. If you think, they typically vary by 50%, then set the dispersion to 0.5^2=0.25. (If you don't have any idea at all, then you really should have tried to find out experimentally.) Of course, this approach may only be used to analyse preliminary data. After all, if you mention in a paper that you used a mere guess, the reviewers will (hopefully) ask how you arrived at it. As you don't have replicates, you cannot argue from your data that your guess is a reasonable value. It may also be hard to argue from previously published data, because metagenomics is too new a field to have much to compare with, and you cannot know whether your data is of as good quality as the one you based your guess on. It should be clear that you cannot claim that a difference is related to depth or treatment if you don't know whether differences of this magnitude could as well have been found between two samples taken at the same depth and treated the same way. I have explained this in a bit more detail, because we haven't had the topic on this mailing list for a while, and it might make sense from time to time to remind newer readers why replicates (or better: data from several independent samples) are so important. 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn’t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can’t figure out a way to export the regularized log transformed counts. Again, the normalization has nothing to with the study design, as it _only_ normalizes for sequencing depth. You will get always the same normalized counts, which you can access with counts( dds, normalized=TRUE ) no matter what design formula you specify. One approach to (2) for the regularized log transformation is below and the variable Deptreat_rlog contains the regularized log transformed count data which is the same size as the original dataset (but the “counts†matrix is inaccessible): Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData'. Simon _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org="">> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENTlink written 5.7 years ago by Kristina M Fontanez220
hi Kristina, Responses in line below On Thu, Jan 16, 2014 at 5:56 PM, Kristina M Fontanez <fontanez@mit.edu>wrote: > **I am including an earlier message which I neglected to cc to the > Bioconductor forum, pasted below** > > Mike- > > Just one more short, related question - why is it that some of the > rlogTransformed counts have negative values? I can certainly just add 1 to > all of the values in order to make plotting relative abundance easier, but > it seems strange that there should be negative values at all. > > ​rlog is the log2(expected count after shrinking), so if the expected count is between 0 and 1 you have a negative value.​ > Thanks, > Kristina > ------------------------------------------------------------------ > Kristina Fontanez, Postdoctoral Fellow > fontanez@mit.edu > Massachusetts Institute of Technology > Department of Civil and Environmental Engineering > 48-120E > 15 Vassar Street > Cambridge, MA 02139 > > > > On Jan 16, 2014, at 5:04 PM, Kristina Fontanez <fontanez@mit.edu> wrote: > > Mike- > > I want to make sure I understand your point here correctly. > > Instead of using rlogTransformation, I am simply using the DESeq > function to do the differential expression analysis and then subsequently > exporting the normalized counts. An example code would be: > > > normalizedcounts<-counts(DEseqobject,normalized=TRUE) > > write.csv(normalizedcounts,file="Normalizedcounts.csv”) > > In this case, the normalized counts use the design formula associated > with the DEseqobject. > > ​Normalized counts are K_ij / s_j where K_ij is the raw count for gene i and sample j and s_j is the size factor for sample j. estimated using the median ratio method. The estimation of size factors has nothing to do with the dispersion.​ > In my case, that means the dispersion was estimated by treating depths as > biological replications (design ~TREATMENT) as opposed to with design ~ 1. > However, Simon mentioned in a previous post that I should be using the > rlog-transformed data to make relative abundance heatmaps/barcharts and NOT > the normalized counts, which only take into account sequencing depth. > > So, that gets me back to the place where I am trying to use > rlogTransformation to export a matrix of rlog transformed values. I was > able to do this successfully but I found that the rlog-transformed values > created from 2 different DESeq objects with different designs resulted in > slightly different count tables despite using > rlogTransformation(DEseqobject, blind=TRUE) which should have ignored the > designs. The values are different but only if you go out to several decimal > places. > > ​The rlog function is deterministic, and I have never observed it not being deterministic given that the count matrix and the size factors are identical. So this is a mystery to me. > > For example: > > Design ~ Treatment > Count value for OTU 1 in sample 1 is > 3.096214077 > Design ~ Treatment + Depth > Count value for OTU 1 in sample 1 is > 3.093644372 > Why are these values different at all if blind was set to TRUE? > > Thanks, > Kristina > ------------------------------------------------------------------ > Kristina Fontanez, Postdoctoral Fellow > fontanez@mit.edu > Massachusetts Institute of Technology > Department of Civil and Environmental Engineering > 48-120E > 15 Vassar Street > Cambridge, MA 02139 > > > > On Jan 13, 2014, at 5:27 PM, Michael Love <michaelisaiahlove@gmail.com> > wrote: > > hi Kristina, > > In the vignette we show the use of the rlogTransformation() function > which takes care of everything for the user: > > The two functions return SummarizedExperiment objects, as the data are >> no longer counts. The assay function is used to extract the matrix of >> normalized values. > > >> rld <- rlogTransformation(dds, blind=TRUE) > > > This by default estimates the dispersion with design of ~ 1 (as the > 'blind' argument defaults to TRUE). We recommend in general estimating > dispersion with ~ 1, so that the transformation is entirely unaware of the > experimental design (though using the experimental design would only inform > or alter the transformation in so far as the estimation of the > dispersion-mean trend, so it's not a huge concern). rlogData is the > lower-level function called by rlogTransformation, after everything has > been "taken care of" for the user, i.e., checking if there are size > factors, if not estimating them, etc. > > if you are surprised by the output of a function, it's always good to > check the help page for the function under the Value section. For > ?rlogTransformation and ?rlogData. > > Value >> for rlogTransformation, a SummarizedExperiment with assay data elements >> equal to log2 (qij ) = xj.βi, see formula at DESeq. for rlogData, a matrix >> of the same dimension as the count data, containing the transformed values. >> > > > > Mike > > > > On Mon, Jan 13, 2014 at 4:08 PM, Kristina M Fontanez <fontanez@mit.edu>wrote: > >> Simon- >> >> Thank you for your insights. Replicates aren’t an option often in marine >> metagenomics because many of us aren’t working at depths where dropping a >> bucket is not a feasible option. Notably, we are working with medium-term >> deployments of in situ incubations where the number of sampling spots was >> extremely limited. It’s a first pass at a difficult issue in marine >> microbiology but until we have the physical infrastructure (sampling >> mechanisms) to get a large number of samples, we often sacrifice replicates >> for better coverage across depths. Looking forward, replicates is >> definitely a goal. >> >> The treatments are in situ incubations representing live microbes, >> poisoned microbes (dead) and surrounding seawater (not an incubation, can >> be thought of as a control). >> >> Your point about guessing the dispersions is a difficult one for me to >> agree with. First, the differences between the treatments are likely to far >> exceed the differences within the depths. Preliminary analyses in bayseq >> where samples are grouped by treatment bear this point out. So, I think >> it’s reasonable to combine the dispersions across depths so that I can >> compare live and dead treatments. Second, as you point out, guessing the >> appropriate dispersion makes the data difficult to publish given I won’t >> have a good reason to argue for any value. These particular marine >> metagenomes would be the first of their kind so there really isn’t a good >> reference point. >> >> Are you following some code example here? Because it looks that you are >> mixing up two. The result of 'rlogData' is a matrix, you don't need >> 'assay' to extract the matrix. Simply save the content of 'rlogData' >> into a CSV file, or pass the variable as is it to the heatmap function, >> or do something else with it. You need to use 'assay' if you have used >> 'rlogTransformation' instead of 'rlogData’. >> /snip >> >> As you suggest, I would prefer to make relative abundance bar >> charts/heatmaps using regularized log transformed counts while accounting >> for varying sequencing depth. This was my Option 2 where I tried to follow >> a code example I found on the Bioconductor forums< >> https://stat.ethz.ch/pipermail/bioconductor/2013-April/051878.html> >> relating to use of data without replicates, which used design ~ 1. >> >> Based on your insights about exporting directly as .csv (and a previous >> poster - thank you Steve), I plan to use the below code to get counts >> normalized by sequencing depth and regularized log transformed. I will >> estimate the dispersions by pooling the treatments across depths. One >> remaining question is whether I need the design ~ 1 option. I’m still not >> clear on that. >> >> >> design(Deptreat)<- ~1 >> Deptreat >> class: DESeqDataSet >> dim: 1064 12 >> exptData(0): >> assays(1): counts >> rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 >> rowData metadata column names(0): >> colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 >> colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day >> Deptreat<-estimateSizeFactors(Deptreat) >> Deptreat<-estimateDispersions(Deptreat,fitType="local") >> gene-wise dispersion estimates >> mean-dispersion relationship >> final dispersion estimates >> >Deptreat_rlog=rlogData(Deptreat) >> >write.csv(Deptreat_rlog,”Deptreat_rlog_counts.csv”) >> >> Thank you again, >> Kristina >> >> ------------------------------------------------------------------ >> Kristina Fontanez, Postdoctoral Fellow >> fontanez@mit.edu<mailto:fontanez@mit.edu> >> Massachusetts Institute of Technology >> Department of Civil and Environmental Engineering >> 48-120E >> 15 Vassar Street >> Cambridge, MA 02139 >> >> >> >> On Jan 13, 2014, at 3:25 PM, Simon Anders <anders@embl.de<mailto:>> anders@embl.de>> wrote: >> >> Hi >> >> On 13/01/14 20:08, Kristina Fontanez [guest] wrote: >> I am working with a metagenomic dataset in DESeq2 that does not have >> true biological replicates but has multiple factors and levels. Due >> to the difficulty in obtaining these samples, replicates were not an >> option. The study design consists of seawater microbes collected at 4 >> depths. >> >> Out of genuine curiosity: Why are replicates not an option? Maybe I have >> a wrong idea about metagenomics, but I now imagine that you went >> somewhere with a ship and lowered a bucket (or, maybe, a more >> sophisticated sample-taking apparatus ;-) ) four times, each time to a >> different depth, to take a sample. Why would it have been prohibitively >> expensive to let down the bucket twice for each depth? Even if you do >> all sample takings at the very same spot, you should get a reasonable >> estimate for variation within a depth layer, because your boat probably >> drifts a few hundred metres anyway in the time. >> >> For each depth, there are 3 treatments (live, dead and none). >> >> Can you explain more what treatments these are? >> >> I expect there to be true biological differences in abundance both >> among depths and among treatments. So, combining the samples across >> depths is not a great option because I expect there to be some >> variation in abundance of taxa among depths. The same holds true for >> combining samples across treatments. However, if I have to choose one >> I would expect there to be less variation among depths than among >> treatments. >> >> The goal is to assess the differences in abundance among treatments >> and among depths. Ultimately, I would like to create relative >> abundance heat maps and bar charts displaying the taxa which are >> differentially abundant among treatments and among depths. I would >> also like to cluster the samples using PCoA or nMDS to see the >> effects of treatment and depth on sample similarity. >> >> Is it possible to normalize these data (using variance-stabilized >> normalization or regularized log transformation) without choosing a >> study design? In my perusing of the mailing list, one suggestion was >> to use a design ~ 1 to estimate the dispersions so that the >> dispersions treat all samples as replicates. In my case, that is not >> an ideal choice due to the expected variation in abundance of taxa >> among depths and treatments. >> >> The normalization does not use a study design, nor does the VST or rlog >> transform. (Note that the term "normalization" here only refers to >> accounting for sequencing depth, and this has little to do with study >> design.) The respective DESeq2 functions all ignore the 'design' argument. >> >> 1) Make relative abundance heat maps and bar charts based on simple >> counts/library size proportions. >> >> You should base them on the rlog-transformed data, not on normalized >> counts. >> >> Later, to do differential abundance >> tests, pool samples across depths to compare the 3 treatments (4 >> replicates per treatment), calculating the dispersion using >> ~Treatment as the study design. >> >> I would be surprised if you find much this way. I don't know what the >> treatments are but "live" and "dead" sounds like rather dramatic >> differences, giving rise to very high dispersion estimates. If you are >> lucky, you get some results nevertheless. >> >> In the end, however, the best option for non-replicated data is to >> _guess_ a good value for the dispersion instead of estimating it. For >> example, instead of the usual "dds <- estimateDispersions(dds)", you >> write "dispersions(dds) <- 0.1" to set the dispersion value for all >> genes to your best guess, in this example 0.1. >> >> Where to get the guess from? Well, hopefully you have some idea how much >> these counts typically vary between two samples taken from the depth and >> treated the same manner. If you think, they typically vary by 50%, then >> set the dispersion to 0.5^2=0.25. (If you don't have any idea at all, >> then you really should have tried to find out experimentally.) >> >> Of course, this approach may only be used to analyse preliminary data. >> After all, if you mention in a paper that you used a mere guess, the >> reviewers will (hopefully) ask how you arrived at it. As you don't have >> replicates, you cannot argue from your data that your guess is a >> reasonable value. It may also be hard to argue from previously published >> data, because metagenomics is too new a field to have much to compare >> with, and you cannot know whether your data is of as good quality as the >> one you based your guess on. >> >> It should be clear that you cannot claim that a difference is related to >> depth or treatment if you don't know whether differences of this >> magnitude could as well have been found between two samples taken at the >> same depth and treated the same way. >> >> I have explained this in a bit more detail, because we haven't had the >> topic on this mailing list for a while, and it might make sense from >> time to time to remind newer readers why replicates (or better: data >> from several independent samples) are so important. >> >> 2) Normalize count data using a design ~1 option in DEseq2, export >> normalized counts. As far I can tell - there isn’t a documented >> procedure to do this in manual, vignettes or on the mailing list. The >> main problem here is that I can’t figure out a way to export the >> regularized log transformed counts. >> >> Again, the normalization has nothing to with the study design, as it >> _only_ normalizes for sequencing depth. You will get always the same >> normalized counts, which you can access with >> counts( dds, normalized=TRUE ) >> no matter what design formula you specify. >> >> One approach to (2) for the regularized log transformation is below >> and the variable Deptreat_rlog contains the regularized log >> transformed count data which is the same size as the original dataset >> (but the “counts†matrix is inaccessible): >> >> Are you following some code example here? Because it looks that you are >> mixing up two. The result of 'rlogData' is a matrix, you don't need >> 'assay' to extract the matrix. Simply save the content of 'rlogData' >> into a CSV file, or pass the variable as is it to the heatmap function, >> or do something else with it. You need to use 'assay' if you have used >> 'rlogTransformation' instead of 'rlogData'. >> >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Michael Love25k
Mike and Simon, Thank you for your detailed answers to my questions. I’ve learned a great deal from this discussion thus far. Below I have summarized and added a few additional questions. In order to present relative abundance barcharts/heatmaps I should use regularized log transformed data, which transforms that data to the log2 scale and stabilizes the variance among the counts. The output is a matrix of log2 expected counts after empirical Bayes shrinkage, some of which may be negative. 1) Am I correct in assuming that this transformation DOES NOT account for sequencing depth in any way? 2) I think it is important to account for sequencing depth, as well as to control the variance, so should I be using normalized AND rlog- transformed data to make relative abundance charts? The rlog function is deterministic, and I have never observed it not being deterministic given that the count matrix and the size factors are identical. So this is a mystery to me. In regards to your above statement about the rlog function being deterministic, I will try to recreate the problem and see if it continues. Based on this discussion and Simon’s previous comments, there are 2 directions I can go in evaluating the results of my differential abundance tests. First, I can take Simon’s suggestion and try a quantitative approach. The nbinomWald test is only testing the question, are the log2fold changes among treatments different from zero? Since my treatments are so different the answer for most (thousands of) OTUs is yes. Acknowledging that this is not a useful conclusion, I will take all OTUs whose FDR is 1% or lower so that I have reduced my number of false positives to near zero. Next, I'll take the entire list of OTUs which pass that cutoff and group them into genera - comparing the average fold change for each genus to all others, including across treatments. I could make a heatmap where the columns represent the comparisons (Live vs. Seawater, Live vs. Dead, and Dead vs. Seawater) and the rows represent the average log fold changes for each genus for the specified comparison. That way, I could visualize which taxa have the most extreme log fold changes in which treatments. Second, I could take Mike’s suggestion and try the thresholded hypothesis testing approach (implemented in development version of DESeq2) in order to assess what log2fold changes might be biologically significant. According to my reading of the manual, if I want to test whether the absolute value of the log2fold change is greater than 3 I would use the following function call: results(DEseqobject, “TREATMENT_Live_vs_Seawater”, lfcThreshold=3, altHypothesis=c(“greaterAbs”),independentFiltering=TRUE,alpha=0.01 ) In the end, I think I’ll try both suggestions and see where they lead. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 16, 2014, at 6:37 PM, Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> wrote: hi Kristina, Responses in line below On Thu, Jan 16, 2014 at 5:56 PM, Kristina M Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: **I am including an earlier message which I neglected to cc to the Bioconductor forum, pasted below** Mike- Just one more short, related question - why is it that some of the rlogTransformed counts have negative values? I can certainly just add 1 to all of the values in order to make plotting relative abundance easier, but it seems strange that there should be negative values at all. ​rlog is the log2(expected count after shrinking), so if the expected count is between 0 and 1 you have a negative value.​ Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 16, 2014, at 5:04 PM, Kristina Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: Mike- I want to make sure I understand your point here correctly. Instead of using rlogTransformation, I am simply using the DESeq function to do the differential expression analysis and then subsequently exporting the normalized counts. An example code would be: > normalizedcounts<-counts(DEseqobject,normalized=TRUE) > write.csv(normalizedcounts,file="Normalizedcounts.csv”) In this case, the normalized counts use the design formula associated with the DEseqobject. ​Normalized counts are K_ij / s_j where K_ij is the raw count for gene i and sample j and s_j is the size factor for sample j. estimated using the median ratio method. The estimation of size factors has nothing to do with the dispersion.​ In my case, that means the dispersion was estimated by treating depths as biological replications (design ~TREATMENT) as opposed to with design ~ 1. However, Simon mentioned in a previous post that I should be using the rlog-transformed data to make relative abundance heatmaps/barcharts and NOT the normalized counts, which only take into account sequencing depth. So, that gets me back to the place where I am trying to use rlogTransformation to export a matrix of rlog transformed values. I was able to do this successfully but I found that the rlog-transformed values created from 2 different DESeq objects with different designs resulted in slightly different count tables despite using rlogTransformation(DEseqobject, blind=TRUE) which should have ignored the designs. The values are different but only if you go out to several decimal places. ​The rlog function is deterministic, and I have never observed it not being deterministic given that the count matrix and the size factors are identical. So this is a mystery to me. For example: Design ~ Treatment Count value for OTU 1 in sample 1 is 3.096214077<tel:3.096214077> Design ~ Treatment + Depth Count value for OTU 1 in sample 1 is 3.093644372<tel:3.093644372> Why are these values different at all if blind was set to TRUE? Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 5:27 PM, Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> wrote: hi Kristina, In the vignette we show the use of the rlogTransformation() function which takes care of everything for the user: The two functions return SummarizedExperiment objects, as the data are no longer counts. The assay function is used to extract the matrix of normalized values. rld <- rlogTransformation(dds, blind=TRUE) This by default estimates the dispersion with design of ~ 1 (as the 'blind' argument defaults to TRUE). We recommend in general estimating dispersion with ~ 1, so that the transformation is entirely unaware of the experimental design (though using the experimental design would only inform or alter the transformation in so far as the estimation of the dispersion-mean trend, so it's not a huge concern). rlogData is the lower-level function called by rlogTransformation, after everything has been "taken care of" for the user, i.e., checking if there are size factors, if not estimating them, etc. if you are surprised by the output of a function, it's always good to check the help page for the function under the Value section. For ?rlogTransformation and ?rlogData. Value for rlogTransformation, a SummarizedExperiment with assay data elements equal to log2 (qij ) = xj.βi, see formula at DESeq. for rlogData, a matrix of the same dimension as the count data, containing the transformed values. Mike On Mon, Jan 13, 2014 at 4:08 PM, Kristina M Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: Simon- Thank you for your insights. Replicates aren’t an option often in marine metagenomics because many of us aren’t working at depths where dropping a bucket is not a feasible option. Notably, we are working with medium-term deployments of in situ incubations where the number of sampling spots was extremely limited. It’s a first pass at a difficult issue in marine microbiology but until we have the physical infrastructure (sampling mechanisms) to get a large number of samples, we often sacrifice replicates for better coverage across depths. Looking forward, replicates is definitely a goal. The treatments are in situ incubations representing live microbes, poisoned microbes (dead) and surrounding seawater (not an incubation, can be thought of as a control). Your point about guessing the dispersions is a difficult one for me to agree with. First, the differences between the treatments are likely to far exceed the differences within the depths. Preliminary analyses in bayseq where samples are grouped by treatment bear this point out. So, I think it’s reasonable to combine the dispersions across depths so that I can compare live and dead treatments. Second, as you point out, guessing the appropriate dispersion makes the data difficult to publish given I won’t have a good reason to argue for any value. These particular marine metagenomes would be the first of their kind so there really isn’t a good reference point. Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData’. /snip As you suggest, I would prefer to make relative abundance bar charts/heatmaps using regularized log transformed counts while accounting for varying sequencing depth. This was my Option 2 where I tried to follow a code example I found on the Bioconductor forums<http s:="" stat.ethz.ch="" pipermail="" bioconductor="" 2013-april="" 051878.html=""> relating to use of data without replicates, which used design ~ 1. Based on your insights about exporting directly as .csv (and a previous poster - thank you Steve), I plan to use the below code to get counts normalized by sequencing depth and regularized log transformed. I will estimate the dispersions by pooling the treatments across depths. One remaining question is whether I need the design ~ 1 option. I’m still not clear on that. design(Deptreat)<- ~1 Deptreat class: DESeqDataSet dim: 1064 12 exptData(0): assays(1): counts rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day Deptreat<-estimateSizeFactors(Deptreat) Deptreat<-estimateDispersions(Deptreat,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates >Deptreat_rlog=rlogData(Deptreat) >write.csv(Deptreat_rlog,”Deptreat_rlog_counts.csv”) Thank you again, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu><mailto:fontanez@mit.edu<mail to:fontanez@mit.edu="">> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 3:25 PM, Simon Anders <anders@embl.de<mailto:ander s@embl.de=""><mailto:anders@embl.de<mailto:anders@embl.de>>> wrote: Hi On 13/01/14 20:08, Kristina Fontanez [guest] wrote: I am working with a metagenomic dataset in DESeq2 that does not have true biological replicates but has multiple factors and levels. Due to the difficulty in obtaining these samples, replicates were not an option. The study design consists of seawater microbes collected at 4 depths. Out of genuine curiosity: Why are replicates not an option? Maybe I have a wrong idea about metagenomics, but I now imagine that you went somewhere with a ship and lowered a bucket (or, maybe, a more sophisticated sample-taking apparatus ;-) ) four times, each time to a different depth, to take a sample. Why would it have been prohibitively expensive to let down the bucket twice for each depth? Even if you do all sample takings at the very same spot, you should get a reasonable estimate for variation within a depth layer, because your boat probably drifts a few hundred metres anyway in the time. For each depth, there are 3 treatments (live, dead and none). Can you explain more what treatments these are? I expect there to be true biological differences in abundance both among depths and among treatments. So, combining the samples across depths is not a great option because I expect there to be some variation in abundance of taxa among depths. The same holds true for combining samples across treatments. However, if I have to choose one I would expect there to be less variation among depths than among treatments. The goal is to assess the differences in abundance among treatments and among depths. Ultimately, I would like to create relative abundance heat maps and bar charts displaying the taxa which are differentially abundant among treatments and among depths. I would also like to cluster the samples using PCoA or nMDS to see the effects of treatment and depth on sample similarity. Is it possible to normalize these data (using variance-stabilized normalization or regularized log transformation) without choosing a study design? In my perusing of the mailing list, one suggestion was to use a design ~ 1 to estimate the dispersions so that the dispersions treat all samples as replicates. In my case, that is not an ideal choice due to the expected variation in abundance of taxa among depths and treatments. The normalization does not use a study design, nor does the VST or rlog transform. (Note that the term "normalization" here only refers to accounting for sequencing depth, and this has little to do with study design.) The respective DESeq2 functions all ignore the 'design' argument. 1) Make relative abundance heat maps and bar charts based on simple counts/library size proportions. You should base them on the rlog-transformed data, not on normalized counts. Later, to do differential abundance tests, pool samples across depths to compare the 3 treatments (4 replicates per treatment), calculating the dispersion using ~Treatment as the study design. I would be surprised if you find much this way. I don't know what the treatments are but "live" and "dead" sounds like rather dramatic differences, giving rise to very high dispersion estimates. If you are lucky, you get some results nevertheless. In the end, however, the best option for non-replicated data is to _guess_ a good value for the dispersion instead of estimating it. For example, instead of the usual "dds <- estimateDispersions(dds)", you write "dispersions(dds) <- 0.1" to set the dispersion value for all genes to your best guess, in this example 0.1. Where to get the guess from? Well, hopefully you have some idea how much these counts typically vary between two samples taken from the depth and treated the same manner. If you think, they typically vary by 50%, then set the dispersion to 0.5^2=0.25. (If you don't have any idea at all, then you really should have tried to find out experimentally.) Of course, this approach may only be used to analyse preliminary data. After all, if you mention in a paper that you used a mere guess, the reviewers will (hopefully) ask how you arrived at it. As you don't have replicates, you cannot argue from your data that your guess is a reasonable value. It may also be hard to argue from previously published data, because metagenomics is too new a field to have much to compare with, and you cannot know whether your data is of as good quality as the one you based your guess on. It should be clear that you cannot claim that a difference is related to depth or treatment if you don't know whether differences of this magnitude could as well have been found between two samples taken at the same depth and treated the same way. I have explained this in a bit more detail, because we haven't had the topic on this mailing list for a while, and it might make sense from time to time to remind newer readers why replicates (or better: data from several independent samples) are so important. 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn’t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can’t figure out a way to export the regularized log transformed counts. Again, the normalization has nothing to with the study design, as it _only_ normalizes for sequencing depth. You will get always the same normalized counts, which you can access with counts( dds, normalized=TRUE ) no matter what design formula you specify. One approach to (2) for the regularized log transformation is below and the variable Deptreat_rlog contains the regularized log transformed count data which is the same size as the original dataset (but the “counts†matrix is inaccessible): Are you following some code example here? Because it looks that you are mixing up two. The result of 'rlogData' is a matrix, you don't need 'assay' to extract the matrix. Simply save the content of 'rlogData' into a CSV file, or pass the variable as is it to the heatmap function, or do something else with it. You need to use 'assay' if you have used 'rlogTransformation' instead of 'rlogData'. Simon _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org="">> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
hi Kristina, On Fri, Jan 17, 2014 at 10:49 AM, Kristina M Fontanez <fontanez@mit.edu>wrote: > Mike and Simon, > > Thank you for your detailed answers to my questions. I’ve learned a > great deal from this discussion thus far. Below I have summarized and added > a few additional questions. > > In order to present relative abundance barcharts/heatmaps I should use > regularized log transformed data, which transforms that data to the log2 > scale and stabilizes the variance among the counts. The output is a matrix > of log2 expected counts after empirical Bayes shrinkage, some of which may > be negative. > 1) Am I correct in assuming that this transformation DOES NOT account for > sequencing depth in any way? > ​No, it does account for sequencing. We say explicitly in the vignette and man pages that the rlog is better for accounting for size factors differences than VST. Also the formulae provided in the vignette and man pages makes this clear: rlog is defined as log2(q_ij)​ ​ E(K_ij​) = mu_ij = s_j q_ij in words, the expected count for gene i, sample j is represented by mu_ij, which is the produce of a size factor, s_j, which accounts for sequencing depth and q_ij. So q_ij therefore is the part leftover after accounting for sequencing depth. > 2) I think it is important to account for sequencing depth, as well as to > control the variance, so should I be using normalized AND rlog- transformed > data to make relative abundance charts? > > The rlog function is deterministic, and I have never observed it not > being deterministic given that the count matrix and the size factors are > identical. So this is a mystery to me. > > > In regards to your above statement about the rlog function being > deterministic, I will try to recreate the problem and see if it continues. > > > Based on this discussion and Simon’s previous comments, there are 2 > directions I can go in evaluating the results of my differential abundance > tests. > > First, I can take Simon’s suggestion and try a quantitative approach. > The nbinomWald test is only testing the question, are the log2fold changes > among treatments different from zero? Since my treatments are so different > the answer for most (thousands of) OTUs is yes. Acknowledging that this is > not a useful conclusion, I will take all OTUs whose FDR is 1% or lower so > that I have reduced my number of false positives to near zero. Next, I'll > take the entire list of OTUs which pass that cutoff and group them into > genera - comparing the average fold change for each genus to all others, > including across treatments. I could make a heatmap where the columns > represent the comparisons (Live vs. Seawater, Live vs. Dead, and Dead vs. > Seawater) and the rows represent the average log fold changes for each > genus for the specified comparison. That way, I could visualize which taxa > have the most extreme log fold changes in which treatments. > > Second, I could take Mike’s suggestion and try the thresholded > hypothesis testing approach (implemented in development version of DESeq2) > in order to assess what log2fold changes might be biologically significant. > According to my reading of the manual, if I want to test whether the > absolute value of the log2fold change is greater than 3 I would use the > following function call: > > results(DEseqobject, “TREATMENT_Live_vs_Seawater”, lfcThreshold=3, > altHypothesis=c(“greaterAbs”),independentFiltering=TRUE,alpha=0. 01) > > In the end, I think I’ll try both suggestions and see where they lead. > > Thanks, > Kristina > ------------------------------------------------------------------ > Kristina Fontanez, Postdoctoral Fellow > fontanez@mit.edu > Massachusetts Institute of Technology > Department of Civil and Environmental Engineering > 48-120E > 15 Vassar Street > Cambridge, MA 02139 > > > > On Jan 16, 2014, at 6:37 PM, Michael Love <michaelisaiahlove@gmail.com> > wrote: > > hi Kristina, > > Responses in line below > > > On Thu, Jan 16, 2014 at 5:56 PM, Kristina M Fontanez <fontanez@mit.edu>wrote: > >> **I am including an earlier message which I neglected to cc to the >> Bioconductor forum, pasted below** >> >> Mike- >> >> Just one more short, related question - why is it that some of the >> rlogTransformed counts have negative values? I can certainly just add 1 to >> all of the values in order to make plotting relative abundance easier, but >> it seems strange that there should be negative values at all. >> >> > ​rlog is the log2(expected count after shrinking), so if the expected > count is between 0 and 1 you have a negative value.​ > > > >> Thanks, >> Kristina >> ------------------------------------------------------------------ >> Kristina Fontanez, Postdoctoral Fellow >> fontanez@mit.edu >> Massachusetts Institute of Technology >> Department of Civil and Environmental Engineering >> 48-120E >> 15 Vassar Street >> Cambridge, MA 02139 >> >> >> >> On Jan 16, 2014, at 5:04 PM, Kristina Fontanez <fontanez@mit.edu> wrote: >> >> Mike- >> >> I want to make sure I understand your point here correctly. >> >> Instead of using rlogTransformation, I am simply using the DESeq >> function to do the differential expression analysis and then subsequently >> exporting the normalized counts. An example code would be: >> >> > normalizedcounts<-counts(DEseqobject,normalized=TRUE) >> > write.csv(normalizedcounts,file="Normalizedcounts.csv”) >> >> In this case, the normalized counts use the design formula associated >> with the DEseqobject. >> >> > ​Normalized counts are K_ij / s_j where K_ij is the raw count for gene i > and sample j and s_j is the size factor for sample j. estimated using the > median ratio method. The estimation of size factors has nothing to do with > the dispersion.​ > > > >> In my case, that means the dispersion was estimated by treating >> depths as biological replications (design ~TREATMENT) as opposed to with >> design ~ 1. However, Simon mentioned in a previous post that I should be >> using the rlog-transformed data to make relative abundance >> heatmaps/barcharts and NOT the normalized counts, which only take into >> account sequencing depth. >> >> So, that gets me back to the place where I am trying to use >> rlogTransformation to export a matrix of rlog transformed values. I was >> able to do this successfully but I found that the rlog-transformed values >> created from 2 different DESeq objects with different designs resulted in >> slightly different count tables despite using >> rlogTransformation(DEseqobject, blind=TRUE) which should have ignored the >> designs. The values are different but only if you go out to several decimal >> places. >> >> > ​The rlog function is deterministic, and I have never observed it not > being deterministic given that the count matrix and the size factors are > identical. So this is a mystery to me. > > > >> >> For example: >> >> Design ~ Treatment >> Count value for OTU 1 in sample 1 is >> 3.096214077 >> Design ~ Treatment + Depth >> Count value for OTU 1 in sample 1 is >> 3.093644372 >> Why are these values different at all if blind was set to TRUE? >> >> Thanks, >> Kristina >> ------------------------------------------------------------------ >> Kristina Fontanez, Postdoctoral Fellow >> fontanez@mit.edu >> Massachusetts Institute of Technology >> Department of Civil and Environmental Engineering >> 48-120E >> 15 Vassar Street >> Cambridge, MA 02139 >> >> >> >> On Jan 13, 2014, at 5:27 PM, Michael Love <michaelisaiahlove@gmail.com> >> wrote: >> >> hi Kristina, >> >> In the vignette we show the use of the rlogTransformation() function >> which takes care of everything for the user: >> >> The two functions return SummarizedExperiment objects, as the data are >>> no longer counts. The assay function is used to extract the matrix of >>> normalized values. >> >> >>> rld <- rlogTransformation(dds, blind=TRUE) >> >> >> This by default estimates the dispersion with design of ~ 1 (as the >> 'blind' argument defaults to TRUE). We recommend in general estimating >> dispersion with ~ 1, so that the transformation is entirely unaware of the >> experimental design (though using the experimental design would only inform >> or alter the transformation in so far as the estimation of the >> dispersion-mean trend, so it's not a huge concern). rlogData is the >> lower-level function called by rlogTransformation, after everything has >> been "taken care of" for the user, i.e., checking if there are size >> factors, if not estimating them, etc. >> >> if you are surprised by the output of a function, it's always good to >> check the help page for the function under the Value section. For >> ?rlogTransformation and ?rlogData. >> >> Value >>> for rlogTransformation, a SummarizedExperiment with assay data elements >>> equal to log2 (qij ) = xj.βi, see formula at DESeq. for rlogData, a matrix >>> of the same dimension as the count data, containing the transformed values. >>> >> >> >> >> Mike >> >> >> >> On Mon, Jan 13, 2014 at 4:08 PM, Kristina M Fontanez <fontanez@mit.edu>wrote: >> >>> Simon- >>> >>> Thank you for your insights. Replicates aren’t an option often in marine >>> metagenomics because many of us aren’t working at depths where dropping a >>> bucket is not a feasible option. Notably, we are working with medium-term >>> deployments of in situ incubations where the number of sampling spots was >>> extremely limited. It’s a first pass at a difficult issue in marine >>> microbiology but until we have the physical infrastructure (sampling >>> mechanisms) to get a large number of samples, we often sacrifice replicates >>> for better coverage across depths. Looking forward, replicates is >>> definitely a goal. >>> >>> The treatments are in situ incubations representing live microbes, >>> poisoned microbes (dead) and surrounding seawater (not an incubation, can >>> be thought of as a control). >>> >>> Your point about guessing the dispersions is a difficult one for me to >>> agree with. First, the differences between the treatments are likely to far >>> exceed the differences within the depths. Preliminary analyses in bayseq >>> where samples are grouped by treatment bear this point out. So, I think >>> it’s reasonable to combine the dispersions across depths so that I can >>> compare live and dead treatments. Second, as you point out, guessing the >>> appropriate dispersion makes the data difficult to publish given I won’t >>> have a good reason to argue for any value. These particular marine >>> metagenomes would be the first of their kind so there really isn’t a good >>> reference point. >>> >>> Are you following some code example here? Because it looks that you are >>> mixing up two. The result of 'rlogData' is a matrix, you don't need >>> 'assay' to extract the matrix. Simply save the content of 'rlogData' >>> into a CSV file, or pass the variable as is it to the heatmap function, >>> or do something else with it. You need to use 'assay' if you have used >>> 'rlogTransformation' instead of 'rlogData’. >>> /snip >>> >>> As you suggest, I would prefer to make relative abundance bar >>> charts/heatmaps using regularized log transformed counts while accounting >>> for varying sequencing depth. This was my Option 2 where I tried to follow >>> a code example I found on the Bioconductor forums< >>> https://stat.ethz.ch/pipermail/bioconductor/2013-April/051878.html> >>> relating to use of data without replicates, which used design ~ 1. >>> >>> Based on your insights about exporting directly as .csv (and a previous >>> poster - thank you Steve), I plan to use the below code to get counts >>> normalized by sequencing depth and regularized log transformed. I will >>> estimate the dispersions by pooling the treatments across depths. One >>> remaining question is whether I need the design ~ 1 option. I’m still not >>> clear on that. >>> >>> >>> design(Deptreat)<- ~1 >>> Deptreat >>> class: DESeqDataSet >>> dim: 1064 12 >>> exptData(0): >>> assays(1): counts >>> rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 >>> rowData metadata column names(0): >>> colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 >>> colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day >>> Deptreat<-estimateSizeFactors(Deptreat) >>> Deptreat<-estimateDispersions(Deptreat,fitType="local") >>> gene-wise dispersion estimates >>> mean-dispersion relationship >>> final dispersion estimates >>> >Deptreat_rlog=rlogData(Deptreat) >>> >write.csv(Deptreat_rlog,”Deptreat_rlog_counts.csv”) >>> >>> Thank you again, >>> Kristina >>> >>> ------------------------------------------------------------------ >>> Kristina Fontanez, Postdoctoral Fellow >>> fontanez@mit.edu<mailto:fontanez@mit.edu> >>> Massachusetts Institute of Technology >>> Department of Civil and Environmental Engineering >>> 48-120E >>> 15 Vassar Street >>> Cambridge, MA 02139 >>> >>> >>> >>> On Jan 13, 2014, at 3:25 PM, Simon Anders <anders@embl.de<mailto:>>> anders@embl.de>> wrote: >>> >>> Hi >>> >>> On 13/01/14 20:08, Kristina Fontanez [guest] wrote: >>> I am working with a metagenomic dataset in DESeq2 that does not have >>> true biological replicates but has multiple factors and levels. Due >>> to the difficulty in obtaining these samples, replicates were not an >>> option. The study design consists of seawater microbes collected at 4 >>> depths. >>> >>> Out of genuine curiosity: Why are replicates not an option? Maybe I have >>> a wrong idea about metagenomics, but I now imagine that you went >>> somewhere with a ship and lowered a bucket (or, maybe, a more >>> sophisticated sample-taking apparatus ;-) ) four times, each time to a >>> different depth, to take a sample. Why would it have been prohibitively >>> expensive to let down the bucket twice for each depth? Even if you do >>> all sample takings at the very same spot, you should get a reasonable >>> estimate for variation within a depth layer, because your boat probably >>> drifts a few hundred metres anyway in the time. >>> >>> For each depth, there are 3 treatments (live, dead and none). >>> >>> Can you explain more what treatments these are? >>> >>> I expect there to be true biological differences in abundance both >>> among depths and among treatments. So, combining the samples across >>> depths is not a great option because I expect there to be some >>> variation in abundance of taxa among depths. The same holds true for >>> combining samples across treatments. However, if I have to choose one >>> I would expect there to be less variation among depths than among >>> treatments. >>> >>> The goal is to assess the differences in abundance among treatments >>> and among depths. Ultimately, I would like to create relative >>> abundance heat maps and bar charts displaying the taxa which are >>> differentially abundant among treatments and among depths. I would >>> also like to cluster the samples using PCoA or nMDS to see the >>> effects of treatment and depth on sample similarity. >>> >>> Is it possible to normalize these data (using variance-stabilized >>> normalization or regularized log transformation) without choosing a >>> study design? In my perusing of the mailing list, one suggestion was >>> to use a design ~ 1 to estimate the dispersions so that the >>> dispersions treat all samples as replicates. In my case, that is not >>> an ideal choice due to the expected variation in abundance of taxa >>> among depths and treatments. >>> >>> The normalization does not use a study design, nor does the VST or rlog >>> transform. (Note that the term "normalization" here only refers to >>> accounting for sequencing depth, and this has little to do with study >>> design.) The respective DESeq2 functions all ignore the 'design' >>> argument. >>> >>> 1) Make relative abundance heat maps and bar charts based on simple >>> counts/library size proportions. >>> >>> You should base them on the rlog-transformed data, not on normalized >>> counts. >>> >>> Later, to do differential abundance >>> tests, pool samples across depths to compare the 3 treatments (4 >>> replicates per treatment), calculating the dispersion using >>> ~Treatment as the study design. >>> >>> I would be surprised if you find much this way. I don't know what the >>> treatments are but "live" and "dead" sounds like rather dramatic >>> differences, giving rise to very high dispersion estimates. If you are >>> lucky, you get some results nevertheless. >>> >>> In the end, however, the best option for non-replicated data is to >>> _guess_ a good value for the dispersion instead of estimating it. For >>> example, instead of the usual "dds <- estimateDispersions(dds)", you >>> write "dispersions(dds) <- 0.1" to set the dispersion value for all >>> genes to your best guess, in this example 0.1. >>> >>> Where to get the guess from? Well, hopefully you have some idea how much >>> these counts typically vary between two samples taken from the depth and >>> treated the same manner. If you think, they typically vary by 50%, then >>> set the dispersion to 0.5^2=0.25. (If you don't have any idea at all, >>> then you really should have tried to find out experimentally.) >>> >>> Of course, this approach may only be used to analyse preliminary data. >>> After all, if you mention in a paper that you used a mere guess, the >>> reviewers will (hopefully) ask how you arrived at it. As you don't have >>> replicates, you cannot argue from your data that your guess is a >>> reasonable value. It may also be hard to argue from previously published >>> data, because metagenomics is too new a field to have much to compare >>> with, and you cannot know whether your data is of as good quality as the >>> one you based your guess on. >>> >>> It should be clear that you cannot claim that a difference is related to >>> depth or treatment if you don't know whether differences of this >>> magnitude could as well have been found between two samples taken at the >>> same depth and treated the same way. >>> >>> I have explained this in a bit more detail, because we haven't had the >>> topic on this mailing list for a while, and it might make sense from >>> time to time to remind newer readers why replicates (or better: data >>> from several independent samples) are so important. >>> >>> 2) Normalize count data using a design ~1 option in DEseq2, export >>> normalized counts. As far I can tell - there isn’t a documented >>> procedure to do this in manual, vignettes or on the mailing list. The >>> main problem here is that I can’t figure out a way to export the >>> regularized log transformed counts. >>> >>> Again, the normalization has nothing to with the study design, as it >>> _only_ normalizes for sequencing depth. You will get always the same >>> normalized counts, which you can access with >>> counts( dds, normalized=TRUE ) >>> no matter what design formula you specify. >>> >>> One approach to (2) for the regularized log transformation is below >>> and the variable Deptreat_rlog contains the regularized log >>> transformed count data which is the same size as the original dataset >>> (but the “counts†matrix is inaccessible): >>> >>> Are you following some code example here? Because it looks that you are >>> mixing up two. The result of 'rlogData' is a matrix, you don't need >>> 'assay' to extract the matrix. Simply save the content of 'rlogData' >>> into a CSV file, or pass the variable as is it to the heatmap function, >>> or do something else with it. You need to use 'assay' if you have used >>> 'rlogTransformation' instead of 'rlogData'. >>> >>> Simon >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> > > [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Michael Love25k
Hi Mike- Thanks for the clarification below - I don’t know how I missed that. ​No, it does account for sequencing. We say explicitly in the vignette and man pages that the rlog is better for accounting for size factors differences than VST. Also the formulae provided in the vignette and man pages makes this clear: rlog is defined as log2(q_ij)​ ​ E(K_ij​) = mu_ij = s_j q_ij in words, the expected count for gene i, sample j is represented by mu_ij, which is the produce of a size factor, s_j, which accounts for sequencing depth and q_ij. So q_ij therefore is the part leftover after accounting for sequencing depth. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
Hi Mike- Is there any reason NOT to convert the log2 counts back into expected counts using 2^(expected count after shrinking)? I want to avoid the negative values and by converting the data back into expected counts I can easily create relative abundance plots by calculating expected count for feature A in sample 1/total counts for all features in sample 1. Thanks, Kristina On Jan 16, 2014, at 6:37 PM, Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> wrote: On Thu, Jan 16, 2014 at 5:56 PM, Kristina M Fontanez <fontanez@mit.edu<mailto:fontanez@mit.edu>> wrote: **I am including an earlier message which I neglected to cc to the Bioconductor forum, pasted below** Mike- Just one more short, related question - why is it that some of the rlogTransformed counts have negative values? I can certainly just add 1 to all of the values in order to make plotting relative abundance easier, but it seems strange that there should be negative values at all. ​rlog is the log2(expected count after shrinking), so if the expected count is between 0 and 1 you have a negative value.​ ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Kristina M Fontanez220
Hi Kristina, That sounds reasonable to me. Mike On Jan 17, 2014 3:51 PM, "Kristina M Fontanez" <fontanez@mit.edu> wrote: > > Hi Mike- > > Is there any reason NOT to convert the log2 counts back into expected > counts using 2^(expected count after shrinking)? I want to avoid the > negative values and by converting the data back into expected counts I can > easily create relative abundance plots by calculating expected count for > feature A in sample 1/total counts for all features in sample 1. > > Thanks, > Kristina > > > On Jan 16, 2014, at 6:37 PM, Michael Love <michaelisaiahlove@gmail.com> > wrote: > > On Thu, Jan 16, 2014 at 5:56 PM, Kristina M Fontanez <fontanez@mit.edu> > wrote: > >> **I am including an earlier message which I neglected to cc to the >> Bioconductor forum, pasted below** >> >> Mike- >> >> Just one more short, related question - why is it that some of the >> rlogTransformed counts have negative values? I can certainly just add 1 to >> all of the values in order to make plotting relative abundance easier, but >> it seems strange that there should be negative values at all. >> >> > ​rlog is the log2(expected count after shrinking), so if the expected > count is between 0 and 1 you have a negative value.​ > > > > > > ------------------------------------------------------------------ > Kristina Fontanez, Postdoctoral Fellow > fontanez@mit.edu > Massachusetts Institute of Technology > Department of Civil and Environmental Engineering > 48-120E > 15 Vassar Street > Cambridge, MA 02139 > [[alternative HTML version deleted]]
ADD REPLYlink written 5.7 years ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 151 users visited in the last hour