Question: DEseq2 metagenomic analysis without replicates
0
5.4 years ago by
Guest User12k
Guest User12k wrote:
Hello- I am working with a metagenomic dataset in DESeq2 that does not have true biological replicates but has multiple factors and levels. Due to the difficulty in obtaining these samples, replicates were not an option. The study design consists of seawater microbes collected at 4 depths. For each depth, there are 3 treatments (live, dead and none). I expect there to be true biological differences in abundance both among depths and among treatments. So, combining the samples across depths is not a great option because I expect there to be some variation in abundance of taxa among depths. The same holds true for combining samples across treatments. However, if I have to choose one I would expect there to be less variation among depths than among treatments. The goal is to assess the differences in abundance among treatments and among depths. Ultimately, I would like to create relative abundance heat maps and bar charts displaying the taxa which are differentially abundant among treatments and among depths. I would also like to cluster the samples using PCoA or nMDS to see the effects of treatment and depth on sample similarity. Is it possible to normalize these data (using variance-stabilized normalization or regularized log transformation) without choosing a study design? In my perusing of the mailing list, one suggestion was to use a design ~ 1 to estimate the dispersions so that the dispersions treat all samples as replicates. In my case, that is not an ideal choice due to the expected variation in abundance of taxa among depths and treatments. So, I see two options for moving forward that I would like feedback on: 1) Make relative abundance heat maps and bar charts based on simple counts/library size proportions. Later, to do differential abundance tests, pool samples across depths to compare the 3 treatments (4 replicates per treatment), calculating the dispersion using ~Treatment as the study design. 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn???t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can???t figure out a way to export the regularized log transformed counts. One approach to (2) for the regularized log transformation is below and the variable Deptreat_rlog contains the regularized log transformed count data which is the same size as the original dataset (but the ???counts??? matrix is inaccessible): > design(Deptreat)<- ~1 > Deptreat class: DESeqDataSet dim: 1064 12 exptData(0): assays(1): counts rownames(1064): OTU1 OTU2 ... OTU1063 OTU1064 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD5_500D HD9_SW_500_22 colData names(5): Treatment Depth Cmg.m2.day Nmg.m2.day Pmg.m2.day > Deptreat<-estimateSizeFactors(Deptreat) > Deptreat<-estimateDispersions(Deptreat,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > plotDispEsts(Deptreat) <dispestplot.jpeg> > Deptreat_rlog=rlogData(Deptreat) > dim(Deptreat_rlog) [1] 1064 12 > assay(Deptreat_rlog) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ???assay??? for signature ???"matrix", "missing"??? > traceback() 3: stop(gettextf("unable to find an inherited method for function %s for signature %s", sQuote(fdef at generic), sQuote(cnames)), domain = NA) 2: (function (classes, fdef, mtable) { methods <- .findInheritedMethods(classes, fdef, mtable) if (length(methods) == 1L) return(methods[[1L]]) else if (length(methods) == 0L) { cnames <- paste0("\"", sapply(classes, as.character), "\"", collapse = ", ") stop(gettextf("unable to find an inherited method for function %s for signature %s", sQuote(fdef at generic), sQuote(cnames)), domain = NA) } else stop("Internal error in finding inherited methods; didn't return a unique method", domain = NA) })(list("matrix", "missing"), function (x, i, ...) standardGeneric("assay"), <environment>) 1: assay(Deptreat_rlog) Thank you for your insights! Kristina -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.2.5 RcppArmadillo_0.3.920.1 Rcpp_0.10.6 [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 [7] BiocGenerics_0.8.0 ggplot2_0.9.3.1 scales_0.2.3 [10] phyloseq_1.7.9 loaded via a namespace (and not attached): [1] ade4_1.5-2 annotate_1.40.0 AnnotationDbi_1.24.0 [4] ape_3.0-11 Biobase_2.22.0 biom_0.3.10 [7] Biostrings_2.30.1 cluster_1.14.4 codetools_0.2-8 [10] colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0 [13] digest_0.6.3 foreach_1.4.1 genefilter_1.44.0 [16] grid_3.0.2 gtable_0.1.2 igraph_0.6.6 [19] iterators_1.0.6 labeling_0.2 lattice_0.20-24 [22] locfit_1.5-9.1 MASS_7.3-29 Matrix_1.1-0 [25] multtest_2.18.0 munsell_0.4.2 nlme_3.1-113 [28] permute_0.7-0 plyr_1.8 proto_0.3-10 [31] RColorBrewer_1.0-5 reshape2_1.2.2 RJSONIO_1.0-3 [34] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 [37] stringr_0.6.2 survival_2.37-4 tools_3.0.2 [40] vegan_2.0-9 XML_3.95-0.2 xtable_1.7-1 -- Sent via the guest posting facility at bioconductor.org.
deseq2 • 2.5k views
modified 5.3 years ago by Kristina M Fontanez220 • written 5.4 years ago by Guest User12k
Answer: DEseq2 metagenomic analysis without replicates
0
5.4 years ago by
Denali
Steve Lianoglou12k wrote:
Hi Kristina, Just a quick comment regarding your trouble w/ exporting the regularized data you mention here: On Mon, Jan 13, 2014 at 11:08 AM, Kristina Fontanez [guest] <guest at="" bioconductor.org=""> wrote: [snip] > 2) Normalize count data using a design ~1 option in DEseq2, export normalized counts. As far I can tell - there isn???t a documented procedure to do this in manual, vignettes or on the mailing list. The main problem here is that I can???t figure out a way to export the regularized log transformed counts. [/snip] You've already done it. You are calling rlogData on your DESeqDataSet here: >> Deptreat_rlog=rlogData(Deptreat) >> dim(Deptreat_rlog) > [1] 1064 12 >> assay(Deptreat_rlog) > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ???assay??? for signature ???"matrix", "missing"??? You are getting an error because Deptreat_rlog is *already* a matrix of regularized counts (in log2) ... job done ... as in: this is the matrix you are looking for. See ?rlogData for more details, but ... there's not much more to it than that. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech
Answer: DEseq2 metagenomic analysis without replicates
0
5.4 years ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:
Simon- I agree that for the treatment effect, combining the dispersions across depths is the way to proceed with these data. I think your reasoning for depth as an ordered factor is sound and likely applies to the depth profile I am working with. I attempted to follow your suggestion (snipped below) to create a design that allows me to use depth as an ordered factor. However, I found it confusing that the design is ~treatment + depth. With depth as the second variable, doesnt that mean that the dispersions will be combined across treatments? I thought the goal was the combine dispersions across depths so that the treatments could be compared. To do this with DESeq2, specify depth in you sample table not as a factor, but as numerical vector, e.g. by putting the depth in metres. The estimated log2 fold change can then be interpreted as reciprocal halving/doubling distances: A log2 fold change of +0.2, for example, would mean that the taxon's abundance doubles every 5 metres of depth (1/0.2=5), -0.4 would mean that it's abundance halves every 2.5 metres. If this assumption of exponential change is not too far from reality (and at least for light, it is true, and hence maybe also for light-dependent microbes), you will get valid p values. Your design for this should then be: ~ treatment + depth I should caution that we haven't tested regression on continuous variables much, so if it does not work, ask again. It should work but may need some tweaking, especially you may need to switch off coefficient shrinkage. /snip In the code below I used a ~TREATMENT+DEPTH design with depth as numerical vectors and the None (seawater) treatment as the base level of the treatment factor. My main concern is that the output results in hundreds of differentially abundant taxa (344 OTUs for the Dead vs Seawater comparison with p-adjusted values <1e-5). These results include both independent Filtering and the cooksCutoff (by default). Given the really high numbers I am inclined to think that a large number of these are false positives and we have modeled the dispersion incorrectly. What do you think of this assessment? > jam class: DESeqDataSet dim: 2151 12 exptData(0): assays(1): counts rownames(2151): OTU1 OTU2 ... OTU2150 OTU2151 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD9_SW_300_22 HD9_SW_500_22 colData names(5): DEPTH TREATMENT Cmg.m2.day Nmg.m2.day Pmg.m2.day > design(jam) ~TREATMENT + DEPTH > colData(jam)$TREATMENT<-relevel(colData(jam)$TREATMENT,"None") > colData(jam)$DEPTH<-relevel(colData(jam)$DEPTH,"150") Error in relevel.default(colData(jam)$DEPTH, "150") : 'relevel' only for factors > jam<-estimateSizeFactors(jam) > jam<-estimateDispersions(jam,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > plotDispEsts(jam) > jam<-nbinomWaldTest(jam) > resultsNames(jam) [1] "Intercept" "TREATMENT_Dead_vs_None" "TREATMENT_Live_vs_None" [4] "DEPTH" > DvN_clean<-results(jam,"TREATMENT_Dead_vs_None",independentFiltering =TRUE) > DvN_clean<-DvN_clean[order(DvN_clean$padj,na.last=NA),] > write.csv(DvN_clean,file="Dead_vs_Seawater.TreatDepth.iF_true.cook_t rue.csv") Thanks, Kristina > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.2.5 RcppArmadillo_0.3.920.1 Rcpp_0.10.6 [4] snow_0.3-13 baySeq_1.16.0 GenomicRanges_1.14.3 [7] XVector_0.2.0 IRanges_1.20.5 BiocGenerics_0.8.0 [10] ggplot2_0.9.3.1 scales_0.2.3 phyloseq_1.7.9 loaded via a namespace (and not attached): [1] ade4_1.5-2 annotate_1.40.0 AnnotationDbi_1.24.0 [4] ape_3.0-11 Biobase_2.22.0 biom_0.3.10 [7] Biostrings_2.30.1 cluster_1.14.4 codetools_0.2-8 [10] colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0 [13] digest_0.6.3 foreach_1.4.1 genefilter_1.44.0 [16] grid_3.0.2 gtable_0.1.2 igraph_0.6.6 [19] iterators_1.0.6 labeling_0.2 lattice_0.20-24 [22] locfit_1.5-9.1 MASS_7.3-29 Matrix_1.1-0 [25] multtest_2.18.0 munsell_0.4.2 nlme_3.1-113 [28] permute_0.7-0 plyr_1.8 proto_0.3-10 [31] RColorBrewer_1.0-5 reshape2_1.2.2 RJSONIO_1.0-3 [34] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 [37] stringr_0.6.2 survival_2.37-4 tools_3.0.2 [40] vegan_2.0-9 XML_3.95-0.2 xtable_1.7-1 ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 13, 2014, at 5:54 PM, Simon Anders <anders@embl.de<mailto:anders@embl.de>> wrote: Hi Kristina On 13/01/14 22:08, Kristina M Fontanez wrote: Thank you for your insights. Replicates arent an option often in marine metagenomics because many of us arent working at depths where dropping a bucket is not a feasible option. Notably, we are working with medium-term deployments of in situ incubations where the number of sampling spots was extremely limited. Its a first pass at a difficult issue in marine microbiology but until we have the physical infrastructure (sampling mechanisms) to get a large number of samples, we often sacrifice replicates for better coverage across depths. Looking forward, replicates is definitely a goal. First, you may (or may not) have a valid substitute for replication, see below, and if so, there might be little to complain about your design. Of course, doing all four depths twice might be overkill for a first try. If I had money for four samples, I might have opted for three depths and taken, say, the middle depth twice, to get some idea how much variation there can be. Precisely because this is a pilot study and a first try for a new field, it seems important to me to establish that the measurements for a given depth are stable and reproducible, _before_ rolling out a bigger experiment, and this is why I'm always so puzzled that people consider replicates as important for follow-ups but not for pilot studies. I would say it is the other way round. The treatments are in situ incubations representing live microbes, poisoned microbes (dead) and surrounding seawater (not an incubation, can be thought of as a control). Your point about guessing the dispersions is a difficult one for me to agree with. First, the differences between the treatments are likely to far exceed the differences within the depths. Preliminary analyses in bayseq where samples are grouped by treatment bear this point out. So, I think its reasonable to combine the dispersions across depths so that I can compare live and dead treatments. Actually, I fully agree. If treatment causes much stronger differences than depth, then using the samples from different depths as replicates to assess the effect of treatment is entirely reasonable and valid. I somehow thought you wanted to to it the other way round (pooling treatments to find differences between depths -- though this is not what you wrote), and this won't work, because nothing will be significant. Second, as you point out, guessing the appropriate dispersion makes the data difficult to publish given I wont have a good reason to argue for any value. These particular marine metagenomes would be the first of their kind so there really isnt a good reference point. For the treatment effect, you are fine because you can argue that the dispersion estimated across different depths is certainly a conservative estimate for what you would have gotten from replicates -- so you are fine there. Problem is, you cannot reason about the effect of depth this way, but you would not want to throw away your knowledge about depths. However, I neglected in my first post that depth is an _ordered_ factor, and this might make quite a difference. I am not a marine biologist, but I imagine that most taxa will either keep increasing or keep decreasing in abundance when you go deeper and deeper, at least if your four depths are not that extreme. I assume that only few of your taxa will be happiest or unhappiest at the medium depths, i.e. have strongest or weakest abundance not at the most shallow or deepest sample but in one of the middle ones. If so (and only if so), you have a good substitute for replication. Strong differences between depths would then nearly always be consistent: the direction of change for all three steps (from depth 1 to depth 2, from depth 2 to depth 3 and from depth 3 to depth 4) is the same. Taxa without depth preference will fluctuate randomly: the differences between adjacent depth are not only weak but also inconsistent in their directions. Hence, the taxa with the strongest differences between depths will nearly always be consistent while those with the weakest differences will have random directions of change, and between the two, the line for significance can be drawn. This is why an ordered factor can substitute for replication. To do this with DESeq2, specify depth in you sample table not as a factor, but as numerical vector, e.g. by putting the depth in metres. The estimated log2 fold change can then be interpreted as reciprocal halving/doubling distances: A log2 fold change of +0.2, for example, would mean that the taxon's abundance doubles every 5 metres of depth (1/0.2=5), -0.4 would mean that it's abundance halves every 2.5 metres. If this assumption of exponential change is not too far from reality (and at least for light, it is true, and hence maybe also for light-dependent microbes), you will get valid p values. Your design for this should then be: ~ treatment + depth I should caution that we haven't tested regression on continuous variables much, so if it does not work, ask again. It should work but may need some tweaking, especially you may need to switch off coefficient shrinkage. One remaining question is whether I need the design ~ 1 option. Im still not clear on that. When you create a DESeqDataSet, you have to pass a design formula. However, the 'estimateSizeFactor' function's output does not depend on this. The function 'rlogTransformation' takes an option, 'blind'. Its default, TRUE, means that the function should _ignore_ the design formula and dispersion values stored in the object and recompute them using the design formula '~ 1'. With '~ 1', the function does not know anything about the samples' assignment to groups, which is why we called it 'blind'. If you use 'rlogData', you have to specify '~ 1' manually. Simon [[alternative HTML version deleted]]
On 14/01/14 18:03, Kristina M Fontanez wrote: > I agree that for the treatment effect, combining the dispersions across > depths is the way to proceed with these data. I think your reasoning for > depth as an ordered factor is sound and likely applies to the depth > profile I am working with. I attempted to follow your suggestion > (snipped below) to create a design that allows me to use depth as an > ordered factor. However, I found it confusing that the design is > ~treatment + depth. > With depth as the second variable, doesn?t that mean > that the dispersions will be combined across treatments? I thought the > goal was the combine dispersions across depths so that the treatments > could be compared. Oh, I was just one step ahead. To see the effect of treatment, just leave out the depth factor, i.e., use "~ treatment". Only once you want to see the effect of depth, use "~ treatment + depth", with depth as a numeric variable (not an ordered factor, I may have been a bit unclear there). > In the code below I used a ~TREATMENT+DEPTH design with depth as > numerical vectors and the ?None? (seawater) treatment as the base level > of the treatment factor. My main concern is that the output results in > hundreds of differentially abundant taxa (344 OTUs for the Dead vs > Seawater comparison with p-adjusted values <1e-5). These results include > both independent Filtering and the cooksCutoff (by default). Given the > really high numbers I am inclined to think that a large number of these > are false positives and we have modeled the dispersion incorrectly. What > do you think of this assessment? Maybe try again with ignoring "depth" altogether, even though, in principle, including ti should not be wrong. Also have a look at the estimated fold changes. Maybe they are significant but very small. An MA plot would help here, and also allow to check the normalization. I'm still puzzled what kind of treatment "live" and "dead" might be. I find it hard to come up with a guess what kind of treatment might have no effect on most taxa, which nevertheless has such a drastic name as "dead". Can you explain a bit more about the biology? (By private mail, in case you don't want to expose unpublished work on a public mailing list.) Simon
Simon- Thank you again for your help answering all of my questions. I really appreciate the time you?ve taken on this problem. For the dead vs seawater comparison, 111/ 344 with p-adjusted values <1e-5 have log2Fold changes of > 3 or < -3. So, if I use that as my significance threshold - that?s still a lot of taxa. I have attached the MAplot created by: > plotMA(jam,lfcColname="TREATMENT_Dead_vs_None",pvalCutoff=0.00001,ma in="Design ~Treatment+Depth") Next, I tried your suggestion of just doing ~Treatment and I found 123 taxa with p-adjusted values <1e-5 have log2Fold changes of > 3 or < -3. > jes class: DESeqDataSet dim: 2151 12 exptData(0): assays(1): counts rownames(2151): OTU1 OTU2 ... OTU2150 OTU2151 rowData metadata column names(0): colnames(12): HD5_150L HD5_150D ... HD9_SW_300_22 HD9_SW_500_22 colData names(5): DEPTH TREATMENT Cmg.m2.day Nmg.m2.day Pmg.m2.day > design(jes) ~TREATMENT > colData(jes)$TREATMENT<-relevel(colData(jes)$TREATMENT,"None") > jes<-estimateSizeFactors(jes) > jes<-estimateDispersions(jes,fitType="local") gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > jes<-nbinomWaldTest(jes) > resultsNames(jes) [1] "Intercept" "TREATMENT_Dead_vs_None" "TREATMENT_Live_vs_None" > DvsN_clean<-results(jes,"TREATMENT_Dead_vs_None",independentFilterin g=TRUE) > DvsN_clean<-DvsN_clean[order(DvsN_clean\$padj,na.last=NA),] > head(DvsN_clean) DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> OTU1608 234.50106 5.150802 0.4392783 11.72560 9.423191e-32 1.897831e-28 OTU1643 592.54806 5.594136 0.4892074 11.43510 2.792133e-30 2.811678e-27 OTU808 908.49633 -4.542030 0.4091021 -11.10244 1.220707e-28 8.195015e-26 OTU1576 216.68941 4.960188 0.4562776 10.87099 1.584772e-27 7.979329e-25 OTU1574 222.80627 5.204206 0.4836326 10.76066 5.279121e-27 2.126430e-24 OTU52 82.98462 -3.508510 0.3291026 -10.66084 1.551899e-26 5.209207e-24 > write.csv(DvsN_clean,file="Dead_vs_Seawater.Treat.iF_true.cook_true. csv") > plotMA(jes,lfcColname="TREATMENT_Dead_vs_None",pvalCutoff=0.00001,ma in="Design ~Treatment") At this point, it looks like including the effect of Depth reduces the total number of taxa under my significance threshold, but not by much. It seems like I need to decide what log2fold changes are biologically meaningful rather than relying on the p-adjusted values as a guide. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez at mit.edu<mailto:fontanez at="" mit.edu=""> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 On Jan 14, 2014, at 12:18 PM, Simon Anders <anders at="" embl.de<mailto:anders="" at="" embl.de="">> wrote: On 14/01/14 18:03, Kristina M Fontanez wrote: I agree that for the treatment effect, combining the dispersions across depths is the way to proceed with these data. I think your reasoning for depth as an ordered factor is sound and likely applies to the depth profile I am working with. I attempted to follow your suggestion (snipped below) to create a design that allows me to use depth as an ordered factor. However, I found it confusing that the design is ~treatment + depth. With depth as the second variable, doesn?t that mean that the dispersions will be combined across treatments? I thought the goal was the combine dispersions across depths so that the treatments could be compared. Oh, I was just one step ahead. To see the effect of treatment, just leave out the depth factor, i.e., use "~ treatment". Only once you want to see the effect of depth, use "~ treatment + depth", with depth as a numeric variable (not an ordered factor, I may have been a bit unclear there). In the code below I used a ~TREATMENT+DEPTH design with depth as numerical vectors and the ?None? (seawater) treatment as the base level of the treatment factor. My main concern is that the output results in hundreds of differentially abundant taxa (344 OTUs for the Dead vs Seawater comparison with p-adjusted values <1e-5). These results include both independent Filtering and the cooksCutoff (by default). Given the really high numbers I am inclined to think that a large number of these are false positives and we have modeled the dispersion incorrectly. What do you think of this assessment? Maybe try again with ignoring "depth" altogether, even though, in principle, including ti should not be wrong. Also have a look at the estimated fold changes. Maybe they are significant but very small. An MA plot would help here, and also allow to check the normalization. I'm still puzzled what kind of treatment "live" and "dead" might be. I find it hard to come up with a guess what kind of treatment might have no effect on most taxa, which nevertheless has such a drastic name as "dead". Can you explain a bit more about the biology? (By private mail, in case you don't want to expose unpublished work on a public mailing list.) Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: MAplot_des_treatplusdepth.pdf Type: application/pdf Size: 332374 bytes Desc: MAplot_des_treatplusdepth.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140114="" ed754c3d="" attachment-0002.pdf=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: MAplot_des_treatonly.pdf Type: application/pdf Size: 331704 bytes Desc: MAplot_des_treatonly.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140114="" ed754c3d="" attachment-0003.pdf="">
Hi Kristina I'll write a bit a longer reply, to advertise a kind of paradigm shift that we are now exploring in DESeq2. First a note to other readers of this thread: Kristina has sent me a private mail explaining the experiment in more detail. To sum this up without disclosing anything, the three "treatments" labelled "none", "life" and "dead" are not so much different treatments applied to one kind of sample but rather three different (actually: very different) manners in which the samples were collected, which are hence expected to capture quite different subsets of the species that are present at the sample collection spots. Therefore, it is no surprise that observed OTU abundances differ so greatly between the "treatments". Remember that the purpose of a test for differential expression in its usual sense is to find gene for which you can reject with confidence the null hypothesis that the gene is _not_at_all_ affected by the treatment, i.e., that its abundance stays exactly the same. It is quite common that this specific null hypothesis turns out to be a bit silly. All genes are connected by the cell's complicated regulatory network, and hence, there typically might not be a single gene which is not at least very slightly affected by the treatment. In essence, what a significant p value hence really means is that the effect was big enough that we can say with confidence, in which _direction_ the gene has changed. Quite often, the effect strength required to establish the direction of change is larger than the effect strength required to make the change of biological interest, and therefore, one looks at p values. If it is the other way round, one needs to also have a cut-off on fold change or do a banded test (see below). In your case, it is also clear a priori that no OTU will have the same abundance in two "treatments", given that the different sample collection approaches favour quite different species. Hence, it is no surprise that at a reasonable cut-off on the adjusted p values (say, 0.1), nearly all OTUs are significant. This means that you are in the favourable situation that you can take the estimated differences between treatments pretty much at face value. However, you want to know _how_precise they actually are, and for this, DESeq2 now reports standard errors alongside all estimated log fold changes. This data, and not the p values, is likely the more useful output for you. Simon
Answer: DEseq2 metagenomic analysis without replicates
0
5.3 years ago by
Kristina M Fontanez220 wrote:
Hi Mike- Thanks for the clarification below - I donât know how I missed that. âNo, it does account for sequencing. We say explicitly in the vignette and man pages that the rlog is better for accounting for size factors differences than VST. Also the formulae provided in the vignette and man pages makes this clear: rlog is defined as log2(q_ij)â â E(K_ijâ) = mu_ij = s_j q_ij in words, the expected count for gene i, sample j is represented by mu_ij, which is the produce of a size factor, s_j, which accounts for sequencing depth and q_ij. So q_ij therefore is the part leftover after accounting for sequencing depth. Thanks, Kristina ------------------------------------------------------------------ Kristina Fontanez, Postdoctoral Fellow fontanez@mit.edu<mailto:fontanez@mit.edu> Massachusetts Institute of Technology Department of Civil and Environmental Engineering 48-120E 15 Vassar Street Cambridge, MA 02139 [[alternative HTML version deleted]]