Hello,
I'm a PhD student and I'm using DESeq2 to analyze the differential abundance of 16S amplicons across a number of factors. The software is fabulous, I'm just struggling a bit with obtaining estimates of differential abundance. I'm investigating the role of host plant species and soil treatments on the abundance of bacterial taxa, these factors are crossed so that each host plant species is replicated in each soil treatment. I think an estimate of the log-fold changes (LFCE) associated with soil treatments and host plant species would be informative. Basically, which factor causes a greater magnitude of response to bacterial taxa, soil treatment or host plant species? I would compare the distribution of LFCE between these factors and note here I'm looking for the main effect of these factors.
I have 2 soil treatments and 30 host plant species.
(UPDATE - Here is the experimental design details from as.data.frame(colData(dds)). Only showing a portion of the rows)
species replicate run treatment group sizeFactor E-100-10 100 10 1 d 100.d 0.13533106 E-100-7 100 7 1 d 100.d 0.89019688 E-101-5 101 5 1 w 101.w 0.17263663 E-105-10 105 10 1 d 105.d 3.55714737 E-105-1 105 1 1 w 105.w 0.23722231 E-105-4 105 4 1 d 105.d 0.86666457 E-105-6 105 6 1 d 105.d 3.59393071 E-105-8 105 8 1 d 105.d 0.49218063 E-108-10 108 10 1 d 108.d 6.17767785 E-108-4 108 4 1 d 108.d 10.06860206 E-108-6 108 6 1 d 108.d 1.47016224 E-108-8 108 8 1 d 108.d 3.46371775 E-109-10 109 10 1 d 109.d 1.12876590 E-109-6 109 6 1 w 109.w 2.92162215 E-109-8 109 8 1 w 109.w 0.23868261 E-109-9 109 9 1 w 109.w 0.23525578 E-112-9 112 9 1 d 112.d 0.57176299 E-134-6 134 6 1 w 134.w 1.62106074 E-134-7 134 7 1 d 134.d 0.40653113 E-134-9 134 9 1 d 134.d 0.80525342 E-137-10 137 10 1 w 137.w 1.83033804 E-137-7 137 7 1 w 137.w 2.85010152 E-137-8 137 8 1 d 137.d 1.77433507 E-142-10 142 10 1 d 142.d 1.17450263 E-142-1 142 1 1 w 142.w 1.76506157 E-142-4 142 4 1 d 142.d 0.47586765 E-142-5 142 5 1 d 142.d 0.31323542 E-142-6 142 6 1 w 142.w 1.58546335 E-142-7 142 7 1 d 142.d 0.27226916 E-142-8 142 8 1 w 142.w 1.38093562 E-142-9 142 9 1 w 142.w 3.47419762 E-143-10 143 10 1 d 143.d 0.99365447 E-143-3 143 3 1 w 143.w 0.96923299 E-143-6 143 6 1 w 143.w 0.38872894 E-143-7 143 7 1 w 143.w 0.80616558
For soil treatment building the appropriate contrast is straightforward (in the actual data treatments are "w" and "d"):
results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))
I take the contrast from a DESeqDataSet on which I've called nbinomWaldTest specifying the factor (treatment) and levels of interest (a versus b). The host plant species contrast isn't as straightforward. I have at least two options. The simplest being:
results(dds.bacterial.taxa, name="species1")
Species 1 is one of my host plant species and appears as an element of resultsNames(dds.bacterial.taxa). I think contrasts samples labelled Species 1 to all other samples in my dataset, basically for each feature estimate log fold change between mean of Species 1 and grand mean. I can repeat this contrast for each host plant species separately and compare the distribution of LFCEs obtained from these individual host plant species contrasts to the LFCEs obtained from the soil treatment contrast. A more complicated (and possibly inappropriate), method would be:
results(dds.bacterial.taxa, contrast=c("species", "1", "2"))
This would yield the LFCE for each bacterial taxa between two individual host plant species. I would then repeat this for each of my 406 unique pairs of host plant species to obtain a distribution of all possible LFCEs when comparing every possible pair of host plants.
Given these two options I wanted to make sure that the LFCE are comparable. If I use the contrast argument for the soil treatment LFCE and the name argument for the host plant species LFCE is it appropriate to compare them?
So then I tried using several methods to estimate LFCE for the soil treatment to see if they match:
results(dds.bacterial.taxa, contrast=c("treatment", "a", "b")) results(dds.bacterial.taxa, name="treatmenta") results(dds.bacterial.taxa, contrast=list(c("treatmenta", "treatmentb"))
Each of these methods yields different LFCE. Interestingly the first method yields largest estimates and the third method yields the smallest estimates. Though qualitatively the estimates and significance match across all three methods. I've read the DESeq2 vignette and ?results page but can't quite understand how the calculation of LFCE differ between these methods. Different denominator and numerator values must be occurring. I'm now concerned that my host plant species LFCE taken from:
results(dds.bacterial.taxa, name="species1")
Will be underestimated relative to the LFCE for soil treatment taken from:
results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))
as a result of the different LFCE calculations performed by these different methods. Any insight into these methods for calculating LFCE across many factor levels would be greatly appreciated.
Thank You!
Connor Fitzpatrick
University of Toronto
(note - I posted this comment first, before spotting the problem in your initial code)
Could you put the first few lines of output from each of the three calls to
results
you do, as I'd expect the first and third to both showand the estimates to be the same. The middle comparison should be different, as you're comparing treatment a to the average of a and b (ignoring species for the moment). Also, may help if you show us how you called DESeq (ie with what design, and other parameters, were used)
Thanks for the reply! OK this makes sense.
I suspected this compared the particular factor level in the name argument to the grand mean. This would explain why the LFCE are reduced compared to:
So now my question becomes more conceptual. For a two level factor the above contrast works fine. But when you have > 5 factors the number of possible contrasts becomes very large. I have 30 levels of my host plant species factor and so every possible contrast = 406. For a dataset with thousands of features this becomes a problem. I'm not particularly interested in the actual LFCE between any given pair of levels for this factor. What I am interested in is the "average" magnitude of LFCE due to my various factors, so for these factor comparing either the distribution of LFCE for all features (both positive and negative), or the mean absolute LFCE for all features.
Since these two methods yield different LFCE:
Do you think it makes sense to compare the LCFEs for a two level factor using the first method and a multi-level factor using the second method? I'm leaning towards using the second method for all factors so that in every case I'm contrasting a factor level to the grand mean.
Thanks for your time!