Question

DESeq2 log fold change estimates for factors with many levels

0

Entering edit mode

crfitzpat • 0

@crfitzpat-12404

Last seen 7.2 years ago

Hello,

I'm a PhD student and I'm using DESeq2 to analyze the differential abundance of 16S amplicons across a number of factors. The software is fabulous, I'm just struggling a bit with obtaining estimates of differential abundance. I'm investigating the role of host plant species and soil treatments on the abundance of bacterial taxa, these factors are crossed so that each host plant species is replicated in each soil treatment. I think an estimate of the log-fold changes (LFCE) associated with soil treatments and host plant species would be informative. Basically, which factor causes a greater magnitude of response to bacterial taxa, soil treatment or host plant species? I would compare the distribution of LFCE between these factors and note here I'm looking for the main effect of these factors.

I have 2 soil treatments and 30 host plant species.

(UPDATE - Here is the experimental design details from as.data.frame(colData(dds)). Only showing a portion of the rows)

         species replicate run treatment group  sizeFactor
E-100-10     100        10   1         d 100.d  0.13533106
E-100-7      100         7   1         d 100.d  0.89019688
E-101-5      101         5   1         w 101.w  0.17263663
E-105-10     105        10   1         d 105.d  3.55714737
E-105-1      105         1   1         w 105.w  0.23722231
E-105-4      105         4   1         d 105.d  0.86666457
E-105-6      105         6   1         d 105.d  3.59393071
E-105-8      105         8   1         d 105.d  0.49218063
E-108-10     108        10   1         d 108.d  6.17767785
E-108-4      108         4   1         d 108.d 10.06860206
E-108-6      108         6   1         d 108.d  1.47016224
E-108-8      108         8   1         d 108.d  3.46371775
E-109-10     109        10   1         d 109.d  1.12876590
E-109-6      109         6   1         w 109.w  2.92162215
E-109-8      109         8   1         w 109.w  0.23868261
E-109-9      109         9   1         w 109.w  0.23525578
E-112-9      112         9   1         d 112.d  0.57176299
E-134-6      134         6   1         w 134.w  1.62106074
E-134-7      134         7   1         d 134.d  0.40653113
E-134-9      134         9   1         d 134.d  0.80525342
E-137-10     137        10   1         w 137.w  1.83033804
E-137-7      137         7   1         w 137.w  2.85010152
E-137-8      137         8   1         d 137.d  1.77433507
E-142-10     142        10   1         d 142.d  1.17450263
E-142-1      142         1   1         w 142.w  1.76506157
E-142-4      142         4   1         d 142.d  0.47586765
E-142-5      142         5   1         d 142.d  0.31323542
E-142-6      142         6   1         w 142.w  1.58546335
E-142-7      142         7   1         d 142.d  0.27226916
E-142-8      142         8   1         w 142.w  1.38093562
E-142-9      142         9   1         w 142.w  3.47419762
E-143-10     143        10   1         d 143.d  0.99365447
E-143-3      143         3   1         w 143.w  0.96923299
E-143-6      143         6   1         w 143.w  0.38872894
E-143-7      143         7   1         w 143.w  0.80616558

For soil treatment building the appropriate contrast is straightforward (in the actual data treatments are "w" and "d"):

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

I take the contrast from a DESeqDataSet on which I've called nbinomWaldTest specifying the factor (treatment) and levels of interest (a versus b). The host plant species contrast isn't as straightforward. I have at least two options. The simplest being:

results(dds.bacterial.taxa, name="species1")

Species 1 is one of my host plant species and appears as an element of resultsNames(dds.bacterial.taxa). I think contrasts samples labelled Species 1 to all other samples in my dataset, basically for each feature estimate log fold change between mean of Species 1 and grand mean. I can repeat this contrast for each host plant species separately and compare the distribution of LFCEs obtained from these individual host plant species contrasts to the LFCEs obtained from the soil treatment contrast. A more complicated (and possibly inappropriate), method would be:

results(dds.bacterial.taxa, contrast=c("species", "1", "2"))

This would yield the LFCE for each bacterial taxa between two individual host plant species. I would then repeat this for each of my 406 unique pairs of host plant species to obtain a distribution of all possible LFCEs when comparing every possible pair of host plants.

Given these two options I wanted to make sure that the LFCE are comparable. If I use the contrast argument for the soil treatment LFCE and the name argument for the host plant species LFCE is it appropriate to compare them?

So then I tried using several methods to estimate LFCE for the soil treatment to see if they match:

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

results(dds.bacterial.taxa, name="treatmenta")

results(dds.bacterial.taxa, contrast=list(c("treatmenta", "treatmentb"))

Each of these methods yields different LFCE. Interestingly the first method yields largest estimates and the third method yields the smallest estimates. Though qualitatively the estimates and significance match across all three methods. I've read the DESeq2 vignette and ?results page but can't quite understand how the calculation of LFCE differ between these methods. Different denominator and numerator values must be occurring. I'm now concerned that my host plant species LFCE taken from:

results(dds.bacterial.taxa, name="species1")

Will be underestimated relative to the LFCE for soil treatment taken from:

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

as a result of the different LFCE calculations performed by these different methods. Any insight into these methods for calculating LFCE across many factor levels would be greatly appreciated.

Thank You!

Connor Fitzpatrick

University of Toronto

deseq2 lfce contrast • 4.8k views

ADD COMMENT • link 7.2 years ago crfitzpat • 0

0

Entering edit mode

(note - I posted this comment first, before spotting the problem in your initial code)

Could you put the first few lines of output from each of the three calls to results you do, as I'd expect the first and third to both show

log2 fold change (MAP): treatment a vs b 
Wald test p-value: treatment a vs b

and the estimates to be the same. The middle comparison should be different, as you're comparing treatment a to the average of a and b (ignoring species for the moment). Also, may help if you show us how you called DESeq (ie with what design, and other parameters, were used)

ADD REPLY • link 7.2 years ago Gavin Kelly ▴ 680

0

Entering edit mode

Thanks for the reply! OK this makes sense.

results(dds.bacterial.taxa, name="treatmenta")

I suspected this compared the particular factor level in the name argument to the grand mean. This would explain why the LFCE are reduced compared to:

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

So now my question becomes more conceptual. For a two level factor the above contrast works fine. But when you have > 5 factors the number of possible contrasts becomes very large. I have 30 levels of my host plant species factor and so every possible contrast = 406. For a dataset with thousands of features this becomes a problem. I'm not particularly interested in the actual LFCE between any given pair of levels for this factor. What I am interested in is the "average" magnitude of LFCE due to my various factors, so for these factor comparing either the distribution of LFCE for all features (both positive and negative), or the mean absolute LFCE for all features.

Since these two methods yield different LFCE:

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

results(dds.bacterial.taxa, name="treatmenta")

Do you think it makes sense to compare the LCFEs for a two level factor using the first method and a multi-level factor using the second method? I'm leaning towards using the second method for all factors so that in every case I'm contrasting a factor level to the grand mean.

Thanks for your time!

ADD REPLY • link 7.2 years ago crfitzpat • 0

score 1 · Answer 1 · 2017-02-20

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

results(dds.bacterial.taxa, name="treatmenta")

results(dds.bacterial.taxa, contrast=list(c("treatmenta", "treatmentb")))

The first one is the one that you want - comparing treatment a with treatment b. The second compares treatment a with the overall average (I think), and the third one has a slight typo in it, which I missed when making my reply - you don't need the 'c(...)' - the first element of the list is the numerator (and the inclusion of the two treatments here means that they'll be added together), and the second element of the list is the denominator (which you don't have, as you've concatenated everything into the first component), so overall you're looking for the additive a+b treatment effect here. If you have the subtly different version of the 3rd

results(dds.bacterial.taxa, contrast=list("treatmenta", "treatmentb"))

then it should agree with the first. A similar approach should apply to the 'species' comparison.

score 0 · Answer 2 · 2017-02-20

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 11 minutes ago

United States

Can you describe the experimental design in more detail? Usually the easiest way to describe in the detail necessary to answer questions is to show as.data.frame(colData(dds))

ADD COMMENT • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Thanks for the reply, I've updated the original question. Please note that each host plant species is replicated in each treatment >5.

ADD REPLY • link 7.2 years ago crfitzpat • 0

0

Entering edit mode

What is the meaning of the 'replicate' column?

Have you coded the variables as factors? E.g. species should be a factor: class(dds$species).

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Replicate is the individual plant from a given species and species is coded as a factor.

ADD REPLY • link 7.2 years ago crfitzpat • 0

0

Entering edit mode

Is replicate information useful here? Or you just have that information in the table but don't plan on using it?

Also, the way you've coded it, 10 in one species is not related to 10 in the other, correct?

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Replicate is just in my meta data table and I don't plan on using it and individual 10 from one species is unrelated to individual 10 in another.

ADD REPLY • link 7.2 years ago crfitzpat • 0

0

Entering edit mode

There are lots of modeling options to think about, and you might want to discuss this with a statistical collaborator.

My approach would be to perform two likelihood ratio tests, both using a full design of ~ species + treatment. To test the treatment effect, the reduced design would be ~species. To test the species effect (this tests over all levels of the species factor), the reduced design would be ~treatment. You can compare the LRT statistics, or the adjusted p-values from the two tests.

design(dds) <- ~species + treatment

dds.trt <- DESeq(dds, test="LRT", reduced=~species)
res.trt <- results(dds.trt)

dds.sp <- DESeq(dds, test="LRT", reduced=~treatment)
res.sp <- results(dds.sp)

See the vignette section on LRT for more details.

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Absolutely, this is my approach for obtaining significance of treatment and species for the abundance of each feature. Now I would like to compare the LFCE associated with each of these factors i.e. which factor, treatment or host species, produces larger LFCE? For this I need to perform a nbiomWaldtest and extract particular results. The problem arises because using:

results(dds.bacterial.taxa, contrast=c("species", "1", "2"))

would require 406 contrasts - one for each unique pair of 30 host species. Alternatively I could use:

results(dds.bacterial.taxa, name="species1")

And repeat for each host species. If I'm understanding correctly this would yield the LFCE between species1 and the grand mean across all factors. I guess boiled down to a general question, could one reasonably compare the LFCE associated with two different factors if they are calculated these 2 different ways:

results(dds.bacterial.taxa, name="species1")

results(dds.bacterial.taxa, contrast=c("treatment", "a", "b"))

Or is it only reasonable to compare LFCE for two factors if they are calculated in the same way?

ADD REPLY • link 7.2 years ago crfitzpat • 0

1

Entering edit mode

I think it's reasonable to compare the LFC with "name" (when there are many levels) and the B vs A contrast. You are correct that, with "name" you are comparing one level against the middle value (it's somewhere between the arithmetic and geometric mean, depending on the strength of shrinkage). So with just two levels, it would be comparable to 1/2 the B vs A contrast. But with 30 species, if there is a species-specific effect, you will get nearly the full effect (because 1 species out of 30 doesn't move the middle very much).

Note that, in the next release, you will need to explicitly set betaPrior=TRUE to produce the shrunken coefficients. I am reorganizing the code around coefficient shrinkage.

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

I agree with your reasoning here. I think there may be instances where knowing the LFCE between every pair of levels in a multi-level factor might be informative but where looking for more general effects the "name" method makes more sense. Thank you for taking the time with this.

ADD REPLY • link 7.2 years ago crfitzpat • 0