Question: DESeq2 multifactor design with interaction term
0
10 months ago by
Italy, Florence, University of Florence
Giovanni Bacci0 wrote:

Hello,

I'm running DESeq2 in combination with phyloseq to analyze 16S rRNA data coming from a large survey. I have gut microbiome coming from 6 subjects (called 5001, 5002, 5003, 5004, 5005, and 5006) that experimented 3 types of diets (called A, B, and C). I would like to test differences between:

1. Different diet in each subject
2. Different subject in normal diet (reference level)
3. Interaction terms (differences on how subjects respond to different diets)

My design matrix is pretty huge so I'll not post it here but these are the first steps that I ran to get my DESeq2 object:

dds <- phyloseq_to_deseq2(mars, design = ~  food + subject_id + food:subject_id)
dds <- DESeq(dds, test = "Wald", fitType = "parametric")
matrix(resultsNames(dds))
      [,1]
[1,] "Intercept"
[2,] "food_B_vs_A"
[3,] "food_C_vs_A"
[4,] "subject_id_5002_vs_5001"
[5,] "subject_id_5003_vs_5001"
[6,] "subject_id_5004_vs_5001"
[7,] "subject_id_5005_vs_5001"
[8,] "subject_id_5006_vs_5001"
[9,] "foodB.subject_id5002"
[10,] "foodC.subject_id5002"
[11,] "foodB.subject_id5003"
[12,] "foodC.subject_id5003"
[13,] "foodB.subject_id5004"
[14,] "foodC.subject_id5004"
[15,] "foodB.subject_id5005"
[16,] "foodC.subject_id5005"
[17,] "foodB.subject_id5006"
[18,] "foodC.subject_id5006" 

If I have understood correctly, I can get the difference across diets for subject 1 simply setting the contrast term correctly:

results(dds, contrast = c("food", "B", "A"))
results(dds, contrast = c("food", "C", "A"))
results(dds, contrast = c("food", "C", "B"))

But now I have a doubt, how can I get the same results for subject 2? I have done the following steps but I don't know if they are correct...

results(dds, contrast = list(c( "food_B_vs_A", "foodB.subject_id5002" ))
results(dds, contrast = list(c( "food_C_vs_A", "foodC.subject_id5002" ))

The contrast for B - C in subject 2 should be: ("food_B_vs_A" - "food_C_vs_A") + ("foodB.subject_id5002" - "foodC.subject_id5002"). So I used a vector of combinations in the contrast argument of the results function:

cbind(as.matrix(resultsNames(dds)), c(0,1,-1,0,0,0,0,0,1,-1,0,0,0,0,0,0,0,0))

      [,1]                      [,2]
[1,] "Intercept"               "0"
[2,] "food_B_vs_A"             "1"
[3,] "food_C_vs_A"             "-1"
[4,] "subject_id_5002_vs_5001" "0"
[5,] "subject_id_5003_vs_5001" "0"
[6,] "subject_id_5004_vs_5001" "0"
[7,] "subject_id_5005_vs_5001" "0"
[8,] "subject_id_5006_vs_5001" "0"
[9,] "foodB.subject_id5002"    "1"
[10,] "foodC.subject_id5002"    "-1"
[11,] "foodB.subject_id5003"    "0"
[12,] "foodC.subject_id5003"    "0"
[13,] "foodB.subject_id5004"    "0"
[14,] "foodC.subject_id5004"    "0"
[15,] "foodB.subject_id5005"    "0"
[16,] "foodC.subject_id5005"    "0"
[17,] "foodB.subject_id5006"    "0"
[18,] "foodC.subject_id5006"    "0" 

results(dds, contrast = c(0,1,-1,0,0,0,0,0,1,-1,0,0,0,0,0,0,0,0))

Is this correct or I missed something? If this is the right way to do it I was planning to repeat the same procedure for the other subjects and then put everything together into a single data frame.

For point 2, I extracted every contrast this way:

results(dds, contrast = c("subject_id", "5002", "5001")
results(dds, contrast = c("subject_id", "5003", "5001")
...
results(dds, contrast = c("subject_id", "5004", "5006")
results(dds, contrast = c("subject_id", "5005", "5006")

whether for point 3 I'm stuck again... I think that I should extract all interaction terms and then each possible combination for differences between subjects that are not the reference level but I'm not sure about that.

Thanks in advance for the help!

best,

Giovanni

This is the output of sessionInfo():

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=it_IT.UTF-8       LC_NAME=C

attached base packages:
[1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggdendro_0.1-20            vegan_2.4-6                lattice_0.20-35            permute_0.9-4
[5] metagenomeSeq_1.20.1       RColorBrewer_1.1-2         glmnet_2.0-13              foreach_1.4.4
[9] Matrix_1.2-11              limma_3.34.9               gridExtra_2.3              ggplot2_2.2.1
[13] DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.53.1
[17] Biobase_2.38.0             GenomicRanges_1.30.2       GenomeInfoDb_1.14.0        IRanges_2.12.0
[21] S4Vectors_0.16.0           BiocGenerics_0.24.0        phyloseq_1.23.1            reshape2_1.4.3

loaded via a namespace (and not attached):
[1] nlme_3.1-131           bitops_1.0-6           bit64_0.9-7            tools_3.4.3            backports_1.1.2
[6] rpart_4.1-13           KernSmooth_2.23-15     Hmisc_4.1-1            DBI_0.7                lazyeval_0.2.1
[11] mgcv_1.8-23            colorspace_1.3-2       ade4_1.7-10            nnet_7.3-12            bit_1.1-12
[16] compiler_3.4.3         htmlTable_1.11.2       labeling_0.3           caTools_1.17.1         scales_0.5.0.9000
[21] checkmate_1.8.5        genefilter_1.60.0      stringr_1.3.0          digest_0.6.15          foreign_0.8-69
[26] XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.1        htmltools_0.3.6        htmlwidgets_1.0
[31] rlang_0.2.0            rstudioapi_0.7         RSQLite_2.0            jsonlite_1.5           BiocParallel_1.12.0
[36] gtools_3.5.0           acepack_1.4.1          RCurl_1.95-4.10        magrittr_1.5           GenomeInfoDbData_1.0.0
[41] Formula_1.2-2          biomformat_1.6.0       Rcpp_0.12.15           munsell_0.4.3          ape_5.0
[46] stringi_1.1.6          yaml_2.1.16            MASS_7.3-49            zlibbioc_1.24.0        rhdf5_2.22.0
[51] gplots_3.0.1           plyr_1.8.4             blob_1.1.0             gdata_2.18.0           Biostrings_2.46.0
[56] splines_3.4.3          multtest_2.34.0        annotate_1.56.1        locfit_1.5-9.1         knitr_1.20
[61] pillar_1.1.0           igraph_1.1.2           geneplotter_1.56.0     codetools_0.2-15       XML_3.98-1.10
[66] latticeExtra_0.6-28    data.table_1.10.4-3    gtable_0.2.0           xtable_1.8-2           survival_2.41-3
[71] tibble_1.4.2           iterators_1.0.9        AnnotationDbi_1.40.0   memoise_1.1.0          cluster_2.0.6
modified 10 months ago by Michael Love21k • written 10 months ago by Giovanni Bacci0
Answer: DESeq2 multifactor design with interaction term
1
10 months ago by
Michael Love21k
United States
Michael Love21k wrote:

It may be easier for you to use a different design:

~subject + subject:food

This will give you a baseline for each subject and then subject specific comparisons of food levels to the reference level of food. It will help you to avoid the numeric contrast for, e.g. B - C in subject 2. This would be just contrast=list("subject2.foodB", "subject2.foodC").

Sorry to bother you but I'm trying to fully understand how DESeq2 builds result tables. The steps I reported for subject2 can be considered correct? I was planning to use DESeq2 for other studies where I cannot make the simplification that you suggested so I need to know if (in principle) my approach is fine. In addition, when I run the contrast between food A and B for subject1 I get the following results: https://figshare.com/s/d7135b65f2a909388725. As you may notice OTU 24, 34, 35, 42, and 44 have zero counts in both A and B but they were reported in the result table with an adjusted p-value lower than 0.1 (I'm using the default value for simplicity). The result table is the following:

          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
OTU3   10926.73678      -2.211819 0.5214396 -4.241754 2.217794e-05 3.016200e-04
OTU9    2042.76020      -1.876544 0.7362477 -2.548794 1.080961e-02 4.243671e-02
OTU7    1521.87518       8.062935 0.9796311  8.230583 1.863046e-16 1.266871e-14
OTU16   1119.77573      -1.969538 0.6723132 -2.929494 3.395141e-03 1.775920e-02
OTU20    473.07108      -3.352930 0.8958218 -3.742853 1.819425e-04 1.546512e-03
OTU26    535.87510      -2.235220 0.8643032 -2.586152 9.705410e-03 4.124799e-02
OTU24     86.29744     -12.756321 3.2250495 -3.955388 7.641066e-05 7.422750e-04
OTU27    264.03226       2.139153 0.4634868  4.615349 3.924344e-06 6.671384e-05
OTU28    460.72956       1.592456 0.6493838  2.452258 1.419630e-02 5.080780e-02
OTU35    123.33839     -14.555568 3.6448367 -3.993476 6.511164e-05 7.379319e-04
OTU34     47.95764     -14.992204 3.2118168 -4.667826 3.044028e-06 6.671384e-05
OTU42    174.02153     -22.086751 3.2550839 -6.785309 1.158381e-11 3.938496e-10
OTU41     68.01734       2.563037 0.6988319  3.667601 2.448365e-04 1.849876e-03
OTU46     80.55887      -2.677220 0.9802923 -2.731042 6.313434e-03 3.066525e-02
OTU44     12.78494     -17.245844 4.7975691 -3.594705 3.247598e-04 2.208367e-03
OTU54     72.62819       1.982016 0.8596337  2.305652 2.113011e-02 6.842130e-02
OTU59     93.76219       2.464469 1.0475717  2.352554 1.864497e-02 6.339291e-02
OTU83     50.33208      -3.372868 1.0951556 -3.079807 2.071349e-03 1.173764e-02
OTU84     80.36375       3.760170 1.0782534  3.487278 4.879632e-04 3.016500e-03
OTU109  2749.34804       1.663127 0.6559731  2.535358 1.123325e-02 4.243671e-02
OTU194    91.21181       1.870896 0.7049041  2.654114 7.951696e-03 3.604769e-02

How can this be possible?

Thanks again!

best,

Giovanni

Yes, your contrasts seem correct. I've seen this occur when the counts do not follow a negative binomial (typically the data is not RNA-seq), e.g. the data is bimodal, and there are many groups involved. I recommend to use LFC shrinkage with the lfcShrink function and apply a minimal filter, e.g. subset(res, abs(log2FoldChange) > 0.5). In the next version of DESeq2 you will be able to directly apply a LFC threshold within lfcShrink.

Oh and I should say, you will need to use type="apeglm" or type="ashr" to apply LFC shrinkage to a design with interactions.

I tried with lfcShrink (both with apeglm and ashr) and it doesn't seem to work. As you said I actually have 16S rRNA data they are not RNA-seq and they could not follow negative binomial distribution (I tried to model them with glmmADMB and in some cases the Poisson family seems to work better than negative binomial according to the dAIC). In addition, I do have many groups so this can also be the cause.

Thanks again for the help,

best,

Giovanni

Out of curiosity “doesn’t seem to work” meaning that it gives an error or that gives a high LFC.

Let me clarify that:

the function lfcShrink worked fine with both methods but the output still contains zero-count OTUs. The worst thing is that they have a huge log2FoldChange and a very small p-value so it is really difficult to remove them in a "programatic" way without removing valid differences ... I think I have to do it manually or write down something to account for zero-count OTUs in the contrast selected.

I'm surprised that they'd have a large LFC in the output from lfcShrink() after running apeglm, it usually is aggressive about shrinkage. You may want to move to a different method then, if the distribution is not appropriate for this dataset.

That was my fault!

I was giving the unfiltered table to my plotting function, sorry. The function now works well and the zero-counts OTUs have been removed. My last question regarding the interaction part of the original post:

If I understood correctly I can extract contrasts between food B and C for, say, subject 2 against the reference level by using:

results(dds, contrast = list("foodC.subject_id5002", "foodB.subject_id5002"))

and this should be valid for each subject. But what about the effect of B vs C for subject 2 and another subject that is not the reference level? I know that this can be tricky but I think that I am missing something ... this is what I tried:

coefs <- c(rep(0,8), -1, 1, 1, -1, rep(0, 6))
cbind(resultsNames(dds), coefs)

coefs
[1,] "Intercept"               "0"
[2,] "food_B_vs_A"             "0"
[3,] "food_C_vs_A"             "0"
[4,] "subject_id_5002_vs_5001" "0"
[5,] "subject_id_5003_vs_5001" "0"
[6,] "subject_id_5004_vs_5001" "0"
[7,] "subject_id_5005_vs_5001" "0"
[8,] "subject_id_5006_vs_5001" "0"
[9,] "foodB.subject_id5002"    "-1"
[10,] "foodC.subject_id5002"    "1"
[11,] "foodB.subject_id5003"    "1"
[12,] "foodC.subject_id5003"    "-1"
[13,] "foodB.subject_id5004"    "0"
[14,] "foodC.subject_id5004"    "0"
[15,] "foodB.subject_id5005"    "0"
[16,] "foodC.subject_id5005"    "0"
[17,] "foodB.subject_id5006"    "0"
[18,] "foodC.subject_id5006"    "0"

results(dds, contrast = coefs)

Is it correct? Is there a way to express the same contrast using a list of contrasts?

Thanks again for the help!

That looks correct except you need to divide the 1's and -1's by 2 (so 1/2 and -1/2) or else you aren't averaging the effect, but doubling it.

Hi Mark and thank again for your help!

If I understood correctly the interaction between, let's say, food B and subject 5002 should be:

(5002.B - 5002.A) - (5001.B - 5001.A)

if that holds, the interaction between food B and subject 5003 should be:

(5003.B - 5003.A) - (5001.B - 5001.A)

so if I subtract the two I get:

(5003.B - 5003.A) - (5001.B - 5001.A) - [(5002.B - 5002.A) - (5001.B - 5001.A)] =
= (5003.B - 5003.A) - (5001.B - 5001.A) - (5002.B - 5002.A) + (5001.B - 5001.A) =
= (5003.B - 5003.A) - (5002.B - 5002.A)

that is the difference between treatment B and A for subject 5002 and 5003 (the interaction term).

If I do the same thing for food C I obtain:

(5003.C - 5003.A) - (5002.C - 5002.A)

and by subtracting the two:

(5003.C - 5003.A) - (5002.C - 5002.A) - [(5003.B - 5003.A) - (5002.B - 5002.A)] =
= (5003.C - 5003.A) - (5002.C - 5002.A) - (5003.B - 5003.A) + (5002.B - 5002.A) =
= 5003.C - 5003.A - 5002.C + 5002.A - 5003.B + 5003.A + 5002.B - 5002.A =
= 5003.C - 5002.C - 5003.B + 5002.B =
= (5003.C - 5002.C) - (5003.B - 5002.B)

that is the interaction term between food C and B for subjects 5003 and 5002. If I use 1/2 and -1/2 instead of 1 and -1 the result should not change but I don't understand why I need to do that ...

Many thanks again for the help!

Best,

Giovanni