Understanding interactions in DESeq2
1
0
Entering edit mode
@benjaminaarontaylor-22552
Last seen 4.4 years ago

I have a multifactorial design containing an integer variable (‘age’) and a boolean factor (‘qr’) which represents treatment (TRUE) vs control (FALSE).

str(data.phenotypes)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   71 obs. of  3 variables:
 $ Novogene_ID: chr  "W01RDYL" "W03BLGR" "W03GR" "W03SL" ...
 $ qr         : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 2 2 2 2 2 1 1 ...
 $ age        : int  20 11 23 25 24 24 24 23 9 23 ...

head(data.phenotypes)

# A tibble: 6 x 3
  Novogene_ID qr      age
  <chr>       <fct> <int>
1 W01RDYL     FALSE    20
2 W03BLGR     TRUE     11
3 W03GR       TRUE     23
4 W03SL       TRUE     25
5 W04BLOR     TRUE     24
6 W04GRSL     TRUE     24

str(data.gene.counts)

int [1:11313, 1:71] 198 185 557 496 118 435 760 4662 1159 68 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:11313] "gene4494" "gene4495" "gene4496" "gene4497" ...
  ..$ : chr [1:71] "W01RDYL" "W03BLGR" "W03GR" "W03SL" ...

>head(data.gene.counts[,1:5])

         W01RDYL W03BLGR W03GR W03SL W04BLOR
gene4494     198     145   188   207     183
gene4495     185      85   114   161     145
gene4496     557     484   478   644     582
gene4497     496     412   402   544     449
gene4490     118     166   119   159      98
gene4491     435     390   424   641     478

Running DESeq2 using a simple additive design (~ age + qr) shows a large number of genes that are differentially expressed with age and a smaller number that are differentially expressed between the two conditions for qr, both of which results are expected:

dds = DESeqDataSetFromMatrix(countData = data.gene.counts,
                             colData = data.phenotypes,
                             design = as.formula("~ age + qr"))

dds.deg = DESeq(dds)

results = resultsNames(dds.deg)

resultstable = matrix(nrow = length(results), ncol = 1, dimnames = list(results,"DEGs"))

#loop through results and extract significant DEGs for each model term
for(i in 1:length(results)){

  res = results(dds.deg, 
                name = results[i],     
                alpha = 0.05)

  DEGs = (length(na.omit(which(res$padj<0.05))))

  resultstable[results[i],"DEGs"] = DEGs

}

resultstable
                 DEGs
Intercept        10281
age               1683
qr_TRUE_vs_FALSE   283

What I'm finding confusing are the results when I add an interactions term: (~ age + gr + age:qr). We have strong a priori reason to think that there should be a large number of genes that respond differently to age in the treatment condition vs the control condition. If I understand correctly, this would be evidenced by a large number of genes responding to the age:qr interaction term. However, this isn't the result we see is very different from these expectations:

dds = DESeqDataSetFromMatrix(countData = data.gene.counts,
                             colData = data.phenotypes,
                             design = as.formula("~ age + qr + age:qr"))

dds.deg = DESeq(dds)

results = resultsNames(dds.deg)

resultstable2 = matrix(nrow = length(results), ncol = 1, dimnames = list(results,"DEGs"))

#loop through results and extract significant DEGs for each model term
for(i in 1:length(results)){

  res = results(dds.deg, 
                name = results[i],     
                alpha = 0.05)

  DEGs = (length(na.omit(which(res$padj<0.05))))

  resultstable2[results[i],"DEGs"] = DEGs

}

resultstable2
                  DEGs
Intercept        10122
age                396
qr_TRUE_vs_FALSE     1
age.qrTRUE           0

Two things confuse me about these results:

  1. No genes are identified as responding significantly to the interaction term. While it's possible that our hypothesis is simply wrong and there's genuinely no interaction between qr and age, this seems really surprising given our other analyses. For example, we observe a strongly age-contingent phenotypic response in the treatment condition that doesn't exist in the control condition, and it would be surprising if this weren't reflected at all in the gene expression data. I'd expect some genes to respond to the interaction, even if it were only a small number, which leads me to think that maybe I've somehow constructed or interpreted the model wrongly.

  2. The number of genes identified as responding to age and to qr individually is greatly reduced after the addition of the interaction term. I appreciate that the interaction term is soaking up some of the DFs available, which is going to cost us some power to detect differential expression. However, it's strange to me that the addition of a single interaction term seems to completely remove our ability to detect genes that are differentially expressed between treatment and control conditions. Again, this makes me wonder whether something is going on with the model that I've failed to grasp.

Can anyone help me here? I can't for the life of me understand how I'm getting such radically different results simply from the addition of a single interaction term.

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0               stringr_1.4.0               dplyr_0.8.3                 purrr_0.3.3                 readr_1.3.1                 tidyr_1.0.0                
 [7] tibble_2.1.3                ggplot2_3.2.1               tidyverse_1.2.1             DESeq2_1.24.0               SummarizedExperiment_1.14.1 DelayedArray_0.10.0        
[13] BiocParallel_1.18.1         matrixStats_0.55.0          Biobase_2.44.0              GenomicRanges_1.36.1        GenomeInfoDb_1.20.0         IRanges_2.18.3             
[19] S4Vectors_0.22.1            BiocGenerics_0.30.0         edgeR_3.26.8                limma_3.40.6                lmerTest_3.1-0              lme4_1.1-21                
[25] Matrix_1.2-17
deseq2 • 1.8k views
ADD COMMENT
1
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 2 hours ago
San Diego

I can't for the life of me understand how I'm getting such radically different results simply from the addition of a single interaction term.

From the tutorial: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

The key point to remember about designs with interaction terms is that, unlike for a design ~genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level).

ADD COMMENT

Login before adding your answer.

Traffic: 623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6