Question

Highly differentially expressed genes having zero expression in one of the conditions lose significance if batch is added to the model in DESeq2

0

Entering edit mode

leonfodoulian • 0

@leonfodoulian-14413

Last seen 5.8 years ago

Hi Mike,

I am running DESeq2 version 1.22.2 on a human bulk RNA-sequencing dataset to compare gene expression between 2 tissues, biopsies that were collected from 4 patients. Two of these patients gave one of each tissues, and 2 others gave either of each. The colData table inputted to DESeq2 is the following:

> sample.info
                          Samples Conditions Patients
patient1_tissue1 patient1_tissue1    tissue1 patient1
patient1_tissue2 patient1_tissue2    tissue2 patient1
patient2_tissue1 patient2_tissue1    tissue1 patient2
patient3_tissue2 patient3_tissue2    tissue2 patient3
patient4_tissue1 patient4_tissue1    tissue1 patient4
patient4_tissue2 patient4_tissue2    tissue2 patient4

When I run DESeq2 where gene expression is modeled as a function of Conditions, I get the expected results of log2FoldChange and padj for my genes of interest:

> dds <- DESeqDataSetFromMatrix(countData = counts.wf,
                                colData = sample.info,
                                design = ~ Conditions)
> dds <- estimateSizeFactors(object = dds)
> dds <- estimateDispersions(object = dds)
> dds <- nbinomWaldTest(object = dds,
                        betaPrior = FALSE)
> res <- results(object = dds,
                 cooksCutoff = FALSE)
> dds.res <- as.data.frame(x = res)
> dds.res[c("GENE1", "GENE2"),]
        baseMean log2FoldChange    lfcSE     stat       pvalue         padj
GENE1  154.10504      10.597958 1.336790 7.927916 2.228534e-15 2.987654e-12
GENE2   86.15448       8.794734 1.615313 5.444601 5.192167e-08 9.061407e-06

These genes are [almost] exclusively expressed in tissue1 samples, but not in tissue2 samples:

> counts.wf[c("GENE1", "GENE2"),]
       patient1_tissue1 patient1_tissue2 patient2_tissue1 patient3_tissue2 patient4_tissue1 patient4_tissue2
GENE1               301                0              186                0              511                0
GENE2                22                0              118                1              443                0

Surprisingly, when I add Patients to the model to control for potential batch effects induced by differences in patient of origin, GENE1 loses significance despite its exclusive expression in one of the tissues (tissue1). However, this does not affect GENE2, seemingly because GENE2 has 1 count in one of the tissue2 samples (patient3_tissue2):

> dds.p <- DESeqDataSetFromMatrix(countData = counts.wf,
                                  colData = sample.info,
                                  design = ~ Patients + Conditions)
> dds.p <- estimateSizeFactors(object = dds.p)
> dds.p <- estimateDispersions(object = dds.p)
> dds.p <- nbinomWaldTest(object = dds.p,
                          betaPrior = FALSE)
> res.p <- results(object = dds.p,
                   cooksCutoff = FALSE)
> dds.res.p <- as.data.frame(x = res.p)
> dds.res.p[c("GENE1", "GENE2"),]
        baseMean log2FoldChange    lfcSE     stat       pvalue         padj
GENE1  154.10504      10.864688 4.784961 2.270591 2.317175e-02 3.777198e-01
GENE2   86.15448       8.898013 1.547535 5.749798 8.935019e-09 1.927043e-06

I therefore added a pseudocount of 1 to all the gene counts in counts.wf to test for this, and indeed GENE1 retains significance even if Patients is added as an independent variable to the model:

> dds.p.pc1 <- DESeqDataSetFromMatrix(countData = counts.wf + 1,
                                      colData = sample.info,
                                      design = ~ Patients + Conditions)
> dds.p.pc1 <- estimateSizeFactors(object = dds.p.pc1)
> dds.p.pc1 <- estimateDispersions(object = dds.p.pc1)
> dds.p.pc1 <- nbinomWaldTest(object = dds.p.pc1,
                              betaPrior = FALSE)
> res.p.pc1 <- results(object = dds.p.pc1,
                       cooksCutoff = FALSE)
> dds.res.p.pc1 <- as.data.frame(x = res.p.pc1)
> dds.res.p.pc1[c("GENE1", "GENE2"),]
        baseMean log2FoldChange    lfcSE     stat       pvalue         padj
GENE1  158.76304       8.434732 1.150680 7.330217 2.297802e-13 1.006481e-10
GENE2   89.12941       7.096690 1.117555 6.350193 2.150457e-10 5.546983e-08

Is there any reason for me not to add a pseudocount of 1 to still account for the Patients effect in the model without losing the significance for tissue-specific genes? If yes, is there any other "proper" way of circumventing this issue?

Please find my sessionInfo() below:

> devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 os       macOS Sierra 10.12.6        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Zurich               
 date     2020-04-16                  

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────
 package              * version    date       lib source                        
 acepack                1.4.1      2016-10-29 [1] CRAN (R 3.5.0)                
 annotate               1.59.1     2018-07-19 [1] Bioconductor                  
 AnnotationDbi          1.43.1     2018-06-13 [1] Bioconductor                  
 apeglm               * 1.4.2      2019-01-04 [1] Bioconductor                  
 assertthat             0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                
 backports              1.1.2      2017-12-13 [1] CRAN (R 3.5.0)                
 base64enc              0.1-3      2015-07-28 [1] CRAN (R 3.5.0)                
 bbmle                  1.0.20     2017-10-30 [1] CRAN (R 3.5.0)                
 Biobase              * 2.41.1     2018-06-17 [1] Bioconductor                  
 BiocGenerics         * 0.27.1     2018-06-17 [1] Bioconductor                  
 BiocParallel         * 1.15.7     2018-07-08 [1] Bioconductor                  
 bit                    1.1-14     2018-05-29 [1] CRAN (R 3.5.0)                
 bit64                  0.9-7      2017-05-08 [1] CRAN (R 3.5.0)                
 bitops                 1.0-6      2013-08-17 [1] CRAN (R 3.5.0)                
 blob                   1.1.1      2018-03-25 [1] CRAN (R 3.5.0)                
 callr                  3.3.2      2019-09-22 [1] CRAN (R 3.5.2)                
 checkmate              1.8.5      2017-10-24 [1] CRAN (R 3.5.0)                
 cli                    1.1.0      2019-03-19 [1] CRAN (R 3.5.2)                
 cluster                2.0.7-1    2018-04-13 [1] CRAN (R 3.5.0)                
 coda                   0.19-3     2019-07-05 [1] CRAN (R 3.5.2)                
 colorspace             1.3-2      2016-12-14 [1] CRAN (R 3.5.0)                
 crayon                 1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                
 data.table             1.11.4     2018-05-27 [1] CRAN (R 3.5.0)                
 DBI                    1.0.0      2018-05-02 [1] CRAN (R 3.5.0)                
 DelayedArray         * 0.8.0      2018-10-30 [1] Bioconductor                  
 desc                   1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                
 DESeq2               * 1.22.2     2019-01-04 [1] Bioconductor                  
 devtools               2.2.1      2019-09-24 [1] CRAN (R 3.5.2)                
 digest                 0.6.25     2020-02-23 [1] CRAN (R 3.5.0)                
 dplyr                  0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                
 ellipsis               0.3.0      2019-09-20 [1] CRAN (R 3.5.2)                
 emdbook                1.3.12     2020-02-19 [1] CRAN (R 3.5.2)                
 extrafont            * 0.17       2014-12-08 [1] CRAN (R 3.5.0)                
 extrafontdb            1.0        2012-06-11 [1] CRAN (R 3.5.0)                
 foreign                0.8-70     2017-11-28 [1] CRAN (R 3.5.0)                
 Formula                1.2-3      2018-05-03 [1] CRAN (R 3.5.0)                
 fs                     1.3.1      2019-05-06 [1] CRAN (R 3.5.2)                
 genefilter             1.64.0     2018-10-30 [1] Bioconductor                  
 geneplotter            1.60.0     2018-10-30 [1] Bioconductor                  
 GenomeInfoDb         * 1.17.1     2018-06-13 [1] Bioconductor                  
 GenomeInfoDbData       1.1.0      2018-07-10 [1] Bioconductor                  
 GenomicRanges        * 1.33.6     2018-06-13 [1] Bioconductor                  
 ggplot2              * 3.2.0      2019-06-16 [1] CRAN (R 3.5.2)                
 glue                   1.3.2      2020-03-12 [1] CRAN (R 3.5.0)                
 gridExtra              2.3        2017-09-09 [1] CRAN (R 3.5.0)                
 gtable                 0.3.0      2019-03-25 [1] CRAN (R 3.5.0)                
 Hmisc                  4.1-1      2018-01-03 [1] CRAN (R 3.5.0)                
 htmlTable              1.12       2018-05-26 [1] CRAN (R 3.5.0)                
 htmltools              0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                
 htmlwidgets            1.5.1      2019-10-08 [1] CRAN (R 3.5.2)                
 IRanges              * 2.15.14    2018-06-13 [1] Bioconductor                  
 knitr                  1.20       2018-02-20 [1] CRAN (R 3.5.0)                
 lattice                0.20-35    2017-03-25 [1] CRAN (R 3.5.0)                
 latticeExtra           0.6-28     2016-02-09 [1] CRAN (R 3.5.0)                
 lazyeval               0.2.1      2017-10-29 [1] CRAN (R 3.5.0)                
 locfit                 1.5-9.1    2013-04-20 [1] CRAN (R 3.5.0)                
 magrittr               1.5        2014-11-22 [1] CRAN (R 3.5.0)                
 MASS                   7.3-50     2018-04-30 [1] CRAN (R 3.5.0)                
 Matrix                 1.2-14     2018-04-13 [1] CRAN (R 3.5.0)                
 matrixStats          * 0.54.0     2018-07-23 [1] CRAN (R 3.5.0)                
 memoise                1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                
 munsell                0.5.0      2018-06-12 [1] CRAN (R 3.5.0)                
 nnet                   7.3-12     2016-02-02 [1] CRAN (R 3.5.0)                
 numDeriv               2016.8-1   2016-08-27 [1] CRAN (R 3.5.0)                
 pillar                 1.3.1      2018-12-15 [1] CRAN (R 3.5.0)                
 pkgbuild               1.0.6      2019-10-09 [1] CRAN (R 3.5.2)                
 pkgconfig              2.0.2      2018-08-16 [1] CRAN (R 3.5.0)                
 pkgload                1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                
 plyr                   1.8.4      2016-06-08 [1] CRAN (R 3.5.0)                
 prettyunits            1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                
 processx               3.4.1      2019-07-18 [1] CRAN (R 3.5.2)                
 ps                     1.3.0      2018-12-21 [1] CRAN (R 3.5.0)                
 purrr                  0.3.3      2019-10-18 [1] CRAN (R 3.5.2)                
 R6                     2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                
 RColorBrewer           1.1-2      2014-12-07 [1] CRAN (R 3.5.0)                
 Rcpp                   1.0.4      2020-03-17 [1] CRAN (R 3.5.0)                
 RCurl                  1.95-4.10  2018-01-04 [1] CRAN (R 3.5.0)                
 remotes                2.1.0      2019-06-24 [1] CRAN (R 3.5.2)                
 rlang                  0.4.5      2020-03-01 [1] CRAN (R 3.5.0)                
 rpart                  4.1-13     2018-02-23 [1] CRAN (R 3.5.0)                
 rprojroot              1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                
 RSQLite                2.1.1      2018-05-06 [1] CRAN (R 3.5.0)                
 rstudioapi             0.7        2017-09-07 [1] CRAN (R 3.5.0)                
 Rttf2pt1               1.3.7      2018-06-29 [1] CRAN (R 3.5.0)                
 S4Vectors            * 0.20.1     2018-11-09 [1] Bioconductor                  
 scales                 1.0.0      2018-08-09 [1] CRAN (R 3.5.0)                
 sessioninfo            1.1.1      2018-11-05 [1] CRAN (R 3.5.0)                
 stringi                1.4.6      2020-02-17 [1] CRAN (R 3.5.0)                
 stringr                1.4.0      2019-02-10 [1] CRAN (R 3.5.0)                
 SummarizedExperiment * 1.11.5     2018-06-13 [1] Bioconductor                  
 survival               3.1-11     2020-03-07 [1] CRAN (R 3.5.0)                
 testthat               2.3.1      2019-12-01 [1] CRAN (R 3.5.2)                
 tibble                 2.0.1      2019-01-12 [1] CRAN (R 3.5.2)                
 tidyselect             0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                
 usethis                1.5.1      2019-07-04 [1] CRAN (R 3.5.2)                
 withr                  2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                
 XML                    3.98-1.16  2018-08-19 [1] CRAN (R 3.5.0)                
 xtable                 1.8-2      2016-02-05 [1] CRAN (R 3.5.0)                
 XVector                0.21.3     2018-06-23 [1] Bioconductor                  
 yaml                   2.2.0.9999 2020-02-01 [1] Github (viking/r-yaml@b954d7d)
 zlibbioc               1.27.0     2018-06-13 [1] Bioconductor                  

[1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Best regards, Leon

deseq2 • 1.1k views

ADD COMMENT • link 5.8 years ago leonfodoulian • 0

score 0 · Answer 1 · 2020-04-16

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 42 minutes ago

United States

hi Leon,

"GENE1 loses significance"

It looks like gene1 gets a p-value after controlling for batch of 0.023, so it's basically just marginally significant given that you've changed the design in a way that impacts the interpretation of the LFC. You also have only 6 samples to estimate 4 coefficients, so it's underpowered.

Can you try a likelihood ratio test, here, with full=~patient + condition compared to reduced=~patient?

The LRT tends to work better when parameters are at the boundary => E(Y|X)=0 for one tissue.

ADD COMMENT • link 5.8 years ago Michael Love 43k

0

Entering edit mode

Hi Mike,

Thank you for your quick reply.

Obviously, what I mean with 'GENE1 loses significance' is that its padj value is above 0.05 (0.37 in this case). I completely agree that this experiment is underpowered.

I tried running a likelihood ratio test as you suggested, but still this doesn't help:

> dds.p <- DESeqDataSetFromMatrix(countData = counts.wf,
                                  colData = sample.info,
                                  design = ~ Patients + Conditions)
> dds.p <- estimateSizeFactors(object = dds.p)
> dds.p <- estimateDispersions(object = dds.p)
> dds.p.LRT <- nbinomLRT(object = dds.p,
                         full = ~ Patients + Conditions,
                         reduced = ~ Patients)
> res.p.LRT <- results(object = dds.p.LRT,
                       cooksCutoff = FALSE)
> dds.res.p.LRT <- as.data.frame(x = res.p.LRT)
> dds.res.p.LRT[c("GENE1", "GENE2"),]
        baseMean log2FoldChange    lfcSE      stat       pvalue         padj
GENE1  154.10504      10.864688 4.784961  2.829669 9.253744e-02 9.712751e-01
GENE2   86.15448       8.898013 1.547535 40.384718 2.085661e-10 9.356276e-08

Any other suggestions?

Best, Leon

ADD REPLY • link 5.8 years ago leonfodoulian • 0

0

Entering edit mode

I guess it's just too under-powered to control for patient baseline and keep this gene at a lower enough p-value to survive multiple test correction. Kind of makes sense.

ADD REPLY • link 5.8 years ago Michael Love 43k

0

Entering edit mode

Unless I add a pseudocount of 1, as shown above. Do you suggest in this case to test only for Conditions? Or adding a pseudocount of 1 would make sense to you? I don’t see how this latter option can be problematic.

ADD REPLY • link 5.8 years ago leonfodoulian • 0

0

Entering edit mode

If you want to modify the counts, I'd go with a well tested procedure, like limma-voom, which does proper statistical modeling of scaled and shifted counts.

If you stick with DESeq2, I'd just drop the patient blocking I suppose if you wanted to keep this particular gene.

ADD REPLY • link 5.8 years ago Michael Love 43k

0

Entering edit mode

Thank you for your advices, Mike. They are very helpful. I will stick to DESeq2 and drop the Patients variable from the model. But I am still perplexed by why this is only happening when all samples from a given condition show no expression for the gene. Surprisingly, this does not occur if at least one sample from this condition expresses a single count for that gene (as shown above with GENE2). GENE1 was an example, but all genes exclusively expressed in either tissue (mainly previously known markers from the literature) are affected in the same way.

Thank you again for your time!

Best, Leon

ADD REPLY • link 5.8 years ago leonfodoulian • 0

0

Entering edit mode

Its the reason I said above, the parameter is on the boundary (mu = 0) so the coefficient is not finite, and there is tradeoff between other coefficients (patient 2 vs patient 1 baseline is 0/0).

ADD REPLY • link 5.8 years ago Michael Love 43k