Hi Mike,
I am running DESeq2
version 1.22.2 on a human bulk RNA-sequencing dataset to compare gene expression between 2 tissues, biopsies that were collected from 4 patients. Two of these patients gave one of each tissues, and 2 others gave either of each. The colData
table inputted to DESeq2
is the following:
> sample.info
Samples Conditions Patients
patient1_tissue1 patient1_tissue1 tissue1 patient1
patient1_tissue2 patient1_tissue2 tissue2 patient1
patient2_tissue1 patient2_tissue1 tissue1 patient2
patient3_tissue2 patient3_tissue2 tissue2 patient3
patient4_tissue1 patient4_tissue1 tissue1 patient4
patient4_tissue2 patient4_tissue2 tissue2 patient4
When I run DESeq2
where gene expression is modeled as a function of Conditions
, I get the expected results of log2FoldChange
and padj
for my genes of interest:
> dds <- DESeqDataSetFromMatrix(countData = counts.wf,
colData = sample.info,
design = ~ Conditions)
> dds <- estimateSizeFactors(object = dds)
> dds <- estimateDispersions(object = dds)
> dds <- nbinomWaldTest(object = dds,
betaPrior = FALSE)
> res <- results(object = dds,
cooksCutoff = FALSE)
> dds.res <- as.data.frame(x = res)
> dds.res[c("GENE1", "GENE2"),]
baseMean log2FoldChange lfcSE stat pvalue padj
GENE1 154.10504 10.597958 1.336790 7.927916 2.228534e-15 2.987654e-12
GENE2 86.15448 8.794734 1.615313 5.444601 5.192167e-08 9.061407e-06
These genes are [almost] exclusively expressed in tissue1
samples, but not in tissue2
samples:
> counts.wf[c("GENE1", "GENE2"),]
patient1_tissue1 patient1_tissue2 patient2_tissue1 patient3_tissue2 patient4_tissue1 patient4_tissue2
GENE1 301 0 186 0 511 0
GENE2 22 0 118 1 443 0
Surprisingly, when I add Patients
to the model to control for potential batch effects induced by differences in patient of origin, GENE1
loses significance despite its exclusive expression in one of the tissues (tissue1
). However, this does not affect GENE2
, seemingly because GENE2
has 1 count in one of the tissue2
samples (patient3_tissue2
):
> dds.p <- DESeqDataSetFromMatrix(countData = counts.wf,
colData = sample.info,
design = ~ Patients + Conditions)
> dds.p <- estimateSizeFactors(object = dds.p)
> dds.p <- estimateDispersions(object = dds.p)
> dds.p <- nbinomWaldTest(object = dds.p,
betaPrior = FALSE)
> res.p <- results(object = dds.p,
cooksCutoff = FALSE)
> dds.res.p <- as.data.frame(x = res.p)
> dds.res.p[c("GENE1", "GENE2"),]
baseMean log2FoldChange lfcSE stat pvalue padj
GENE1 154.10504 10.864688 4.784961 2.270591 2.317175e-02 3.777198e-01
GENE2 86.15448 8.898013 1.547535 5.749798 8.935019e-09 1.927043e-06
I therefore added a pseudocount of 1 to all the gene counts in counts.wf
to test for this, and indeed GENE1
retains significance even if Patients
is added as an independent variable to the model:
> dds.p.pc1 <- DESeqDataSetFromMatrix(countData = counts.wf + 1,
colData = sample.info,
design = ~ Patients + Conditions)
> dds.p.pc1 <- estimateSizeFactors(object = dds.p.pc1)
> dds.p.pc1 <- estimateDispersions(object = dds.p.pc1)
> dds.p.pc1 <- nbinomWaldTest(object = dds.p.pc1,
betaPrior = FALSE)
> res.p.pc1 <- results(object = dds.p.pc1,
cooksCutoff = FALSE)
> dds.res.p.pc1 <- as.data.frame(x = res.p.pc1)
> dds.res.p.pc1[c("GENE1", "GENE2"),]
baseMean log2FoldChange lfcSE stat pvalue padj
GENE1 158.76304 8.434732 1.150680 7.330217 2.297802e-13 1.006481e-10
GENE2 89.12941 7.096690 1.117555 6.350193 2.150457e-10 5.546983e-08
Is there any reason for me not to add a pseudocount of 1 to still account for the Patients
effect in the model without losing the significance for tissue-specific genes? If yes, is there any other "proper" way of circumventing this issue?
Please find my sessionInfo()
below:
> devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.5.0 (2018-04-23)
os macOS Sierra 10.12.6
system x86_64, darwin15.6.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2020-04-16
─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
acepack 1.4.1 2016-10-29 [1] CRAN (R 3.5.0)
annotate 1.59.1 2018-07-19 [1] Bioconductor
AnnotationDbi 1.43.1 2018-06-13 [1] Bioconductor
apeglm * 1.4.2 2019-01-04 [1] Bioconductor
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.0)
bbmle 1.0.20 2017-10-30 [1] CRAN (R 3.5.0)
Biobase * 2.41.1 2018-06-17 [1] Bioconductor
BiocGenerics * 0.27.1 2018-06-17 [1] Bioconductor
BiocParallel * 1.15.7 2018-07-08 [1] Bioconductor
bit 1.1-14 2018-05-29 [1] CRAN (R 3.5.0)
bit64 0.9-7 2017-05-08 [1] CRAN (R 3.5.0)
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.5.0)
blob 1.1.1 2018-03-25 [1] CRAN (R 3.5.0)
callr 3.3.2 2019-09-22 [1] CRAN (R 3.5.2)
checkmate 1.8.5 2017-10-24 [1] CRAN (R 3.5.0)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.2)
cluster 2.0.7-1 2018-04-13 [1] CRAN (R 3.5.0)
coda 0.19-3 2019-07-05 [1] CRAN (R 3.5.2)
colorspace 1.3-2 2016-12-14 [1] CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
data.table 1.11.4 2018-05-27 [1] CRAN (R 3.5.0)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.5.0)
DelayedArray * 0.8.0 2018-10-30 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
DESeq2 * 1.22.2 2019-01-04 [1] Bioconductor
devtools 2.2.1 2019-09-24 [1] CRAN (R 3.5.2)
digest 0.6.25 2020-02-23 [1] CRAN (R 3.5.0)
dplyr 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.5.2)
emdbook 1.3.12 2020-02-19 [1] CRAN (R 3.5.2)
extrafont * 0.17 2014-12-08 [1] CRAN (R 3.5.0)
extrafontdb 1.0 2012-06-11 [1] CRAN (R 3.5.0)
foreign 0.8-70 2017-11-28 [1] CRAN (R 3.5.0)
Formula 1.2-3 2018-05-03 [1] CRAN (R 3.5.0)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.5.2)
genefilter 1.64.0 2018-10-30 [1] Bioconductor
geneplotter 1.60.0 2018-10-30 [1] Bioconductor
GenomeInfoDb * 1.17.1 2018-06-13 [1] Bioconductor
GenomeInfoDbData 1.1.0 2018-07-10 [1] Bioconductor
GenomicRanges * 1.33.6 2018-06-13 [1] Bioconductor
ggplot2 * 3.2.0 2019-06-16 [1] CRAN (R 3.5.2)
glue 1.3.2 2020-03-12 [1] CRAN (R 3.5.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 3.5.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.5.0)
Hmisc 4.1-1 2018-01-03 [1] CRAN (R 3.5.0)
htmlTable 1.12 2018-05-26 [1] CRAN (R 3.5.0)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
htmlwidgets 1.5.1 2019-10-08 [1] CRAN (R 3.5.2)
IRanges * 2.15.14 2018-06-13 [1] Bioconductor
knitr 1.20 2018-02-20 [1] CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 [1] CRAN (R 3.5.0)
latticeExtra 0.6-28 2016-02-09 [1] CRAN (R 3.5.0)
lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0)
locfit 1.5-9.1 2013-04-20 [1] CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
MASS 7.3-50 2018-04-30 [1] CRAN (R 3.5.0)
Matrix 1.2-14 2018-04-13 [1] CRAN (R 3.5.0)
matrixStats * 0.54.0 2018-07-23 [1] CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0)
nnet 7.3-12 2016-02-02 [1] CRAN (R 3.5.0)
numDeriv 2016.8-1 2016-08-27 [1] CRAN (R 3.5.0)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.5.2)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.5.2)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0)
purrr 0.3.3 2019-10-18 [1] CRAN (R 3.5.2)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.5.0)
Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.5.0)
RCurl 1.95-4.10 2018-01-04 [1] CRAN (R 3.5.0)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.5.2)
rlang 0.4.5 2020-03-01 [1] CRAN (R 3.5.0)
rpart 4.1-13 2018-02-23 [1] CRAN (R 3.5.0)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
RSQLite 2.1.1 2018-05-06 [1] CRAN (R 3.5.0)
rstudioapi 0.7 2017-09-07 [1] CRAN (R 3.5.0)
Rttf2pt1 1.3.7 2018-06-29 [1] CRAN (R 3.5.0)
S4Vectors * 0.20.1 2018-11-09 [1] Bioconductor
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0)
stringi 1.4.6 2020-02-17 [1] CRAN (R 3.5.0)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.0)
SummarizedExperiment * 1.11.5 2018-06-13 [1] Bioconductor
survival 3.1-11 2020-03-07 [1] CRAN (R 3.5.0)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.5.2)
tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.5.2)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
XML 3.98-1.16 2018-08-19 [1] CRAN (R 3.5.0)
xtable 1.8-2 2016-02-05 [1] CRAN (R 3.5.0)
XVector 0.21.3 2018-06-23 [1] Bioconductor
yaml 2.2.0.9999 2020-02-01 [1] Github (viking/r-yaml@b954d7d)
zlibbioc 1.27.0 2018-06-13 [1] Bioconductor
[1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library
Best regards, Leon
Hi Mike,
Thank you for your quick reply.
Obviously, what I mean with '
GENE1
loses significance' is that itspadj
value is above0.05
(0.37
in this case). I completely agree that this experiment is underpowered.I tried running a likelihood ratio test as you suggested, but still this doesn't help:
Any other suggestions?
Best, Leon
I guess it's just too under-powered to control for patient baseline and keep this gene at a lower enough p-value to survive multiple test correction. Kind of makes sense.
Unless I add a pseudocount of 1, as shown above. Do you suggest in this case to test only for
Conditions
? Or adding a pseudocount of 1 would make sense to you? I don’t see how this latter option can be problematic.If you want to modify the counts, I'd go with a well tested procedure, like limma-voom, which does proper statistical modeling of scaled and shifted counts.
If you stick with DESeq2, I'd just drop the patient blocking I suppose if you wanted to keep this particular gene.
Thank you for your advices, Mike. They are very helpful. I will stick to
DESeq2
and drop thePatients
variable from the model. But I am still perplexed by why this is only happening when all samples from a given condition show no expression for the gene. Surprisingly, this does not occur if at least one sample from this condition expresses a single count for that gene (as shown above withGENE2
).GENE1
was an example, but all genes exclusively expressed in either tissue (mainly previously known markers from the literature) are affected in the same way.Thank you again for your time!
Best, Leon
Its the reason I said above, the parameter is on the boundary (mu = 0) so the coefficient is not finite, and there is tradeoff between other coefficients (patient 2 vs patient 1 baseline is 0/0).