Hi, I’m performing a DEG analysis with RNA-seq data using salmon
+ DESeq2
. The data were obtained from different parts of cerebral tissue and I am expecting to get a few differentially expressed genes in my contrasts, because our case/control samples are quite similar (in a PCA). As described in DESeq2
vignette, I’ve summarized transcripts estimated counts into gene counts, following tximport
best practices. For DEG analysis, I need to compare contrasts and I included some covariates that I found to be important to explain gene expression variability.
Here is the example of my metadata used to build the design matrix. My contrasts are in the ‘group’ column (treated vs control). My covariates are ‘rin’ and ‘ph’ (they're scaled and centered).
> head(ann)
rin ph group
1 0.096719502 -0.1718305 Nac_treated_male
2 0.622282087 1.0948704 Nac_treated_male
3 0.832507121 -0.4698777 OFC_CTRL_male
4 -0.008393015 0.2007286 OFC_CTRL_male
5 -0.428843082 0.4615200 OFC_CTRL_male
6 0.622282087 -0.7306691 OFC_CTRL_male
Here is how I performed the expression analysis with DESeq2:
# Creating dds object
dds <- DESeqDataSetFromTximport(txi, colData = ann, design = ~ 0 + rin + ph + group)
# Filter low expression genes
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]
# Normalize, get dispersions, and perform Wald's test
dds <- DESeq(dds)
# Get results
res1 <- results(dds, contrast = c("group", treated, ctrl))
I’ve found a strange result for one of my contrasts, reaching ~23k DEG (almost all genes mapped, based on padjusted <= 0.05). Almost all of those genes in this contrast have a pvalue AND adjusted pvalues equal to 0 (exactly 0, not 0.000000e+00, for example). Pvalues distribution for this contrast are virtually all near zero. As I’m expecting to get few differentially expressed genes, this result is intriguing me. This happens only in this contrast, other contrasts seem fine to me.
I tried to filter genes with little variation with varFilter
function from genefilter
package and I still got similar results for this contrast.
Without the covariates 'rin' and 'ph', the model seems to be consistent: I used only the group variable ( ~0 + group) and the “problematic” contrast showed only 14 DEG (padjusted <= 0.05), with “healthy” p-values.
For comparison, I’ve performed the DEG analysis with edgeR. Here is how I did it:
# Buld matrix, filter and normalize
y <- DGEList(counts = txi$counts)
design <- model.matrix(~ 0 + rin + ph + group, data = ann)
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y)
# Estimate dispersions and test with glm
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
res1 <- glmQLFTest(fit, contrast = c) # The contrast here is treated vs control
res1 <- topTags(res1, n = Inf, adjust.method = "BH")$table
I've found that edgeR seem to deal better with the covariates (rin and ph), and gives me only 8 DEG (again, adjusted values <=0.05) for the "problematic" contrast. Pvalues distribution seems ok to me.
I’m trying to understand why DESeq2 is giving me such small pvalues (and also adjusted pvalues) for this contrast. I’m wondering if there is something in my covariates that are causing the pvalues issue for DESeq2, but not for edgeR. I know that DESeq2 and edgeR have different approaches in normalization and statistical tests, but I would expect comparable results. I don’t know if I’m missing something very clear here, but I would appreciate if someone have some thoughts about how to investigate this issue. Thanks in advance!
Yes, you're right, I was testing it wrong. Is it the case to use betaPrior = TRUE in DESeq function for "~0 +" designed matrices? Thank you so much!
We recommend betaPrior=FALSE throughout now. For shrinkage we split out a new function, see vignette.