Hi, I’m performing a DEG analysis with RNA-seq data using
DESeq2. The data were obtained from different parts of cerebral tissue and I am expecting to get a few differentially expressed genes in my contrasts, because our case/control samples are quite similar (in a PCA). As described in
DESeq2 vignette, I’ve summarized transcripts estimated counts into gene counts, following
tximport best practices. For DEG analysis, I need to compare contrasts and I included some covariates that I found to be important to explain gene expression variability.
Here is the example of my metadata used to build the design matrix. My contrasts are in the ‘group’ column (treated vs control). My covariates are ‘rin’ and ‘ph’ (they're scaled and centered).
> head(ann) rin ph group 1 0.096719502 -0.1718305 Nac_treated_male 2 0.622282087 1.0948704 Nac_treated_male 3 0.832507121 -0.4698777 OFC_CTRL_male 4 -0.008393015 0.2007286 OFC_CTRL_male 5 -0.428843082 0.4615200 OFC_CTRL_male 6 0.622282087 -0.7306691 OFC_CTRL_male
Here is how I performed the expression analysis with DESeq2:
# Creating dds object dds <- DESeqDataSetFromTximport(txi, colData = ann, design = ~ 0 + rin + ph + group) # Filter low expression genes keep <- rowSums(counts(dds) >= 10) >= 3 dds <- dds[keep,] # Normalize, get dispersions, and perform Wald's test dds <- DESeq(dds) # Get results res1 <- results(dds, contrast = c("group", treated, ctrl))
I’ve found a strange result for one of my contrasts, reaching ~23k DEG (almost all genes mapped, based on padjusted <= 0.05). Almost all of those genes in this contrast have a pvalue AND adjusted pvalues equal to 0 (exactly 0, not 0.000000e+00, for example). Pvalues distribution for this contrast are virtually all near zero. As I’m expecting to get few differentially expressed genes, this result is intriguing me. This happens only in this contrast, other contrasts seem fine to me.
I tried to filter genes with little variation with
varFilter function from
genefilter package and I still got similar results for this contrast.
Without the covariates 'rin' and 'ph', the model seems to be consistent: I used only the group variable ( ~0 + group) and the “problematic” contrast showed only 14 DEG (padjusted <= 0.05), with “healthy” p-values.
For comparison, I’ve performed the DEG analysis with edgeR. Here is how I did it:
# Buld matrix, filter and normalize y <- DGEList(counts = txi$counts) design <- model.matrix(~ 0 + rin + ph + group, data = ann) keep <- filterByExpr(y) y <- y[keep, , keep.lib.sizes = FALSE] y <- calcNormFactors(y) # Estimate dispersions and test with glm y <- estimateDisp(y, design) fit <- glmQLFit(y, design) res1 <- glmQLFTest(fit, contrast = c) # The contrast here is treated vs control res1 <- topTags(res1, n = Inf, adjust.method = "BH")$table
I've found that edgeR seem to deal better with the covariates (rin and ph), and gives me only 8 DEG (again, adjusted values <=0.05) for the "problematic" contrast. Pvalues distribution seems ok to me.
I’m trying to understand why DESeq2 is giving me such small pvalues (and also adjusted pvalues) for this contrast. I’m wondering if there is something in my covariates that are causing the pvalues issue for DESeq2, but not for edgeR. I know that DESeq2 and edgeR have different approaches in normalization and statistical tests, but I would expect comparable results. I don’t know if I’m missing something very clear here, but I would appreciate if someone have some thoughts about how to investigate this issue. Thanks in advance!