Hello, I'm trying to identify differentially expressed genes between some conditions. I have read that including and not including the intercept in the design should lead to identical results for the equivalent contrast arguments. However, I am getting different results in my analysis and was wondering if anyone know why this is happening? I have looked for other posts similar to my question but have not been able to find much help.
My coldata looks something like this:
And here is my code to test with and without the intercept in the design:
# with intercept (I get about 1266 genes with padj < 0.05)
test1 <- DESeqDataSetFromMatrix(countData = mat,
colData = meta,
design = ~ treatment,
rowData = genes.list)
test1 <- DESeq(test1)
resultsNames(test1)
a <- as.data.frame(results(test1, contrast = list("treatment_A_vs_B"), tidy = TRUE))
# without intercept (I get about 1149 genes with padj < 0.05)
test2 <- DESeqDataSetFromMatrix(countData = mat,
colData = meta,
design = ~ 0 + treatment,
rowData = genes.list)
test2 <- DESeq(test2)
resultsNames(test2)
b <- as.data.frame(results(test2, contrast = list("treatmentA", "treatmentB"), tidy = TRUE))
Similarly, I also get different results when I compare results with the designs ~ donor + treatment and ~ 0 + treatment + donor. This is what I ultimately want (differences in treatment while controlling for donor); I have used only treatment above just as an example.
Would appreciate any insight!
Hi Michael, thank you for your reply. Like you said the correlation between the two is 0.94. The genes were more or less the same when I did more stringent pre-filtering (removing rows with lots of zeroes), but I do have a few questions:
Is there anything else I could do to retain the number of genes? (maybe pre-filtering by treatment groups).
If i did not pre-filter, would the with or without intercept design be more "accurate"? I am wondering as they are quite different.
Would LRT be a better option?
The vignette says that pre-filtering is not required, but how does independent filtering deal with this if it takes the mean of normalised counts? As in shouldn't I be getting the same results for with and without the intercept in the design even if I don't pre-filter due to independent filtering?
Pre-filtering makes sense to me, for example maybe you want to focus on genes with minimal expression in particular groups being compared, or all groups.
There may be a lot of heterogeneity across all of your groups, and you may have better modeling with just subsetting to the two groups and running
DESeq()
just on the two groups.Thank you for your advice! Looking at my PCA, I do have a lot of within group variability so subsetting makes sense.
I was asked to compare each group to an average of other groups (for example group V to the average of D, DL and VL). But I'm not sure about this as subsetting and doing pairwise comparisons makes most sense for this data. Out of curiosity, in cases like these with some within group variability, would this type of comparison (one to an average of others) make sense? If yes, would design = ~ 0 + treatment be acceptable? Or should this avoided?
A few things, first, yes
~0 + treatment
is the right design to compare one group to the average of others.However, I don't recommend doing this type of comparison very often, and you're right on track with why it may be an issue.
Even if there wasn't a lot of heterogeneity across groups in terms of variance, another reason is that the average doesn't necessarily represent anything that exists.
E.g. three people: A is 150 cm tall, B is 160 cm, C is 170 cm. The LFC of B compared to the average of A and C is 0... but is this meaningful?
That all makes sense and is clear now. Thank you very much for your help!