Question: Constant High Padj Values
0
21 months ago by
le.alex.chang0 wrote:

Hi everyone,

Our lab is attempting to use the DESeq2 package to perform DGE analysis with a multifactorial experimental design. Our question of interest is whether a treatment effect differs for two conditions (tissues) C and T at each time point, day 3, 7, 14 and 56; we would like to obtain DEGs for each day. For each tissue group, 5 replicates were treated, and 3 served as controls, for a total of (5 treated + 3 controls)* 2 conditions = 16 replicates for each of the 4 days.

library(DESeq2)
library(RUVSeq)
# a separate dds object was created for each day by repeating these lines
condition <- factor(c(rep("C", 8), rep("T",8)))
treatment <- factor(c(rep("treated", 5), rep("control", 3),
rep("treated", 5), rep("control", 3)))

# RUVg with negative control house-keeping genes
uq3 <- betweenLaneNormalization(cts3, which="upper")
ruv3 <- RUVg(uq3, cIdx=normgenes, k=1, round=TRUE)

coldata3 <- data.frame(row.names=colnames(cts3), treatment, condition, ruv3$W) coldata3$condtreat <- factor(paste0(coldata3$condition, coldata3$treatment))

dds3 <- DESeqDataSetFromMatrix(countData=cts3,
colData=coldata3,
design=~W_1 + condtreat)

boxplot(log2(counts(dds3)), range=0, las=2) # medians line up

dds3$treatment <- relevel( dds3$treatment, "control" )
dds3$condition <- relevel( dds3$condition, "C" )

dds3 <- DESeq(dds3)

res3 <- results(dds3, contrast=list(c("condtreat_Ttreated_vs_Ccontrol"),
c("condtreat_Tcontrol_vs_Ccontrol",
"condtreat_Ctreated_vs_Ccontrol")), alpha=0.05)
# is this contrast correct to find differences in treatment effect between both conditions?

table(res3$padj<0.05) hist(res3$pvalue) # looks like the ones others have for significant data

After running the code above for day 3, a significant number of DEG's (with padj < 0.05) were outputted (~1200). However, by running the exact same calls for the other days, under 15 are found. By looking at the dataframe outputted by results(), The large majority of padj values are identical and greater than 0.99. After plotting the pvalue histogram, we noticed that days 7 and up had a "hill-shape" which could explain the inadequacy for the BH padj method. Why are the pvalue histograms so different from day 3?

Another data analyst has performed DGE on our data using EdgeR and managed to pull > 1000 genes per day. 500 of ours from day 3 overlap with theirs, which leads us to believe that our data is actually statistically significant; our DESeq2 analysis just seems to be missing something. Does anyone know what we could be overlooking?

Thank you,

Alex

modified 21 months ago by Michael Love25k • written 21 months ago by le.alex.chang0
0
21 months ago by
Michael Love25k
United States
Michael Love25k wrote:

Let's jump back to the biological question and appropriate design:

"Our question of interest is whether a treatment effect differs for two conditions "

You can use an interaction term between condition and treatment, and test on this term. This would look like a design:

~condition + treatment + condition:treatment

Then you can easily perform a likelihood ratio test:

dds <- DESeq(dds, test="LRT", reduced=~condition + treatment)
res <- results(dds)

Hi Michael,

Thanks for this response, I'm working with Alex up there on this. We've done the LRT and had no issues with our dataset in that regard. In addition to the time-series question (which is answered with the LRT analyses), we wanted to know the DEGs at each individual time point taking into account each condition's respective control.

e.g. C(treated/control) vs. T(treated/control) at 3 days

We've gotten over 1200 genes for day 3, but 7, 14, 56 we've had lots of issues with the p-adjusted value being very close to 1. We aren't sure why this is happening, because we have biological replicates (others have reported this when there are no biological replicates).

Thanks for all your help,
James

You can look at examples by eye to get a better sense (see plotCounts), but from what you've told me, I would infer that the changes are more pronounced at day 3 and less so at the other days, such that testing on the individual interaction terms for those other days doesn't give you small enough p-values to survive multiple test correction.

Thanks Michael, it's strange as I've done the analysis on EdgeR and cuffdiff and have gotten hundreds of DEGs for day 7, 14, 56. I also visually inspected some of the genes that were found to be significant in one pipeline vs. deseq2, and the counts cluster far apart between the treated of the two conditions, but not for the control of the two conditions--is the interaction term going to tax us for that? I'm thinking something is wrong with the way we are writing the script. Did you see anything wrong with the scripts we wrote above for a single time-point comparison?

Thanks again, James

"Did you see anything wrong with the scripts we wrote above"

It seems correct, but try using the LRT code I recommended above for each day, testing on the interaction term by removing it from the design. (I assume you are still using the contrast code above when you say that you didn't find significant genes with DESeq2.)

No, the interaction won't tax you for that pattern, that is exactly what the interaction model is designed to pick up on.