Our lab is attempting to use the DESeq2 package to perform DGE analysis with a multifactorial experimental design. Our question of interest is whether a treatment effect differs for two conditions (tissues) C and T at each time point, day 3, 7, 14 and 56; we would like to obtain DEGs for each day. For each tissue group, 5 replicates were treated, and 3 served as controls, for a total of (5 treated + 3 controls)* 2 conditions = 16 replicates for each of the 4 days.
library(DESeq2) library(RUVSeq) # a separate dds object was created for each day by repeating these lines condition <- factor(c(rep("C", 8), rep("T",8))) treatment <- factor(c(rep("treated", 5), rep("control", 3), rep("treated", 5), rep("control", 3))) # RUVg with negative control house-keeping genes uq3 <- betweenLaneNormalization(cts3, which="upper") ruv3 <- RUVg(uq3, cIdx=normgenes, k=1, round=TRUE) coldata3 <- data.frame(row.names=colnames(cts3), treatment, condition, ruv3$W) coldata3$condtreat <- factor(paste0(coldata3$condition, coldata3$treatment)) dds3 <- DESeqDataSetFromMatrix(countData=cts3, colData=coldata3, design=~W_1 + condtreat) boxplot(log2(counts(dds3)), range=0, las=2) # medians line up dds3$treatment <- relevel( dds3$treatment, "control" ) dds3$condition <- relevel( dds3$condition, "C" ) dds3 <- DESeq(dds3) res3 <- results(dds3, contrast=list(c("condtreat_Ttreated_vs_Ccontrol"), c("condtreat_Tcontrol_vs_Ccontrol", "condtreat_Ctreated_vs_Ccontrol")), alpha=0.05) # is this contrast correct to find differences in treatment effect between both conditions? table(res3$padj<0.05) hist(res3$pvalue) # looks like the ones others have for significant data
After running the code above for day 3, a significant number of DEG's (with padj < 0.05) were outputted (~1200). However, by running the exact same calls for the other days, under 15 are found. By looking at the dataframe outputted by results(), The large majority of padj values are identical and greater than 0.99. After plotting the pvalue histogram, we noticed that days 7 and up had a "hill-shape" which could explain the inadequacy for the BH padj method. Why are the pvalue histograms so different from day 3?
Another data analyst has performed DGE on our data using EdgeR and managed to pull > 1000 genes per day. 500 of ours from day 3 overlap with theirs, which leads us to believe that our data is actually statistically significant; our DESeq2 analysis just seems to be missing something. Does anyone know what we could be overlooking?