Hi, I've recently started using DESeq2 with complex experimental designs and I've stumbled across some facts which I cannot wrap my head around.
One of the aims of the experiment is to study the transcriptional response following knockdown of several genes, working both on basal conditions and in an "induced state" were we know that several genes are up/downregulated (due to a treatment).
Right now I've got samples obtained in two different batches with 4 biological replicates for the sh and some more for the wt-controls.
I've decided to use a simple design with a factor representing the treatment and sh condition together as suggested in the documentation and another one to control for the batch effect: "~ batch + condition"
, where condition is "shgeneX|control_untreated|treated". After fitting the model I call DE genes for all the desired comparisons using results(dds, c("condition", "shgeneX_treated", "control_treated"))
. I am using betaPrior=TRUE
since someone suggested me that this is on average a good idea.
With this setup I obtained some strange calls, namely some genes that are not expressed in the untreated samples are called as DE (FDR corrected for all comparison < 0.05 and |lfc|>1) in comparisons shgeneX_untreated vs control_untreated...I therefore reasoned that my initial filtering on raw reads would be
able to filter out these genes if I worked separately on the treated and untreated samples, building two different DESeq2 objects and so on. I also feared that working with everything together the variance estimates could be bloated due to the high variability between the treated and untreated conditions (they are widely separated the the first principal component).
To make a long story short when I split treated and untreated samples I got very similar pvalues and baseMeans but smaller log fold changes, and therefore the "strange" calls go away (I am not always able to filter out those genes using a filter on raw reads since in some sh samples, which are not outliers for PCA, they still have a reasonable number of reads...). The DE genes with this setup are a subset of the other one.
Since I stumbled in this observation: "The beta prior is estimated from the observed differences between conditions across all genes, so if there are many large differences here, there is less shrinkage." DESeq2 with high dispersions, I tried to use betaPrior=FALSE
with the split setup and this made lfc more similar to the first setup (for the curious ones I put some plots here https://www.dropbox.com/sh/5i9r3dsva23xbys/AABXjauLgOdojtYNdSfS_6u3a?dl=0).
My questions, sorry for the long post, are:
- in experiments like this one is betaPrior=T/lfc shrinkage suggested or not? To me it seems that shrinking fold changes in this way would work if the comparisons I am interested in are the ones where the large differences lies, but I am not sure
- considering the fact that I need to perform the low counts filtering on treated and untreated samples separately is it ok to split the analysis or not? Right now I am tempted to use the results obtained with the split setup and betaPrior=T
but I would like some insights :)
merged_pruned is a data.frame with samples identifiers as colnames, entrez as rownames and counts as values.
conditions_pruned has columns conditions_sample, rep, comparison and has samples identifiers as rownames.
mincounts is 10, mingroup is 3.
dds <- DESeqDataSetFromMatrix(countData = merged_pruned, colData = conditions_pruned, design = ~ rep + comparison)
filterGenes <- rowSums(counts(dds) > mincounts) < mingroup
dds <- dds[!filterGenes]
dds <- DESeq(dds, parallel=TRUE, betaPrior=TRUE)
res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_lps", "ctrl_lps"), parallel=TRUE)
Am I making some rookie mistakes? :)