IGNORE this issue. It is likely that I did something wrong at some point since I am having problems reproducing the problem myself.
tldr: it seems that after `collapseReplicates` the levels of newly created factors are shifted affecting the results. **If** this observation is correct and not a due to a mistake i my design, I suggest adding a warning in either in the function itself, or a note in the vignette.
I was doing an analysis with a few variables and in one of my analysis I want to compare a mix of two of those: `DevStage` combined with `Treatment`. Following usual instructions, instead of creating a more complicated model, I simply created a new condition combining these variables:
(my apologies for having to source the sample data but it was the minimal example I could use)
library("DESeq2") source("https://gist.githubusercontent.com/adomingues/65ddd8a658b547952407eb5c01072e77/raw/98206cb399dcfeef6acda6513631eeb3c2be8fff/deseq2_factors.R") samples$NewCondition <- factor(paste(samples$DevStage, samples$Treatment, sep = "_")) ddsTmp <- makeExampleDESeqDataSet(m=28) colnames(ddsTmp) <- rownames(samples) countdata <- assay(ddsTmp) dds <- DESeqDataSetFromMatrix( countData = countdata, colData = samples, design = ~ NewCondition ) colData(dds) DataFrame with 28 rows and 5 columns sample DevStage Type <factor> <factor> <factor> wt_Fish_1dpf_rep1_mRNA.paired wt_Fish_1dpf_rep1_mRNA 1dpf paired wt_Fish_1dpf_rep2_mRNA.paired wt_Fish_1dpf_rep2_mRNA 1dpf paired wt_Fish_2dpf_rep1_mRNA.paired wt_Fish_2dpf_rep1_mRNA 2dpf paired wt_Fish_2dpf_rep2_mRNA.paired wt_Fish_2dpf_rep2_mRNA 2dpf paired wt_Fish_3dpf_rep1_mRNA.paired wt_Fish_3dpf_rep1_mRNA 3dpf paired ... ... ... ... wt_PGCs_6dpf_rep2_mRNA.paired wt_PGCs_6dpf_rep2_mRNA 6dpf paired wt_PGCs_6dpf_rep3_mRNA.paired wt_PGCs_6dpf_rep3_mRNA 6dpf paired wt_PGCs_10dpf_rep1_mRNA.paired wt_PGCs_10dpf_rep1_mRNA 10dpf paired wt_PGCs_10dpf_rep2_mRNA.paired wt_PGCs_10dpf_rep2_mRNA 10dpf paired wt_PGCs_10dpf_rep3_mRNA.paired wt_PGCs_10dpf_rep3_mRNA 10dpf paired Treatment NewCondition <character> <factor> wt_Fish_1dpf_rep1_mRNA.paired control 1dpf_control wt_Fish_1dpf_rep2_mRNA.paired control 1dpf_control wt_Fish_2dpf_rep1_mRNA.paired control 2dpf_control wt_Fish_2dpf_rep2_mRNA.paired control 2dpf_control wt_Fish_3dpf_rep1_mRNA.paired control 3dpf_control ... ... ... wt_PGCs_6dpf_rep2_mRNA.paired treated 6dpf_treated wt_PGCs_6dpf_rep3_mRNA.paired treated 6dpf_treated wt_PGCs_10dpf_rep1_mRNA.paired treated 10dpf_treated wt_PGCs_10dpf_rep2_mRNA.paired treated 10dpf_treated wt_PGCs_10dpf_rep3_mRNA.paired treated 10dpf_treated
Here what I am trying to compare is the effect of `treatment` at each `DevStage`:
- 1dpf_treated vs 1dpf_control
- 2dpf_treated vs 2dpf_control
Since in this set-up we also sequenced some samples with single and paired end reads, after QC I merged the replicated for the DGE analysis:
ddsColl <- collapseReplicates(dds, dds$sample, dds$Type) ddsColl <- DESeq(ddsColl) colData(ddsColl) DataFrame with 25 rows and 7 columns sample DevStage Type Treatment <factor> <factor> <factor> <character> wt_Fish_10dpf_rep1_mRNA wt_Fish_10dpf_rep1_mRNA 10dpf paired control wt_Fish_10dpf_rep2_mRNA wt_Fish_10dpf_rep2_mRNA 10dpf paired control wt_Fish_1dpf_rep1_mRNA wt_Fish_1dpf_rep1_mRNA 1dpf paired control wt_Fish_1dpf_rep2_mRNA wt_Fish_1dpf_rep2_mRNA 1dpf paired control wt_Fish_2dpf_rep1_mRNA wt_Fish_2dpf_rep1_mRNA 2dpf paired control ... ... ... ... ... wt_PGCs_3dpf_rep2_mRNA wt_PGCs_3dpf_rep2_mRNA 3dpf paired treated wt_PGCs_3dpf_rep3_mRNA wt_PGCs_3dpf_rep3_mRNA 3dpf paired treated wt_PGCs_6dpf_rep1_mRNA wt_PGCs_6dpf_rep1_mRNA 6dpf paired treated wt_PGCs_6dpf_rep2_mRNA wt_PGCs_6dpf_rep2_mRNA 6dpf paired treated wt_PGCs_6dpf_rep3_mRNA wt_PGCs_6dpf_rep3_mRNA 6dpf paired treated NewCondition runsCollapsed sizeFactor <factor> <character> <numeric> wt_Fish_10dpf_rep1_mRNA 10dpf_control paired 0.94359672670364 wt_Fish_10dpf_rep2_mRNA 10dpf_control paired 0.944675357175298 wt_Fish_1dpf_rep1_mRNA 1dpf_control paired 0.976524986018752 wt_Fish_1dpf_rep2_mRNA 1dpf_control paired 0.944935039261004 wt_Fish_2dpf_rep1_mRNA 2dpf_control paired 0.988405339505884 ... ... ... ... wt_PGCs_3dpf_rep2_mRNA 3dpf_treated paired 0.971152123564414 wt_PGCs_3dpf_rep3_mRNA 3dpf_treated paired 0.926392628153864 wt_PGCs_6dpf_rep1_mRNA 6dpf_treated paired,single 1.97410478743642 wt_PGCs_6dpf_rep2_mRNA 6dpf_treated paired 0.972588828338043 wt_PGCs_6dpf_rep3_mRNA 6dpf_treated paired 0.952043225967793
However when looking at the results, the FC and p-values did not match what I could see in either the IGV tracks or the read counts (not shown here). I dug deeper and found that the issue were the sifted factor levels after collapsing the replicates, specifically for rows after the libraries where collapsed:
# Before the first row collapsing # levels match what is expected ddsColl$NewCondition str(ddsColl$NewCondition) levels(ddsColl$NewCondition) # After the level name and value do not match ddsColl$NewCondition str(ddsColl$NewCondition) levels(ddsColl$NewCondition)
The way I solve it was to create the `NewCondition` after `collapseReplicates`. It is possible that this is a quite specific issue to my particular experimental design or I made a mistake somewhere. Nonetheless I would like to alert the developers and users for this in case someone else runs into similar issue, or when running many comparisons doesn't visually confirm the results of all of them. I noticed becasue there is a set of "control" genes in my set-up that checked to see the effect in all comparisons. Otherwise it would have gone unnoticed.