I'd like to pose a rather complicated question regarding a DE analysis I'm trying to do using DESeq2.
In brief, I have a dataset consisting of gene counts of d.melanogaster sampled at several time points and ranging all different development stages, from embryo to mature adult. Each stage has been sampled at uneven timepoints according to the organism development scheme (e.g.: every 2hrs for the embryo, at every larval stage for the larvae). For each timepoints I have a different number of replicates, ranging from 4 to 6.
I've been asked to find a number of significant DE genes between each stage (e.g. larva vs embryo, pupae vs larva and so on). I've first modeled my dataset by keeping into account only the stage when creating the DE object like this (an example of my design table is at the end of the post):
Using this approach and following the guidelines described in the tutorial, correcting with BH with an FDR=1%, I obtain a very high number of significant DE genes in each comparison (on average ~3K). This number is quite high, so I tried to rebuild my colData accounting for the peculiar "noise" of the time points, considering that many genes during a stage may turn on and off intermittently, though exhibiting a very high stage-specific variance.
I therefore first tried to model the DESeq object naming each timepoint and considering a model like
design = ~stage + timepoints + stage:timepoints,
then reducing the object for the timepoints and then applying the LRT test, as also described in the Bioconductor vignette. With this approach however the situation gets even worse, with the number of significant DE genes raising consistently (10K on average).
As final effort, I then decided to consider each timepoints separately by creating a "fake" column in my colData by pasting the stage and the timepoints, column so
colData$paste = paste(colData$stage, colData$timepoint, sep = "-") design = ~paste
and then try to model the gene counts in DEseq. At this point though, before computing the results(), I'd like to apply a contrasts matrix externally that considers the stages only, and then test for DE expression only between stages. Problem is, I don't know if this is possible and even more, if this procedure is correct. Another alternate approach I thought was to use sva or other packages to remove the "timepoints batch effect" in each stage, however I still haven't tried this. and I'd rather prefer to ask you if you had performed similar tests and what are the recommendations for this peculiar cases.
Here's a brief example of how my colData table is built,<caption>
I hope that this is clear enough, otherwise I'll try to be more intelligible. Of course if you need more code, or more info regarding the experiment, feel free to ask and I'll share what I can.
Thanks a lot!