Hi Michael
I am somewhat new to DeSeq2 and RNA Seq analysis (still slowly diving into the depths of DeqSeq2 manuscript)
I have an experiment with 4 time points and 5 biological replicates and each biological replicate has 5 technical replicates (100 samples thus)
The first two time points are really early is the organism's life cycle and hence we often expect less reads for some genes:
After summing the reads from the technical replicates, I get this profile (OBVIOUSLY AN ARTEFACT)
0h GeneA 3 reads
0h GeneA 2 reads
0h GeneA 0 reads
0h GeneA 1 reads
0h GeneA 3 reads
24h GeneA 0 reads
24h GeneA 1 reads
24h GeneA 2 reads
24h GeneA 0 reads
24h GeneA 1 reads
But DeSeq2 analysis of DE of these genes gives me a profile like this:
baseMean log2FoldChange lfcSE stat pvalue padj
31.88739159 -4.026822896 0.943450084 -4.268188603 1.97E-05 0.007173227
How do I filter for such artefacts ? Though I give an example here, my results are severely afflicted with these kind of low read count but highly significant p value and padj.
Any suggestion is highly appreciated...
I cannot set a minreadcount filter as these genes will have higher read count as the time line progresses.
Any guidance is very much appreciated - thanks
Thanks Michael
here are the counts before and after normalization: (five biological replicates at each 0h, 24h,48h,96h). I summed the read counts of the technical replicates before starting analysis
counts for geneA
for normalizing I did :
dds <- estimateSizeFactors(dds)
counts(dds, normalized=TRUE)
and I get this :
Wondering if there is a way to interpret this or if I did something wrong (most likely!)
So there was a lot more to this than you initially posted, and it's not "obviously an artifact".
There is a problem with having extreme differences in library size (3-4 orders of magnitude) confounded with condition. You can see this by plotting log10 of size factors over condition. We talk about this being a pathological case of confounding in the DESeq2 publication.
Even so, I think this gene is in fact DE given the evidence you have posted.
You have raw counts in the range of 0-3 for both groups being compared, true. But a raw count is representing 10x more expression in the condition=0 group, due to the small library size.
I do think the very low sequencing libraries almost perfectly confounded with condition could induce problems, not for this gene, but for other genes, in particular genes where the low sequencing samples have all 0's. I think if I had this data I would downsample the highest sequenced samples just to make sure that the results are not biased by the technical confounding.
Thanks a lot Michael for the insights I am going to look deeper at this issue. I meant it as an artefact also because for this organism we don't expect any reads at 0h and minimal at 24h then a burst of activity at 48h and 96h so the raw read counts did seem to affirm that story...