Question: Dealing with low read count in early time point samples
0
7 months ago by
xei60
xei60 wrote:

Hi Michael

I am somewhat new to DeSeq2 and RNA Seq analysis (still slowly diving into the depths of DeqSeq2 manuscript)

I have an experiment with 4 time points and 5 biological replicates and each biological replicate has 5 technical replicates (100 samples thus)

The first two time points are really early is the organism's life cycle and hence we often expect less reads for some genes:

After summing the reads from the technical replicates, I get this profile (OBVIOUSLY AN ARTEFACT)

But DeSeq2 analysis of DE of these genes gives me a profile like this:
baseMean log2FoldChange lfcSE stat pvalue padj
31.88739159 -4.026822896 0.943450084 -4.268188603 1.97E-05 0.007173227

How do I filter for such artefacts ? Though I give an example here, my results are severely afflicted with these kind of low read count but highly significant p value and padj.
Any suggestion is highly appreciated...

I cannot set a minreadcount filter as these genes will have higher read count as the time line progresses.

Any guidance is very much appreciated - thanks

deseq2 • 166 views
modified 7 months ago by Michael Love23k • written 7 months ago by xei60
0
7 months ago by
Michael Love23k
United States
Michael Love23k wrote:

Can you post the plotCounts for this gene? Why is the base mean 32? Can you post your code?

Are the technical replicates of the same library? You should use collapseReplicates() if so, to only focus on the 20 biological replicates.

Thanks Michael

here are the counts before and after normalization: (five biological replicates at each 0h, 24h,48h,96h). I summed the read counts of the technical replicates before starting analysis

dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = "/", design= ~ condition)

counts for geneA

 0_1 0_2 0_3 0_4 0_5 24_1 24_2 24_3 24_4 24_5 48_1 48_2 48_3 48_4 48_5 96_1 96_2 96_3 96_4 96_5 2 1 5 0 1 8 1 2 0 0 31 59 32 174 50 73 226 351 198 311

for normalizing I did :

dds <- estimateSizeFactors(dds)
counts(dds, normalized=TRUE)

and I get this :

 0_1 0_2 0_3 0_4 0_5 24_1 24_2 24_3 24_4 24_5 48_1 48_2 48_3 48_4 48_5 96_1 96_2 96_3 96_4 96_5 79.51029 36.71899 156.9399 0 36.71899 36.75497 4.746411 7.827447 0 0 12.0785 9.981079 12.32481 9.75938 7.403459 6.779857 6.340876 6.307691 7.265664 7.53292

Wondering if there is a way to interpret this or if I did something wrong (most likely!)

So there was a lot more to this than you initially posted, and it's not "obviously an artifact".

There is a problem with having extreme differences in library size (3-4 orders of magnitude) confounded with condition. You can see this by plotting log10 of size factors over condition. We talk about this being a pathological case of confounding in the DESeq2 publication.

Even so, I think this gene is in fact DE given the evidence you have posted.

You have raw counts in the range of 0-3 for both groups being compared, true. But a raw count is representing 10x more expression in the condition=0 group, due to the small library size.

I do think the very low sequencing libraries almost perfectly confounded with condition could induce problems, not for this gene, but for other genes, in particular genes where the low sequencing samples have all 0's. I think if I had this data I would downsample the highest sequenced samples  just to make sure that the results are not biased by the technical confounding.

Thanks a  lot Michael for the insights I am going to look deeper at this issue.  I meant it as an artefact also because for this organism we don't expect any reads at 0h and minimal at 24h then a burst of activity at 48h and 96h so the raw read counts did seem to affirm that story...

>sizeFactors(dds)
 0_1 0_2 0_3 0_4 0_5 24_1 24_2 24_3 24_4 24_5 48_1 48_2 48_3 48_4 48_5 96_1 96_2 96_3 96_4 96_5 0.025154 0.027234 0.031859 0.052671 0.027234 0.217658 0.210686 0.255511 0.222638 0.284082 2.566543 5.911184 2.596389 17.829 6.7536 10.76719 35.64176 55.64635 27.25147 41.28545