Search
Question: Dealing with low read count in early time point samples
0
gravatar for xei6
10 days ago by
xei60
xei60 wrote:

Hi Michael

I am somewhat new to DeSeq2 and RNA Seq analysis (still slowly diving into the depths of DeqSeq2 manuscript)

I have an experiment with 4 time points and 5 biological replicates and each biological replicate has 5 technical replicates (100 samples thus)

The first two time points are really early is the organism's life cycle and hence we often expect less reads for some genes:

After summing the reads from the technical replicates, I get this profile (OBVIOUSLY AN ARTEFACT)


0h GeneA 3 reads
0h GeneA 2 reads
0h GeneA 0 reads
0h GeneA 1 reads
0h GeneA 3 reads

24h GeneA 0 reads
24h GeneA 1 reads
24h GeneA 2 reads
24h GeneA 0 reads
24h GeneA 1 reads

But DeSeq2 analysis of DE of these genes gives me a profile like this:
baseMean log2FoldChange lfcSE stat pvalue padj
31.88739159 -4.026822896 0.943450084 -4.268188603 1.97E-05 0.007173227

How do I filter for such artefacts ? Though I give an example here, my results are severely afflicted with these kind of low read count but highly significant p value and padj.
Any suggestion is highly appreciated...

I cannot set a minreadcount filter as these genes will have higher read count as the time line progresses.

Any guidance is very much appreciated - thanks

ADD COMMENTlink modified 9 days ago by Michael Love19k • written 10 days ago by xei60
0
gravatar for Michael Love
10 days ago by
Michael Love19k
United States
Michael Love19k wrote:

Can you post the plotCounts for this gene? Why is the base mean 32? Can you post your code?

Are the technical replicates of the same library? You should use collapseReplicates() if so, to only focus on the 20 biological replicates.

ADD COMMENTlink written 10 days ago by Michael Love19k

Thanks Michael

here are the counts before and after normalization: (five biological replicates at each 0h, 24h,48h,96h). I summed the read counts of the technical replicates before starting analysis

dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, 
  directory = "/", design= ~ condition)

counts for geneA

0_1
0_2
0_3
0_4
0_5
24_1
24_2
24_3
24_4
24_5
48_1
48_2
48_3
48_4
48_5
96_1
96_2
96_3
96_4
96_5
2
1
5
0
1
8
1
2
0
0
31
59
32
174
50
73
226
351
198
311

for normalizing I did :

dds <- estimateSizeFactors(dds)
counts(dds, normalized=TRUE)

and I get this :

0_1
0_2
0_3
0_4
0_5
24_1
24_2
24_3
24_4
24_5
48_1
48_2
48_3
48_4
48_5
96_1
96_2
96_3
96_4
96_5
79.51029
36.71899
156.9399
0
36.71899
36.75497
4.746411
7.827447
0
0
12.0785
9.981079
12.32481
9.75938
7.403459
6.779857
6.340876
6.307691
7.265664
7.53292

Wondering if there is a way to interpret this or if I did something wrong (most likely!)

 

 

ADD REPLYlink modified 9 days ago by Michael Love19k • written 10 days ago by xei60

So there was a lot more to this than you initially posted, and it's not "obviously an artifact".

There is a problem with having extreme differences in library size (3-4 orders of magnitude) confounded with condition. You can see this by plotting log10 of size factors over condition. We talk about this being a pathological case of confounding in the DESeq2 publication.

Even so, I think this gene is in fact DE given the evidence you have posted.

You have raw counts in the range of 0-3 for both groups being compared, true. But a raw count is representing 10x more expression in the condition=0 group, due to the small library size. 

I do think the very low sequencing libraries almost perfectly confounded with condition could induce problems, not for this gene, but for other genes, in particular genes where the low sequencing samples have all 0's. I think if I had this data I would downsample the highest sequenced samples  just to make sure that the results are not biased by the technical confounding.

ADD REPLYlink modified 9 days ago • written 9 days ago by Michael Love19k

Thanks a  lot Michael for the insights I am going to look deeper at this issue.  I meant it as an artefact also because for this organism we don't expect any reads at 0h and minimal at 24h then a burst of activity at 48h and 96h so the raw read counts did seem to affirm that story...

ADD REPLYlink written 7 days ago by xei60
>sizeFactors(dds)
0_1
0_2
0_3
0_4
0_5
24_1
24_2
24_3
24_4
24_5
48_1
48_2
48_3
48_4
48_5
96_1
96_2
96_3
96_4
96_5
0.025154
0.027234
0.031859
0.052671
0.027234
0.217658
0.210686
0.255511
0.222638
0.284082
2.566543
5.911184
2.596389
17.829
6.7536
10.76719
35.64176
55.64635
27.25147
41.28545

 

ADD REPLYlink modified 9 days ago by Michael Love19k • written 10 days ago by xei60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 307 users visited in the last hour