Question

Dealing with low read count in early time point samples

0

Entering edit mode

xei6 • 0

@xei6-17328

Last seen 5.6 years ago

Hi Michael

I am somewhat new to DeSeq2 and RNA Seq analysis (still slowly diving into the depths of DeqSeq2 manuscript)

I have an experiment with 4 time points and 5 biological replicates and each biological replicate has 5 technical replicates (100 samples thus)

The first two time points are really early is the organism's life cycle and hence we often expect less reads for some genes:

After summing the reads from the technical replicates, I get this profile (OBVIOUSLY AN ARTEFACT)

0h GeneA 3 reads
0h GeneA 2 reads
0h GeneA 0 reads
0h GeneA 1 reads
0h GeneA 3 reads

24h GeneA 0 reads
24h GeneA 1 reads
24h GeneA 2 reads
24h GeneA 0 reads
24h GeneA 1 reads

But DeSeq2 analysis of DE of these genes gives me a profile like this:
baseMean log2FoldChange lfcSE stat pvalue padj
31.88739159 -4.026822896 0.943450084 -4.268188603 1.97E-05 0.007173227

How do I filter for such artefacts ? Though I give an example here, my results are severely afflicted with these kind of low read count but highly significant p value and padj.
Any suggestion is highly appreciated...

I cannot set a minreadcount filter as these genes will have higher read count as the time line progresses.

Any guidance is very much appreciated - thanks

deseq2 • 837 views

ADD COMMENT • link updated 5.6 years ago by Michael Love 41k • written 5.6 years ago by xei6 • 0

Michael Love · Answer 1 · 2018-09-13

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 44 minutes ago

United States

Can you post the plotCounts for this gene? Why is the base mean 32? Can you post your code?

Are the technical replicates of the same library? You should use collapseReplicates() if so, to only focus on the 20 biological replicates.

ADD COMMENT • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Thanks Michael

here are the counts before and after normalization: (five biological replicates at each 0h, 24h,48h,96h). I summed the read counts of the technical replicates before starting analysis

dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, 
  directory = "/", design= ~ condition)

counts for geneA

0_1	0_2	0_3	0_4	0_5	24_1	24_2	24_3	24_4	24_5	48_1	48_2	48_3	48_4	48_5	96_1	96_2	96_3	96_4	96_5
2	1	5	0	1	8	1	2	0	0	31	59	32	174	50	73	226	351	198	311

for normalizing I did :

dds <- estimateSizeFactors(dds)
counts(dds, normalized=TRUE)

and I get this :

0_1

0_2

0_3

0_4

0_5

24_1

24_2

24_3

24_4

24_5

48_1

48_2

48_3

48_4

48_5

96_1

96_2

96_3

96_4

96_5

79.51029

36.71899

156.9399

36.71899

36.75497

4.746411

7.827447

12.0785

9.981079

12.32481

9.75938

7.403459

6.779857

6.340876

6.307691

7.265664

7.53292

Wondering if there is a way to interpret this or if I did something wrong (most likely!)

ADD REPLY • link updated 5.6 years ago by Michael Love 41k • written 5.6 years ago by xei6 • 0

0

Entering edit mode

So there was a lot more to this than you initially posted, and it's not "obviously an artifact".

There is a problem with having extreme differences in library size (3-4 orders of magnitude) confounded with condition. You can see this by plotting log10 of size factors over condition. We talk about this being a pathological case of confounding in the DESeq2 publication.

Even so, I think this gene is in fact DE given the evidence you have posted.

You have raw counts in the range of 0-3 for both groups being compared, true. But a raw count is representing 10x more expression in the condition=0 group, due to the small library size.

I do think the very low sequencing libraries almost perfectly confounded with condition could induce problems, not for this gene, but for other genes, in particular genes where the low sequencing samples have all 0's. I think if I had this data I would downsample the highest sequenced samples just to make sure that the results are not biased by the technical confounding.

ADD REPLY • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Thanks a lot Michael for the insights I am going to look deeper at this issue. I meant it as an artefact also because for this organism we don't expect any reads at 0h and minimal at 24h then a burst of activity at 48h and 96h so the raw read counts did seem to affirm that story...

ADD REPLY • link 5.6 years ago xei6 • 0

0

Entering edit mode

>sizeFactors(dds)

0_1

0_2

0_3

0_4

0_5

24_1

24_2

24_3

24_4

24_5

48_1

48_2

48_3

48_4

48_5

96_1

96_2

96_3

96_4

96_5

0.025154

0.027234

0.031859

0.052671

0.027234

0.217658

0.210686

0.255511

0.222638

0.284082

2.566543

5.911184

2.596389

17.829

6.7536

10.76719

35.64176

55.64635

27.25147

41.28545

ADD REPLY • link updated 5.6 years ago by Michael Love 41k • written 5.6 years ago by xei6 • 0