Justification for collapsing technical replicates in DESeq2
1
4
Entering edit mode
@kieranrcampbell-11133
Last seen 6.7 years ago

Hi all,

Can someone enlighten me as to the justification for summing counts across technical replicates in DESeq2, especially with respect to the collapseReplicates() function?

I would have thought that statistically the correct thing to do would be to build a column into the design matrix to account for technical replicates and include all samples. This effectively doubles (or x N for N technical replicates) the number of samples you have which obviously "increases" the power and so affects the all-important p-values. On the other hand you are assuming biologically-identical replicates constitute independent samples, but I can't see how you would adjust for large batch effects any other way.

Thanks,

Kieran

deseq2 rnaseq replicate differential expression • 6.7k views
ADD COMMENT
0
Entering edit mode

I moved my comment below

ADD REPLY
0
Entering edit mode

The second argument to collapseReplicates is the factor that you want to collapse on. Here you gave it mock/zika and it collapsed to two samples, one per group. You want to instead collapse based on donor.

ADD REPLY
0
Entering edit mode

comment deleted. Sorry I did not find a way to delete my comment, since it was not a response to the question

ADD REPLY
0
Entering edit mode

This isn't really an "answer" to the first question. On the support forum, the form at the bottom is "Add your answer" which is really supposed to be used by people who are answering the post at the top.

Either way, see above for my reply.

ADD REPLY
0
Entering edit mode

And more generally, what would be the consequences of NOT collapsing technical replicates? 

ADD REPLY
1
Entering edit mode

Not collapsing replicates is not appropriate, in a simple way to describe this: failing to collapse technical replicates and providing these to a DE method is "pretending" you have more independent sample than you really do. You can think of a technical replicate as just more reads from the library of cDNA. You could take a library and split it in 2, again and again, and make many technical replicates. None of these would contain any biological variability, because they are from a single, static library of molecules.

So you can think of an idealized experiment, where you have say, 2 vs 2 biological replicates, which is very under-powered to find any significant differences in expression. But if you make many technical replicates from these, by splitting the reads, and pretend these are independent samples, the DE methods will think you have very low within-group biological variability, and tend to report many genes as DE. It will greatly increase your FPR for the "truly null" genes.

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 5 days ago
United States

When you do differential expression across samples, the kind of variability you need to estimate is the variability across biological replicates. So you don't get a gain in power, because it's not helping you to estimate the variability that would go into a test of differential expression across conditions.

Technical replicate variability is small compared to biological replicate variability and the former is well approximated by a Poisson for the large majority of genes (I've looked into SEQC technical replicates and confirmed this to myself recently). Since the technical replicates of a biological replicate aren't helping you to estimate variability across biological replicates at all, it's best to simply add them together, increasing the sequencing depth of the individual biological replicate. Increasing sequencing depth increases power for differential expression, as does increasing the number of biological replicates.

"On the other hand you are assuming biologically-identical replicates constitute independent samples, but I can't see how you would adjust for large batch effects any other way."

I don't follow this last part, can you add a comment to my post which explains this question more?

ADD COMMENT
0
Entering edit mode

Hi Mike,

Thanks for your answer. The last part relates to an RNA-seq dataset I'm currently working on where batch effect dominates (ie technical replicate variability is large, sadly explains ~80% variance in the data...). So the follow-up questions would be (1) what's the best practice for dealing with dominating technical effects and (2) what's the point in doing technical replicates if we throw away that information by summing over counts?

Thanks,

Kieran

ADD REPLY
0
Entering edit mode

For my above answer, a technical replicate is when you produce more sequences from the same library. And I wouldn't expect much variation above Poisson. 

The point of summing is that you increase the sequencing depth for that sample, which improves power by allowing more precise measurement of gene expression, and increases the set of genes which have minimal read counts.

If you prepare a new library, I wouldn't refer to this as a technical replicate.

Regarding what to do about batches, the recommended approach is to add a term which accounts for this sample dependence into the design, e.g. ~ batch + condition. This typically improves power if there are batches.

ADD REPLY
0
Entering edit mode

Okay I think I was just confused by terms here and thought technical replicate and batch (ie independent library prep) were equivalent. Entirely makes sense that if you do different sequencing runs of the same library then just to sum the counts.

ADD REPLY

Login before adding your answer.

Traffic: 608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6