Question

Further clarification on when not to use duplicateCorrelation with technical replicates (RNA-seq)

1

Entering edit mode

paul.alto ▴ 50

@paulalto-11559

Last seen 5.5 years ago

After reading the limma manual and paper and several posts about using duplicateCorrelation with technical replicates mixed with biological replicates, I am still unsure when to use it (and why not use it).

The 2015 limma paper says about duplicateCorrelation: "More generally, the same idea is also used to model the correlation between related RNA samples, for example repeated measures on the same individual or RNA samples collected at the same time."

The duplicateCorrelation help in the limma R package says: Estimate the correlation between duplicate spots (regularly spaced replicate spots on the same array) or between technical replicates from a series of arrays.

However, several posts here

suggest not using duplicateCorrelation in the designs proposed and instead pooling technical replicates.

In this thread ( https://support.bioconductor.org/p/86867 ), Aaron Lun says that duplicateCorrelation "does better when you have samples across a large number of levels of the blocking factor". So should duplicateCorrelation be used when mixing biological and technical replicates, but only when there is a minimum number of samples/replicates/levels? If so, what are the minimums that should be observed?

Thank you in advance.

limma duplicateCorrelation rna-seq technical replicates biological replicates • 7.0k views

ADD COMMENT • link updated 5.5 years ago by Aaron Lun ★ 28k • written 5.5 years ago by paul.alto ▴ 50

0

Entering edit mode

FYI by definition biological samples from different individuals/subjects are also biological replicates. If you have e.g. multiple biological samples per subject then that is a repeated measures design that you would use duplicateCorrelation on. A repeated measures design is any where you have multiple correlated biological samples per higher-level biological unit.

ADD REPLY • link 5.5 years ago hermidalc ▴ 20

score 6 · Answer 1 · 2019-10-14

As Gordon suggests, the diversity of possible designs makes it difficult to suggest a hard-and-fast rule. Nonetheless, here are some thoughts:

Technical replicates: If these are generated by literally sequencing the same sample multiple times (e.g., on different lanes), just add them together and treat the resulting sum as a single sample.

Not-quite-technical replicates: These are usually things like "we took multiple samples from the same donor", so they're not fully fledged biological replicates but they aren't totally technical either. In most cases, I would just add them together and move on because I don't care about capturing the variability within levels of the blocking factor. For example, if biopsies are variable within a patient but the average expression across multiple biopsies is consistent across patients, then the latter is all I care about. ~~On the other hand, if I did expect the repeated samples to be similar, I would want to penalize genes that exhibit variation between them, so I'd like to capture that variation with duplicateCorrelation.~~ (Update: see comment below.)

Also, when adding, it is better that each repeated sample contributes evenly to the sum for a particular blocking level; this gives you a more stable sum and thus lower across-level variance. It may also be wise to use voomWithQualityWeights to adjust for differences in the number of repeated samples per donor.

Repeated samples with different uninteresting predictors: This refers to situations where repeated samples do not have the same set of predictors in the design matrix, e.g., because some repeated samples were processed in a different batch. If the repeated samples for each blocking level have the same pattern of values for those predictors (e.g., each blocking level has one repeated sample in each of three batches), summation is still possible. However, in general, this is not the case and then duplicateCorrelation must be used.

Repeated samples with different interesting predictors: This refers to situations where repeated samples do not have the same set of predictors in the design matrix, because those predictors are interesting and their effects are to be tested. The archetypical example would be to collect samples before and after treatment for each patient. Here, we can either use duplicateCorrelation or we can block on the uninteresting factors in the design matrix. I prefer the latter as it avoids a few assumptions of the former, namely that all genes have the same consensus correlation. (There's also an assumption about the distribution of the random effect, but I can't remember what it was - maybe normal i.i.d.) However, duplicateCorrelation is more general and is the only solution when you want to compare across blocking levels, e.g., comparing diseased and healthy donors when each donor also contributes before/after treatment samples.

score 1 · Answer 2 · 2019-10-09

So should duplicateCorrelation be used when mixing biological and technical replicates, but only when there is a minimum number of samples/replicates/levels?

Sure, treating factor effects as random often makes more sense when the number of levels is larger, but there is no minimum number. You can apply duplicateCorrelation with only two blocks, and there examples of this in the User's Guide.

Judging from your previous question that Aaron answered, you don't actually have technical replicates at all. If you really did have pure technical replicates (sequencing the same RNA samples twice) then you would normally just sum the counts using edgeR::sumTechReps. There is an infinite variety of designs and an infinite spectrum of "semi" technical replicates that may be strongly or weakly correlated, so it is impossible to give a universal rule that covers all cases. When we advised against duplicateCorrelation in previous posts there was always an alternative, and we gave a reason for choosing the alternative.