Question

What's the best way to use "semi" technical replicates with biological replicates?

1

Entering edit mode

paul.alto ▴ 50

@paulalto-11559

Last seen 6.2 years ago

I have a question about how to analyze a mix of biological and semi-technical replicates.

The experiment I am analyzing consists of 3 cell lines X 3 replicates of each cell line X 2 conditions. The 3 replicates are done with the same cell line, but independently treated, processed and sequenced, so they aren't "hard" technical replicates, but they are not biological replicates as the 3 cell lines. They show higher correlation between them (cluster more closely in a PCA) than with the other biological replicates (cell lines). The experiment is paired, in which a sample is split and treated with treatments A and B. The 3 cell lines are sequenced together (replicate group below) in 3 groups.

What is the best way to analyze these data? Is a paired analysis (~condition + pair) OK? Or should I average the semi-technical replicates? How else should I account for different correlation between replicates/cell lines?

Analyzing ~condition + pair or ~condition + cell_line yields DEG fairly similar to analyzing only one replicate group and consistent GO enrichment (but many more DEG), but I wonder if using the semi-technical replicates in the same way I'm using biological replicates is increasing type I error. It doesn't seem it is, judging by the consistent GO fold-enrichment of some interesting terms.

Thank you!

condition   cell_line   repl_group  pair
A           C1          1           c1-1
A           C2          1           c2-1
A           C3          1           c3-1
A           C1          2           c1-2
A           C2          2           c2-2
A           C3          2           c3-2
A           C1          3           c1-3
A           C2          3           c2-3
A           C3          3           c3-3
B           C1          1           c1-1
B           C2          1           c2-1
B           C3          1           c3-1
B           C1          2           c1-2
B           C2          2           c2-2
B           C3          2           c3-2
B           C1          3           c1-3
B           C2          3           c2-3
B           C3          3           c3-3

limma technical replicates biological replicates paired analysis • 2.3k views

ADD COMMENT • link updated 6.7 years ago by Aaron Lun ★ 29k • written 6.7 years ago by paul.alto ▴ 50

score 1 · Answer 1 · 2019-04-12

Using ~condition + pair is not ideal as it conflates the inter-cell line variance (specifically, the variation in the A-B difference across cell lines) with the intra-cell line variance between your not-quite-technical replicates. The A-B differences for levels of pair derived from the same cell line will be correlated, reducing the precision with which the variance can be estimated. This probably manifests as more false positives because you trick limma into thinking the variance is more precise than it really is.

Using ~condition + cell_line is suboptimal for similar reasons, in addition to the fact that it fails to consider systematic differences in expression due to repl_group (which will add even more correlations that won't be properly modelled).

If I were you, I would just average all the semi-technical replicates (assuming this is microarray data). Then the residual variance of the linear model will only capture the variation between cell lines with the correct precision - nice and simple. A more complex approach could be considered with duplicateCorrelation, but this is (i) probably unnecessary, as your nuisance factors are orthogonal to your condition of interest; and (ii) limited by the low number of unique levels for either cell_line or repl_group. (While pair would provide more levels, they aren't independent of each other, leading to more problems.)

As an aside: in my opinion, cell lines aren't biological replicates. They're just... different things. I mean, how would one independently "sample" from the "distribution" of cell lines? Randomly pick tubes from the freezer? Certainly it would not be straightforward to account for the correlation structure between a group of "replicate" cell lines in which some cell lines are more related than others. The best analogy would like be grabbing a mouse, cat and dog and treating them as "replicates" for the "mammalian distribution".

A better approach would be to take full advantage of your experimental design and treat each cell line as an separate entity of interest rather than as a sample from some nebulous distribution of cell lines. Something like:

G <- paste0(cell_line, ".", condition)
design <- model.matrix(~G + repl_group)

... followed by a meta-analysis across cell lines to identify genes that are changing consistently between A and B across all three cell lines. This is actually possible here, as you've got replicates for each cell line.