What's the best way to use "semi" technical replicates with biological replicates?
1
1
Entering edit mode
paul.alto ▴ 50
@paulalto-11559
Last seen 18 months ago

I have a question about how to analyze a mix of biological and semi-technical replicates.

The experiment I am analyzing consists of 3 cell lines X 3 replicates of each cell line X 2 conditions. The 3 replicates are done with the same cell line, but independently treated, processed and sequenced, so they aren't "hard" technical replicates, but they are not biological replicates as the 3 cell lines. They show higher correlation between them (cluster more closely in a PCA) than with the other biological replicates (cell lines). The experiment is paired, in which a sample is split and treated with treatments A and B. The 3 cell lines are sequenced together (replicate group below) in 3 groups.

What is the best way to analyze these data? Is a paired analysis (~condition + pair) OK? Or should I average the semi-technical replicates? How else should I account for different correlation between replicates/cell lines?

Analyzing ~condition + pair or ~condition + cell_line yields DEG fairly similar to analyzing only one replicate group and consistent GO enrichment (but many more DEG), but I wonder if using the semi-technical replicates in the same way I'm using biological replicates is increasing type I error. It doesn't seem it is, judging by the consistent GO fold-enrichment of some interesting terms.

Thank you!

condition   cell_line   repl_group  pair
A           C1          1           c1-1
A           C2          1           c2-1
A           C3          1           c3-1
A           C1          2           c1-2
A           C2          2           c2-2
A           C3          2           c3-2
A           C1          3           c1-3
A           C2          3           c2-3
A           C3          3           c3-3
B           C1          1           c1-1
B           C2          1           c2-1
B           C3          1           c3-1
B           C1          2           c1-2
B           C2          2           c2-2
B           C3          2           c3-2
B           C1          3           c1-3
B           C2          3           c2-3
B           C3          3           c3-3

1
Entering edit mode
Aaron Lun ★ 26k
@alun
Last seen 2 hours ago
The city by the bay

Using ~condition + pair is not ideal as it conflates the inter-cell line variance (specifically, the variation in the A-B difference across cell lines) with the intra-cell line variance between your not-quite-technical replicates. The A-B differences for levels of pair derived from the same cell line will be correlated, reducing the precision with which the variance can be estimated. This probably manifests as more false positives because you trick limma into thinking the variance is more precise than it really is.

Using ~condition + cell_line is suboptimal for similar reasons, in addition to the fact that it fails to consider systematic differences in expression due to repl_group (which will add even more correlations that won't be properly modelled).

If I were you, I would just average all the semi-technical replicates (assuming this is microarray data). Then the residual variance of the linear model will only capture the variation between cell lines with the correct precision - nice and simple. A more complex approach could be considered with duplicateCorrelation, but this is (i) probably unnecessary, as your nuisance factors are orthogonal to your condition of interest; and (ii) limited by the low number of unique levels for either cell_line or repl_group. (While pair would provide more levels, they aren't independent of each other, leading to more problems.)

As an aside: in my opinion, cell lines aren't biological replicates. They're just... different things. I mean, how would one independently "sample" from the "distribution" of cell lines? Randomly pick tubes from the freezer? Certainly it would not be straightforward to account for the correlation structure between a group of "replicate" cell lines in which some cell lines are more related than others. The best analogy would like be grabbing a mouse, cat and dog and treating them as "replicates" for the "mammalian distribution".

A better approach would be to take full advantage of your experimental design and treat each cell line as an separate entity of interest rather than as a sample from some nebulous distribution of cell lines. Something like:

G <- paste0(cell_line, ".", condition)
design <- model.matrix(~G + repl_group)


... followed by a meta-analysis across cell lines to identify genes that are changing consistently between A and B across all three cell lines. This is actually possible here, as you've got replicates for each cell line.

0
Entering edit mode

Thank you very much, Aaron! Extremely helpful, as usual! I should have mentioned that the experiment is RNA-seq. Would averaging samples still be a solution or would it violate the count data assumptions? I understand your point about cell lines not being proper replicates. I have been considering the meta-analysis approach that you suggested, i.e. identifying genes that consistently change across cell lines and I may go that way if averaging replicates is not acceptable. Again, thank you!

0
Entering edit mode

You'll want to sum RNA-seq counts; averages wouldn't respect the mean-variance relationship of count data. If the semi-technical replicates for a given cell line/condition combination are of differing depth, summation may introduce hidden correlations between summed counts derived from replicates with the same pattern of differences in depth. In such cases, it would be advisable to downsample the replicates for each combination to the same depth before summation. Of course, if all combinations have the same pattern of differences in depth, then it's not a problem.

0
Entering edit mode

Makes sense, I'll make sure to downsample before summation. Thank you so much!