Question: What's the best way to use "semi" technical replicates with biological replicates?
gravatar for paul.alto
9 weeks ago by
paul.alto40 wrote:

I have a question about how to analyze a mix of biological and semi-technical replicates.

The experiment I am analyzing consists of 3 cell lines X 3 replicates of each cell line X 2 conditions. The 3 replicates are done with the same cell line, but independently treated, processed and sequenced, so they aren't "hard" technical replicates, but they are not biological replicates as the 3 cell lines. They show higher correlation between them (cluster more closely in a PCA) than with the other biological replicates (cell lines). The experiment is paired, in which a sample is split and treated with treatments A and B. The 3 cell lines are sequenced together (replicate group below) in 3 groups.

What is the best way to analyze these data? Is a paired analysis (~condition + pair) OK? Or should I average the semi-technical replicates? How else should I account for different correlation between replicates/cell lines?

Analyzing ~condition + pair or ~condition + cell_line yields DEG fairly similar to analyzing only one replicate group and consistent GO enrichment (but many more DEG), but I wonder if using the semi-technical replicates in the same way I'm using biological replicates is increasing type I error. It doesn't seem it is, judging by the consistent GO fold-enrichment of some interesting terms.

Thank you!

condition   cell_line   repl_group  pair
A           C1          1           c1-1
A           C2          1           c2-1
A           C3          1           c3-1
A           C1          2           c1-2
A           C2          2           c2-2
A           C3          2           c3-2
A           C1          3           c1-3
A           C2          3           c2-3
A           C3          3           c3-3
B           C1          1           c1-1
B           C2          1           c2-1
B           C3          1           c3-1
B           C1          2           c1-2
B           C2          2           c2-2
B           C3          2           c3-2
B           C1          3           c1-3
B           C2          3           c2-3
B           C3          3           c3-3
ADD COMMENTlink modified 9 weeks ago by Aaron Lun24k • written 9 weeks ago by paul.alto40
Answer: What's the best way to use "semi" technical replicates with biological replicate
gravatar for Aaron Lun
9 weeks ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Using ~condition + pair is not ideal as it conflates the inter-cell line variance (specifically, the variation in the A-B difference across cell lines) with the intra-cell line variance between your not-quite-technical replicates. The A-B differences for levels of pair derived from the same cell line will be correlated, reducing the precision with which the variance can be estimated. This probably manifests as more false positives because you trick limma into thinking the variance is more precise than it really is.

Using ~condition + cell_line is suboptimal for similar reasons, in addition to the fact that it fails to consider systematic differences in expression due to repl_group (which will add even more correlations that won't be properly modelled).

If I were you, I would just average all the semi-technical replicates (assuming this is microarray data). Then the residual variance of the linear model will only capture the variation between cell lines with the correct precision - nice and simple. A more complex approach could be considered with duplicateCorrelation, but this is (i) probably unnecessary, as your nuisance factors are orthogonal to your condition of interest; and (ii) limited by the low number of unique levels for either cell_line or repl_group. (While pair would provide more levels, they aren't independent of each other, leading to more problems.)

As an aside: in my opinion, cell lines aren't biological replicates. They're just... different things. I mean, how would one independently "sample" from the "distribution" of cell lines? Randomly pick tubes from the freezer? Certainly it would not be straightforward to account for the correlation structure between a group of "replicate" cell lines in which some cell lines are more related than others. The best analogy would like be grabbing a mouse, cat and dog and treating them as "replicates" for the "mammalian distribution".

A better approach would be to take full advantage of your experimental design and treat each cell line as an separate entity of interest rather than as a sample from some nebulous distribution of cell lines. Something like:

G <- paste0(cell_line, ".", condition)
design <- model.matrix(~G + repl_group)

... followed by a meta-analysis across cell lines to identify genes that are changing consistently between A and B across all three cell lines. This is actually possible here, as you've got replicates for each cell line.

ADD COMMENTlink modified 8 weeks ago • written 9 weeks ago by Aaron Lun24k

Thank you very much, Aaron! Extremely helpful, as usual! I should have mentioned that the experiment is RNA-seq. Would averaging samples still be a solution or would it violate the count data assumptions? I understand your point about cell lines not being proper replicates. I have been considering the meta-analysis approach that you suggested, i.e. identifying genes that consistently change across cell lines and I may go that way if averaging replicates is not acceptable. Again, thank you!

ADD REPLYlink written 9 weeks ago by paul.alto40

You'll want to sum RNA-seq counts; averages wouldn't respect the mean-variance relationship of count data. If the semi-technical replicates for a given cell line/condition combination are of differing depth, summation may introduce hidden correlations between summed counts derived from replicates with the same pattern of differences in depth. In such cases, it would be advisable to downsample the replicates for each combination to the same depth before summation. Of course, if all combinations have the same pattern of differences in depth, then it's not a problem.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Aaron Lun24k

Makes sense, I'll make sure to downsample before summation. Thank you so much!

ADD REPLYlink written 9 weeks ago by paul.alto40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 195 users visited in the last hour