I have a question about how to analyze a mix of biological and semi-technical replicates.
The experiment I am analyzing consists of 3 cell lines X 3 replicates of each cell line X 2 conditions. The 3 replicates are done with the same cell line, but independently treated, processed and sequenced, so they aren't "hard" technical replicates, but they are not biological replicates as the 3 cell lines. They show higher correlation between them (cluster more closely in a PCA) than with the other biological replicates (cell lines). The experiment is paired, in which a sample is split and treated with treatments A and B. The 3 cell lines are sequenced together (replicate group below) in 3 groups.
What is the best way to analyze these data? Is a paired analysis (~condition + pair) OK? Or should I average the semi-technical replicates? How else should I account for different correlation between replicates/cell lines?
Analyzing ~condition + pair or ~condition + cell_line yields DEG fairly similar to analyzing only one replicate group and consistent GO enrichment (but many more DEG), but I wonder if using the semi-technical replicates in the same way I'm using biological replicates is increasing type I error. It doesn't seem it is, judging by the consistent GO fold-enrichment of some interesting terms.
Thank you!
condition cell_line repl_group pair
A C1 1 c1-1
A C2 1 c2-1
A C3 1 c3-1
A C1 2 c1-2
A C2 2 c2-2
A C3 2 c3-2
A C1 3 c1-3
A C2 3 c2-3
A C3 3 c3-3
B C1 1 c1-1
B C2 1 c2-1
B C3 1 c3-1
B C1 2 c1-2
B C2 2 c2-2
B C3 2 c3-2
B C1 3 c1-3
B C2 3 c2-3
B C3 3 c3-3
Thank you very much, Aaron! Extremely helpful, as usual! I should have mentioned that the experiment is RNA-seq. Would averaging samples still be a solution or would it violate the count data assumptions? I understand your point about cell lines not being proper replicates. I have been considering the meta-analysis approach that you suggested, i.e. identifying genes that consistently change across cell lines and I may go that way if averaging replicates is not acceptable. Again, thank you!
You'll want to sum RNA-seq counts; averages wouldn't respect the mean-variance relationship of count data. If the semi-technical replicates for a given cell line/condition combination are of differing depth, summation may introduce hidden correlations between summed counts derived from replicates with the same pattern of differences in depth. In such cases, it would be advisable to downsample the replicates for each combination to the same depth before summation. Of course, if all combinations have the same pattern of differences in depth, then it's not a problem.
Makes sense, I'll make sure to downsample before summation. Thank you so much!