Question

RNAseq analyses: to use, or not use, all the samples; that's the question.

0

Entering edit mode

Aaron Mackey ▴ 200

@aaron-mackey-3833

Last seen 10.9 years ago

We have a project with a somewhat spectacular design: two cell types in trans-well culture, each cell population from a human donor. In each experimental setup, we use one set of donors, treated with ~10 different perturbagens performed as a single replicate, plus a vehicle control performed in duplicate; we repeat this setup independently 5 times with 5 different sets of donors (same perturbagens). Then, in another batch, we repeat the entire process with a new set of perturbagens (again with vehicle controls, though same donors as before). At the end of the day (or months, as it may be), we have many hundreds of samples across ~8 batches, 5 sets of cell donors, and ~80 different perturbagens, and we'd of course like to test for the effects of each individual perturbagen, and compare/contrast effects across/between perturbagens. Using edgeR, voom, and DESeq we have analyzed this data two ways 1) using all of the samples, accounting for a variety of intermingled batch and donor effects, and modeling all 80 perturbagens simultaneously 2) using only the small subset of samples relevant to a given perturbagen (i.e. the 5 treated samples and their 10 matching vehicle controls), and modeling only 4 donor effects. When we look for subset of known/expected biological signals across well-understood perturbagens, we are surprised to find that approach #2 performs much better than #1 (both with respect to sensitivity at any given FDR, and with respect to statistical ranking, regardless of FDR), despite our prior expectation that #2 will lack power compared to #1 (which has many more df than #2, and thus much more accurate tag-level dispersion estimates). Our working hypothesis is that the additional batch corrections required for #1 introduce enough additional uncertainty that while the same "correct" fold changes are observed by each approach, statistical significance is rarely achieved with #1. A second concern is that our dataset as a whole violates the the constant dispersion assumption inherent to all of the methods, and that by refitting each subset of the data in #2 we are in effect relaxing that assumption; if so, I wonder if it would be useful to consider using the whole-dataset dispersion estimates from #1 as priors when refitting the subsets, to regain some of the expected power. A third issue with both approaches is that the same vehicle controls are reused for multiple perturbagens within a batch, leading to a nested correlation structure not accounted for in either of the approaches; the net effect of this is that the estimated perturbagen-induced fold changes within each batch are correlated with each other, leading to difficulties when comparing responses within and between batches. I presume we could try to address this in the voom pipeline using duplicateCorrelation() to get a mixed model, but this doesn't help us with edgeR or DEseq. In sum, I'm sure we're just rediscovering well known modeling pitfalls, but we'd appreciate any comments or suggestions. We are currently testing whether modifying approach #1 to include only donor-level batch coefficients will recover some accuracy, but any further ideas would be appreciated. Thanks, -Aaron [[alternative HTML version deleted]]

PROcess edgeR DESeq PROcess edgeR DESeq • 1.3k views

ADD COMMENT • link 12.6 years ago Aaron Mackey ▴ 200