Hi all,
I am a bioinformatics novice and trying to get a handle on both conceptual and technical issues with batch effects in RNAseq datasets and analysis. The context is this:
I've done differential expression analyses in an experiment comparing responses to a drug treatment (trt, ctrl) in two populations (A, B) using both edgeR and DESeq2. Sequences for all four treatment*population group combinations were generated simultaneously in a single run on the same machine. I now want to generate sequence data from a third population (C) and compare this new population's response in the same fashion to A and B.
I understand that an effect of batch would arise with population C because its sequences were generated after A and B on a different run, and thus that batch effect would need to be included in linear models for edgeR and DESeq2. But I'm having a difficult time figuring out how batch effects could be distinguished from population effects since it (C) is the only population sequenced on the second run. Short of resequencing samples from all three populations and treatment combinations, is there a robust approach to differentiating population effects from batch effects in this context?
Thanks,
Matthew
edited for grammar
Hi, Aaron
Thanks for the reply and input. What I didn't mention is that all samples were processed and RNA extracted simultaneously. Although I wouldn't have the ability to parse possible differences in reagents used during extractions, I can at least say that we would be able to generate sequences in all three populations starting from RNA that was extracted at the same time. But you're right -- this wouldn't preclude batch effects incurred during library prep, but we would have the ability to minimize those effects moving forward.
Thanks again for your help!
If you are going to include A or B samples in your next run, try not to use samples that were already sequenced in the first batch. This makes life difficult - see A: Batch correction when only some subjects have replicates across batches. If you've already sequenced all of the A or B samples, then make sure you remove the re-sequenced samples from the first batch when you pool all of the samples together.
Got it -- thanks, Aaron. I have the option of going all the way back to back to RNA and generating all new libraries from scratch for all three populations as a backup, but sequencing new A or B samples is a viable and more attractive option.
It helps that the RNA was extracted on the same day, though it would be better if the libraries had been prepped on the same day as well. Different run dates on the instrument is not likely to be a large source of variation if everything else happened at the same time. So resequencing the same libraries again would not help; you'd have to make the libraries new again.
I think we would have to go all the way back to RNA, anyway.
From how the OP worded the question, it seemed to me that population C has both treated and control. Is this not correct? If so, and the only comparison of interest is differences in reaction to treatment for each population, isn't this pretty much the same thing as section 3.5 in the edgeR User's Guide?
Hi, James. The distinction between the example in section 3.5 is that our samples within populations are not paired by subject. Thus, we are not comparing responses pre- and post-treatment in a repeated measures design for any of the populations. Nonetheless, the approach to specifying contrasts outlined in the example is how I've thus far made comparisons of responses to treatment. But this leads back to my initial conceptual sticking point -- if we identify genes responding differently in treatment in population C, how do we distinguish "true" effects from batch effects.
My point was to ask about this statement that Aaron made:
You might have some more luck if there were more information - for example, a control group that was present in both batches, or multiple replicate batches for each population - but it sounds like it's too late for that.
Where it seems he is thinking all you have in population C are treated samples, and from your original question I was inferring that you had both treated and control. So do you have both treated and control for population C?
Okay, I follow. Sorry for the confusion and thank you for weighing in. I go into additional detail below in response to Aaron, but we have Treated and Control for all three populations. We have thus far sequenced A and B treated and control, and are considering sequencing C treated and control. But the the inclusion of population B groups in this next sequencing run alongside C is what's under consideration.
I'll admit that I read through your question a bit too quickly the first time around. James raises a salient point, so I'll try to clarify things here and you can match it up to whatever you're trying to do.
We have three populations, two treatment groups crossed with population, and two batches. The experimental design looks like this (I'll assume two replicates in each population/treatment combination for simplicity):
Let's combine population and treatment into a single factor:
In my responses above, I had been assuming that you wanted to compare expression directly across populations, e.g.,
C.T
vsA.T
, orC.N
vsB.N
, and so on. This would not be possible with this design.However, upon re-reading your question, it seems that you want to compare treatment responses between populations, i.e.,
(C.T - C.N)
vs(A.T - A.N)
. This is possible here as any batch effect that affectsC.T
should also affectC.N
. When you subtract one from the other to compute the C-specific response, the batch effect should cancel out, allowing you to compare the responses between populations. (Assuming additivity, of course, but we would do this anyway.)So, yes, the example in Section 3.5 is somewhat analogous. Perhaps the similarity would be clearer if you replaced "patient" in that example with "batch" - their within-patient disease-specific response to hormone treatment is the same as your within-batch population-specific response to treatment.
Ahh, okay. I follow -- thank you, Aaron (and James) for clarifying. I have framed the question up to this point as wanting to compare responses between population (C.T - C.N) versus (A.T - A.N), but we DO also care about comparing C.N to A.N and B.N. I assumed specifying all of the questions I wanted to address would get a little far into the weeds, but I see now that is very relevant, and I apologize for my original post being misleading.
To put it plainly, the contrasts we care about are:
1.) Baseline contrasts between populations:
(A.N) vs. (B.N) vs. (C.N)
2.) Population-specific responses to treatment (using presence/absence of DEGs to identify both shared and population-specific gene responses)
[(A.N) vs (A.T)]; [(B.N) vs (B.T)]; [(C.N) vs (C.T)]
3.) Population*treatment interactions (interactive effects identify different directions or magnitude of responses to treatment within shared gene responses)
(A.N - A.T) vs (B.N - B.T) vs (C.N - C.T)
Population C (both C.N and C.T) is being included after the fact because our initial comparisons (A versus B) were exploratory and returned a (surprisingly) large number of DEGs. Because of our experimental approach, population C represents an opportunity to explore possible mechanisms behind those DEGs experimentally. If we had found very few differences between A and B, we likely wouldn't have considered this in the first place.
And if I understand yours and James' comment correctly, if we sequence groups C.N and C.T only, we will be able to make comparisons (2) and (3), because these will incorporate the N versus T contrast, thus handling possible batch effects. But we will be precluded from making comparison (1) because batch will completely confound the population effect. In contrast, if we sequence both C.N and C.T, as well as B.N and B.T, we can make the same baseline comparisons between C.N and B.N as outlined above, but would be precluded from comparing A.N and C.N because of batch/population confounding. However, if we follow Aaron's earlier suggestion of estimating batch effects from our original B.N sequencing and new B.N sequencing, we might be able to make the comparison between A.N and C.N directly (if we're careful with our sample selection).
Just to be absolutely clear - I say "re-sequence", but I really mean "do everything at the same time for a given batch". Do not do something like "resequence A's existing libraries but for C perform library preparation and sequencing". This will result in hierarchical batch effects that are even more difficult to deal with. Rather, generate new libraries for everything that you plan to re-sequence. (Of course, we're assuming that the original RNA extractions were done at the same time, otherwise this would be an unavoidable batch effect in and of itself.)
Got it! Thanks again, Aaron. Yes -- we would go all the way back to RNA, generating new libraries, and sequencing new libraries de novo for C.N, C.T, B.N, and B.T.
Thank you!