Hi,
I'm currently exploring methods for batch correction with my RNA-seq dataset from a series of clinical samples. The goal of this approach is to remove batch effects for PCA and other visualization approaches, as well as to generate input for other analytical pipelines that do not include internal methods for batch correction.
Our data is comprised of time-courses from individual patients that we did not want to separate across batches in order to avoid batch effects when examining within-patient gene expression over time. We knew this design would make comparing samples across patients challenging, so we included 3 samples that we sequenced as part of every batch (each from a single time point of a different patient).
When examining just the 3 samples repeated across our 5 batches, we see patients separating predominantly by PC1 and batch separating on PC2 . The "removeBatchEffect()" command in limma appears to correct for batch effect very well, with the 5 samples from each patient now clustering very close together and no separation by batch detected by the first 10 PCs.
What I would like to do is use limma batch correction to correct all samples but using only the 3 repeated samples to model the batch correction (since there are certain to be patient to patient differences between batches).
- Based on Aaron Lun's answer to a similar question (https://support.bioconductor.org/p/91704/), I believe that samples that are contained within only a single batch (all but my repeats) will be used only to estimate variance within that batch, but not used for batch correction. Therefore, RemoveBatchEffects() should only consider these samples when batch correcting. Is this correct?
- If my above assumption is incorrect, how would I execute this approach within limma? Or using another batch correction algorithm (combat, sva, other)?
Thanks for any input the Bioconductor community can provide,
Mike
If I understand correctly, your design is a bit different than the question you reference, because your samples shared across batches are effectively technical replicates, i.e. the same biological sample sequenced in each batch, as opposed to different biological replicates in the same treatment group. This means that there should be no biological sources of variation between those technical replicates. (You could verify this by fitting an edgeR model of
~ sample + batch
to only the shared samples and checking whether the dispersion values are close to 0.) As a result, I'm not exactly sure how to fit a model. Assuming your design for the full dataset looks something like~ patient + time + batch
, I believe the model will assume that the samples shared across batches are biological replicates (e.g. independent samples from the same patient at the same time point). Since it will be looking for variation that isn't there, this will lower the overall variance estimates accordingly, which will in turn overestimate the significance of your differential expression results. Presumably this effect will be fairly small since the shared samples are a small fraction of the total.So, that's my understanding of your situation based on what I know about linear model fitting (which might not be entirely correct, so I welcome corrections). Given this, I'm not sure exactly how to fit this model, which is why this is a comment rather than an answer.
Hi Ryan,
Thanks for your insights. You are correct that the shared samples across batches are from the same cDNA libraries and therefore represent technical replicates rather than biological replicates. Given your comment, it is perhaps worth mentioning that these shared samples make up less than 10% of the total samples in my overall dataset, so you are perhaps correct that the underestimation of variance would be small?
Mike