Question

Batch correction when only some subjects have replicates across batches

1

Entering edit mode

Mike A ▴ 10

@mike-a-17539

Last seen 7.1 years ago

Yale University

Hi,

I'm currently exploring methods for batch correction with my RNA-seq dataset from a series of clinical samples. The goal of this approach is to remove batch effects for PCA and other visualization approaches, as well as to generate input for other analytical pipelines that do not include internal methods for batch correction.

Our data is comprised of time-courses from individual patients that we did not want to separate across batches in order to avoid batch effects when examining within-patient gene expression over time. We knew this design would make comparing samples across patients challenging, so we included 3 samples that we sequenced as part of every batch (each from a single time point of a different patient).

When examining just the 3 samples repeated across our 5 batches, we see patients separating predominantly by PC1 and batch separating on PC2 . The "removeBatchEffect()" command in limma appears to correct for batch effect very well, with the 5 samples from each patient now clustering very close together and no separation by batch detected by the first 10 PCs.

What I would like to do is use limma batch correction to correct all samples but using only the 3 repeated samples to model the batch correction (since there are certain to be patient to patient differences between batches).

Based on Aaron Lun's answer to a similar question (https://support.bioconductor.org/p/91704/), I believe that samples that are contained within only a single batch (all but my repeats) will be used only to estimate variance within that batch, but not used for batch correction. Therefore, RemoveBatchEffects() should only consider these samples when batch correcting. Is this correct?
If my above assumption is incorrect, how would I execute this approach within limma? Or using another batch correction algorithm (combat, sva, other)?

Thanks for any input the Bioconductor community can provide,

Mike

limma batch effect rna-seq • 7.2k views

ADD COMMENT • link updated 7.1 years ago by Peter Langfelder ★ 3.0k • written 7.1 years ago by Mike A ▴ 10

2

Entering edit mode

If I understand correctly, your design is a bit different than the question you reference, because your samples shared across batches are effectively technical replicates, i.e. the same biological sample sequenced in each batch, as opposed to different biological replicates in the same treatment group. This means that there should be no biological sources of variation between those technical replicates. (You could verify this by fitting an edgeR model of ~ sample + batch to only the shared samples and checking whether the dispersion values are close to 0.) As a result, I'm not exactly sure how to fit a model. Assuming your design for the full dataset looks something like ~ patient + time + batch, I believe the model will assume that the samples shared across batches are biological replicates (e.g. independent samples from the same patient at the same time point). Since it will be looking for variation that isn't there, this will lower the overall variance estimates accordingly, which will in turn overestimate the significance of your differential expression results. Presumably this effect will be fairly small since the shared samples are a small fraction of the total.

So, that's my understanding of your situation based on what I know about linear model fitting (which might not be entirely correct, so I welcome corrections). Given this, I'm not sure exactly how to fit this model, which is why this is a comment rather than an answer.

ADD REPLY • link 7.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan,

Thanks for your insights. You are correct that the shared samples across batches are from the same cDNA libraries and therefore represent technical replicates rather than biological replicates. Given your comment, it is perhaps worth mentioning that these shared samples make up less than 10% of the total samples in my overall dataset, so you are perhaps correct that the underestimation of variance would be small?

Mike

ADD REPLY • link 7.1 years ago Mike A ▴ 10

score 0 · Answer 1 · 2018-09-27

0

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 16 minutes ago

The city by the bay

This is a tricky experimental design. You're basically stuck between two extremes:

The technical replicates are perfect and have negligible variance, in which case you can manually correct across batches and discard the controls before running limma on the corrected values.
Technical variance dominates and the contribution of biological variation is negligible, in which case you might as well treat the three controls in each batch as samples in their own right when using limma.

Unfortunately, the truth is probably somewhere in between those two poles, and I'm not aware of any functionality in limma that can smoothly transition from one to the other for a general experimental design. Well, there is, but it's used up in handling the other levels of replication (across patients, and across batches) in your experiment.

The most obvious way to make limma work with off-the-shelf methods is to discard some replication. In particular, you can subset your data so that each batch consists of samples from only one patient - for simplicity, I'll further assume that all samples from one patient belong to one batch. You can then calculate corrected values and discard the controls; fit a linear model to the corrected values using a design matrix containing your factors of interest (e.g., treatment or something, I guess - you didn't specify it); and finally block on the patient effect with duplicateCorrelation. This works because any uncertainty from the batch correction is absorbed into the patient effect and handled by the blocking, given that the patient and batch factors are now equivalent in this subset.

Of course, this is rather unsatisfying if you have to discard a lot of data. I am less sure of other solutions, which would involve weight calculations to account for differences in variance between the controls and other samples. The weights can be interpreted as the reciprocal of the variances of the observations - thus, one could imagine assigning higher weight to control samples that are less variable. Determining the exact value of the weight is the tricky bit. Perhaps arrayWeights might work, but I would have to think it through.

ADD COMMENT • link 7.1 years ago Aaron Lun ★ 29k

0

Entering edit mode

Perhaps using arrayWeights with a var.design that separates the samples into cross-batch controls and everything else?

ADD REPLY • link 7.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Here's a simple demonstration of the use of weighting to account for differences in variance between samples:

groups <- gl(3, 4)
vars <- ifelse(groups==3, 0.1, 1) # 3rd group is full of controls.
ngenes <- 10000  
y <- matrix(rnorm(length(groups) * ngenes, sd=sqrt(vars)), nrow=ngenes, byrow=TRUE)

With no weights, you can see that type I error control is lost when comparing groups 1 and 2:

library(limma)
design <- model.matrix(~groups)    
fit.raw <- lmFit(y, design)
fit.raw <- eBayes(fit.raw)
out <- topTable(fit.raw, n=Inf, coef=2)
hist(out$P.Value) # not uniform, even though null is true for all genes.

... and performance is much better with weights defined by the known variances:

fit.weight <- lmFit(y, design, weights=1/vars)
fit.weight <- eBayes(fit.weight)
out.weight <- topTable(fit.weight, n=Inf, coef=2)
hist(out.weight$P.Value) # much better looking.

We get pretty sensible estimates from arrayWeights here:

arrayWeights(y, design)
##         1         2         3         4         5         6         7         8
## 0.4808122 0.4601444 0.4615454 0.4612732 0.4746076 0.4826005 0.4422411 0.4591060
##         9        10        11        12
## 4.6577222 4.5400287 4.7745787 4.5216388

You can see that the non-controls have weights that are 10 times smaller, which makes sense.

So, all in all, I would say that using arrayWeights would be the cleanest solution to the problem here. This would need to be combined with duplicateCorrelation to block on the patient effect (iterate with voomWithQualityWeights, see A: using duplicateCorrelation with limma+voom for RNA-seq data), along with blocking factors for the batch effects in the design matrix itself, as previously discussed.

ADD REPLY • link 7.1 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks to both of you for your insight. I'm new to linear modeling and to RNA-seq analysis, so incorporating the approaches you suggest will take me a bit of time (and some reading!). I'll report back with either questions or, hopefully, wonderful results. Thanks again for your time.

Mike

ADD REPLY • link 7.1 years ago Mike A ▴ 10

0

Entering edit mode

One thing that just occurred to me is that it might be possible to use voomWithQualityWeights to determine whether there is a data issue to begin with. Run it on the full data set with a design of ~ patient + time + batch, and then compare the weights of the shared samples against the non-shared ones. If they are significantly different, then that is evidence of a problematic variance discrepancy between the technical replicates and the biological ones. Have I got this right?

ADD REPLY • link 7.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

More or less, though for simplicity/consistency, you could might as well block on patient in duplicateCorrelation instead, which is what you would do for the actual model for the ensuing analysis.

ADD REPLY • link 7.1 years ago Aaron Lun ★ 29k

score 0 · Answer 2 · 2018-09-28

0

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 12 months ago

United States

I'm not sure my answer will be helpful since I don't quite follow the variance and weight issues raised by Aaron and may be missing some important aspects here. Take it for what it's worth...

You can try a developmental version of WGCNA from here (dropbox link). To install it, you will need to be able to compile a package with C/C++ code. The package contains function empiricalBayesLM which can also remove batch effects (using approach similar to ComBat). Importantly for your application, the function allows you to specify reference samples (via argument fitToSamples) on which the correction (i.e., the model coefficients) are based. I'd run the function on all your data but specify the "fit to" samples to be the ones that are repeated. You can use either plain linear model adjustment or the empirical Bayes-moderated one which tends to be a bit more stable but leaves a residual batch dependence in the data. Since you sequenced the same 3 samples in each batch, you should not need any "preserved" covariates.

Unlike ComBat, empiricalBayesLM only adjusts means; it does not adjust variances

You should run the function on log-transformed normalized data. The output should be usable for downstream analyses, although it won't help you much with DE analysis in limma.

ADD COMMENT • link 7.1 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Thank you Peter. This sounds very appropriate for my data, thanks for the suggestion. I'll report back with results and/or questions.

Mike

ADD REPLY • link 7.1 years ago Mike A ▴ 10

0

Entering edit mode

Hi Peter,

I implemented your suggested approach, thank you. Interestingly, the result was very similar to using limma to batch correct with the sampleIDs built into the design. Although this method in limma treated the samples as biological replicates instead of technical, the effect of the decreased estimation of variance (as referenced by Ryan above) was likely minimal due to there being over 200 samples and just 15 samples used for the correction.

Unfortunately, the result with both of these algorithms approaches appeared to cause greater separation of batches, rather than reduced separation. My best interpretation of this is that the technical variation from the samples themselves was confounded with the variation from batch to batch and applied normalization to the other samples that was not appropriate.

Just thought I'd share for the edification of anyone who comes across this question and answer thread in the future.

Mike

ADD REPLY • link 7.1 years ago Mike A ▴ 10

0

Entering edit mode

Interesting. When you say that batch separation was increased, do you mean separation overall? On the samples from the 3 subjects that were repeated, the batch correction should certainly decrease the batch differences. If not, there's something wrong either with the function or with the way you called it.

ADD REPLY • link 7.1 years ago Peter Langfelder ★ 3.0k