Question: remove batch effect using limma
0
3.7 years ago by
elpsakkas0 wrote:

I have a quick question concerning the removal of the batch effect using removeBatchEffect()…

I have RNA-seq data and i have three batches. I normalize using spike in RNA and i estimate size factors (following the Deseq methodology). At the end i get a matrix which i log2 transform. When i do PCA using this data matrix i have a profound batch effect.

I used the removeBatchEffect function on this log2 matrix. I get a correction for my batch effect which is great. When i construct  samples correlation (pearson) heatmap though, i get a total different value scale compared to my log2 data. Is there a simple explanation on that? The correlation looks similar in both cases but totally different scales! Is there any way that the function changes the values in such degree?

Ellis

modified 3.7 years ago by Aaron Lun25k • written 3.7 years ago by elpsakkas0

Could you please explain in a bit more detail what you mean when you say the following?

I normalize using spike in RNA and i estimate size factors (following the Deseq methodology)

Can you describe in a bit more detail (perhaps with code) how you are doing that, exactly?

Yes you are right...I follow the Deseq2 protocol. This is the code:

countsMmus <- cd[ which( geneTypes=="ENSM" ), ]
countsERCC <- cd[ which( geneTypes=="ERCC" ), ]

sfERCC <- estimateSizeFactorsForMatrix( countsERCC )
sfMmus <- sfERCC

nCountsERCC <- t( t(countsERCC) / sfERCC )
nCountsMmus <- t( t(countsMmus) / sfMmus )

Log.counts.Mmus=-log2(nCountsMmus+1)

I get the matrix above and then i do a PCA. In my PCA i find a batch effect. I costruct a correlation heatmap and i see the clustering of my samples.

Continuing, i run the removeBatchEffect function on the Log.countsMmus matrix. I get a corrected batch effect and my PCA looks great. BUT, when i construct a correlation heatmap, the scale is totally different and the correlation values are really high. I believed that after the removal of the batch effect, the pearson correlation heatmap would look similar to my initial matrix.

I hope that makes it clearer..Thanks!

Thanks for sharing the code. Next step: including figures would be helpful (as well as code to generate them), such as the PCA and heatmaps pre/post batch effect removal.

Anyway: if the PCA result on your data after you call removeBatchEffect is so striking, why should you expect the correlation heatmap to look similar to the original data (do you mean the original correlation heatmap)? What do you mean by the scales being completely different?

Hey Steve,

https://dl.dropboxusercontent.com/u/14753468/p_correlation_all_data_SE.pdf

And this is the heatmap after removal of the batch effect:

https://dl.dropboxusercontent.com/u/14753468/p_correlation_all_data_SE_corrected.pdf

What do you think? The colors representing the columns are the three batches. As you can see, in the second case the batches mix quite well.

1

Well, I guess the red batch gets broken up somewhat. The stranger thing is the differences in the color keys; I think you've forced symmetric colors in the second, and you haven't done so in the first. Otherwise there would be no reason that a correlation of zero is white in the second plot.

Other than that, I don't think there's any cause for concern. The absolute values of the correlations seem comparable between the plots, so the changes aren't ridiculous.

Answer: remove batch effect using limma
2
3.7 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Well, if you expect similar results before and after batch removal, then what's the point of doing it at all?

In your case, I would suspect that the batch effect "muddies the water" by introducing additional variability between well-correlated samples in different batches. You haven't shown what your experimental design is, but let's consider a simple case; you have two libraries that are perfectly correlated to each other with log-normal expression values:

lib1 <- lib2 <- rnorm(1000, runif(1000, -2, 2))
cor(lib1, lib2) # gives 1, obviously.

Assuming that the two libraries belong in different batches, we end up introducing a normally-distributed batch effect:

lib1 <- lib1 + rnorm(1000)
lib2 <- lib2 + rnorm(1000)
cor(lib1, lib2) # should give something smaller.

This reduces the correlations as the effect of being in each batch is different. Thus, if you remove the batch effect, you'll recover the larger correlation. Note, however, that if two poorly-correlated libraries are in the same batch, then the correlation between them gets increased because of the shared batch effect:

lib1 <- rnorm(1000, runif(1000, -2, 2))
lib2 <- rnorm(1000, runif(1000, -2, 2))
cor(lib1, lib2) # close to zero.
batch <- rnorm(1000)
lib1 <- lib1 + batch
lib2 <- lib2 + batch
cor(lib1, lib2) # bigger.

So the effect of the batch on the correlations depends on which pairs of libraries you consider. In any case, removing the batch effect would seem to give the more appropriate results, by avoid deflated correlations due to difference between batches and inflated correlations due to the presence of libraries in the same batch.

I think i got it..! Thank you. So, in a few words in the first case (before the batch removal), my scale of correlation values between my samples is approximately 0.2 - 0.4 (for the 90% of the samples). After i remove the batch effect the scale is becoming more than 0.5 (for 90% of the samples). Does this make sense? Higher correlation because of the removal of the batch effect? The color scale is totally changing.

So, is it ok to continue trying to get the most variable genes using this corrected matrix?

1

I can't tell you whether or not the increase in the correlations makes sense, only that it's within the realms of possibility. You'll have to look at the biology to figure out whether the de-batched results are sensible.

Calculating the variance from the corrected matrix is generally not advised as you don't account for the loss of residual d.f. when you remove the batch effect. This would only be permissible if the number of samples is a lot larger than the number of batches. The better way would work directly from the linear model and compute the variance from the residual effects:

X <- model.matrix(~factor(batch))
fit <- lm.fit(x=X, y=t(y))
s2 <- colMeans(fit$effects[-seq_len(fit$rank),]^2)

... where batch is a vector of batch IDs, and y is the normalized log-expression matrix.

Great! It makes sense, i will try this way as well...

Could you have a look at the heatmaps? I posted them above..

Appreciate it

Another way to identify highly variable genes would be to fit a NB GLM (via edgeR or DESeq) and then to rank the genes with the highest dispersion estimates. This allows you to model the batch and it also handles discrete counts more naturally; taking the variance of the logs tends to do funny things at low counts.