Well, if you expect similar results before and after batch removal, then what's the point of doing it at all?
In your case, I would suspect that the batch effect "muddies the water" by introducing additional variability between well-correlated samples in different batches. You haven't shown what your experimental design is, but let's consider a simple case; you have two libraries that are perfectly correlated to each other with log-normal expression values:
lib1 <- lib2 <- rnorm(1000, runif(1000, -2, 2))
cor(lib1, lib2) # gives 1, obviously.
Assuming that the two libraries belong in different batches, we end up introducing a normally-distributed batch effect:
lib1 <- lib1 + rnorm(1000)
lib2 <- lib2 + rnorm(1000)
cor(lib1, lib2) # should give something smaller.
This reduces the correlations as the effect of being in each batch is different. Thus, if you remove the batch effect, you'll recover the larger correlation. Note, however, that if two poorly-correlated libraries are in the same batch, then the correlation between them gets increased because of the shared batch effect:
lib1 <- rnorm(1000, runif(1000, -2, 2))
lib2 <- rnorm(1000, runif(1000, -2, 2))
cor(lib1, lib2) # close to zero.
batch <- rnorm(1000)
lib1 <- lib1 + batch
lib2 <- lib2 + batch
cor(lib1, lib2) # bigger.
So the effect of the batch on the correlations depends on which pairs of libraries you consider. In any case, removing the batch effect would seem to give the more appropriate results, by avoid deflated correlations due to difference between batches and inflated correlations due to the presence of libraries in the same batch.
Could you please explain in a bit more detail what you mean when you say the following?
I normalize using spike in RNA and i estimate size factors (following the Deseq methodology)
Can you describe in a bit more detail (perhaps with code) how you are doing that, exactly?
Yes you are right...I follow the Deseq2 protocol. This is the code:
I get the matrix above and then i do a PCA. In my PCA i find a batch effect. I costruct a correlation heatmap and i see the clustering of my samples.
Continuing, i run the
removeBatchEffectfunction on the Log.countsMmusmatrix. I get a corrected batch effect and my PCA looks great. BUT, when i construct a correlation heatmap, the scale is totally different and the correlation values are really high. I believed that after the removal of the batch effect, the pearson correlation heatmap would look similar to my initial matrix.I hope that makes it clearer..Thanks!
Thanks for sharing the code. Next step: including figures would be helpful (as well as code to generate them), such as the PCA and heatmaps pre/post batch effect removal.
Anyway: if the PCA result on your data after you call
removeBatchEffectis so striking, why should you expect the correlation heatmap to look similar to the original data (do you mean the original correlation heatmap)? What do you mean by the scales being completely different?Hey Steve,
Thanks for your reply..this is the original initial heatmap:
https://dl.dropboxusercontent.com/u/14753468/p_correlation_all_data_SE.pdf
And this is the heatmap after removal of the batch effect:
https://dl.dropboxusercontent.com/u/14753468/p_correlation_all_data_SE_corrected.pdf
What do you think? The colors representing the columns are the three batches. As you can see, in the second case the batches mix quite well.
Well, I guess the red batch gets broken up somewhat. The stranger thing is the differences in the color keys; I think you've forced symmetric colors in the second, and you haven't done so in the first. Otherwise there would be no reason that a correlation of zero is white in the second plot.
Other than that, I don't think there's any cause for concern. The absolute values of the correlations seem comparable between the plots, so the changes aren't ridiculous.