Can I meaningfully interpret a PCA or MDS plot of data with surrogate variable effects subtracted?
Entering edit mode
Last seen 2.1 years ago
Scripps Research, La Jolla, CA

I am using sva in an analysis of my ChIP-Seq data, and I would like to look at a PCA or MDS plot of the data with the effects of surrogate variables subtracted out. Briefly, I use limma's removeBatchEffect to subtract the fitted effects of the surrogate variables and then run plotMDS on the result. When I do so, the MDS plot looks much cleaner than when plotMDS is run on the uncorrected data, as expected.

However, I'm not sure if this is a reasonable thing to do. One could make the argument that I told removeBatchEffect to remove all the variation in the data that doesn't match my specified model, so therefore the fact that the resulting MDS plot matches my model is merely circular reasoning rather than indicative of an actual effect. At the very least, the generation of the "corrected" MDS plot is dependent on the specification of the experimental design, whereas an ordinary MDS plot is not. On the other hand, this "corrected" MDS plot corresponds more closely to what the differential testing results show, so one could argue that if the differential tests are valid, so is the MDS plot. And my design only specifies what the groups are, not what their relative arrangement should be in principal coordinate space.

So, can anyone give me some insight as to how much I can read into this "corrected" MDS plot, and which of the above arguments is more correct?

If you want to see an example of such an MDS plot with and without SV subtraction, look here: (Note: There are multiple MDS plots because I'm also testing multiple normalization methods, so make sure to pay attention to the plot titles.)

sva pca • 982 views
Entering edit mode
Last seen 4 minutes ago
WEHI, Melbourne, Australia

Yes, it is a reasonable thing to do for the purposes of data display and exploration, which is why I made the removeBatchEffect function. But it is not reasonable to try to judge statistical significance from such a plot. Only the linear model can do that.

Entering edit mode
Last seen 2.6 years ago

My experience is that the "after batch correction"-plots may sometimes be useful and sometimes be very misleading. It all depends on the batch/group balance, what information is used in the correction and what you are looking for. The same would apply for any other effect one tries to remove from the data.

These are my experiences for a few scenarios:

For a balanced or fairly balanced design. i.e each batch consists of the same proportion of samples from different groups,  the "after"-plot seems to be reliable

For a severely unbalanced design where the group information is left out in the correction step, the after plot is useful to inspect whether your groups cluster, i.e. is there more biology than batch effect.

For a severely unbalanced design where the group information is used in the correction step, the "after"-plot may show a clear group effect regardless of any true effect. This is a quite unreliable plot if your purpose is to detect group effects.

If you are interested in seeing which samples are more similar within a group, the "after"-plot could probably be reliably used.

I have deduced the above through analysis of simulated data. I made a shiny-app to inspect batch effect correction and used PCA-plots and other plots to see how data with known properties could look (no MDS). I have set up a simulation with an unbalanced design where group labels are used in the correction step to illustrate a scenario where the "after"-plot is not meaningful. The pretended "aim" of this "after"-plot is to see if there are any biology going on in the data, i.e group effects. 

Link to simulation. 

The simulated data consists of 1000 "genes" in 30 samples from 3 equally sized groups (A red, B green, C blue) unevenly distributed across 3 batches (1, 2, 3). The data are drawn from a normal distribution with no group effect, but with known batch effects added. 

Focus on the three PCA plots on the right.  

The top PCA-plot is made from the data before batch effect is added. You will not have an option to do this plot with real data. This is how a plot would have looked like if there was no batch effect in the first place or if the batch effect had been miraculously perfectly removed. The samples are spread out with no clear pattern and in agreement with the simulation setup.

The middle PCA-plot is made from data with the batch effect added. This is the "before"-plot often used to deduce that there is a batch effect. The samples clearly cluster by batch, indicating that we have a batch problem. Again, in agreement with the simulation setup.

The third PCA-plot is made from data after batch correction with removeBatchEffect. The group labels where included in the design parameter. This is the "after"-plot. The samples clearly cluster by group. One would easily think there is a strong group effect, i.e a lot of biology going on. However, there are no group effects in this simulated data and it is the use of removeBatchEffect that has introduced this clear pattern.

Back to the post.
I recommend you to make your own simulation for your particular design and analysis. I found simulations very useful in order to convince my self of these problems. However, my simulations were not exhaustive and I may have missed things.


Login before adding your answer.

Traffic: 297 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6