Question: Removing batch effects from microarray data based on only a subset of samples
0
gravatar for jaro.slamecka
15 months ago by
jaro.slamecka130
Mitchell Cancer Institute, Mobile AL, USA
jaro.slamecka130 wrote:

I am wondering if there is a way to remove batch effects from microarray data based on only a subset of samples. The thing is that the controls in the data are homogenous and the experimental samples are much more heterogenous, as seen from the PCA plot (created after removing batch effect with ComBat). This is expected since the experimental samples are a result of a more stochastic process and therefore I do not necessarily expect replicates to cluster together. The blue circles are control and green experimental samples.

For differential expression analysis in limma between the groups, I account for the batch in the design matrix, however, to create a heatmap, I would like to remove the batch effect (either using Combat or removeBatchEffect) as correctly as possible to allow for the control samples to cluster together, otherwise the batch still shows in the control branch. When I subset the data to only keep the controls and then remove the batch effect, the replicates do cluster together.

condition    batch
control      1
control      1
control      1
exp          1
exp          1
exp          1
control      2
control      2
control      2
control      2
control      2
control      2
exp          2
exp          2
exp          2
exp          2
exp          2
exp          2

 

Is there a way to do this? And should I be trying in the first place?

Thanks for any advice!

 

ADD COMMENTlink written 15 months ago by jaro.slamecka130
1

I'm not quite seeing what the problem is. The PCA plot you give shows a beautiful separation between experimental and control samples and no batch effect. What more are you after? Note that heterogeneity of experimental samples is a different issue, not necessarily related to batch correction.

You haven't actually shown us any evidence that there is a batch effect in your data, or that the batch effect is better identified from the controls than from the experimental samples. It would have been helpful to show a PCA plot (or better, a limma MDS plot) without the batch correction, so we can see what is going on.

ADD REPLYlink modified 15 months ago • written 15 months ago by Gordon Smyth37k

Here is a PCA plot and a boxplot without the batch correction and before normalization:

Even though the PCA plot looks fine after the batch correction, my issue is that hierarchical clustering shows the replicates of the control samples (LINE1, 2 and 3) not clustering together which reflects on the heatmap's sample dendrogram (regardless of how many genes I draw the heatmap from).

I understand there may not be much I can do to help it and it could be more a "cosmetic" issue but one I thought the heatmap could be called out on. But I thought I would ask anyway and learn something new.

ADD REPLYlink modified 15 months ago • written 15 months ago by jaro.slamecka130

This is the first time you've mentioned the "line" variable. If this is what you were asking about, you should have mentioned it in the original question. Are you saying that you expect "CON LINE1 r1" and "CON LINE1 r2" to cluster with each other more tightly than they do with other controls? If so, why do you expect this? What does "line" mean in the context of your experiment.

ADD REPLYlink written 15 months ago by Ryan C. Thompson7.3k

Yes, exactly, ideally I'd like to see Control Line1, 2 and 3 cluster together with their replicates since their gene expression pattern should be relatively stable which I have seen in other experiments. They are cell lines derived from 3 different patients with unrelated genetic background. So I was wondering if the heterogeneity within the experimental group (green circles) was preventing the hierarchical clustering to look more neat. The PCA plot corrected for batch looks intuitive, just the control branch of the heatmap's dendrogram doesn't show the replicates clustering together.

Also, if I take out all EXP samples and remove the batch effect from the CON samples solely by supplying the batch variable, the replicates cluster together neatly, as shown below. I am not planning to do a differential expression analysis between any of the control samples but I would like to learn if doing this has any informative value at all.

ADD REPLYlink written 15 months ago by jaro.slamecka130

A simple explanation is that the batch effect differs between different lines. This is not particularly surprising for patient-derived lines, which may be quite heterogeneous in their response to differences in culture conditions, etc. If we assume that CON LINE X is related to EXP LINE X, you can account for line-specific batch effects by blocking on LINE:Batch in your DE analysis. A similar approach can be used to make your dendrogram pretty, by running removeBatchEffect on the log-CPMs with design set to a one-way layout for paste0(LINE, Condition) and block set to paste0(LINE, Batch).

ADD REPLYlink modified 15 months ago • written 15 months ago by Aaron Lun23k

Thank you! I'll try your approach

ADD REPLYlink written 15 months ago by jaro.slamecka130

Hi,

From a naïve point of view, I would say that all samples but six samples of the first batch have a high background signal that prevents from measuring low levels. You could check that hypothesis by plotting the first 3 controls of batch 1 vs controls of batch 2 or 1. If you got a nice straight line for levels higher than 7 or 8, that's it. Accordingly, I would conclude that the first axis of the PCA highlights this difference. If all your blue dots represent the controls, then the 2nd axis of the PCA highlights the difference between control vs experiment. Therefore extract the coordinate of the samples on the second axis and compute your clustering. It will be very discriminative as I can imagine by projecting all the dots on the vertical axis.

HTH

ADD REPLYlink written 15 months ago by SamGG180

The asker is already aware of this batch effect. It is captured in the batch variable of the sample table given in the original question.

ADD REPLYlink written 15 months ago by Ryan C. Thompson7.3k

Oops, I thought the batch id was encoded by the rx suffix. Using the correct batch identifier, I think using PC2 is even more obvious for clustering. Batch variance is clearly taken into account by PC1, and ComBat sounds to be overkill to me. Thanks for your remark.

ADD REPLYlink written 15 months ago by SamGG180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 221 users visited in the last hour