Question: Batch Correction with only one sample in a batch
0
10 months ago by
i.sudbery10
European Union
i.sudbery10 wrote:

We have an experiment where samples were collected and sequenced in several batch (matched ATACseq and RNAseq). We would like to include the batch in our analysis formula if possible. The problem is that after quality control and removing substandard samples, some of the batches have only one sample in them (we have several batches with many samples in and a few with only one).

We have used DESeq2 to do the differential testing (for both ATAC and RNA) and would also like to do more downstream analysis (such as clustering etc). We have included the batch in our design formula for DESeq2. Is this a terrible thing to do? Because the model can now set the batch effect to be anything for those samples, are they even adding anything to the analysis?

Secondly for the downstream analysis we are using rlog from DESeq2 and then passing those results to removeBatchEffects from limma. When we do this, the basically get a 0s in every row for those samples that in a batch on their own. (makes sense, because with only one sample in a batch the linear model can always find a beta value that will account for all variance)

Some solutions we have considered:

* Pool all the samples that are the only sample in their batch into a pooled batch.

* Do the differential analysis using all samples, but only use samples from multisample batches in the downstream analysis

* Remove the single batch samples from the analysis. We'd rather not do this, because it would severly reduce the number of samples in the analysis.

Thanks.

limma deseq2 batch effect • 369 views
modified 10 months ago by Ryan C. Thompson7.2k • written 10 months ago by i.sudbery10

I wouldn’t assume that because some of your “batches” include only one sample that they’re useless. I’d be interested in exploratory analysis to perform analysis of variance on the larger batches to see the relative contribution of batches and experimental factors to variance, to help understand how “big” the batch effects are. If they’re smaller or maybe even comparable to experimental effects, I might take an approach like you mentioned of combining batches by combining the singletons, expanding the time periods, or otherwise expanding what constitutes a batch.

Since it can be hard to know what really matters or contributes to the strongest batch effects, another solution I would explore is the *sva* approach of inferring surrogate batch variables.

Answer: Batch Correction with only one sample in a batch
1
10 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.2k wrote:

You say that you don't want to remove the single-sample batches from the analysis, but this is exactly what you have already done. When you include a coefficient that only affects a single sample, this is mathematically equivalent to deleting that sample from the analysis. You were correct to suspect that these samples are not adding anything to the analysis.

The bottom line is that if you know there is a significant batch effect and you only have a single sample in a batch, that sample is completely uninformative.

1

1. There is probably a slight effect of retaining the samples and including a sample-specific term in the design matrix, compared to explicitly removing them. This is because the calculation of the average abundance will be different, which will probably alter (slightly) the trend fit from, say, voom. There may also be slight changes in normalization depending on which library was chosen as the reference (in TMM) or in the calculation of the average pseudo-cell (for DESeq's normalization strategy).
2. If you want to retain information from the single-sample batches, the standard way to do so is to use duplicateCorrelation in limma. Basically, you remove the batch factor from your design matrix and use it instead in duplicateCorrelation to account for correlations induced by the batch effect.
1

What is the practical effect of dupilcateCorrelation in this case? Does it estimate the average batch effect size from the multi-sample batches and then discount any effects smaller than that size in the single-sample batches?

1

For each gene, the function estimates the correlation between samples from the same batch. I think of this as the (square root of the) percentage of the variance that is explained by the blocking factor. A larger correlation corresponds to a stronger batch effect, as you might expect. The correlation is stabilized by sharing information across genes to obtain a consensus correlation, which is then used in generalized least squares for model fitting.

The role of the correlation in hypothesis testing depends on whether the comparison is within or between batches. I don't remember exactly how it works, but a large correlation will penalize comparisons between batches while improving comparisons within batches. I think it does so by scaling the standard error for the relevant coefficients. In this case, it's a bit complicated as the comparison occurs both within and between batches, so I don't know what will crawl out at the end.

We don't *KNOW* that there is a significant batch effect. Indeed, we don't see any obvious evidence of one on, for example, MDS plots. However, we do know we have batches, and that batches can have effects. How would we determine if it was better to remove, say a third of our samples, or to not correct for batch effects or only correct where we know samples have been batched?

2

If you can confidently conclude, based on the batches that do have multiple samples, that the batch effect is negligible, then you can justify leaving the batch effect out of the design, allowing you to include the single-sample batches. Obviously in such a case you are extrapolating from the batch effects you can measure to all the other batch effects, and you need to make a case that this extrapolation is justified.