Question: RNA-Seq Batch correction negative values
gravatar for rmf
8 months ago by
rmf0 wrote:

In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7. 

Fig 1: MDS plot showing outliers and that they belong to extbatch 7.

Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples. 

Fig 2: Mean and SD of RNA yield over extraction batches.

So now i am trying to correct this. I tried sva::ComBat() to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using scran::mnnCorrect. This also produced negative values.

Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.

Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (~extbatch+family+condition). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (~family+condition), I find many DEGs.

So, some of my questions are

1. Why do I get negative values and how do I fix that?
2. Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
3. Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
4. If anyone has any better suggestions to handle this?

ADD COMMENTlink modified 8 months ago by Aaron Lun19k • written 8 months ago by rmf0
gravatar for Aaron Lun
8 months ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:
  1. You get negative values because these methods, by and large, are designed to work on log-expression values that can go from -Inf to Inf. This is what they do, so there's nothing to "fix". If you want to account for the batch effect in a DE analysis, include batch as a factor in the design matrix.
  2. According to your first MDS plot, samples in batch 7 are highly variable within the same batch. Batch correction won't help here, because your variability is occurring within the same batch. You need to figure out why this is the case before discarding those samples. For example, what are the DE genes between the batch 7 samples on the left and their counterparts on the right? Sex-related genes, or something?
  3. See my answer to point 2. Including a batch effect won't solve your particular problem, because the variability is occurring within the batch. This will inflate the dispersions and reduce power. Removing the entire batch will avoid this effect - though obviously at the cost of losing samples from batch 7.
  4. If you still can't figure out what's happening in batch 7, you could try using RUVSeq or sva to guess the factor of unwanted variation and put that in the model. However, there's no substitute for properly understanding what's happening with your data.
ADD COMMENTlink modified 8 months ago • written 8 months ago by Aaron Lun19k

Ahan. So batch effect correction works only if all the samples in that batch are affected and not partially? By fixing the negative values, I meant to say to make them all positive because counts cannot be negative. And negative counts cannot be used by DGE packages. Anywya, I will try sva. Thanks for the suggestions.

ADD REPLYlink written 8 months ago by rmf0

You shouldn't be applying the batch correction methods on the raw counts. If you're getting negative "count" values, you're doing something wrong. There are some complicated ways to get "corrected count" values with quantile-quantile mapping, but the simplest approach is just to include batch in the design matrix.

ADD REPLYlink modified 8 months ago • written 8 months ago by Aaron Lun19k

My thinking was that I would batch correct the raw counts to get corrected raw counts which would be used in DESeq2 along with the model (which would now exclude the corrected batch variable). What kind of counts do you think the correction should be applied on? It cannot be log-cpm, vst, rlog, voom etc, because they cannot be fed into DESeq2. The negative values are worrying. I wonder if I could add a pseudocount to push them all above zero.

ADD REPLYlink written 8 months ago by rmf0

No, your proposed approach is not correct. As I have already said, batch correction methods are not designed to be used with the raw counts. If you try to do so... well, you'll get nonsensical negative counts, and no amount of pseudo-count addition can protect you against that (notwithstanding the other biases introduced by adding pseudo-counts in the first place). In any case, you should be including the batch factor directly in the model for DE analysis. Supplying "corrected" values will fail to account for the uncertainty of the correction parameters.

ADD REPLYlink modified 8 months ago • written 8 months ago by Aaron Lun19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 390 users visited in the last hour