In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7.
Fig 1: MDS plot showing outliers and that they belong to extbatch 7.
Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples.
Fig 2: Mean and SD of RNA yield over extraction batches.
So now i am trying to correct this. I tried
sva::ComBat() to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using
scran::mnnCorrect. This also produced negative values.
Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.
Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (
~extbatch+family+condition). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (
~family+condition), I find many DEGs.
So, some of my questions are
1. Why do I get negative values and how do I fix that?
2. Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
3. Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
4. If anyone has any better suggestions to handle this?