Question: Batch Correction: DESeq2 Linear Model vs Batch-Correction-Specific Algorithms (RUV, ComBat, ect.)
1
4 weeks ago by
wunderl20
wunderl20 wrote:

I am analyzing bulk RNA data from neurons derived from iPSC cells. The experimental design matrix is as follows:

Lineage     Method
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3

Here lineage indicates what iPSC line the neuronal sample was derived from, and method the method used to derive said neurons. All iPSC lines were derived from the same starting material, so they can be considered to be biological replicates of each other.

The main question I am trying to answer is what genes are significantly up/down regulated for a given method compared to the other two methods. I am also interested in determining how 'bad' of a batch effect the source iPSC lineage is (ie. how much of the variation is explained by lineage vs by method, we hope to find it is mostly explained by method)

The way I have been handling this with DESeq2 is using the design ~ lineage + method. One of my colleagues claimed, however, that this was an inappropriate use of DESeq2 since I do not have the right replicate structure. He claimed that I needed biological replicates that were identical from a design perspective in order for DESeq2 to be an appropriate choice (eg multiple samples generated from lineage A using method 1, ect). This would imply design matrix like this (here I am adding a sample column for clairty):

Lineage     Method     Sample
A 1 S1
A 1 S2
A 2 S3
A 2 S4
.... .... ....
C 3 S18

His rationale was that DESeq2 is not able to leverage the fact that A1 B1 C1 are "expected" to be the "same"/similar and that the modeling is making an assumption of additive variance between the lineage and method where as batch-correction methods (such as RUV or ComBat) would not suffer from these problems, making them a more appropriate choice.

My questions are:

1. Is it correct that, given a design like the first matrix, DESeq2 is not an appropriate choice when attempting to control for lineage while comparing between methods, and that other batch-correction-specific methods should be used?

2. If DESeq2 is an appropriate choice for this senario, is the design ~ lineage + method the most appropriate? Also, what is the best way to compare the strength of the overall effect of lineage vs the overall effect of method? I'm guessing this might involve extracting and comparing the model coefficients or perhaps a likelihood ratio test.

modified 4 weeks ago by Michael Love24k • written 4 weeks ago by wunderl20
Answer: Batch Correction: DESeq2 Linear Model vs Batch-Correction-Specific Algorithms (R
2
4 weeks ago by
Michael Love24k
United States
Michael Love24k wrote:

Batch effects that can be captured by LFC between batches, eg additive on the log scale will be “fixed” by just adding a linear term. And it’s similar to the kind of things that SVA or RUV would find because they also compute decompositions on the log scale, and those are designed to be provided in the design formula of a method like DESeq2 or others. But when I know the batch I typically just put it in the design.

I’ve also done a lot of work on per sample batch correction in alpine and Salmon, and if you use Salmon with GC bias correction, it helps a lot to reduce the most common technical variation in RNA-seq. You can do both, use Salmon and tximport AND use the ~batch + condition design to capture what’s left. This is what most groups do.

Thanks Michael, I really appreciate the help!!

I will look into Salmon. So far my pipeline has been STAR --> featureCounts --> DESeq2. Would you say that not correcting for GC bias is a significant oversight?

Also, is there a reason DESeq2 does not have anything analogous to limma's removeBatchEffects? Ideally I would like to do clustering and visualization on the data after removing batch effects using the model fit by DESeq instead of using a different linear model fit by removeBatchEffects.

1

That's a long question. I think GC bias can mostly be removed at the gene-level with batch covariates. But isoform-level really requires direct modeling of fragments (as is shown fairly extensively in alpine and Salmon papers). My point is just that, if one uses Salmon and tximport, we have a sample-specific term which helps to account for some of the batch differences already, and you can use batch in the design to capture the rest.

DESeq2 offers the GLM for testing (e.g. DESeq() and results()), in which case batch is included in the design.

We also offer the transformations for clustering, and then you get log-scale count data which is variance stabilized. Why should we re-implement removeBatchEffect when this is just what one would want to do. It's removing shifts in the log-scale data that are associated with batch covariates. There's nothing that we would add by making a "DESeq2" version of this function.

Thank you for the elaboration. I didn't realize removeBatchEffect operated at the log-transformed level, I can see now why it is distinct from the suite of tools DESeq2 provides. The point about GC bias was very interesting as well. I am very new to this sort of space so I really appreciate the help!