Question: Questions regarding SVA batch effect normalizazion into DeSeq2 workflow
0
3.5 years ago by
wariobrega0 wrote:

Hello Everyone,

I'm currently analyzing a series of RNA-Seq experiments that are splitted into 4 runs. In order to account for the batch effect of each experiments, I'm trying to correct my gene counts using sva, as recommended inside the Bioconductor RNASeq workflow (http://www.bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects)

As I proceed in this phase, I got very confused regarding the methods and the procedures, so I wonder if you could answer me to these questions:

1) Do I need to account for the batch effect when I specify the design in the dds creation? e.g.: assuming that I'm screening for the differences between two different cell types, should I say design = ~type + batch or just design = type? Why?

2)in the RNASeq bioconductor workflow, when exporting the counts matrix to sva, it assumed that the data has already been normalized. Specifically,  counts are exported in this fashion:

dat <- counts(dds, normalized=TRUE

This seems pretty confusing to me, as my understanding is that it is needed to remove the batch effect before performing the normalization. Is this true for DeSeq2? Do I need only to Estimate the size Factor before performing sva normalization? Do you have code examples that you can share?

3) Do you have examples other than the one reported in the tutorial for the batch effect removal? Are there any other bioConductor libraries that integrates the batch effect normalization into the DeSeq2 Workflow?

Apologies if the questions sounds naive, I'm a self learner and I think I still have a long road to walk!

Daniele

sva deseq2 batch effect • 3.5k views
modified 3.5 years ago by Michael Love25k • written 3.5 years ago by wariobrega0
Answer: Questions regarding SVA batch effect normalizazion into DeSeq2 workflow
3
3.5 years ago by
Michael Love25k
United States
Michael Love25k wrote:

1) Do I need to account for the batch effect when I specify the design in the dds creation? e.g.: assuming that I'm screening for the differences between two different cell types, should I say design = ~type + batch or just design = type? Why?

If you have a known batch variable, you should include it in the colData and the design. If you put ~batch + type, then you can call results(dds) and it will know to grab the results for the last variable, "type".

A design with ~batch + type, uses a per-gene fixed effect to account for the differences between batch (similar to the terms that are used to model differences in type).

2) This seems pretty confusing to me, as my understanding is that it is needed to remove the batch effect before performing the normalization. Is this true for DeSeq2? Do I need only to Estimate the size Factor before performing sva normalization? Do you have code examples that you can share?

The normalized=TRUE argument to counts() is only correcting for library size. Batch effects are a separate effect which should be estimated after removing known library size differences. Yes you should estimate size factors first, then provide normalized counts to svaseq as in the workflow (I don't have another example but this is the correct paradigm).

3) Do you have examples other than the one reported in the tutorial for the batch effect removal? Are there any other bioConductor libraries that integrates the batch effect normalization into the DeSeq2 Workflow?

I don't at this time.

And Ryan's main point is important: svaseq() is for estimating unknown batch effects and other hidden structure. If you know the batches, just including the batch term in the design should be sufficient.

Answer: Questions regarding SVA batch effect normalizazion into DeSeq2 workflow
1
3.5 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.4k wrote:

1. Yes, if you have a known batch effect, you should include it in your design formula.
2. Normalization happens before any kind of batch effect adjustment or removal, so there is no problem here.

Also, whenever I use sva with DESeq2, my preference would be to run it on regularized log CPM as calculated by the rlog function. This should reduce the contribution of noise in low-count genes.

I mostly agree but you should use only scaled counts -- for example, counts(dds, normalized=TRUE) -- with svaseq, and not anything on the log scale. Note this line of code:

https://github.com/Bioconductor-mirror/sva/blob/master/R/svaseq.R#L36

The reason to use scaled counts is because DESeq() takes care of size factor estimation itself, so it doesn't make sense for svaseq to use up the surrogate variables estimating library size correction.

Oh, just to clarify, I meant that my preference was to use rlog and then pass the resulting normalized, regularized logCPM values to sva instead of svaseq. (As far as I can tell, the only difference between the sva and svaseq functions is the log transform in svaseq.)