Questions regarding SVA batch effect normalizazion into DeSeq2 workflow
2
0
Entering edit mode
wariobrega • 0
@wariobrega-9755
Last seen 6.8 years ago

Hello Everyone,

I'm currently analyzing a series of RNA-Seq experiments that are splitted into 4 runs. In order to account for the batch effect of each experiments, I'm trying to correct my gene counts using sva, as recommended inside the Bioconductor RNASeq workflow (http://www.bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects)

As I proceed in this phase, I got very confused regarding the methods and the procedures, so I wonder if you could answer me to these questions:

 

1) Do I need to account for the batch effect when I specify the design in the dds creation? e.g.: assuming that I'm screening for the differences between two different cell types, should I say design = ~type + batch or just design = type? Why?

2)in the RNASeq bioconductor workflow, when exporting the counts matrix to sva, it assumed that the data has already been normalized. Specifically,  counts are exported in this fashion:

dat <- counts(dds, normalized=TRUE

This seems pretty confusing to me, as my understanding is that it is needed to remove the batch effect before performing the normalization. Is this true for DeSeq2? Do I need only to Estimate the size Factor before performing sva normalization? Do you have code examples that you can share?

3) Do you have examples other than the one reported in the tutorial for the batch effect removal? Are there any other bioConductor libraries that integrates the batch effect normalization into the DeSeq2 Workflow?

 

Apologies if the questions sounds naive, I'm a self learner and I think I still have a long road to walk!

 

Thanks in advance,

 

Daniele

sva deseq2 batch effect • 5.3k views
ADD COMMENT
3
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 months ago
Scripps Research, La Jolla, CA

Answers to your first 2 questions:

  1. Yes, if you have a known batch effect, you should include it in your design formula.
  2. Normalization happens before any kind of batch effect adjustment or removal, so there is no problem here.

Additionally, I don't think you need sva here. The purpose of sva is to discover hidden batch effects in the data, and represent them as "surrogate variables" that you can insert into your design in lieu of the real hidden variables. However, you already know about your batch effect and can include it directly in the design. Hypothetically, if someone had given you this data set and told you "this was done in several batches, but I didn't record which samples went in which batches", then you might want to try sva to partially recover the batch information from the data itself. Similarly, if you suspect that there are additional confounding factors other than the runs, then sva will help you discover those as well.

Also, whenever I use sva with DESeq2, my preference would be to run it on regularized log CPM as calculated by the rlog function. This should reduce the contribution of noise in low-count genes.

ADD COMMENT
0
Entering edit mode

I mostly agree but you should use only scaled counts -- for example, counts(dds, normalized=TRUE) -- with svaseq, and not anything on the log scale. Note this line of code:

https://github.com/Bioconductor-mirror/sva/blob/master/R/svaseq.R#L36

The reason to use scaled counts is because DESeq() takes care of size factor estimation itself, so it doesn't make sense for svaseq to use up the surrogate variables estimating library size correction.

ADD REPLY
0
Entering edit mode

Oh, just to clarify, I meant that my preference was to use rlog and then pass the resulting normalized, regularized logCPM values to sva instead of svaseq. (As far as I can tell, the only difference between the sva and svaseq functions is the log transform in svaseq.)

ADD REPLY
0
Entering edit mode

Oh, I get it, I didn't read carefully enough.

ADD REPLY
3
Entering edit mode
@mikelove
Last seen 56 minutes ago
United States

Just to repeat Ryan's comments,

1) Do I need to account for the batch effect when I specify the design in the dds creation? e.g.: assuming that I'm screening for the differences between two different cell types, should I say design = ~type + batch or just design = type? Why?

If you have a known batch variable, you should include it in the colData and the design. If you put ~batch + type, then you can call results(dds) and it will know to grab the results for the last variable, "type".

A design with ~batch + type, uses a per-gene fixed effect to account for the differences between batch (similar to the terms that are used to model differences in type).

2) This seems pretty confusing to me, as my understanding is that it is needed to remove the batch effect before performing the normalization. Is this true for DeSeq2? Do I need only to Estimate the size Factor before performing sva normalization? Do you have code examples that you can share?

The normalized=TRUE argument to counts() is only correcting for library size. Batch effects are a separate effect which should be estimated after removing known library size differences. Yes you should estimate size factors first, then provide normalized counts to svaseq as in the workflow (I don't have another example but this is the correct paradigm).

3) Do you have examples other than the one reported in the tutorial for the batch effect removal? Are there any other bioConductor libraries that integrates the batch effect normalization into the DeSeq2 Workflow?

I don't at this time.

ADD COMMENT
0
Entering edit mode

And Ryan's main point is important: svaseq() is for estimating unknown batch effects and other hidden structure. If you know the batches, just including the batch term in the design should be sufficient.

ADD REPLY

Login before adding your answer.

Traffic: 718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6