Question

Correcting for known and surrogate variables in DESeq2

1

Entering edit mode

Akula, Nirmala NIH/NIMH [C] ▴ 190

@akula-nirmala-nihnimh-c-5007

Last seen 5.4 years ago

NHi All,

We have RNA-seq data on controls and cases along with known co-variates like race, age, sex, RIN and library batch. So, in DESeq2 I could correct for these covariates as follows:

dds=DESeqDataSetFromMatrix(countData=countData,colData=coldata,design=~race+age+sex+RIN+batch+condition)

On the other hand, svaseq with normalized counts (within DESeq2) identified 13 variables for the same data.

My question is regarding the design that corrects for both known and unknown surrogate variables:

1) design=~race+age+sex+RIN+batch+SV1+......SV13+condition

or

2) design=~SV1.......SV13+condition assuming svaseq is accounting for differences due to known covariates as well.

In design1 we are correcting for 18 variables and in design2 we are correcting for 13 variables, for a sample size of ~100 are we over-correcting?

Thanks,

Nirmala

deseq2 svaseq • 6.1k views

ADD COMMENT • link 8.5 years ago Akula, Nirmala NIH/NIMH [C] ▴ 190

score 1 · Answer 1 · 2016-10-05

1

Entering edit mode

Bernd Klaus ▴ 610

@bernd-klaus-6281

Last seen 6.4 years ago

Germany

Hi Nirmala,

SVA should capture the effect of the known variables as well, so I would be in favor of option 2.

To see whether SVA indeed removes the effect of the known batches, you can look at cleaned data (where the surrogate variables have been regressed out) using frozen sva like so:

# regress out surrogate variables
newV = NULL ## neccessary due to bug in the sva pacakge
fsvaobj <- fsva(edata, mod, svobj, newdat = NULL)
# get corrected data
edata_adj <- fsvaobj$db

where edata is the expression data on the log-scale (e.g. after rlog or vst, or simply log(x + pseudocount) and mod would be:

mod = model.matrix( ~  colData(dds)$condition)

(Assuming that this is the same mod that you used in the original SVA, and did not tell it to preserve the batches.)

and then perform a PCA / MDS / clustering on edata_adj to see whether a clustering per batch is still present.

You can also check associations of the surrogate variables with the known batches, as shown in the rna-seq gene workflow.

Best wishes,

Bernd

PS: Do not use the cleaned data for downstream analysis!

ADD COMMENT • link 8.5 years ago Bernd Klaus ▴ 610

1

Entering edit mode

Dear Bernd,

Thanks for an interesting post and a useful answer to it.

This extraction of frozen data has helped me. I also view PCA of data with batches removed. Great, it seems from PCA that SVA is working very well.

Please can you explain this comment?

PS: Do not use the cleaned data for downstream analysis!

If the aim of SVA is to "clean" data from the batch effects, why can't we use it for downstream analysis?

It is fine to use for PCA/MDS/cluster exploratory analysis but not for something else?

I am guessing there is a statistical reason for it, and it will be great to know exactly why.

Thank you.

John.

ADD REPLY • link 8.5 years ago John ▴ 10

2

Entering edit mode

Hi John,

the major problem is that using the cleaned data can lead to overly confident results in the downstream analysis (since you remove variation from the data in general) and/or regress out interesting biological signal. Two key papers are:

Nygaard et. al., 2015, Exaggerated effects after batch correction

See also A: voom with combat- Aaron Lun's simulated data example

(discussing combat, but the warnings apply to an SVA analysis that uses the experimental factors as well)

and Jaffe et. al., 2016, which more specifically discusses SVA.

When it is desired to use the cleaned data for us with downstream analysis, it is a good idea to not include information about experimental groups of interest in the cleaning process. A simple way to do this is to select empirical controls (e.g. genes that have a low variance across samples) and use them to infer the surrogate variables (run sva with method = "supervised" and specify control genes).

This, and more sophisticated techniques are explored in the RUVnormalize package and the associated paper.

So in a nutshell it does not mean that you cannot use the cleaned data downstream, but be careful about it ...

Bernd

ADD REPLY • link 8.5 years ago Bernd Klaus ▴ 610

0

Entering edit mode

Hello Bernd,

Thank you very much for your help and advice, for a comprehensive answer, very helpful. The publications also great.

For random interest, I have also noticed the Limma removeBatchEffect function in addition to SVA and RUV, which seems to work pretty well on my data.

best,

John.

ADD REPLY • link 8.5 years ago John ▴ 10

score 0 · Answer 2 · 2016-10-05

0

Entering edit mode

Akula, Nirmala NIH/NIMH [C] ▴ 190

@akula-nirmala-nihnimh-c-5007

Last seen 5.4 years ago

After correcting for 13 surrogate variables, a QQplot of the observed p-values looks inflated compared to the expected. Here's the code I used:

>design(dds.sva)=~SV1+SV2+SV3+SV4+SV5+SV6+SV7+SV8+SV9+SV10+SV11+SV12+SV13+condition
> dds.sva <- DESeq(dds.sva)

> ctrlBPsva=results(dds.sva,contrast=c("condition","Control","Bipolar"))
> ctrlBPsva=ctrlBPsva[order(ctrlBPsva$padj),]
> summary(ctrlBPsva)

Is there any chance that the correction is causing this inflation?

ADD COMMENT • link 8.5 years ago Akula, Nirmala NIH/NIMH [C] ▴ 190

0

Entering edit mode

This is hard to answer without some context ...

Can you post the qq-plot or the p-value histogram somewhere?

13 latent components are a lot, do you have human samples from diverse populations?

Are the SVs associated with your known batches somehow?

ADD REPLY • link 8.5 years ago Bernd Klaus ▴ 610

0

Entering edit mode

Yes, these are human samples from two different populations. Here's the qq-plot

ADD REPLY • link 8.1 years ago Akula, Nirmala NIH/NIMH [C] ▴ 190