NHi All,
We have RNA-seq data on controls and cases along with known co-variates like race, age, sex, RIN and library batch. So, in DESeq2 I could correct for these covariates as follows:
dds=DESeqDataSetFromMatrix(countData=countData,colData=coldata,design=~race+age+sex+RIN+batch+condition)
On the other hand, svaseq with normalized counts (within DESeq2) identified 13 variables for the same data.
My question is regarding the design that corrects for both known and unknown surrogate variables:
1) design=~race+age+sex+RIN+batch+SV1+......SV13+condition
or
2) design=~SV1.......SV13+condition assuming svaseq is accounting for differences due to known covariates as well.
In design1 we are correcting for 18 variables and in design2 we are correcting for 13 variables, for a sample size of ~100 are we over-correcting?
Thanks,
Nirmala
Dear Bernd,
Thanks for an interesting post and a useful answer to it.
This extraction of frozen data has helped me. I also view PCA of data with batches removed. Great, it seems from PCA that SVA is working very well.
Please can you explain this comment?
PS: Do not use the cleaned data for downstream analysis!
If the aim of SVA is to "clean" data from the batch effects, why can't we use it for downstream analysis?
It is fine to use for PCA/MDS/cluster exploratory analysis but not for something else?
I am guessing there is a statistical reason for it, and it will be great to know exactly why.
Thank you.
John.
Hi John,
the major problem is that using the cleaned data can lead to overly confident results in the downstream analysis (since you remove variation from the data in general) and/or regress out interesting biological signal. Two key papers are:
Nygaard et. al., 2015, Exaggerated effects after batch correction
See also A: voom with combat- Aaron Lun's simulated data example
(discussing combat, but the warnings apply to an SVA analysis that uses the experimental factors as well)
and Jaffe et. al., 2016, which more specifically discusses SVA.
When it is desired to use the cleaned data for us with downstream analysis, it is a good idea to not include information about experimental groups of interest in the cleaning process. A simple way to do this is to select empirical controls (e.g. genes that have a low variance across samples) and use them to infer the surrogate variables (run
sva
withmethod = "supervised"
and specify control genes).This, and more sophisticated techniques are explored in the RUVnormalize package and the associated paper.
So in a nutshell it does not mean that you cannot use the cleaned data downstream, but be careful about it ...
Bernd
Hello Bernd,
Thank you very much for your help and advice, for a comprehensive answer, very helpful. The publications also great.
For random interest, I have also noticed the Limma removeBatchEffect function in addition to SVA and RUV, which seems to work pretty well on my data.
best,
John.