Question

Taking surrogate variable into account before running PCA

0

Entering edit mode

wamiqsaifi • 0

@wamiqsaifi-13273

Last seen 6.9 years ago

I ran SVA to get surrogate variables. I want to visualise my data taking into consideration the surrogate variables. I came to know tha modifying the original expression matrix is not a good idea (https://www.biostars.org/p/121489/). How can I go about doing this?

sva svaseq • 2.1k views

ADD COMMENT • link updated 7.8 years ago by Guido Hooiveld ★ 4.1k • written 7.8 years ago by wamiqsaifi • 0

score 3 · Accepted Answer · 2017-06-16

3

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 13 days ago

Wageningen University, Wageningen, the …

In the few cases I need to visualize the 'cleaned' data I use the approaches mentioned in the Biostars thread (AFAIK they are the same). However, if you would like to identify differentially expressed genes, you should take these into account as covariates in your (linear) model rather than to use the 'cleaned' data for that. This is actually also what is stated in that thread...

ADD COMMENT • link 7.8 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Thanks, But do you know the reason why we shouldn't use the cleaned data for differential expression but can use it for visualisations?

ADD REPLY • link 7.8 years ago wamiqsaifi • 0

0

Entering edit mode

As is nicely phrased in e.g. this paper:

"... SVs can be regressed out of the data to obtain “cleaned” data for visualization (as we do in this report), however differential expression statistics should not be performed on this “clean” data, as this too can lead to anti-conservative bias resulting from between-sample correlation being introduced by regressing out the SVs and from inflating variance partitioning related to the effect of interest, as the total variance of the system has been reduced without being taken into account during the linear modeling."

From: Jaffe et al. Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis. BMC Bioinformatics. 2015 Nov 6;16:372. Link

See also e.g. these threads batch effect : comBat or blocking in limma ? and Using of limma moderated t-test with "corrected" expression matrix resulting from ComBat batch effect correction.

Along the same line:
Nygaard et al. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016 Jan;17(1):29-39. Link

This thread at Biostarts is also an interesting read, and links to the Nygaard paper.