I ran SVA to get surrogate variables. I want to visualise my data taking into consideration the surrogate variables. I came to know tha modifying the original expression matrix is not a good idea (https://www.biostars.org/p/121489/). How can I go about doing this?
In the few cases I need to visualize the 'cleaned' data I use the approaches mentioned in the Biostars thread (AFAIK they are the same). However, if you would like to identify differentially expressed genes, you should take these into account as covariates in your (linear) model rather than to use the 'cleaned' data for that. This is actually also what is stated in that thread...
"... SVs can be regressed out of the data to obtain “cleaned” data for visualization (as we do in this report), however differential expression statistics should not be performed on this “clean” data, as this too can lead to anti-conservative bias resulting from between-sample correlation being introduced by regressing out the SVs and from inflating variance partitioning related to the effect of interest, as the total variance of the system has been reduced without being taken into account during the linear modeling."
From: Jaffe et al. Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis. BMC Bioinformatics. 2015 Nov 6;16:372. Link
Along the same line:
Nygaard et al. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016 Jan;17(1):29-39. Link
This thread at Biostarts is also an interesting read, and links to the Nygaard paper.
Thanks, But do you know the reason why we shouldn't use the cleaned data for differential expression but can use it for visualisations?
As is nicely phrased in e.g. this paper:
"... SVs can be regressed out of the data to obtain “cleaned” data for visualization (as we do in this report), however differential expression statistics should not be performed on this “clean” data, as this too can lead to anti-conservative bias resulting from between-sample correlation being introduced by regressing out the SVs and from inflating variance partitioning related to the effect of interest, as the total variance of the system has been reduced without being taken into account during the linear modeling."
From: Jaffe et al. Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis. BMC Bioinformatics. 2015 Nov 6;16:372. Link
See also e.g. these threads batch effect : comBat or blocking in limma ? and Using of limma moderated t-test with "corrected" expression matrix resulting from ComBat batch effect correction.
Along the same line:
Nygaard et al. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016 Jan;17(1):29-39. Link
This thread at Biostarts is also an interesting read, and links to the Nygaard paper.
Thanks! That helps.