Hello all,
I have a question regarding the sva function. To start with I am analysing a RNA seq dataset. My condition consists of 4 groups(CON_GM,CON_WM,MS_GM and MS_WM). CON is the control and MS is patients affected with multiple sclerosis. GM(grey matter) and WM(white matter) are cell types. I first normalize the data by deseq2 and then visualize the PCA. I can see that the data clusters according to cell type. I then use svaseq function to search for hidden batch effects. After that I use a function to remove the surrogate variables and then visualize the PCA . I do not use the cleaned data for downstream analysis. I just take the PCA to decide on the number of surrogates to be added as covariates when I will perform downstream analysis using deseq2. I found the function that removes batch effect from this link https://www.biostars.org/p/121489/ . I also used the function num.sv to get the total number of surrogates. The function returned a value of 18. Initially I removed only 2-3 surrogates . But I was unable to get 4 different clusters.I found that after removing 14 surrogate variables , I am able to see 4 separate groups. I repeated the same steps but this time I use sva function. When I used sva I only had to remove 6 surrogate variables to get 4 separate groups. As per the documentation it says that the sva
function can be used to estimate artifacts from microarray data the svaseq
function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data. I dont understand why I would need to remove more number of surrogate variables while using svaseq to get good clustering. Can we use sva to estimate artifacts for RNA-seq data?
Code:
#compute null and full model
mod =model.matrix(~merged, data=pheno)#merged is condition
mod0 = model.matrix(~1,data=pheno)
#find the number of surrogate variables
n.sv = num.sv(data,mod,method="leek")
#function that computes surrogate variables
#passing deseq2 normalized data(normalized by size factors) as input
svobj = svaseq(data_normalized, mod, mod0, n.sv=14)
#svobj = sva(data_normalized, mod, mod0, n.sv=6)
####clean function: removes unwanted variation from data##
cleanY = function(y, mod, svs) {
X = cbind(mod, svs)
Hat = solve(t(X) %*% X) %*% t(X)
beta = (Hat %*% t(y))
rm(Hat)
gc()
P = ncol(mod)
return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}
#calling the above function to remove unwanted variation
cleany = cleanY(data_normalized,mod,svobj$sv[,1:14])
#plotpca
> sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] sva_3.20.0 mgcv_1.8-13 nlme_3.1-128 [4] ggplot2_2.1.0 rgl_0.95.1441 genefilter_1.54.2 [7] DESeq2_1.12.3 SummarizedExperiment_1.2.3 Biobase_2.32.0 [10] GenomicRanges_1.24.2 RColorBrewer_1.1-2 GenomeInfoDb_1.8.3 [13] IRanges_2.6.1 S4Vectors_0.10.2 BiocGenerics_0.18.0 [16] BiocInstaller_1.22.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.6 plyr_1.8.4 XVector_0.12.1 tools_3.3.0 [5] zlibbioc_1.18.0 digest_0.6.10 rpart_4.1-10 RSQLite_1.0.0 [9] annotate_1.50.0 gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6 [13] DBI_0.4-1 gridExtra_2.2.1 cluster_2.0.4 locfit_1.5-9.1 [17] grid_3.3.0 nnet_7.3-12 data.table_1.9.6 AnnotationDbi_1.34.4 [21] XML_3.98-1.4 survival_2.39-5 BiocParallel_1.6.4 foreign_0.8-66 [25] latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.50.0 Hmisc_3.17-4 [29] scales_0.4.0 splines_3.3.0 colorspace_1.2-6 xtable_1.8-2 [33] labeling_0.3 acepack_1.3-3.3 munsell_0.4.3 chron_2.3-47
Thank you jeff. Yes I now transformed the data using deseq2 and then used sva. Now I am getting the same result as earlier. Just to confirm . sva needs transformed data like log2 or rlog or voom tranformation as input. svaseq takes count data and then log transforms it. So the difference would be in the transformation we are using in these cases. Therefore I get different results when I take the PCA after batch correction. Did I understand correctly?can we give deseq2 normalized data as input to svaseq since deseq2 normalization only corrects for size factors?