Search
Question: can we use sva function to estimate artifacts for rna seq data?
0
2.2 years ago by
szenitha0
szenitha0 wrote:

Hello all,

I have a question regarding the sva function. To start with I am analysing a RNA seq dataset. My condition consists of 4 groups(CON_GM,CON_WM,MS_GM and MS_WM). CON is the control and MS is patients affected with multiple sclerosis. GM(grey matter) and WM(white matter) are cell types. I first normalize the data by deseq2 and then visualize the PCA. I can see that the data clusters according to cell type.  I then use svaseq function to search for hidden batch effects. After that I use a function to remove the surrogate variables and then visualize the PCA . I do not use the cleaned data for downstream analysis. I just take the PCA  to decide on the number of surrogates to be added as covariates when I will perform downstream analysis using deseq2. I found the function that removes batch effect from this link https://www.biostars.org/p/121489/  . I also used the function num.sv to get the total number of surrogates. The function returned a value of 18.  Initially I removed only 2-3 surrogates . But I was unable to get 4 different clusters.I found that after removing 14 surrogate variables , I am able to see 4 separate groups. I repeated the same steps but this time I use sva function. When I used sva I only had to remove 6 surrogate variables to get 4 separate groups. As per the documentation it says that the sva function can be used to estimate artifacts from microarray data the svaseq function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data.  I dont understand why I would need to remove more number of surrogate variables while using svaseq to get good clustering. Can we use sva to estimate artifacts for RNA-seq data?

Code:

#compute null and full model
mod =model.matrix(~merged, data=pheno)#merged is condition
mod0 = model.matrix(~1,data=pheno)

#find the number of surrogate variables
n.sv = num.sv(data,mod,method="leek")

#function that computes surrogate variables
#passing deseq2 normalized data(normalized by size factors) as input

svobj = svaseq(data_normalized, mod, mod0, n.sv=14)

#svobj = sva(data_normalized, mod, mod0, n.sv=6)

####clean function: removes unwanted variation from data##
cleanY = function(y, mod, svs) {
X = cbind(mod, svs)
Hat = solve(t(X) %*% X) %*% t(X)
beta = (Hat %*% t(y))
rm(Hat)
gc()
P = ncol(mod)
return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}

#calling the above function to remove unwanted variation
cleany = cleanY(data_normalized,mod,svobj$sv[,1:14]) #plotpca > sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] sva_3.20.0 mgcv_1.8-13 nlme_3.1-128 [4] ggplot2_2.1.0 rgl_0.95.1441 genefilter_1.54.2 [7] DESeq2_1.12.3 SummarizedExperiment_1.2.3 Biobase_2.32.0 [10] GenomicRanges_1.24.2 RColorBrewer_1.1-2 GenomeInfoDb_1.8.3 [13] IRanges_2.6.1 S4Vectors_0.10.2 BiocGenerics_0.18.0 [16] BiocInstaller_1.22.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.6 plyr_1.8.4 XVector_0.12.1 tools_3.3.0 [5] zlibbioc_1.18.0 digest_0.6.10 rpart_4.1-10 RSQLite_1.0.0 [9] annotate_1.50.0 gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6 [13] DBI_0.4-1 gridExtra_2.2.1 cluster_2.0.4 locfit_1.5-9.1 [17] grid_3.3.0 nnet_7.3-12 data.table_1.9.6 AnnotationDbi_1.34.4 [21] XML_3.98-1.4 survival_2.39-5 BiocParallel_1.6.4 foreign_0.8-66 [25] latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.50.0 Hmisc_3.17-4 [29] scales_0.4.0 splines_3.3.0 colorspace_1.2-6 xtable_1.8-2 [33] labeling_0.3 acepack_1.3-3.3 munsell_0.4.3 chron_2.3-47  ADD COMMENTlink modified 2.2 years ago by Jeff Leek520 • written 2.2 years ago by szenitha0 0 2.2 years ago by Jeff Leek520 United States Jeff Leek520 wrote: Hi sva and svaseq do nearly the same thing, except svaseq performs a log transform before applying sva. If you didn't take the log transform before performing sva you may get different results. If you are using count data we suggest you either perform an appropriate transform of the data (log is only one option here, DESeq2 has more clever transforms you could use) before using the sva function. I hope this helps! Jeff On Mon, Aug 8, 2016 at 11:58 AM szenitha [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User szenitha <https: support.bioconductor.org="" u="" 10863=""/> wrote Question: > can we use sva function to estimate artifacts for rna seq data? > <https: support.bioconductor.org="" p="" 85905=""/>: > > Hello all, > > I have a question regarding the sva function. To start with I am analysing > a RNA seq dataset. My condition consists of 4 groups(CON_GM,CON_WM,MS_GM > and MS_WM). CON is the control and MS is patients affected with multiple > sclerosis. GM(grey matter) and WM(white matter) are cell types. I first > normalize the data by deseq2 and then visualize the PCA. I can see that the > data clusters according to cell type. I then use svaseq function to search > for hidden batch effects. After that I use a function to remove the > surrogate variables and then visualize the PCA . I do not use the cleaned > data for downstream analysis. I just take the PCA to decide on the number > of surrogates to be added as covariates when I will perform downstream > analysis using deseq2. I found the function that removes batch effect from > this link https://www.biostars.org/p/121489/ . I also used the function > num.sv to get the total number of surrogates. The function returned a > value of 18. Initially I removed only 2-3 surrogates . But I was unable to > get 4 different clusters.I found that after removing 14 surrogate variables > , I am able to see 4 separate groups. I repeated the same steps but this > time I use sva function. When I used sva I only had to remove 6 surrogate > variables to get 4 separate groups. As per the documentation it says that > the sva <http: 127.0.0.1:47764="" help="" library="" sva="" help="" sva=""> function can > be used to estimate artifacts from microarray data the svaseq > <http: 127.0.0.1:47764="" help="" library="" sva="" help="" svaseq=""> function can be > used to estimate artifacts from count-based RNA-sequencing (and other > sequencing) data. I dont understand why I would need to remove more number > of surrogate variables while using svaseq to get good clustering. Can we > use sva to estimate artifacts for RNA-seq data? > > Code: > > #compute null and full model > mod =model.matrix(~merged, data=pheno)#merged is condition > mod0 = model.matrix(~1,data=pheno) > > #find the number of surrogate variables > n.sv = num.sv(data,mod,method="leek") > > > #function that computes surrogate variables > #passing deseq2 normalized data(normalized by size factors) as input > > svobj = svaseq(data_normalized, mod, mod0, n.sv=14) > > #svobj = sva(data_normalized, mod, mod0, n.sv=6) > > ####clean function: removes unwanted variation from data## > cleanY = function(y, mod, svs) { > X = cbind(mod, svs) > Hat = solve(t(X) %*% X) %*% t(X) > beta = (Hat %*% t(y)) > rm(Hat) > gc() > P = ncol(mod) > return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),])) > } > > #calling the above function to remove unwanted variation > cleany = cleanY(data_normalized,mod,svobj$sv[,1:14]) > > #plotpca > > > sessionInfo() > R version 3.3.0 (2016-05-03) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 7 x64 (build 7601) Service Pack 1 > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats4 stats graphics grDevices utils datasets methods base > > other attached packages: > [1] sva_3.20.0 mgcv_1.8-13 nlme_3.1-128 > [4] ggplot2_2.1.0 rgl_0.95.1441 genefilter_1.54.2 > [7] DESeq2_1.12.3 SummarizedExperiment_1.2.3 Biobase_2.32.0 > [10] GenomicRanges_1.24.2 RColorBrewer_1.1-2 GenomeInfoDb_1.8.3 > [13] IRanges_2.6.1 S4Vectors_0.10.2 BiocGenerics_0.18.0 > [16] BiocInstaller_1.22.3 > > loaded via a namespace (and not attached): > [1] Rcpp_0.12.6 plyr_1.8.4 XVector_0.12.1 tools_3.3.0 > [5] zlibbioc_1.18.0 digest_0.6.10 rpart_4.1-10 RSQLite_1.0.0 > [9] annotate_1.50.0 gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6 > [13] DBI_0.4-1 gridExtra_2.2.1 cluster_2.0.4 locfit_1.5-9.1 > [17] grid_3.3.0 nnet_7.3-12 data.table_1.9.6 AnnotationDbi_1.34.4 > [21] XML_3.98-1.4 survival_2.39-5 BiocParallel_1.6.4 foreign_0.8-66 > [25] latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.50.0 Hmisc_3.17-4 > [29] scales_0.4.0 splines_3.3.0 colorspace_1.2-6 xtable_1.8-2 > [33] labeling_0.3 acepack_1.3-3.3 munsell_0.4.3 chron_2.3-47 > > ------------------------------ > > Post tags: sva, svaseq, batcheffectcorrection > > You may reply via email or visit can we use sva function to estimate artifacts for rna seq data? >

Thank you jeff. Yes I now transformed the data using deseq2 and then used sva. Now I am getting the same result as earlier. Just to confirm . sva needs transformed data like log2 or rlog or voom tranformation as input. svaseq takes count data and then log transforms it. So the difference would be in the transformation we are using in these cases. Therefore I get different results when I take the PCA after batch correction. Did I understand correctly?can we give deseq2 normalized data as input to svaseq since deseq2 normalization only corrects for size factors?