Hi All,
I did DESeq2 on my set of samples and I got only two DE genes. When I checked I found that my samples were collected at different time points and in PCA a prominent batch effect is shown. So, I included a batch in my design
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = directory,
design= ~ batch + condition)
where batch is batch [1] 1 2 3 1 2 3 Levels: 1 2 3
I get 15 DE genes after including design. So far all good.
But then I read another post where we can adjust the batch effect using 'SVA' and I tried it also. See below:
dat <- counts(dds, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx, ]
mod <- model.matrix(~ condition, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
n.sv <- num.sv(dat,mod, method = 'leek')
n.sv
svseq <- svaseq(dat, mod, mod0, n.sv = 3)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
ddssva$SV3 <- svseq$sv[,3]
design(ddssva) <- ~ SV1 + SV2 + SV3 + condition
ddssva <- DESeq(ddssva)
resSV <- results(ddssva)
n.sv was 4 but while using 4 it throws the below error so I used 3 and it worked fine
svseq <- svaseq(dat, mod, mod0, n.sv = 4)
Number of significant surrogate variables is: 4
Iteration (out of 5 ):Error in density.default(x, adjust = adj) : 'x' contains missing values
In addition: Warning message:
In pf(fstats, df1 = (df1 - df0), df2 = (n - df1)) : NaNs produced
Using this method (sva) I am getting 51 DE genes and only 5 genes are common between this and if using batch in design. My question is that why there is a difference between the results of the two methods and which method should I use form my downstream analysis.
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] sva_3.30.1 genefilter_1.64.0 mgcv_1.8-24 nlme_3.1-137
[5] openxlsx_4.1.0 XLConnect_0.2-15 XLConnectJars_0.2-15 xlsx_0.6.1
[9] tibble_2.0.1 biomaRt_2.38.0 gplots_3.0.1 ggplot2_3.1.0
[13] DESeq2_1.22.2 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 BiocParallel_1.16.5
[17] matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
[21] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2 progress_1.2.0 httr_1.4.0
[6] tools_3.5.1 backports_1.1.3 R6_2.3.0 rpart_4.1-13 KernSmooth_2.23-15
[11] Hmisc_4.1-1 DBI_1.0.0 lazyeval_0.2.1 colorspace_1.4-0 nnet_7.3-12
[16] withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3 prettyunits_1.0.2 bit_1.1-14
[21] curl_3.3 compiler_3.5.1 htmlTable_1.13.1 labeling_0.3 caTools_1.17.1.1
[26] scales_1.0.0 checkmate_1.9.1 stringr_1.3.1 digest_0.6.18 foreign_0.8-70
[31] XVector_0.22.0 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 limma_3.38.3
[36] htmlwidgets_1.3 rlang_0.3.1 rstudioapi_0.9.0 RSQLite_2.1.1 bindr_0.1.1
[41] gtools_3.8.1 zip_1.0.0 acepack_1.4.1 dplyr_0.7.8 RCurl_1.95-4.11
[46] magrittr_1.5 GenomeInfoDbData_1.2.0 Formula_1.2-3 Matrix_1.2-14 Rcpp_1.0.0
[51] munsell_0.5.0 stringi_1.2.4 yaml_2.2.0 zlibbioc_1.28.0 plyr_1.8.4
[56] grid_3.5.1 blob_1.1.1 gdata_2.18.0 crayon_1.3.4 lattice_0.20-35
[61] splines_3.5.1 xlsxjars_0.6.1 annotate_1.60.0 hms_0.4.2 locfit_1.5-9.1
[66] knitr_1.21 pillar_1.3.1 geneplotter_1.60.0 XML_3.98-1.16 glue_1.3.0
[71] latticeExtra_0.6-28 data.table_1.12.0 BiocManager_1.30.4 gtable_0.2.0 purrr_0.2.5
[76] assertthat_0.2.0 xfun_0.4 xtable_1.8-3 survival_2.42-3 rJava_0.9-10
[81] AnnotationDbi_1.44.0 memoise_1.1.0 bindrcpp_0.2.2 cluster_2.0.7-1
Any help would be appreciated! Thanks!
Thanks, Michael for the answer.
I was not sure why I am getting such different numbers. I would go ahead with ~ batch + condition.
Also when I use n.sv = 2 and use only two SVs I only get 4 DE genes.
Is it recommended to use high n.sv or low n.sv when running svaseq. Any suggestion?
I’ll let the SVA devels handle that one.