Question

Batch-effect: batch in design and corrected batch with sva- DESeq2

0

Entering edit mode

bioinfo • 0

@bioinfo-19580

Last seen 7.0 years ago

Hi All,

I did DESeq2 on my set of samples and I got only two DE genes. When I checked I found that my samples were collected at different time points and in PCA a prominent batch effect is shown. So, I included a batch in my design

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                   directory = directory,
                                   design= ~ batch + condition)

where batch is batch [1] 1 2 3 1 2 3 Levels: 1 2 3

I get 15 DE genes after including design. So far all good.

But then I read another post where we can adjust the batch effect using 'SVA' and I tried it also. See below:

       dat  <- counts(dds, normalized = TRUE)
        idx  <- rowMeans(dat) > 1
        dat  <- dat[idx, ]
        mod  <- model.matrix(~ condition, colData(dds))
        mod0 <- model.matrix(~   1, colData(dds))

        n.sv <- num.sv(dat,mod, method = 'leek')
        n.sv

     svseq <- svaseq(dat, mod, mod0, n.sv = 3)

    ddssva <- dds
    ddssva$SV1 <- svseq$sv[,1]
    ddssva$SV2 <- svseq$sv[,2]
    ddssva$SV3 <- svseq$sv[,3]
    design(ddssva) <- ~ SV1 + SV2 + SV3 + condition

 ddssva <- DESeq(ddssva) 
 resSV <- results(ddssva)

n.sv was 4 but while using 4 it throws the below error so I used 3 and it worked fine

 svseq <- svaseq(dat, mod, mod0, n.sv = 4)
    Number of significant surrogate variables is:  4 
    Iteration (out of 5 ):Error in density.default(x, adjust = adj) : 'x' contains missing values
    In addition: Warning message:
    In pf(fstats, df1 = (df1 - df0), df2 = (n - df1)) : NaNs produced

Using this method (sva) I am getting 51 DE genes and only 5 genes are common between this and if using batch in design. My question is that why there is a difference between the results of the two methods and which method should I use form my downstream analysis.

 sessionInfo()
    R version 3.5.1 (2018-07-02)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows >= 8 x64 (build 9200)

    Matrix products: default

    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
    [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

    attached base packages:
    [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
     [1] sva_3.30.1                  genefilter_1.64.0           mgcv_1.8-24                 nlme_3.1-137               
     [5] openxlsx_4.1.0              XLConnect_0.2-15            XLConnectJars_0.2-15        xlsx_0.6.1                 
     [9] tibble_2.0.1                biomaRt_2.38.0              gplots_3.0.1                ggplot2_3.1.0              
    [13] DESeq2_1.22.2               SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
    [17] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
    [21] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        

    loaded via a namespace (and not attached):
     [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2     progress_1.2.0         httr_1.4.0            
     [6] tools_3.5.1            backports_1.1.3        R6_2.3.0               rpart_4.1-13           KernSmooth_2.23-15    
    [11] Hmisc_4.1-1            DBI_1.0.0              lazyeval_0.2.1         colorspace_1.4-0       nnet_7.3-12           
    [16] withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3          prettyunits_1.0.2      bit_1.1-14            
    [21] curl_3.3               compiler_3.5.1         htmlTable_1.13.1       labeling_0.3           caTools_1.17.1.1      
    [26] scales_1.0.0           checkmate_1.9.1        stringr_1.3.1          digest_0.6.18          foreign_0.8-70        
    [31] XVector_0.22.0         base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6        limma_3.38.3          
    [36] htmlwidgets_1.3        rlang_0.3.1            rstudioapi_0.9.0       RSQLite_2.1.1          bindr_0.1.1           
    [41] gtools_3.8.1           zip_1.0.0              acepack_1.4.1          dplyr_0.7.8            RCurl_1.95-4.11       
    [46] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3          Matrix_1.2-14          Rcpp_1.0.0            
    [51] munsell_0.5.0          stringi_1.2.4          yaml_2.2.0             zlibbioc_1.28.0        plyr_1.8.4            
    [56] grid_3.5.1             blob_1.1.1             gdata_2.18.0           crayon_1.3.4           lattice_0.20-35       
    [61] splines_3.5.1          xlsxjars_0.6.1         annotate_1.60.0        hms_0.4.2              locfit_1.5-9.1        
    [66] knitr_1.21             pillar_1.3.1           geneplotter_1.60.0     XML_3.98-1.16          glue_1.3.0            
    [71] latticeExtra_0.6-28    data.table_1.12.0      BiocManager_1.30.4     gtable_0.2.0           purrr_0.2.5           
    [76] assertthat_0.2.0       xfun_0.4               xtable_1.8-3           survival_2.42-3        rJava_0.9-10          
    [81] AnnotationDbi_1.44.0   memoise_1.1.0          bindrcpp_0.2.2         cluster_2.0.7-1

Any help would be appreciated! Thanks!

deseq2 svaseq • 2.4k views

ADD COMMENT • link 7.0 years ago bioinfo • 0

score 2 · Answer 1 · 2019-01-23

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 hours ago

United States

With known batches, I tend to favor ~batch + condition. I'm not sure why you would get such a small overlap.

ADD COMMENT • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Thanks, Michael for the answer.

I was not sure why I am getting such different numbers. I would go ahead with ~ batch + condition.

Also when I use n.sv = 2 and use only two SVs I only get 4 DE genes.

Is it recommended to use high n.sv or low n.sv when running svaseq. Any suggestion?