DESeq2 results change dramatically after scaling + centering of numeric variable
Entering edit mode
Last seen 4 weeks ago

To adjust for the data quality, I added the numerical variable percent_mapped to my model:

design = ~ variable_of_interest + percent_mapped
dds = DESeq2::DESeqDataSet(gse, design)

During building of the DESeqDataSet object I got the following warning:

the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence.

Since variable_of_interest is categorical, this warning came up because of percent_mapped. The mean of percent_mapped is 91.44303 (standard deviation = 1.369892).

I performed the differential expression analysis (a) without and (b) with scaling + centering and (c) excluding the percent_mapped variable. Code I used for centering and scaling:

colData(gse)$percent_mapped = scale(colData(gse)$percent_mapped, center=TRUE)

Differential expression analysis:

dds = DESeq(dds, parallel = TRUE)
res = results(dds, contrast = c(variable_of_interest , group1, group2), alpha = 0.05, parallel = TRUE)

For my gene of interest, I got results which are greatly differing between (a) without scaling and (b) + (c) with scaling/centering or by exluding percent_mapped:

(a) baseMean: 1058.808 log2FoldChange: 1.550588 lfcSE: 3.630939 stat: 0.4270487 pvalue: 0.6693438 padj: 0.8398124

(b) baseMean: 1058.808 log2FoldChange: 2.579453 lfcSE: 0.4652964 stat: 5.543676 pvalue: 2.961866e-08 padj: 1.332327e-05

(c) baseMean: 1058.808 log2FoldChange: 2.800934 lfcSE: 0.4835646 stat: 5.792265 pvalue: 6.944332e-09 padj: 4.909011e-06

How can this be? I it is unclear to me, how scaling and centering can effect the DESeq2 results. Can anyone explain what happened? I was assuming that scaling + centering will speed up the fitting of the model and don't affect the results.

Is it recommended to scale + center numeric variables in general?

sessionInfo( )

R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /pstore/apps/R/4.0.1-intel-2018b/lib64/R/lib/
LAPACK: /pstore/apps/R/4.0.1-intel-2018b/lib64/R/modules/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.50.0          BiocParallel_1.24.1         DESeq2_1.30.0               SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [6] MatrixGenerics_1.2.0        matrixStats_0.57.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.2         IRanges_2.24.1             
[11] S4Vectors_0.28.1            BiocGenerics_0.36.0         ggvenn_0.1.7                ggplot2_3.3.3               UpSetR_1.4.0               
[16] tidyr_1.1.2                 msigdbr_7.2.1               biomaRt_2.46.1              ArvadosR_0.0.7              dplyr_1.0.2                

loaded via a namespace (and not attached):
 [1] httr_1.4.2               jsonlite_1.7.2           bit64_4.0.5              splines_4.0.1            assertthat_0.2.1         askpass_1.1             
 [7] BiocManager_1.30.10      BiocFileCache_1.14.0     blob_1.2.1               Rsamtools_2.6.0          GenomeInfoDbData_1.2.4   progress_1.2.2          
[13] pillar_1.4.7             RSQLite_2.2.2            lattice_0.20-41          glue_1.4.2               digest_0.6.27            RColorBrewer_1.1-2      
[19] XVector_0.30.0           colorspace_2.0-0         plyr_1.8.6               Matrix_1.2-18            XML_3.99-0.5             pkgconfig_2.0.3         
[25] genefilter_1.72.0        zlibbioc_1.36.0          purrr_0.3.4              xtable_1.8-6             scales_1.1.1             tibble_3.0.4            
[31] openssl_1.4.3            annotate_1.68.0          generics_0.1.0           ellipsis_0.3.1           withr_2.3.0              cli_2.2.0               
[37] survival_3.1-12          magrittr_2.0.1           crayon_1.3.4             memoise_1.1.0            fansi_0.4.1              xml2_1.3.2              
[43] tools_4.0.1              prettyunits_1.1.1        hms_0.5.3                lifecycle_0.2.0          stringr_1.4.0            munsell_0.5.0           
[49] locfit_1.5-9.4           DelayedArray_0.16.0      Biostrings_2.58.0        AnnotationDbi_1.52.0     compiler_4.0.1           rlang_0.4.10            
[55] RCurl_1.98-1.2           rstudioapi_0.13          rappdirs_0.3.1           bitops_1.0-6             gtable_0.3.0             DBI_1.1.0               
[61] curl_4.3                 R6_2.5.0                 GenomicAlignments_1.26.0 gridExtra_2.3            bit_4.0.4                stringi_1.5.3           
[67] Rcpp_1.0.5               vctrs_0.3.6              geneplotter_1.68.0       dbplyr_2.0.0             tidyselect_1.1.0
DESeq2 • 181 views
Entering edit mode

Hi, I will wait for Mike to respond; however, I see no need to include percent_mapped in your design formula. To adjust for quality, some people include, e.g., RIN (RNA Integrity Number). The alignment % (percent_mapped) is not strictly a measure of DNA quality and relates to other things.

Irrespective, by scaling the variable, you are completely altering its distribution. By default, percent_mapped is measured from 0-100% and is assuredly heavily right-skewed when plot via histogram. When you scale it, you are centering it at 0 and essentially transforming it to Z-scores; so, its distribution changes dramatically. Plot it via histogram (before / after) to see this.

Entering edit mode

I wrote data quality. Scaling and centering are not new to me.

Entering edit mode

You don't need to include percent_mapped in order to adjust for quality. By including it, even as scaled, you are likely introducing bias into the statistical inferences.

Entering edit mode
Last seen 2 minutes ago
United States

When you add a covariate with a large mean and small SD, you are creating a near collinearity with the intercept. This is similar to a nearly confounded design: the coefficients can trade off and give the same likelihood, so you are impairing the estimation of the coefficients. Collinearity is part of a typical linear models course.

It is simple to resolve: don't include covariates with a large mean value, just center and scale the continuous-valued covariates before adding to the design. Yes it is recommended.

This is why DESeq2 prints the message to center and scale covariates when the mean is large.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Traffic: 217 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.4