To adjust for the data quality, I added the numerical variable percent_mapped
to my model:
design = ~ variable_of_interest + percent_mapped
dds = DESeq2::DESeqDataSet(gse, design)
During building of the DESeqDataSet
object I got the following warning:
the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence.
Since variable_of_interest
is categorical, this warning came up because of percent_mapped
. The mean of percent_mapped
is 91.44303 (standard deviation = 1.369892).
I performed the differential expression analysis (a) without and (b) with scaling + centering and (c) excluding the percent_mapped
variable. Code I used for centering and scaling:
colData(gse)$percent_mapped = scale(colData(gse)$percent_mapped, center=TRUE)
Differential expression analysis:
register(MulticoreParam(5))
dds = DESeq(dds, parallel = TRUE)
res = results(dds, contrast = c(variable_of_interest , group1, group2), alpha = 0.05, parallel = TRUE)
res["gene_of_interest",]
For my gene of interest, I got results which are greatly differing between (a) without scaling and (b) + (c) with scaling/centering or by exluding percent_mapped
:
(a) baseMean: 1058.808 log2FoldChange: 1.550588 lfcSE: 3.630939 stat: 0.4270487 pvalue: 0.6693438 padj: 0.8398124
(b) baseMean: 1058.808 log2FoldChange: 2.579453 lfcSE: 0.4652964 stat: 5.543676 pvalue: 2.961866e-08 padj: 1.332327e-05
(c) baseMean: 1058.808 log2FoldChange: 2.800934 lfcSE: 0.4835646 stat: 5.792265 pvalue: 6.944332e-09 padj: 4.909011e-06
How can this be? I it is unclear to me, how scaling and centering can effect the DESeq2 results. Can anyone explain what happened? I was assuming that scaling + centering will speed up the fitting of the model and don't affect the results.
Is it recommended to scale + center numeric variables in general?
sessionInfo( )
R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /pstore/apps/R/4.0.1-intel-2018b/lib64/R/lib/libR.so
LAPACK: /pstore/apps/R/4.0.1-intel-2018b/lib64/R/modules/lapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.50.0 BiocParallel_1.24.1 DESeq2_1.30.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
[6] MatrixGenerics_1.2.0 matrixStats_0.57.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 IRanges_2.24.1
[11] S4Vectors_0.28.1 BiocGenerics_0.36.0 ggvenn_0.1.7 ggplot2_3.3.3 UpSetR_1.4.0
[16] tidyr_1.1.2 msigdbr_7.2.1 biomaRt_2.46.1 ArvadosR_0.0.7 dplyr_1.0.2
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.2 bit64_4.0.5 splines_4.0.1 assertthat_0.2.1 askpass_1.1
[7] BiocManager_1.30.10 BiocFileCache_1.14.0 blob_1.2.1 Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 progress_1.2.2
[13] pillar_1.4.7 RSQLite_2.2.2 lattice_0.20-41 glue_1.4.2 digest_0.6.27 RColorBrewer_1.1-2
[19] XVector_0.30.0 colorspace_2.0-0 plyr_1.8.6 Matrix_1.2-18 XML_3.99-0.5 pkgconfig_2.0.3
[25] genefilter_1.72.0 zlibbioc_1.36.0 purrr_0.3.4 xtable_1.8-6 scales_1.1.1 tibble_3.0.4
[31] openssl_1.4.3 annotate_1.68.0 generics_0.1.0 ellipsis_0.3.1 withr_2.3.0 cli_2.2.0
[37] survival_3.1-12 magrittr_2.0.1 crayon_1.3.4 memoise_1.1.0 fansi_0.4.1 xml2_1.3.2
[43] tools_4.0.1 prettyunits_1.1.1 hms_0.5.3 lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0
[49] locfit_1.5-9.4 DelayedArray_0.16.0 Biostrings_2.58.0 AnnotationDbi_1.52.0 compiler_4.0.1 rlang_0.4.10
[55] RCurl_1.98-1.2 rstudioapi_0.13 rappdirs_0.3.1 bitops_1.0-6 gtable_0.3.0 DBI_1.1.0
[61] curl_4.3 R6_2.5.0 GenomicAlignments_1.26.0 gridExtra_2.3 bit_4.0.4 stringi_1.5.3
[67] Rcpp_1.0.5 vctrs_0.3.6 geneplotter_1.68.0 dbplyr_2.0.0 tidyselect_1.1.0
Hi, I will wait for Mike to respond; however, I see no need to include
percent_mapped
in your design formula. To adjust for quality, some people include, e.g., RIN (RNA Integrity Number). The alignment % (percent_mapped
) is not strictly a measure of DNA quality and relates to other things.Irrespective, by scaling the variable, you are completely altering its distribution. By default,
percent_mapped
is measured from 0-100% and is assuredly heavily right-skewed when plot via histogram. When you scale it, you are centering it at 0 and essentially transforming it to Z-scores; so, its distribution changes dramatically. Plot it via histogram (before / after) to see this.I wrote data quality. Scaling and centering are not new to me.
You don't need to include
percent_mapped
in order to adjust for quality. By including it, even as scaled, you are likely introducing bias into the statistical inferences.