Unexpected rlog values when count spread is large
1
0
Entering edit mode
eric.blanc • 0
@ericblanc-11613
Last seen 13 months ago

Hi,

I am encountering an unexpected behaviour with regularized logarithmic values for one gene, when there is:

• a large number of conditions (12),
• limited replication number (3),
• the read counts in most samples is 0, and
• there is one sample for which the read count for the gene is high.

When these circumstances are met, the regularized logarithm value for this gene in the outlier condition is much higher than any other value in the dataset.

This effect can be seen in the toy example below, using the highest expressed gene in my dataset, the gene with the unexpected behaviour, and 10 genes selected at random to generate the dds object. Please note that the number of reads in sample A_b_1 for gene Question is much higher (2471) than in all other samples, and that its behaviour is just as unexpected in my full dataset (60321 genes in 107 samples).

E <- rbind(
HighExpr=c(14413010, 14729567, 16831319, 9285, 42687, 208961, 7768481, 12181302, 10001122, 1275, 19230, 2846911, 15888015, 15805554, 13694644, 4124374, 43071, 16584140, 17740751, 15121608, 740, 675, 1093, 15783974, 14289011, 13051649, 1432938, 2229190, 2194025, 15630065, 14096331, 14922741, 2925, 4603, 26915),
Question=c(0, 0, 0, 2471, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0),
Random_0=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Random_1=c(0, 0, 0, 0, 14, 1, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
Random_2=c(64, 67, 71, 35, 15, 216, 87, 166, 220, 100, 114, 321, 105, 167, 173, 155, 158, 186, 235, 166, 56, 177, 119, 50, 102, 43, 311, 373, 333, 74, 106, 123, 289, 267, 374),
Random_3=c(22, 40, 16, 12, 28, 77, 4, 18, 5, 45, 59, 20, 50, 37, 26, 21, 55, 49, 40, 39, 52, 58, 35, 49, 55, 50, 58, 54, 59, 26, 29, 25, 40, 65, 120),
Random_4=c(1, 3, 7, 77, 25, 40, 4, 1, 7, 15, 46, 17, 7, 0, 4, 8, 25, 8, 15, 10, 7, 22, 5, 5, 3, 2, 6, 13, 6, 12, 9, 3, 8, 3, 17),
Random_5=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Random_6=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Random_7=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
Random_8=c(3255, 2734, 3156, 11996, 14702, 8891, 3011, 3161, 3364, 8394, 9060, 8458, 3991, 4067, 3630, 9788, 11737, 4580, 5183, 4147, 9150, 10252, 8110, 3530, 3557, 2837, 10429, 9828, 8649, 3863, 3709, 4091, 9683, 10046, 15143),
Random_9=c(0, 0, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 1, 5, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 5, 0, 4, 2, 0, 0, 1, 0, 1, 0, 5)
)
colnames(E) <- c("A_a_1", "A_a_2", "A_a_3", "A_b_1", "A_b_2", "A_b_3", "A_c_1", "A_c_2", "A_c_3", "A_d_1", "A_d_2", "A_d_3", "B_a_1", "B_a_2", "B_a_3", "B_b_1", "B_b_3", "B_c_1", "B_c_2", "B_c_3", "B_d_1", "B_d_2", "B_d_3", "C_a_1", "C_a_2", "C_a_3", "C_b_1", "C_b_2", "C_b_3", "C_c_1", "C_c_2", "C_c_3", "C_d_1", "C_d_2", "C_d_3")
Cov <- data.frame(Sample=colnames(E), Group=sub("_[123]\$", "", colnames(E)), Repl=sub("^[ABC]_[abcd]_", "", colnames(E)))
dds <- DESeqDataSetFromMatrix(countData=E, colData=Cov, design=~Group)
dds <- DESeq(dds)
rld <- rlog(dds)


The regularised logarithm value of the Question gene in sample A_b_1 is more than twice higher than the highest regularized likelihood of the HighExpr gene.

> counts(dds)[1:2, 1:5]
A_a_1    A_a_2    A_a_3 A_b_1 A_b_2
HighExpr 14413010 14729567 16831319  9285 42687
Question        0        0        0  2471     4
> assay(rld)[1:2, 1:5]
A_a_1     A_a_2     A_a_3    A_b_1     A_b_2
HighExpr 24.260943 23.913180 24.598851 15.18478 17.231491
Question  6.480747  6.480516  6.480839 53.75706  6.558154
> max(assay(rld)[1,])
[1] 24.59885


However, when fitNbinomGLMs is invoked with argument forceOptim=TRUE from rlogData, this unexpected behaviour disappears, see below:

> cbind(assay(rld_forceFalse)["Question",1:5], assay(rld_forceTrue)["Question",1:5])
[,1]      [,2]
A_a_1  6.480747 -3.815157
A_a_2  6.480516 -3.935015
A_a_3  6.480839 -3.778702
A_b_1 53.757064 12.202386
A_b_2  6.558154  3.153798


My questions are:

• Is my expectation of regularized logarithm values plain wrong? (i.e. could super high values be sometimes reached for genes with limited counts)
• Am I using DESeq or rlog wrong?
• Is there a way to force optimisation for all genes (or is it too slow?)

Thanks in advance for your help! Best, Eric

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS/LAPACK: /home/eblanc/miniconda3/envs/R_test/lib/libopenblasp-r0.3.10.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] DESeq2_1.28.0               SummarizedExperiment_1.18.1
[3] DelayedArray_0.14.0         matrixStats_0.56.0
[5] Biobase_2.48.0              GenomicRanges_1.40.0
[7] GenomeInfoDb_1.24.0         IRanges_2.22.1
[9] S4Vectors_0.26.0            BiocGenerics_0.34.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6           pillar_1.4.6           compiler_4.0.2
[4] RColorBrewer_1.1-2     XVector_0.28.0         bitops_1.0-6
[7] tools_4.0.2            zlibbioc_1.34.0        digest_0.6.25
[10] bit_4.0.4              tibble_3.0.3           lifecycle_0.2.0
[13] gtable_0.3.0           annotate_1.66.0        RSQLite_2.2.0
[16] memoise_1.1.0          lattice_0.20-41        pkgconfig_2.0.3
[19] rlang_0.4.7            Matrix_1.2-18          DBI_1.1.0
[22] GenomeInfoDbData_1.2.3 genefilter_1.70.0      vctrs_0.3.2
[25] locfit_1.5-9.4         bit64_4.0.2            grid_4.0.2
[28] glue_1.4.1             R6_2.4.1               AnnotationDbi_1.50.0
[31] XML_3.99-0.3           survival_3.2-3         BiocParallel_1.22.0
[34] magrittr_1.5           ggplot2_3.3.2          geneplotter_1.66.0
[37] blob_1.2.1             ellipsis_0.3.1         scales_1.1.1
[40] splines_4.0.2          colorspace_1.4-1       xtable_1.8-4
[43] munsell_0.5.0          RCurl_1.98-1.2         crayon_1.3.4

deseq2 normalization • 184 views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

Dear Eric,

Thanks for your post and example.

I have also found that rlog does not perform well when the data do not fit a NB model (e.g. extreme count outliers), and for this reason and for the speed issue we have de-prioritized rlog in the vignette and workflow. Both recommend VST now for moderate to large datasets.

If you are using v1.20 or greater (current release is v1.28), then rlog should even print a message that rlog is not recommended for large datasets due to speed (n >= 30).

ADD COMMENT
0
Entering edit mode

Hi Michael,

OK, I'll stick to vst then.

Thanks for your help (& for DESeq2!) Eric

ADD REPLY

Login before adding your answer.

Traffic: 320 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6