Question

Unexpected rlog values when count spread is large

0

Entering edit mode

eric.blanc • 0

@ericblanc-11613

Last seen 3.7 years ago

Hi,

I am encountering an unexpected behaviour with regularized logarithmic values for one gene, when there is:

a large number of conditions (12),
limited replication number (3),
the read counts in most samples is 0, and
there is one sample for which the read count for the gene is high.

When these circumstances are met, the regularized logarithm value for this gene in the outlier condition is much higher than any other value in the dataset.

This effect can be seen in the toy example below, using the highest expressed gene in my dataset, the gene with the unexpected behaviour, and 10 genes selected at random to generate the dds object. Please note that the number of reads in sample A_b_1 for gene Question is much higher (2471) than in all other samples, and that its behaviour is just as unexpected in my full dataset (60321 genes in 107 samples).

E <- rbind(
    HighExpr=c(14413010, 14729567, 16831319, 9285, 42687, 208961, 7768481, 12181302, 10001122, 1275, 19230, 2846911, 15888015, 15805554, 13694644, 4124374, 43071, 16584140, 17740751, 15121608, 740, 675, 1093, 15783974, 14289011, 13051649, 1432938, 2229190, 2194025, 15630065, 14096331, 14922741, 2925, 4603, 26915),
    Question=c(0, 0, 0, 2471, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0),
    Random_0=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Random_1=c(0, 0, 0, 0, 14, 1, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
    Random_2=c(64, 67, 71, 35, 15, 216, 87, 166, 220, 100, 114, 321, 105, 167, 173, 155, 158, 186, 235, 166, 56, 177, 119, 50, 102, 43, 311, 373, 333, 74, 106, 123, 289, 267, 374),
    Random_3=c(22, 40, 16, 12, 28, 77, 4, 18, 5, 45, 59, 20, 50, 37, 26, 21, 55, 49, 40, 39, 52, 58, 35, 49, 55, 50, 58, 54, 59, 26, 29, 25, 40, 65, 120),
    Random_4=c(1, 3, 7, 77, 25, 40, 4, 1, 7, 15, 46, 17, 7, 0, 4, 8, 25, 8, 15, 10, 7, 22, 5, 5, 3, 2, 6, 13, 6, 12, 9, 3, 8, 3, 17),
    Random_5=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Random_6=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Random_7=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
    Random_8=c(3255, 2734, 3156, 11996, 14702, 8891, 3011, 3161, 3364, 8394, 9060, 8458, 3991, 4067, 3630, 9788, 11737, 4580, 5183, 4147, 9150, 10252, 8110, 3530, 3557, 2837, 10429, 9828, 8649, 3863, 3709, 4091, 9683, 10046, 15143),
    Random_9=c(0, 0, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 1, 5, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 5, 0, 4, 2, 0, 0, 1, 0, 1, 0, 5)
)
colnames(E) <- c("A_a_1", "A_a_2", "A_a_3", "A_b_1", "A_b_2", "A_b_3", "A_c_1", "A_c_2", "A_c_3", "A_d_1", "A_d_2", "A_d_3", "B_a_1", "B_a_2", "B_a_3", "B_b_1", "B_b_3", "B_c_1", "B_c_2", "B_c_3", "B_d_1", "B_d_2", "B_d_3", "C_a_1", "C_a_2", "C_a_3", "C_b_1", "C_b_2", "C_b_3", "C_c_1", "C_c_2", "C_c_3", "C_d_1", "C_d_2", "C_d_3")
Cov <- data.frame(Sample=colnames(E), Group=sub("_[123]$", "", colnames(E)), Repl=sub("^[ABC]_[abcd]_", "", colnames(E)))
dds <- DESeqDataSetFromMatrix(countData=E, colData=Cov, design=~Group)
dds <- DESeq(dds)
rld <- rlog(dds)

The regularised logarithm value of the Question gene in sample A_b_1 is more than twice higher than the highest regularized likelihood of the HighExpr gene.

> counts(dds)[1:2, 1:5]
            A_a_1    A_a_2    A_a_3 A_b_1 A_b_2
HighExpr 14413010 14729567 16831319  9285 42687
Question        0        0        0  2471     4
> assay(rld)[1:2, 1:5]
             A_a_1     A_a_2     A_a_3    A_b_1     A_b_2
HighExpr 24.260943 23.913180 24.598851 15.18478 17.231491
Question  6.480747  6.480516  6.480839 53.75706  6.558154
> max(assay(rld)[1,])
[1] 24.59885

However, when fitNbinomGLMs is invoked with argument forceOptim=TRUE from rlogData, this unexpected behaviour disappears, see below:

> cbind(assay(rld_forceFalse)["Question",1:5], assay(rld_forceTrue)["Question",1:5])
           [,1]      [,2]
A_a_1  6.480747 -3.815157
A_a_2  6.480516 -3.935015
A_a_3  6.480839 -3.778702
A_b_1 53.757064 12.202386
A_b_2  6.558154  3.153798

My questions are:

Is my expectation of regularized logarithm values plain wrong? (i.e. could super high values be sometimes reached for genes with limited counts)
Am I using DESeq or rlog wrong?
Is there a way to force optimisation for all genes (or is it too slow?)

Thanks in advance for your help! Best, Eric

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS/LAPACK: /home/eblanc/miniconda3/envs/R_test/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.28.0               SummarizedExperiment_1.18.1
 [3] DelayedArray_0.14.0         matrixStats_0.56.0         
 [5] Biobase_2.48.0              GenomicRanges_1.40.0       
 [7] GenomeInfoDb_1.24.0         IRanges_2.22.1             
 [9] S4Vectors_0.26.0            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6           pillar_1.4.6           compiler_4.0.2        
 [4] RColorBrewer_1.1-2     XVector_0.28.0         bitops_1.0-6          
 [7] tools_4.0.2            zlibbioc_1.34.0        digest_0.6.25         
[10] bit_4.0.4              tibble_3.0.3           lifecycle_0.2.0       
[13] gtable_0.3.0           annotate_1.66.0        RSQLite_2.2.0         
[16] memoise_1.1.0          lattice_0.20-41        pkgconfig_2.0.3       
[19] rlang_0.4.7            Matrix_1.2-18          DBI_1.1.0             
[22] GenomeInfoDbData_1.2.3 genefilter_1.70.0      vctrs_0.3.2           
[25] locfit_1.5-9.4         bit64_4.0.2            grid_4.0.2            
[28] glue_1.4.1             R6_2.4.1               AnnotationDbi_1.50.0  
[31] XML_3.99-0.3           survival_3.2-3         BiocParallel_1.22.0   
[34] magrittr_1.5           ggplot2_3.3.2          geneplotter_1.66.0    
[37] blob_1.2.1             ellipsis_0.3.1         scales_1.1.1          
[40] splines_4.0.2          colorspace_1.4-1       xtable_1.8-4          
[43] munsell_0.5.0          RCurl_1.98-1.2         crayon_1.3.4

deseq2 normalization • 519 views

ADD COMMENT • link updated 3.7 years ago by Michael Love 41k • written 3.7 years ago by eric.blanc • 0

score 1 · Answer 1 · 2020-08-17

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

Dear Eric,

Thanks for your post and example.

I have also found that rlog does not perform well when the data do not fit a NB model (e.g. extreme count outliers), and for this reason and for the speed issue we have de-prioritized rlog in the vignette and workflow. Both recommend VST now for moderate to large datasets.

If you are using v1.20 or greater (current release is v1.28), then rlog should even print a message that rlog is not recommended for large datasets due to speed (n >= 30).

ADD COMMENT • link 3.7 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

OK, I'll stick to vst then.

Thanks for your help (& for DESeq2!) Eric

ADD REPLY • link 3.7 years ago eric.blanc • 0