Question

Large difference between naive log2 and rlog transformation of count data

0

Entering edit mode

dario.beraldi • 0

@darioberaldi-6991

Last seen 17 months ago

United Kingdom

I'm comparing normalized gene expression using a naive log2 transformation of raw counts and DESeq::rlog method. It seems to me that rlog tends to apply considerable shrinkage in this dataset resulting in a noticeable difference between the two methods even for highly expressed genes. I.e., in the figure below I would expect the dots (samples) to align more closely along the diagonal.

This is real data from Nanostring which I have anonymized and put on dropbox.

I'm inclined to think that in this case rlog is not very appropriate perhaps due to the nature of Nanostring experiments (few genes many of them expected to be different between conditions) compared to RNAseq. Also, after testing for differential expression analysis using lfcShrink, the estimates of fold-change are quite a bit higher relative to the rlog estimates. That is, estimates of log2 fold-change are better recapitulated by the naive log2 normalization than by the rlog (I understand that rlog and similar should not be used for testing DE but still...)

Any comment much appreciated... Is it appropriate to use rlog here? Is this difference expected?

library(data.table)
library(ggplot2)
library(DESeq2)

mat2 <- read.csv('https://www.dropbox.com/s/xdrdgc06392vp3g/nano.tsv?dl=1', sep= '\t')

acc <- mat2$Accession
mat2$Accession <- NULL
mat2 <- as.matrix(mat2)
rownames(mat2) <- acc

design <- data.frame(sample_id= colnames(mat2), tissue= sub('.*_', '', colnames(mat2)))

deseq <- DESeqDataSetFromMatrix(mat2, colData= design, design= ~ tissue)
deseq <- DESeq(deseq)

# RLOG NORMALIZATION
rlog_gex <- rlog(deseq)
rlog_gex <- data.table(assay(rlog_gex), Accession= rownames(rlog_gex))
rlog_gex <- melt(data= rlog_gex, id.vars= 'Accession', variable.name= 'nanostring_id', value.name= 'rlog_gex')

# NAIVE log2 NORMALIZATION
log_gex <- normTransform(deseq)
log_gex <- data.table(assay(log_gex), Accession= rownames(log_gex))
log_gex <- melt(data= log_gex, id.vars= 'Accession', variable.name= 'nanostring_id', value.name= 'log2_gex')

dat <- merge(rlog_gex, log_gex, by= c('Accession', 'nanostring_id'))

gg <- ggplot(data= dat, aes(x= log2_gex, y= rlog_gex)) +
    geom_abline(intercept= 0, slope= 1, colour= 'blue') +
    geom_point(size= 0.2) +
    facet_wrap(~Accession, scale= 'free') +
    theme_light()
ggsave('tmp.png', width= 13, height= 13)

enter image description here

sessionInfo( )
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/20200306_matt_nanostr_bmpb/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.0 DelayedArray_0.12.0         BiocParallel_1.20.0         matrixStats_0.58.0          Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.0         IRanges_2.20.0             
[10] S4Vectors_0.24.0            BiocGenerics_0.32.0         ggplot2_3.3.2               data.table_1.12.2          

loaded via a namespace (and not attached):
 [1] bit64_4.0.5            splines_3.6.3          Formula_1.2-4          latticeExtra_0.6-29    blob_1.2.1             GenomeInfoDbData_1.2.2 pillar_1.6.0           RSQLite_2.2.5          backports_1.2.1        lattice_0.20-41        glue_1.4.2            
[12] digest_0.6.27          RColorBrewer_1.1-2     XVector_0.26.0         checkmate_2.0.0        colorspace_2.0-0       htmltools_0.5.1.1      Matrix_1.3-2           XML_3.99-0.3           pkgconfig_2.0.3        genefilter_1.68.0      zlibbioc_1.32.0       
[23] xtable_1.8-4           scales_1.1.1           jpeg_0.1-8.1           htmlTable_2.2.1        tibble_3.1.0           annotate_1.64.0        ellipsis_0.3.1         cachem_1.0.4           withr_2.4.1            nnet_7.3-16            survival_3.2-11       
[34] magrittr_2.0.1         crayon_1.4.1           memoise_2.0.0          fansi_0.4.2            foreign_0.8-76         tools_3.6.3            lifecycle_1.0.0        stringr_1.4.0          locfit_1.5-9.4         munsell_0.5.0          cluster_2.1.0         
[45] AnnotationDbi_1.48.0   compiler_3.6.3         rlang_0.4.10           grid_3.6.3             RCurl_1.98-1.3         rstudioapi_0.13        htmlwidgets_1.5.3      bitops_1.0-6           base64enc_0.1-3        gtable_0.3.0           DBI_1.1.1             
[56] R6_2.5.0               gridExtra_2.3          knitr_1.33             fastmap_1.1.0          bit_4.0.4              utf8_1.2.1             Hmisc_4.4-1            stringi_1.4.3          Rcpp_1.0.6             geneplotter_1.64.0     vctrs_0.3.7           
[67] rpart_4.1-15           png_0.1-7              xfun_0.23

DESeq2 Normalization rlog • 1.5k views

ADD COMMENT • link updated 2.1 years ago by Michael Love 41k • written 2.1 years ago by dario.beraldi • 0

1

Entering edit mode

Mike will give his two cents for sure but based on Which rlog should I use for DESeq2 analysis and other threads and tweets I recall rlog is discouraged these days (as is the normal method in lfcShrink) as it tends to overshrink the effect size. For shrinkage the recommendation is either ashr or apeglm, and for data transformation it is the vst afaik. I also recall threads that DESeq2+Nanostring is fine as long as you make sure that you confidently normalize based on a set of non-changing/housekeeping genes.

ADD REPLY • link 2.1 years ago ATpoint ★ 4.0k

score 2 · Answer 1 · 2022-03-22

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 hours ago

United States

Yes, agree with @ATpoint above, we recommend VST over rlog. Rlog tends to overshrink real large differences. Same with recommending the improved methods apeglm and ashr, over the previous method "normal" which tended to overshrink.

ADD COMMENT • link 2.1 years ago Michael Love 41k