DESeq2 size factors estimation after RSEM quantification
Entering edit mode
Hugo Varet ▴ 80
Last seen 6.3 years ago

Dear Mike Love,

I just upgraded DESeq2 to get the last version (1.12.0) and I have a problem when using the estimateSizeFactors() function on a DESeqDataSet object built from RSEM outputs (with the help of tximport). The problem is that infinite values are generated because of the use of the log() on transcript/gene lengths equal to 0 (in the txi.rsem$length matrix) when computing the geometric mean. Here is a piece of R code from the tximport vignette slightly modified to reproduce the error:

dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results"))
names(files) <- paste0("sample", 1:6)
txi.rsem <- tximport(files, type = "rsem")
### modification: ###
txi.rsem$length[which.min(txi.rsem$length)] <- 0
ddsrsem <- DESeqDataSetFromTximport(txi.rsem, colData=samples, design=~1)
ddsrsem <- estimateSizeFactors(ddsrsem) # Erreur : all(! is not TRUE

Do you think I should filter out rows for which there is at least one transcript/gene length equal to 0?

Below is the result of my sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_0.2.2                tximport_1.0.0             tximportData_1.0.0         DESeq2_1.12.0              SummarizedExperiment_1.2.0 Biobase_2.32.0             GenomicRanges_1.24.0      
 [8] GenomeInfoDb_1.8.0         IRanges_2.6.0              S4Vectors_0.10.0           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3           XVector_0.12.0       tools_3.3.0          zlibbioc_1.18.0      rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0     
[10] gtable_0.2.0         lattice_0.20-33      Matrix_1.2-6         DBI_0.4-1            gridExtra_2.2.1      genefilter_1.54.0    cluster_2.0.4        locfit_1.5-9.1       grid_3.3.0          
[19] nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.0 XML_3.98-1.4         survival_2.39-3      BiocParallel_1.6.0   foreign_0.8-66       latticeExtra_0.6-28  Formula_1.2-1       
[28] geneplotter_1.50.0   ggplot2_2.1.0        Hmisc_3.17-4         scales_0.4.0         splines_3.3.0        colorspace_1.2-6     xtable_1.8-2         acepack_1.3-3.3      munsell_0.4.3       
[37] chron_2.3-47

Please let me know if you need more information.

Many thanks for your help, best regards,



deseq2 tximport rsem estimatesizefactors • 2.0k views
Entering edit mode
Last seen 2 days ago
United States

Dear Hugo, 

Thanks for the bug report.

I should add some warnings and errors at some point because a gene length of 0 doesn't make sense (in the modeling or otherwise).

I guess the easiest thing would be for tximport to warn about this and then DESeqDataSetFromTximport to give an error and not allow length of 0. Does that sound right to you?

Yes, in the meantime, I would add a check in your scripts to somehow deal with rows that have a gene length of 0, either by somehow filling in a reasonable value or removing the gene from analysis. I don't know which makes more sense.

Entering edit mode
Hugo Varet ▴ 80
Last seen 6.3 years ago

Dear Mike,

thanks for you answer and your advices. I added a check in my script to deal with that and continue the analysis.

I am not very familiar with RSEM, but I think the difficulty comes from the "effective_length" which can be set to zero for some samples and positive for others, depending on whether reads from the gene have been sequenced. From a statistical point of view, may it be correct to remove the infinite values coming from the log(0) as it is done to compute the size factors?



Entering edit mode
Well the gene length used as a normalization factor in the tximport workflow, so then it will cause a problem again for dispersion and LFC estimation. It may make sense to plug in the standard calculation of effective gene length.

Login before adding your answer.

Traffic: 599 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6