Question: DESeq2 size factors estimation after RSEM quantification
gravatar for Hugo Varet
3.4 years ago by
Hugo Varet80
Hugo Varet80 wrote:

Dear Mike Love,

I just upgraded DESeq2 to get the last version (1.12.0) and I have a problem when using the estimateSizeFactors() function on a DESeqDataSet object built from RSEM outputs (with the help of tximport). The problem is that infinite values are generated because of the use of the log() on transcript/gene lengths equal to 0 (in the txi.rsem$length matrix) when computing the geometric mean. Here is a piece of R code from the tximport vignette slightly modified to reproduce the error:

dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results"))
names(files) <- paste0("sample", 1:6)
txi.rsem <- tximport(files, type = "rsem")
### modification: ###
txi.rsem$length[which.min(txi.rsem$length)] <- 0
ddsrsem <- DESeqDataSetFromTximport(txi.rsem, colData=samples, design=~1)
ddsrsem <- estimateSizeFactors(ddsrsem) # Erreur : all(! is not TRUE

Do you think I should filter out rows for which there is at least one transcript/gene length equal to 0?

Below is the result of my sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_0.2.2                tximport_1.0.0             tximportData_1.0.0         DESeq2_1.12.0              SummarizedExperiment_1.2.0 Biobase_2.32.0             GenomicRanges_1.24.0      
 [8] GenomeInfoDb_1.8.0         IRanges_2.6.0              S4Vectors_0.10.0           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3           XVector_0.12.0       tools_3.3.0          zlibbioc_1.18.0      rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0     
[10] gtable_0.2.0         lattice_0.20-33      Matrix_1.2-6         DBI_0.4-1            gridExtra_2.2.1      genefilter_1.54.0    cluster_2.0.4        locfit_1.5-9.1       grid_3.3.0          
[19] nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.0 XML_3.98-1.4         survival_2.39-3      BiocParallel_1.6.0   foreign_0.8-66       latticeExtra_0.6-28  Formula_1.2-1       
[28] geneplotter_1.50.0   ggplot2_2.1.0        Hmisc_3.17-4         scales_0.4.0         splines_3.3.0        colorspace_1.2-6     xtable_1.8-2         acepack_1.3-3.3      munsell_0.4.3       
[37] chron_2.3-47

Please let me know if you need more information.

Many thanks for your help, best regards,



ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Hugo Varet80
Answer: DESeq2 size factors estimation after RSEM quantification
gravatar for Michael Love
3.4 years ago by
Michael Love25k
United States
Michael Love25k wrote:

Dear Hugo, 

Thanks for the bug report.

I should add some warnings and errors at some point because a gene length of 0 doesn't make sense (in the modeling or otherwise).

I guess the easiest thing would be for tximport to warn about this and then DESeqDataSetFromTximport to give an error and not allow length of 0. Does that sound right to you?

Yes, in the meantime, I would add a check in your scripts to somehow deal with rows that have a gene length of 0, either by somehow filling in a reasonable value or removing the gene from analysis. I don't know which makes more sense.

ADD COMMENTlink written 3.4 years ago by Michael Love25k
Answer: DESeq2 size factors estimation after RSEM quantification
gravatar for Hugo Varet
3.4 years ago by
Hugo Varet80
Hugo Varet80 wrote:

Dear Mike,

thanks for you answer and your advices. I added a check in my script to deal with that and continue the analysis.

I am not very familiar with RSEM, but I think the difficulty comes from the "effective_length" which can be set to zero for some samples and positive for others, depending on whether reads from the gene have been sequenced. From a statistical point of view, may it be correct to remove the infinite values coming from the log(0) as it is done to compute the size factors?



ADD COMMENTlink written 3.4 years ago by Hugo Varet80
Well the gene length used as a normalization factor in the tximport workflow, so then it will cause a problem again for dispersion and LFC estimation. It may make sense to plug in the standard calculation of effective gene length.
ADD REPLYlink written 3.4 years ago by Michael Love25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 232 users visited in the last hour