Question: DESeq2 size factors estimation after RSEM quantification
1
3.4 years ago by
Hugo Varet80
France
Hugo Varet80 wrote:

Dear Mike Love,

I just upgraded DESeq2 to get the last version (1.12.0) and I have a problem when using the estimateSizeFactors() function on a DESeqDataSet object built from RSEM outputs (with the help of tximport). The problem is that infinite values are generated because of the use of the log() on transcript/gene lengths equal to 0 (in the txi.rsem$length matrix) when computing the geometric mean. Here is a piece of R code from the tximport vignette slightly modified to reproduce the error: library(tximportData) library(tximport) dir <- system.file("extdata", package = "tximportData") samples <- read.table(file.path(dir, "samples.txt"), header = TRUE) files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results")) names(files) <- paste0("sample", 1:6) txi.rsem <- tximport(files, type = "rsem") ### modification: ### txi.rsem$length[which.min(txi.rsem\$length)] <- 0
ddsrsem <- DESeqDataSetFromTximport(txi.rsem, colData=samples, design=~1)
ddsrsem <- estimateSizeFactors(ddsrsem) # Erreur : all(!is.na(value)) is not TRUE

Do you think I should filter out rows for which there is at least one transcript/gene length equal to 0?

Below is the result of my sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

locale:
[1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8
[8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_0.2.2                tximport_1.0.0             tximportData_1.0.0         DESeq2_1.12.0              SummarizedExperiment_1.2.0 Biobase_2.32.0             GenomicRanges_1.24.0
[8] GenomeInfoDb_1.8.0         IRanges_2.6.0              S4Vectors_0.10.0           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3           XVector_0.12.0       tools_3.3.0          zlibbioc_1.18.0      rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0
[10] gtable_0.2.0         lattice_0.20-33      Matrix_1.2-6         DBI_0.4-1            gridExtra_2.2.1      genefilter_1.54.0    cluster_2.0.4        locfit_1.5-9.1       grid_3.3.0
[19] nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.0 XML_3.98-1.4         survival_2.39-3      BiocParallel_1.6.0   foreign_0.8-66       latticeExtra_0.6-28  Formula_1.2-1
[28] geneplotter_1.50.0   ggplot2_2.1.0        Hmisc_3.17-4         scales_0.4.0         splines_3.3.0        colorspace_1.2-6     xtable_1.8-2         acepack_1.3-3.3      munsell_0.4.3
[37] chron_2.3-47

Many thanks for your help, best regards,

Hugo

modified 3.4 years ago • written 3.4 years ago by Hugo Varet80
Answer: DESeq2 size factors estimation after RSEM quantification
0
3.4 years ago by
Michael Love25k
United States
Michael Love25k wrote:

Dear Hugo,

Thanks for the bug report.

I should add some warnings and errors at some point because a gene length of 0 doesn't make sense (in the modeling or otherwise).

I guess the easiest thing would be for tximport to warn about this and then DESeqDataSetFromTximport to give an error and not allow length of 0. Does that sound right to you?

Yes, in the meantime, I would add a check in your scripts to somehow deal with rows that have a gene length of 0, either by somehow filling in a reasonable value or removing the gene from analysis. I don't know which makes more sense.

Answer: DESeq2 size factors estimation after RSEM quantification
0
3.4 years ago by
Hugo Varet80
France
Hugo Varet80 wrote:

Dear Mike,

I am not very familiar with RSEM, but I think the difficulty comes from the "effective_length" which can be set to zero for some samples and positive for others, depending on whether reads from the gene have been sequenced. From a statistical point of view, may it be correct to remove the infinite values coming from the log(0) as it is done to compute the size factors?

Best,

Hugo