DESeq2 size factors estimation after RSEM quantification
2
1
Entering edit mode
Hugo Varet ▴ 80
@hugo-varet-6301
Last seen 6.5 years ago
France

Dear Mike Love,

I just upgraded DESeq2 to get the last version (1.12.0) and I have a problem when using the estimateSizeFactors() function on a DESeqDataSet object built from RSEM outputs (with the help of tximport). The problem is that infinite values are generated because of the use of the log() on transcript/gene lengths equal to 0 (in the txi.rsem$length matrix) when computing the geometric mean. Here is a piece of R code from the tximport vignette slightly modified to reproduce the error:

library(tximportData)
library(tximport)
dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "rsem", samples$run, paste0(samples$run, ".genes.results"))
names(files) <- paste0("sample", 1:6)
txi.rsem <- tximport(files, type = "rsem")
### modification: ###
txi.rsem$length[which.min(txi.rsem$length)] <- 0
ddsrsem <- DESeqDataSetFromTximport(txi.rsem, colData=samples, design=~1)
ddsrsem <- estimateSizeFactors(ddsrsem) # Erreur : all(!is.na(value)) is not TRUE

Do you think I should filter out rows for which there is at least one transcript/gene length equal to 0?

Below is the result of my sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_0.2.2                tximport_1.0.0             tximportData_1.0.0         DESeq2_1.12.0              SummarizedExperiment_1.2.0 Biobase_2.32.0             GenomicRanges_1.24.0      
 [8] GenomeInfoDb_1.8.0         IRanges_2.6.0              S4Vectors_0.10.0           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3           XVector_0.12.0       tools_3.3.0          zlibbioc_1.18.0      rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0     
[10] gtable_0.2.0         lattice_0.20-33      Matrix_1.2-6         DBI_0.4-1            gridExtra_2.2.1      genefilter_1.54.0    cluster_2.0.4        locfit_1.5-9.1       grid_3.3.0          
[19] nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.0 XML_3.98-1.4         survival_2.39-3      BiocParallel_1.6.0   foreign_0.8-66       latticeExtra_0.6-28  Formula_1.2-1       
[28] geneplotter_1.50.0   ggplot2_2.1.0        Hmisc_3.17-4         scales_0.4.0         splines_3.3.0        colorspace_1.2-6     xtable_1.8-2         acepack_1.3-3.3      munsell_0.4.3       
[37] chron_2.3-47

Please let me know if you need more information.

Many thanks for your help, best regards,

Hugo

 

deseq2 tximport rsem estimatesizefactors • 2.1k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 10 hours ago
United States

Dear Hugo, 

Thanks for the bug report.

I should add some warnings and errors at some point because a gene length of 0 doesn't make sense (in the modeling or otherwise).

I guess the easiest thing would be for tximport to warn about this and then DESeqDataSetFromTximport to give an error and not allow length of 0. Does that sound right to you?

Yes, in the meantime, I would add a check in your scripts to somehow deal with rows that have a gene length of 0, either by somehow filling in a reasonable value or removing the gene from analysis. I don't know which makes more sense.

ADD COMMENT
0
Entering edit mode
Hugo Varet ▴ 80
@hugo-varet-6301
Last seen 6.5 years ago
France

Dear Mike,

thanks for you answer and your advices. I added a check in my script to deal with that and continue the analysis.

I am not very familiar with RSEM, but I think the difficulty comes from the "effective_length" which can be set to zero for some samples and positive for others, depending on whether reads from the gene have been sequenced. From a statistical point of view, may it be correct to remove the infinite values coming from the log(0) as it is done to compute the size factors?

Best,

Hugo

ADD COMMENT
0
Entering edit mode
Well the gene length used as a normalization factor in the tximport workflow, so then it will cause a problem again for dispersion and LFC estimation. It may make sense to plug in the standard calculation of effective gene length.
ADD REPLY

Login before adding your answer.

Traffic: 845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6