Question

tximport rowsum error

1

Entering edit mode

Ezgi ▴ 60

@ezgi-24130

Last seen 2.2 years ago

United States

I'm trying to import some salmon quant.sf files and convert them to gene level TPMs with tximport. I have checked whether the files exist, and all files do. All these files have a long "transcript name" but I'm already using the appropriate data frame to map them to the gene ids. However, I get a rowsum error, saying something that's supposed to be numeric, isn't. I'm not sure how I can find the problem in my data, I would appreciate any suggestions.

Here's what I try to run:

txi <- tximport(salmon_paths, 
                type = "salmon",
                tx2gene = tx2gene)

Then I get the error:

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
removing duplicated transcript rows from tx2gene
transcripts missing from tx2gene: 92
summarizing abundance
summarizing counts
summarizing length
Error in rowsum.default(x[sub.idx, , drop = FALSE], geneId) : 
  'x' must be numeric

I tried importing the same files with map and read_tsv while binding each file as rows (map_dfr) or as columns (map_dfc) and I don't get any errors:

tpm_dfc <- purrr::map_dfc(salmon_paths, read_tsv)
tpm_dfr <- purrr::map_dfr(salmon_paths, read_tsv)

And I get the expected column specifications for each file. So I imagine it can't be that it's importing columns in the wrong format? What else might be going on here?

── Column specification ──────────────────────────────────────────────────────────────────────────────
cols(
  Name = col_character(),
  Length = col_double(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)

Here's an example header from one of my quant.sf files:

tibble::tribble(
  ~Name, ~Length, ~EffectiveLength,     ~TPM, ~NumReads,
  "ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|",    1657,         1490.931,        0,         0,
  "ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|",     632,          465.981,        0,         0,
  "ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|",    1351,         1184.931, 0.135787, 14.127527,
  "ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|",      68,            22.69,        0,         0,
  "ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|",     712,          545.971,        0,         0,
  "ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|",     535,          369.025,        0,         0
)

And the example header of my tx2gene data frame:

tibble::tribble(
  ~Name,     ~ensembl_gene,
  "ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|", "ENSG00000223972",
  "ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|", "ENSG00000223972",
  "ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|", "ENSG00000227232",
  "ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|", "ENSG00000278267",
  "ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|", "ENSG00000243485",
  "ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|", "ENSG00000243485"
)

And here's the session info:

sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /igm/apps/R/R-4.0.2_install/lib64/R/lib/libRblas.so
LAPACK: /igm/apps/R/R-4.0.2_install/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] datapasta_3.1.0     tximportData_1.16.0 tximport_1.16.1     forcats_0.5.0      
 [5] stringr_1.4.0       dplyr_1.0.2         purrr_0.3.4         readr_1.4.0        
 [9] tidyr_1.1.2         tibble_3.0.4        ggplot2_3.3.2       tidyverse_1.3.0    
[13] knitr_1.30         

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0    xfun_0.19           haven_2.3.1         colorspace_2.0-0   
 [5] vctrs_0.3.5         generics_0.1.0      utf8_1.1.4          blob_1.2.1         
 [9] rlang_0.4.8         pillar_1.4.7        glue_1.4.2          withr_2.3.0        
[13] DBI_1.1.0           bit64_4.0.5         dbplyr_2.0.0        modelr_0.1.8       
[17] readxl_1.3.1        lifecycle_0.2.0     munsell_0.5.0       gtable_0.3.0       
[21] cellranger_1.1.0    rvest_0.3.6         memoise_1.1.0       parallel_4.0.2     
[25] fansi_0.4.1         highr_0.8           broom_0.7.2         Rcpp_1.0.5         
[29] clipr_0.7.1         BiocManager_1.30.10 scales_1.1.1        backports_1.2.0    
[33] vroom_1.3.2         jsonlite_1.7.1      fs_1.5.0            bit_4.0.4          
[37] hms_0.5.3           digest_0.6.27       stringi_1.5.3       grid_4.0.2         
[41] cli_2.2.0           tools_4.0.2         magrittr_2.0.1      RSQLite_2.2.1      
[45] crayon_1.3.4        pkgconfig_2.0.3     ellipsis_0.3.1      xml2_1.3.2         
[49] reprex_0.3.0        lubridate_1.7.9.2   assertthat_0.2.1    httr_1.4.2         
[53] rstudioapi_0.13     R6_2.5.0            compiler_4.0.2

tximport • 1.0k views

ADD COMMENT • link 3.2 years ago • updated 3.1 years ago Ezgi ▴ 60

score 2 · Answer 1 · 2021-03-04

Setting the dropInfReps = TRUE argument in tximport resolved the problem. 🤦‍♀️

Also I needed to import over 400 files and setting the importer as the vroom::vroom() function with column specifications improved the speed quite a bit. Here's the example code:

txi <- tximport(
  salmon_paths,
  type = "salmon",
  tx2gene = tx2gene,
  dropInfReps = TRUE,
  importer = function(x)
    vroom::vroom(
      x,
      col_types = cols(
        Name = col_character(),
        Length = col_double(),
        EffectiveLength = col_double(),
        TPM = col_double(),
        NumReads = col_double()
      )
    )
)