tximport error with kallisto .h5 files
1
0
Entering edit mode
@stevestandage-19371
Last seen 5.3 years ago
USA / Cincinnati / Cincinnati Children'…

This is my first time analyzing RNA sequencing data for gene expression. I am trying to import count data from Kallisto to DESeq2 using the tximport package following the instructions here. After running this code:

filenames <- list.files("./Data", full.names = TRUE, pattern = "*abundance.h5")
files <- filenames %>% `names<-`(str_extract(filenames, "SWS[:digit:]*"))

txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)

I get the following error:

Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 Warning: 4894 parsing failures.
row         col  expected        actual                        file
  2 <U+0089>HDF           embedded null './Data/SWS01_abundance.h5'
  2 NA          1 columns 2 columns     './Data/SWS01_abundance.h5'
  5 <U+0089>HDF           embedded null './Data/SWS01_abundance.h5'
  9 <U+0089>HDF           embedded null './Data/SWS01_abundance.h5'
 10 <U+0089>HDF           embedded null './Data/SWS01_abundance.h5'
... ........... ......... ............. ...........................
See problems(...) for more details.

Error in tximport(files, type = "kallisto", tx2gene = tx2gene, txOut = TRUE) : 
  all(c(lengthCol, abundanceCol) %in% names(raw)) is not TRUE
In addition: Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two.

I'm trying to import the .h5 files, but when I peak in the .tsv files, they are formatted like this:

# A tibble: 105,129 x 5
   target_id                    length eff_length est_counts   tpm
   <chr>                         <dbl>      <dbl>      <dbl> <dbl>
 1 ENSMUST00000177564.1-Trdd2       16         17          0     0
 2 ENSMUST00000196221.1-Trdd1        9         10          0     0
 3 ENSMUST00000179664.1-Trdd1       11         12          0     0
 4 ENSMUST00000178537.1-Trbd1       12         13          0     0
 5 ENSMUST00000178862.1-Trbd2       14         15          0     0
 6 ENSMUST00000179520.1-Ighd4-1     11         12          0     0
 7 ENSMUST00000179883.1-Ighd3-2     16         17          0     0
 8 ENSMUST00000195858.1-Ighd5-6     10         11          0     0
 9 ENSMUST00000179932.1-Ighd5-6     12         13          0     0
10 ENSMUST00000180001.1-Ighd2-8     17         18          0     0
# ... with 105,119 more rows

Here's my session info:

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.22.2               SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5         matrixStats_0.54.0          tximport_1.10.1            
 [7] rhdf5_2.26.2                GenomicFeatures_1.34.1      AnnotationDbi_1.44.0        Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
[13] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0         forcats_0.3.0               stringr_1.3.1               dplyr_0.7.8                
[19] purrr_0.2.5                 readr_1.3.1                 tidyr_0.8.2                 tibble_1.4.2                ggplot2_3.1.0               tidyverse_1.2.1            

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2         htmlTable_1.13.1         XVector_0.22.0           base64enc_0.1-3          rstudioapi_0.9.0         bit64_0.9-7              fansi_0.4.0             
 [8] lubridate_1.7.4          xml2_1.2.0               splines_3.5.2            geneplotter_1.60.0       knitr_1.21               Formula_1.2-3            jsonlite_1.6            
[15] Rsamtools_1.34.0         broom_0.5.1              annotate_1.60.0          cluster_2.0.7-1          compiler_3.5.2           httr_1.4.0               backports_1.1.3         
[22] assertthat_0.2.0         Matrix_1.2-15            lazyeval_0.2.1           cli_1.0.1                acepack_1.4.1            htmltools_0.3.6          prettyunits_1.0.2       
[29] tools_3.5.2              bindrcpp_0.2.2           gtable_0.2.0             glue_1.3.0               GenomeInfoDbData_1.2.0   Rcpp_1.0.0               cellranger_1.1.0        
[36] Biostrings_2.50.2        nlme_3.1-137             rtracklayer_1.42.1       xfun_0.4                 rvest_0.3.2              XML_3.98-1.16            zlibbioc_1.28.0         
[43] scales_1.0.0             hms_0.4.2                RColorBrewer_1.1-2       yaml_2.2.0               memoise_1.1.0            gridExtra_2.3            biomaRt_2.38.0          
[50] rpart_4.1-13             latticeExtra_0.6-28      stringi_1.2.4            RSQLite_2.1.1            genefilter_1.64.0        checkmate_1.8.5          rlang_0.3.1             
[57] pkgconfig_2.0.2          bitops_1.0-6             lattice_0.20-38          Rhdf5lib_1.4.2           bindr_0.1.1              GenomicAlignments_1.18.1 htmlwidgets_1.3         
[64] bit_1.1-14               tidyselect_0.2.5         plyr_1.8.4               magrittr_1.5             R6_2.3.0                 generics_0.0.2           Hmisc_4.1-1             
[71] DBI_1.0.0                pillar_1.3.1             haven_2.0.0              foreign_0.8-71           withr_2.1.2              survival_2.43-3          RCurl_1.95-4.11         
[78] nnet_7.3-12              modelr_0.1.2             crayon_1.3.4             utf8_1.1.4               progress_1.2.0           locfit_1.5-9.1           grid_3.5.2              
[85] readxl_1.2.0             data.table_1.11.8        blob_1.1.1               digest_0.6.18            xtable_1.8-3             munsell_0.5.0

Any ideas to help solve my import problem?

Thanks for your help!

deseq2 kallisto tximport • 2.9k views
ADD COMMENT
0
Entering edit mode

When I run the commands using the abundance.tsv files:

txdb <- makeTxDbFromGFF("gencode.vM20.annotation.gff3.gz") # Pulled this file from: https://www.gencodegenes.org/mouse/release_M20.html
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)

It actually imports all my files, but renders the subsequent error:

Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Error in summarizeToGene(txi, tx2gene, varReduce, ignoreTxVersion, ignoreAfterBar,  : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [ENSMUST00000177564.1-Trdd2, ENSMUST00000196221.1-Trdd1, ENSMUST00000179664.1-Trdd1, ...]

Example IDs (tx2gene): [ENSMUST00000193812.1, ENSMUST00000082908.1, ENSMUST00000192857.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

Thanks again!

ADD REPLY
0
Entering edit mode

Consider taking the advice that is printed in the error message.

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 39 minutes ago
United States

tximport as is currently implemented assumes you don't modify the names of the output files of the methods. This is generally a good idea I think not to modify the filenames themselves, so I probably won't change this. There was a very recent post here showing some code to get around it, by specifying your own importer.

ADD COMMENT
0
Entering edit mode

Thank you for your help!

ADD REPLY
0
Entering edit mode

Thank you for your help!

ADD REPLY

Login before adding your answer.

Traffic: 767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6