Question

discrepancy in IDs from file and tx2gene (IDs don't match)

0

Entering edit mode

IUH • 0

@6d1ed6fa

Last seen 5 months ago

United States

Hi,

This questions has been asked a couple of times, and I have gone through the answers but I am unable to get around this problem of IDs not matching between the file generated by kallisto (abundance.tsv) and those in the tx2gene.csv . Is there an easy fix to make this work? Any help is appreciated.

I am using the makeTxDbFromGFF function in txdbmaker to create a db. I then follow the tutorial to make the tx2gene file as follows:

k <- keys(traverDb, keytype = "TXNAME")
tx2gene <- select(traverDb, k, "GENEID", "TXNAME")

However, the IDs don't match when I run the following:

samples <- read.csv(file = "kallisto/Samples_all.csv", header = TRUE)
files <- file.path("/home/kallisto_abundances",  samples$Sample_ID,  "abundance.tsv")
names(files) <- paste0(c("g1","g2", "g3", "lm1", "lm2", "lm3", "hm1", "hm2", "hm3"))
all(file.exists(files))
tx2gene <- read_csv(file = "kallisto/tx2gene.csv")

txi <- tximport(files, type="kallisto", tx2gene=tx2gene)

Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 
Error in .local(object, ...) : 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [jgi|Traver1|10000|rna-Trave2p4_13050, jgi|Traver1|10001|rna-Trave2p4_13051, jgi|Traver1|10002|rna-Trave2p4_13052, ...]

Example IDs (tx2gene): [jgi.p|Traver1|4, jgi.p|Traver1|5, jgi.p|Traver1|6, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'

> head(tx2gene)
# A tibble: 6 × 2
  TXNAME           GENEID          
  <chr>            <chr>           
1 jgi.p|Traver1|4  jgi.p|Traver1|4 
2 jgi.p|Traver1|5  jgi.p|Traver1|5 
3 jgi.p|Traver1|6  jgi.p|Traver1|6 
4 jgi.p|Traver1|8  jgi.p|Traver1|8 
5 jgi.p|Traver1|11 jgi.p|Traver1|11
6 jgi.p|Traver1|12 jgi.p|Traver1|12

> sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Chicago
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] txdbmaker_1.3.1             readr_2.1.5                 ggplot2_3.5.2               DESeq2_1.47.5              
 [5] SummarizedExperiment_1.37.0 MatrixGenerics_1.19.1       matrixStats_1.5.0           tximport_1.35.0            
 [9] GenomicFeatures_1.59.1      AnnotationDbi_1.69.1        Biobase_2.67.0              GenomicRanges_1.59.1       
[13] GenomeInfoDb_1.43.4         IRanges_2.41.3              S4Vectors_0.45.4            BiocGenerics_0.53.6        
[17] generics_0.1.4             

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1         dplyr_1.1.4              farver_2.1.2             blob_1.2.4              
 [5] filelock_1.0.3           Biostrings_2.75.4        bitops_1.0-9             fastmap_1.2.0           
 [9] RCurl_1.98-1.17          BiocFileCache_2.15.1     GenomicAlignments_1.43.0 XML_3.99-0.18           
[13] digest_0.6.37            lifecycle_1.0.4          KEGGREST_1.47.1          RSQLite_2.4.1           
[17] magrittr_2.0.3           compiler_4.5.1           rlang_1.1.6              progress_1.2.3          
[21] tools_4.5.1              yaml_2.3.10              rtracklayer_1.67.1       knitr_1.50              
[25] prettyunits_1.2.0        S4Arrays_1.7.3           bit_4.6.0                curl_6.4.0              
[29] DelayedArray_0.33.6      xml2_1.3.8               RColorBrewer_1.1-3       abind_1.4-8             
[33] BiocParallel_1.41.5      withr_3.0.2              grid_4.5.1               colorspace_2.1-1        
[37] scales_1.4.0             dichromat_2.0-0.1        biomaRt_2.63.3           cli_3.6.5               
[41] rmarkdown_2.29           crayon_1.5.3             rstudioapi_0.17.1        httr_1.4.7              
[45] tzdb_0.5.0               rjson_0.2.23             DBI_1.2.3                cachem_1.1.0            
[49] stringr_1.5.1            parallel_4.5.1           XVector_0.47.2           restfulr_0.0.16         
[53] vctrs_0.6.5              Matrix_1.7-3             jsonlite_2.0.0           hms_1.1.3               
[57] bit64_4.6.0-1            locfit_1.5-9.12          glue_1.8.0               codetools_0.2-20        
[61] stringi_1.8.7            gtable_0.3.6             BiocIO_1.17.2            UCSC.utils_1.3.1        
[65] tibble_3.3.0             pillar_1.11.0            rappdirs_0.3.3           htmltools_0.5.8.1       
[69] GenomeInfoDbData_1.2.14  dbplyr_2.5.0             httr2_1.1.2              R6_2.6.1                
[73] vroom_1.6.5              evaluate_1.0.4           lattice_0.22-7           png_0.1-8               
[77] Rsamtools_2.23.1         memoise_2.0.1            Rcpp_1.1.0               SparseArray_1.7.7       
[81] xfun_0.52                pkgconfig_2.0.3

tximport DESeq2 • 555 views

ADD COMMENT • link updated 5 months ago by James W. MacDonald 68k • written 5 months ago by IUH • 0

score 0 · Answer 1 · 2025-07-10

I don't think there's an easy fix (depending on your definition of easy, which may vary from mine). Ideally the transcript IDs used for alignment with kallisto would be the same as the IDs in the GTF file you are using to make the TxDb. It's weird that they are not consistent - you appear to have a transcriptome that has different IDs than the GTF that is supposed to map transcripts to genes, which seems suboptimal.

But anyway, if that's really the case, then you will likely need to use sed or awk and a reasonable regular expression to convert the IDs to be consistent. Which if you grok regular expressions could be easy, but if not, it could be hard.