tximport error: TXNAME from Txdb.mm10 and kallisto target_id mismatch
1
0
Entering edit mode
Kaustav • 0
@kaustav-13212
Last seen 2.2 years ago
NY

Hello

I'm relatively new to Bioconductor (and R). My overall goal is to find out differentially expressed genes form RNA-Seq data using DESeq2 (I switched from Cuffdiff). I have 6 samples - 2 conditions and 3 bio-replicates each. I used kallisto to get counts from each fastq file with the mm10 version of the mouse genome. I have used DESeq before, for which I made my own count matrix table with counts from 'htseq-counts'. However, this time I am trying to use tximport to import my RNA-Seq count files generated using kallisto . I get the following error: 

> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> k <- keys(txdb, keytype = "GENEID")
> df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
'select()' returned 1:many mapping between keys and columns
> tx2gene <- df[, 2:1]
> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

I checked to see that the files are intact:

> all(file.exists(fileEB))

[1] TRUE

After browsing through the forums, I also found a solution to try and use 'ignoreTxVersion', but that didn't work either:

> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, ignoreTxVersion = TRUE)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

It seems that there is a mismatch between the transcript names. The tx2gene dataframe looks like this:
> head(tx2gene)
      TXNAME    GENEID
1 uc009veu.1 100009600
2 uc033jjg.1 100009600
3 uc012fog.1 100009609
4 uc011xhj.2 100009614
5 uc007inp.2 100009664
6 uc008vqx.2    100012

The kallisto output tsv looks like this:

target_id    length    eff_length    est_counts    tpm
AF240164    597    398    0.0494548    0.00760895
AF240165    285    86.0009    0    0
AF240166    463    264    0    0
AF240167    540    341    0    0
AF240168    671    472    0    0
AF240169    461    262    0    0
AF240170    535    336    0    0
AF240171    624    425    0    0
AF240172    683    484    0  
 0

When I search for the terms in the first column of the kallisto output I get:

LOCUS       AF240169                 461 bp    mRNA    linear   HTC 30-APR-2001
DEFINITION  Mus musculus MRP6 mRNA.
ACCESSION   AF240169
VERSION     AF240169.1
KEYWORDS    HTC.
SOURCE      Mus musculus (house mouse)

I can't figure out the cause of the mismatch. I definitely used the mm10 build of the mouse genome downloaded from the UCSC server. I know it may partly be an issue with kallisto, and I will post this to other forums as well, but I wanted to ask if anyone has faced this before and know of a solution. I checked some of the terms for matches manually (after changing the case), but didn't get any.

I will be happy to provide other details if you ask.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)

Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] rhdf5_2.16.0                           
[2] readr_0.2.2                            
[3] tximport_1.0.2                         
[4] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[5] GenomicFeatures_1.26.0                 
[6] AnnotationDbi_1.36.0                   
[7] Biobase_2.34.0                         
[8] GenomicRanges_1.26.1                   
[9] GenomeInfoDb_1.10.0                    
[10] IRanges_2.8.0                          
[11] S4Vectors_0.12.0                       
[12] BiocGenerics_0.20.0                     

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9.4              XVector_0.12.1           
[3] GenomicAlignments_1.8.4    splines_3.4.0            
[5] zlibbioc_1.20.0            BiocParallel_1.11.2      
[7] xtable_1.8-2               lattice_0.20-35          
[9] DESeq_1.24.0               tools_3.4.0              
[11] SummarizedExperiment_1.2.3 grid_3.4.0               
[13] DBI_0.5-1                  genefilter_1.54.2        
[15] survival_2.40-1            Matrix_1.2-9             
[17] rtracklayer_1.34.2         geneplotter_1.50.0       
[19] RColorBrewer_1.1-2         bitops_1.0-6             
[21] biomaRt_2.30.0             RCurl_1.95-4.8           
[23] RSQLite_1.0.0              compiler_3.4.0           
[25] Rsamtools_1.26.1           Biostrings_2.40.2        
[27] XML_3.98-1.4               annotate_1.50.0

tximport kallisto txdb.mmusculus.ucsc.mm10.knowngene • 1.2k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

The transcript IDs you are using are the UCSC transcript IDs, but the IDs in your kallisto file are GenBank IDs. You will get better luck doing something like

library(org.Hs.eg.db)
tx2gene <- select(org.Hs.eg.db, keys(org.Hs.eg.db), "ACCNUM")
tx2gene <- tx2gene[,2:1]

You could also check to make sure that the first column of your tx2gene has the same values as your kallisto files:

all(<first column of kallisto file> %in% tx2gene)
ADD COMMENT
0
Entering edit mode

Thank you James, it worked like a charm. Spent the last 48h on this, and finally victory!

ADD REPLY

Login before adding your answer.

Traffic: 289 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6