Hello
I'm relatively new to Bioconductor (and R). My overall goal is to find out differentially expressed genes form RNA-Seq data using DESeq2 (I switched from Cuffdiff). I have 6 samples - 2 conditions and 3 bio-replicates each. I used kallisto to get counts from each fastq file with the mm10 version of the mouse genome. I have used DESeq before, for which I made my own count matrix table with counts from 'htseq-counts'. However, this time I am trying to use tximport to import my RNA-Seq count files generated using kallisto . I get the following error:
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> k <- keys(txdb, keytype = "GENEID")
> df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
'select()' returned 1:many mapping between keys and columns
> tx2gene <- df[, 2:1]
> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.I checked to see that the files are intact:
> all(file.exists(fileEB)) 
[1] TRUE
After browsing through the forums, I also found a solution to try and use 'ignoreTxVersion', but that didn't work either:
> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, ignoreTxVersion = TRUE)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.
It seems that there is a mismatch between the transcript names. The tx2gene dataframe looks like this:
> head(tx2gene)
      TXNAME    GENEID
1 uc009veu.1 100009600
2 uc033jjg.1 100009600
3 uc012fog.1 100009609
4 uc011xhj.2 100009614
5 uc007inp.2 100009664
6 uc008vqx.2    100012
The kallisto output tsv looks like this:
target_id    length    eff_length    est_counts    tpm 0
AF240164    597    398    0.0494548    0.00760895
AF240165    285    86.0009    0    0
AF240166    463    264    0    0
AF240167    540    341    0    0
AF240168    671    472    0    0
AF240169    461    262    0    0
AF240170    535    336    0    0
AF240171    624    425    0    0
AF240172    683    484    0   
When I search for the terms in the first column of the kallisto output I get:
LOCUS AF240169 461 bp mRNA linear HTC 30-APR-2001 DEFINITION Mus musculus MRP6 mRNA. ACCESSION AF240169 VERSION AF240169.1 KEYWORDS HTC. SOURCE Mus musculus (house mouse)
I can't figure out the cause of the mismatch. I definitely used the mm10 build of the mouse genome downloaded from the UCSC server. I know it may partly be an issue with kallisto, and I will post this to other forums as well, but I wanted to ask if anyone has faced this before and know of a solution. I checked some of the terms for matches manually (after changing the case), but didn't get any.
I will be happy to provide other details if you ask.
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)
Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so
locale:
[1] C
attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     
other attached packages:
[1] rhdf5_2.16.0                           
[2] readr_0.2.2                            
[3] tximport_1.0.2                         
[4] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[5] GenomicFeatures_1.26.0                 
[6] AnnotationDbi_1.36.0                   
[7] Biobase_2.34.0                         
[8] GenomicRanges_1.26.1                   
[9] GenomeInfoDb_1.10.0                    
[10] IRanges_2.8.0                          
[11] S4Vectors_0.12.0                       
[12] BiocGenerics_0.20.0                     
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9.4              XVector_0.12.1           
[3] GenomicAlignments_1.8.4    splines_3.4.0            
[5] zlibbioc_1.20.0            BiocParallel_1.11.2      
[7] xtable_1.8-2               lattice_0.20-35          
[9] DESeq_1.24.0               tools_3.4.0              
[11] SummarizedExperiment_1.2.3 grid_3.4.0               
[13] DBI_0.5-1                  genefilter_1.54.2        
[15] survival_2.40-1            Matrix_1.2-9             
[17] rtracklayer_1.34.2         geneplotter_1.50.0       
[19] RColorBrewer_1.1-2         bitops_1.0-6             
[21] biomaRt_2.30.0             RCurl_1.95-4.8           
[23] RSQLite_1.0.0              compiler_3.4.0           
[25] Rsamtools_1.26.1           Biostrings_2.40.2        
[27] XML_3.98-1.4               annotate_1.50.0

Thank you James, it worked like a charm. Spent the last 48h on this, and finally victory!