I am currently in the process importing Kallisto data for a mouse RNAseq experiment using the tximport
package and am creating tx2gene
objects to use for mapping. I decided to try three different annotation packages: (1) org.Mm.eg.db
, (2) EnsDb.Mmusculus.v79
, and (3) biomaRt
. However, when I ran the code (shown below) to extract the transcript to gene mappings, I only returned ~27k transcripts when using org.Mm.eg.db
, but returned ~104k when using EnsDb.Mmusculus.v79
and ~280k when using biomaRt
. Why are there such drastic differences?
I understand that biomaRt
will always be the most up to date since it is pulled directly from Ensembl, but I was curious why org.Mm.eg.db
had so few transcripts returned. Although the code below pertains to creating tx2gene
objects, my question is more broadly related to annotation packages in general.
Code for getting transcripts with org.Mm.eg.db
:
# Using org.Mm.eg.db
keys <- keys(org.Mm.eg.db, keytype = 'ENSEMBLTRANS')
tx2gene.org <- select(org.Mm.eg.db, keys, columns = c('ENSEMBLTRANS', 'SYMBOL'), keytype = 'ENSEMBLTRANS') %>%
dplyr::rename(target_id = ENSEMBLTRANS, gene_name = SYMBOL)
length(keys)
dim(tx2gene.org)
> length(keys)
[1] 26481
> dim(tx2gene.org)
[1] 26909 2
Code for getting transcripts with EnsDb.Mmusculus.v79
:
# Using EnsDb.Mmusculus.v79
tx2gene.ensdb <- transcripts(EnsDb.Mmusculus.v79, columns=c("tx_id", "gene_name")) %>%
dplyr::as_tibble() %>%
dplyr::rename(target_id = tx_id) %>%
dplyr::select(target_id, gene_name)
dim(tx2gene.ensdb)
> dim(tx2gene.ensdb)
[1] 104129 2
`
Code for getting transcripts with biomaRt
:
# Using biomaRt
myMart <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', mirror = 'useast')
tx2gene.bm <- getBM(mart = myMart, attributes = c('ensembl_transcript_id_version', 'external_gene_name')) %>%
as_tibble() %>%
dplyr::rename(target_id = ensembl_transcript_id_version,
gene_name = external_gene_name)
dim(tx2gene.bm)
> dim(tx2gene.bm)
[1] 278375 2
> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.62.0 EnsDb.Mmusculus.v79_2.99.0 org.Mm.eg.db_3.20.0 enrichplot_1.26.5
[5] msigdbr_7.5.1 clusterProfiler_4.14.4 gprofiler2_0.2.3 GSVA_2.0.4
[9] GSEABase_1.68.0 graph_1.84.0 annotate_1.84.0 XML_3.99-0.18
[13] gplots_3.2.0 RColorBrewer_1.1-3 d3heatmap_0.9.0 DT_0.33
[17] gt_0.11.1 plotly_4.10.4 cowplot_1.1.3 edgeR_4.4.1
[21] limma_3.62.1 datapasta_3.1.0 beepr_2.0 EnsDb.Hsapiens.v86_2.99.0
[25] ensembldb_2.30.0 AnnotationFilter_1.30.0 GenomicFeatures_1.58.0 AnnotationDbi_1.68.0
[29] Biobase_2.66.0 GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 IRanges_2.40.1
[33] S4Vectors_0.44.0 BiocGenerics_0.52.0 tximport_1.34.0 lubridate_1.9.4
[37] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[41] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[45] tidyverse_2.0.0 rhdf5_2.50.1
loaded via a namespace (and not attached):
[1] splines_4.4.2 later_1.4.1 BiocIO_1.16.0 filelock_1.0.3
[5] ggplotify_0.1.2 bitops_1.0-9 R.oo_1.27.0 httr2_1.0.7
[9] lifecycle_1.0.4 lattice_0.22-6 magrittr_2.0.3 yaml_2.3.10
[13] ggtangle_0.0.6 httpuv_1.6.15 DBI_1.2.3 abind_1.4-8
[17] pkgload_1.4.0 zlibbioc_1.52.0 audio_0.1-11 R.utils_2.12.3
[21] RCurl_1.98-1.16 yulab.utils_0.1.9 rappdirs_0.3.3 GenomeInfoDbData_1.2.13
[25] ggrepel_0.9.6 irlba_2.3.5.1 tidytree_0.4.6 codetools_0.2-20
[29] DelayedArray_0.32.0 DOSE_4.0.0 xml2_1.3.6 tidyselect_1.2.1
[33] aplot_0.2.4 farver_2.1.2 UCSC.utils_1.2.0 ScaledMatrix_1.14.0
[37] BiocFileCache_2.14.0 matrixStats_1.5.0 base64enc_0.1-3 GenomicAlignments_1.42.0
[41] jsonlite_1.8.9 progress_1.2.3 tools_4.4.2 treeio_1.30.0
[45] Rcpp_1.0.13-1 glue_1.8.0 SparseArray_1.6.0 qvalue_2.38.0
[49] MatrixGenerics_1.18.0 HDF5Array_1.34.0 withr_3.0.2 BiocManager_1.30.25
[53] fastmap_1.2.0 rhdf5filters_1.18.0 caTools_1.18.3 digest_0.6.37
[57] rsvd_1.0.5 gridGraphics_0.5-1 timechange_0.3.0 R6_2.5.1
[61] mime_0.12 colorspace_2.1-2 GO.db_3.20.0 gtools_3.9.5
[65] RSQLite_2.3.9 R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3
[69] data.table_1.16.4 rtracklayer_1.66.0 prettyunits_1.2.0 httr_1.4.7
[73] htmlwidgets_1.6.4 S4Arrays_1.6.0 pkgconfig_2.0.3 gtable_0.3.6
[77] blob_1.2.4 SingleCellExperiment_1.28.1 XVector_0.46.0 htmltools_0.5.8.1
[81] fgsea_1.32.0 ProtGenerics_1.38.0 scales_1.3.0 png_0.1-8
[85] SpatialExperiment_1.16.0 ggfun_0.1.8 rstudioapi_0.17.1 tzdb_0.4.0
[89] reshape2_1.4.4 rjson_0.2.23 nlme_3.1-166 curl_6.1.0
[93] cachem_1.1.0 KernSmooth_2.23-26 parallel_4.4.2 restfulr_0.0.15
[97] pillar_1.10.1 grid_4.4.2 vctrs_0.6.5 promises_1.3.2
[101] BiocSingular_1.22.0 dbplyr_2.5.0 beachmat_2.22.0 xtable_1.8-6
[105] magick_2.8.5 cli_3.6.3 locfit_1.5-9.10 compiler_4.4.2
[109] Rsamtools_2.22.0 rlang_1.1.4 crayon_1.5.3 plyr_1.8.9
[113] fs_1.6.5 stringi_1.8.4 viridisLite_0.4.2 BiocParallel_1.40.0
[117] babelgene_22.9 munsell_0.5.1 Biostrings_2.74.1 lazyeval_0.2.2
[121] GOSemSim_2.32.0 Matrix_1.7-1 patchwork_1.3.0 hms_1.1.3
[125] sparseMatrixStats_1.18.0 bit64_4.5.2 Rhdf5lib_1.28.0 KEGGREST_1.46.0
[129] statmod_1.5.0 shiny_1.10.0 SummarizedExperiment_1.36.0 igraph_2.1.2
[133] memoise_2.0.1 ggtree_3.14.0 fastmatch_1.1-6 bit_4.5.0.1
[137] gson_0.1.0 ape_5.8-1
Just an update on this: I tried
tximport
withtx2gene
objects from the three annotation packages and from the GTF file itself. The best performance was using the GTF file itself, thank you ATpoint for the suggestion. This begs the question: why use annotation packages at all, when the GTF file gives the best performance?It makes sense that the GTF file would yield the best results as you create the kallisto index from the fasta file that is linked to that GTF. Should this be the standard for annotating kallisto inputs with
tximport
?Code creating
tx2gene
object from GTF file: