I am currently in the process importing Kallisto data for a mouse RNAseq experiment using the tximport
package and am creating tx2gene
objects to use for mapping. I decided to try three different annotation packages: (1) org.Mm.eg.db
, (2) EnsDb.Mmusculus.v79
, and (3) biomaRt
. However, when I ran the code (shown below) to extract the transcript to gene mappings, I only returned ~27k transcripts when using org.Mm.eg.db
, but returned ~104k when using EnsDb.Mmusculus.v79
and ~280k when using biomaRt
. Why are there such drastic differences?
I understand that biomaRt
will always be the most up to date since it is pulled directly from Ensembl, but I was curious why org.Mm.eg.db
had so few transcripts returned. Although the code below pertains to creating tx2gene
objects, my question is more broadly related to annotation packages in general.
Code for getting transcripts with org.Mm.eg.db
:
# Using org.Mm.eg.db
keys <- keys(org.Mm.eg.db, keytype = 'ENSEMBLTRANS')
tx2gene.org <- select(org.Mm.eg.db, keys, columns = c('ENSEMBLTRANS', 'SYMBOL'), keytype = 'ENSEMBLTRANS') %>%
dplyr::rename(target_id = ENSEMBLTRANS, gene_name = SYMBOL)
length(keys)
dim(tx2gene.org)
> length(keys)
[1] 26481
> dim(tx2gene.org)
[1] 26909 2
Code for getting transcripts with EnsDb.Mmusculus.v79
:
# Using EnsDb.Mmusculus.v79
tx2gene.ensdb <- transcripts(EnsDb.Mmusculus.v79, columns=c("tx_id", "gene_name")) %>%
dplyr::as_tibble() %>%
dplyr::rename(target_id = tx_id) %>%
dplyr::select(target_id, gene_name)
dim(tx2gene.ensdb)
> dim(tx2gene.ensdb)
[1] 104129 2
`
Code for getting transcripts with biomaRt
:
# Using biomaRt
myMart <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', mirror = 'useast')
tx2gene.bm <- getBM(mart = myMart, attributes = c('ensembl_transcript_id_version', 'external_gene_name')) %>%
as_tibble() %>%
dplyr::rename(target_id = ensembl_transcript_id_version,
gene_name = external_gene_name)
dim(tx2gene.bm)
> dim(tx2gene.bm)
[1] 278375 2
> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.62.0 EnsDb.Mmusculus.v79_2.99.0 org.Mm.eg.db_3.20.0 enrichplot_1.26.5
[5] msigdbr_7.5.1 clusterProfiler_4.14.4 gprofiler2_0.2.3 GSVA_2.0.4
[9] GSEABase_1.68.0 graph_1.84.0 annotate_1.84.0 XML_3.99-0.18
[13] gplots_3.2.0 RColorBrewer_1.1-3 d3heatmap_0.9.0 DT_0.33
[17] gt_0.11.1 plotly_4.10.4 cowplot_1.1.3 edgeR_4.4.1
[21] limma_3.62.1 datapasta_3.1.0 beepr_2.0 EnsDb.Hsapiens.v86_2.99.0
[25] ensembldb_2.30.0 AnnotationFilter_1.30.0 GenomicFeatures_1.58.0 AnnotationDbi_1.68.0
[29] Biobase_2.66.0 GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 IRanges_2.40.1
[33] S4Vectors_0.44.0 BiocGenerics_0.52.0 tximport_1.34.0 lubridate_1.9.4
[37] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[41] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[45] tidyverse_2.0.0 rhdf5_2.50.1
loaded via a namespace (and not attached):
[1] splines_4.4.2 later_1.4.1 BiocIO_1.16.0 filelock_1.0.3
[5] ggplotify_0.1.2 bitops_1.0-9 R.oo_1.27.0 httr2_1.0.7
[9] lifecycle_1.0.4 lattice_0.22-6 magrittr_2.0.3 yaml_2.3.10
[13] ggtangle_0.0.6 httpuv_1.6.15 DBI_1.2.3 abind_1.4-8
[17] pkgload_1.4.0 zlibbioc_1.52.0 audio_0.1-11 R.utils_2.12.3
[21] RCurl_1.98-1.16 yulab.utils_0.1.9 rappdirs_0.3.3 GenomeInfoDbData_1.2.13
[25] ggrepel_0.9.6 irlba_2.3.5.1 tidytree_0.4.6 codetools_0.2-20
[29] DelayedArray_0.32.0 DOSE_4.0.0 xml2_1.3.6 tidyselect_1.2.1
[33] aplot_0.2.4 farver_2.1.2 UCSC.utils_1.2.0 ScaledMatrix_1.14.0
[37] BiocFileCache_2.14.0 matrixStats_1.5.0 base64enc_0.1-3 GenomicAlignments_1.42.0
[41] jsonlite_1.8.9 progress_1.2.3 tools_4.4.2 treeio_1.30.0
[45] Rcpp_1.0.13-1 glue_1.8.0 SparseArray_1.6.0 qvalue_2.38.0
[49] MatrixGenerics_1.18.0 HDF5Array_1.34.0 withr_3.0.2 BiocManager_1.30.25
[53] fastmap_1.2.0 rhdf5filters_1.18.0 caTools_1.18.3 digest_0.6.37
[57] rsvd_1.0.5 gridGraphics_0.5-1 timechange_0.3.0 R6_2.5.1
[61] mime_0.12 colorspace_2.1-2 GO.db_3.20.0 gtools_3.9.5
[65] RSQLite_2.3.9 R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3
[69] data.table_1.16.4 rtracklayer_1.66.0 prettyunits_1.2.0 httr_1.4.7
[73] htmlwidgets_1.6.4 S4Arrays_1.6.0 pkgconfig_2.0.3 gtable_0.3.6
[77] blob_1.2.4 SingleCellExperiment_1.28.1 XVector_0.46.0 htmltools_0.5.8.1
[81] fgsea_1.32.0 ProtGenerics_1.38.0 scales_1.3.0 png_0.1-8
[85] SpatialExperiment_1.16.0 ggfun_0.1.8 rstudioapi_0.17.1 tzdb_0.4.0
[89] reshape2_1.4.4 rjson_0.2.23 nlme_3.1-166 curl_6.1.0
[93] cachem_1.1.0 KernSmooth_2.23-26 parallel_4.4.2 restfulr_0.0.15
[97] pillar_1.10.1 grid_4.4.2 vctrs_0.6.5 promises_1.3.2
[101] BiocSingular_1.22.0 dbplyr_2.5.0 beachmat_2.22.0 xtable_1.8-6
[105] magick_2.8.5 cli_3.6.3 locfit_1.5-9.10 compiler_4.4.2
[109] Rsamtools_2.22.0 rlang_1.1.4 crayon_1.5.3 plyr_1.8.9
[113] fs_1.6.5 stringi_1.8.4 viridisLite_0.4.2 BiocParallel_1.40.0
[117] babelgene_22.9 munsell_0.5.1 Biostrings_2.74.1 lazyeval_0.2.2
[121] GOSemSim_2.32.0 Matrix_1.7-1 patchwork_1.3.0 hms_1.1.3
[125] sparseMatrixStats_1.18.0 bit64_4.5.2 Rhdf5lib_1.28.0 KEGGREST_1.46.0
[129] statmod_1.5.0 shiny_1.10.0 SummarizedExperiment_1.36.0 igraph_2.1.2
[133] memoise_2.0.1 ggtree_3.14.0 fastmatch_1.1-6 bit_4.5.0.1
[137] gson_0.1.0 ape_5.8-1
Just an update on this: I tried
tximport
withtx2gene
objects from the three annotation packages and from the GTF file itself. The best performance was using the GTF file itself, thank you ATpoint for the suggestion. This begs the question: why use annotation packages at all, when the GTF file gives the best performance?It makes sense that the GTF file would yield the best results as you create the kallisto index from the fasta file that is linked to that GTF. Should this be the standard for annotating kallisto inputs with
tximport
?Code creating
tx2gene
object from GTF file:Annotation packages have their value. They are a convenient and stable way to make annotations accessible inside R, they are version-controlled etc. YOu have a corner case now in which, together with how you preprocessed your data, got better performance with another annotation source, but that's not the annotation package fault. If you had built your kallisto index with transcripts retrieved in some way via an annotation package then this should give same performance as you experience now with the gtf.
It's not the version control in R/Bioconductor though. It's the version of the genome. OP used Ensembl build 113 to align and then noted that using an
EnsDb
based on Ensembl build 79, which is ten years old and based on GRCm38 instead of GRCm39 didn't match well. I can't get the Biomart server to load, but I'd bet a dollar it's still on version 112, which will result in some mismatching as well.There are more recent vintage
EnsDb
packages on theAnnotationHub
that would be more useful (up to Ensembl build 112).Given that the genome is a moving target, it's critical to ensure that whatever is used for alignment and whatever is used for downstream annotation are from the same release.