Dear all,
I'm using TxDb packages to annotate certain genomic loci to genes. While inspecting my results, I found that a lot of my loci mapped to the same genes. This was suspicious, as their coordinates were very separated, so I inspected the TxDb from which I got the gene information. To get the coordinates of genes in mm10 I'm using the TxDb like so:
> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> mm10 <- TxDb.Mmusculus.UCSC.mm10.knownGene
> mm10_genes <- genes(mm10, columns = c("gene_id"), single.strand.genes.only = TRUE)
Example of output:
> mm10_genes
GRanges object with 24528 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
100009600 chr9 21062393-21073096 - | 100009600
100009609 chr7 84935565-84964115 - | 100009609
100009614 chr10 77711457-77712009 + | 100009614
100009664 chr11 45808087-45841171 + | 100009664
100012 chr4 144157557-144162663 - | 100012
... ... ... ... . ...
99889 chr3 84496093-85887516 - | 99889
99890 chr3 110246109-110250998 - | 99890
99899 chr3 151730922-151749960 - | 99899
99929 chr3 65528410-65555518 + | 99929
99982 chr4 136550540-136602723 - | 99982
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
I found that there are many genes with giant sizes:
> head(sort(width(mm10_genes), decreasing = T), 10)
[1] 35832227 27094601 20008339 13417859 5893261 4928863 4839115 4715213 4714606 4686084
Now, see for example these genes:
> mm10_genes[c("102465114","108148")]
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
102465114 chr19 24942235-60774461 - | 102465114
108148 chr8 119910841-124345724 + | 108148
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
> width(mm10_genes[c("102465114","108148")])
[1] 35832227 4434884
If I look them up the coordinates for 102465114 are 19:60774329-60774397. So it is quite short (it is a miRNA, ~60 bp). For 108148 they are 8:124231394-124345722 (100 kb long, a big gene. It is Galnt2).
I feel like I'm missing something here. Why are the coordinates for these genes so big in the TxDb database?
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=es_ES.UTF-8
[6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0 TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 GenomicFeatures_1.38.2
[4] AnnotationDbi_1.48.0 Biobase_2.46.0 GenomicRanges_1.38.0
[7] GenomeInfoDb_1.22.0 IRanges_2.20.2 S4Vectors_0.24.3
[10] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.16.1 progress_1.2.2 tidyselect_1.0.0 purrr_0.3.3
[5] lattice_0.20-40 vctrs_0.2.3 BiocFileCache_1.10.2 rtracklayer_1.46.0
[9] yaml_2.2.1 blob_1.2.1 XML_3.99-0.3 rlang_0.4.4
[13] pillar_1.4.3 glue_1.3.1 DBI_1.1.0 BiocParallel_1.20.1
[17] rappdirs_0.3.1 bit64_0.9-7 dbplyr_1.4.2 matrixStats_0.55.0
[21] GenomeInfoDbData_1.2.2 stringr_1.4.0 zlibbioc_1.32.0 Biostrings_2.54.0
[25] memoise_1.1.0 biomaRt_2.42.0 curl_4.3 Rcpp_1.0.3
[29] openssl_1.4.1 DelayedArray_0.12.2 XVector_0.26.0 bit_1.1-15.2
[33] Rsamtools_2.2.3 hms_0.5.3 askpass_1.1 digest_0.6.25
[37] stringi_1.4.6 dplyr_0.8.4 grid_3.6.2 tools_3.6.2
[41] bitops_1.0-6 magrittr_1.5 RCurl_1.98-1.1 RSQLite_2.2.0
[45] tibble_2.1.3 crayon_1.3.4 pkgconfig_2.0.3 Matrix_1.2-18
[49] prettyunits_1.1.1 assertthat_0.2.1 httr_1.4.1 rstudioapi_0.11
[53] R6_2.4.1 GenomicAlignments_1.22.1 compiler_3.6.2
Or sometimes they are unprocessed pseudogenes and stuff
Thank you for the answer, it was very clarifying. I have 2 doubts then:
A question: I want to annotate loci which fall inside genes. Thus, is it better to get the transcripts starts and ends and use those (collapsed with
GenomicRanges::reduce()
or a similar method) to map my loci to genes? Because as seen in your examples thestart
andend
of the gene coordinates span a large regions which include sequence where there is no gene.An observation: I've been looking at some of the genes and, although this is hand-picking, I seem to find that 1) when looking at the gene in the UCSC Genome Browser, one or more of the coordinates does not map to transcripts of the queried gene. 2) The ensembl annotation package seems to agree more with the UCSC browser annotation.
For example, the 668963 gene:
I find that in the UCSC browser the first coordinates map to the 668963 gene (Btbd35f20), while the second coordinates map to a "contig" named CT867961.1, which I found contains at those coordinates the 100040448 gene (Btbd35f19), although this one does not appear in the TxDb:
While this does not happen when looking for the gene in the Ensembl database:
(Although the second gene at the second coordinate is also not present in that database:)
So to sum up, because I don't want to be overly confusing with this:
I apologize for the vagueness in the observations. Basically I'm asking if you had any "general" remarks about the usual practices when retrieving these types of coordinates.
So the newest version of mm10 data on UCSC are now based on Gencode rather than what they used to do in the past, where they did the mapping using mostly NCBI data. Put another way, the knownGene table used to have these weird transcript IDs that UCSC generated, and those mapped to NCBI Gene IDs. But now they use Genecode, which is based on Ensembl mappings, so the mapping is Ensembl transcript -> NCBI Gene ID.
And that is where the problem arises, because NCBI and EBI don't always agree as to what is a gene, and where the genes live. So you can get these issues. All things equal we should probably stop using the NCBI Gene IDs as the central ID for these things because it's a problem, but that's way above my (non-existent) pay grade.
Anyway, looking at the UCSC genome browser won't tell you much, because it's all +/- Gencode based now. You have to query the underlying database directly. So as an example, you mention NCBI Gene ID 668963. Let's look that up.
So you can see what we provide is consistent with what UCSC provides as well, and the infelicities arise because of the mapping between the two annotation services.
And as a bit of a followup, do note:
Thank you very much for going into detail and the comprehensive answer!