Hi,
I noticed some unexpected behaviour with the annotation package TxDb.Hsapiens.UCSC.hg38.knownGene: several genes on chr21 are unexpectedly long, very different from their corresponding NCBI entries. These genes are CBS (ENTREZID=875), CRYAA (1409), U2AF1 (7307), CBSL (102724560), U2AF1L5 (102724594), CRYAA2 (102724652), SMIM11B (102723553), SMIM34A (388820), FAM243A (101928147), MIR8069-1 (102466252) & LOC101928269 (101928269).
For example, the package sets the CBS gene locus from 6,444,869 to 43,076,378 (length of 36,631,510, see below), while NCBI shows a locus between 43,053,190 and 43,076,861 (length of 23,672).
Could anyone let me know what I am doing wrong?
Thanks, Eric
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> g <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
> g["875"]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
875 chr21 6444869-43076378 - | 875
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
[2] GenomicFeatures_1.38.1
[3] AnnotationDbi_1.48.0
[4] Biobase_2.46.0
[5] GenomicRanges_1.38.0
[6] GenomeInfoDb_1.22.0
[7] IRanges_2.20.2
[8] S4Vectors_0.24.3
[9] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.16.1 progress_1.2.2
[3] tidyselect_1.0.0 purrr_0.3.3
[5] lattice_0.20-38 vctrs_0.2.2
[7] BiocFileCache_1.10.2 rtracklayer_1.46.0
[9] blob_1.2.1 XML_3.99-0.3
[11] rlang_0.4.4 pillar_1.4.3
[13] glue_1.3.1 DBI_1.1.0
[15] BiocParallel_1.20.1 rappdirs_0.3.1
[17] bit64_0.9-7 dbplyr_1.4.2
[19] matrixStats_0.55.0 GenomeInfoDbData_1.2.2
[21] stringr_1.4.0 zlibbioc_1.32.0
[23] Biostrings_2.54.0 memoise_1.1.0
[25] biomaRt_2.42.0 curl_4.3
[27] Rcpp_1.0.3 openssl_1.4.1
[29] DelayedArray_0.12.2 XVector_0.26.0
[31] bit_1.1-15.2 Rsamtools_2.2.1
[33] hms_0.5.3 askpass_1.1
[35] digest_0.6.24 stringi_1.4.5
[37] dplyr_0.8.4 grid_3.6.3
[39] tools_3.6.3 bitops_1.0-6
[41] magrittr_1.5 RCurl_1.98-1.1
[43] RSQLite_2.2.0 tibble_2.1.3
[45] crayon_1.3.4 pkgconfig_2.0.3
[47] Matrix_1.2-18 prettyunits_1.1.1
[49] assertthat_0.2.1 httr_1.4.1
[51] R6_2.4.1 GenomicAlignments_1.22.1
[53] compiler_3.6.3
Thanks a lot for your answer! There is still something unclear to me:
The transcript ids for CBS (ENTREZID=875) and for its paralog CBSL (102724560) in TxDb.Hsapiens.UCSC.hg38.knownGene appear to be mixed.
The situation is similar with the pair of paralog genes U2AF1 (7307) & U2AF1L5 (102724594), and for the paralog pair CRYAA (1409) & CRYAA2 (102724652).
How would you recommend to resolve these kind of situations?
Thanks!
I didn't check your code because I'm not from the planet tidyverse. However, it seems there might be an error?
Thanks for your reply, (& sorry I used tidyverse).
I see that when the transcripts are mapped to gene symbols throught
transcriptsBy
, all transcripts of gene875
indeed map toCBS
, and transcripts of102724560
map toCBSL
.However, this is not the case when the mapping between
tx_name
and gene symbol is done through ENSEMBL viabiomaRt
. The results of the mapping is below:It looks as if there is a contradiction between the transcripts definitions from ENSEMBL & from the package. I am not sure how to deal with that, any advice is welcome.
Thanks again, Eric
My advice is that you should stick with either NCBI mappings or EBI/EMBL mappings. When you have two different groups defining something as complex as the human genome, you are bound to come up with differences of opinion as to where genes are, which transcripts belong with which gene, etc. You can come up with any number of instances of disagreement, and as a consumer of their product it is probably not that interesting to know why they disagree, but it is worthwhile to know that they do, and cross mappings are fraught.