At the GitHub of clusterProfiler
a question was posted on the use of symbols of mitochondrial genes.
It basically boils down to the question whether a hyphen in a symbol may cause a problem with the org.Hs.eg.db
, preventing it from being present in the database.
To illustrate the issue:
A set of 4 'official' mitochondrial symbols:
mito.hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CYB")
When the OrgDb
is queried to retrieve the corresponding ENTREZID
nothing is found....:
> library(org.Hs.eg.db)
> AnnotationDbi::select(org.Hs.eg.db, keys = mito.hgnc, keytype = "SYMBOL",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.
>
The same when ALIAS
is used...
> AnnotationDbi::select(org.Hs.eg.db, keys = mito.hgnc, keytype = "ALIAS",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'ALIAS'. Please use the keys method to see a listing of valid arguments.
>
Using a code snippet to retrieve all mitochondrial genes (previously posted on this forum by James)
## check all genes on chromosome M
> z <- unlist(as.list(org.Hs.egCHR))
> mito.egids <- names(z)[z %in% "MT"]
> mito.egids
[1] "4508" "4509" "4511" "4512" "4513" "4514" "4519" "4535" "4536" "4537"
[11] "4538" "4539" "4540" "4541" "4549" "4550" "4553" "4555" "4556" "4558"
[21] "4563" "4564" "4565" "4566" "4567" "4568" "4569" "4570" "4571" "4572"
[31] "4573" "4574" "4575" "4576" "4577" "4578" "4579"
>
> head( AnnotationDbi::select(org.Hs.eg.db, keys = mito.egids, keytype = "ENTREZID",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") ) )
'select()' returned 1:1 mapping between keys and columns
ENTREZID SYMBOL GENENAME
1 4508 ATP6 ATP synthase F0 subunit 6
2 4509 ATP8 ATP synthase F0 subunit 8
3 4511 TRNC tRNA-Cys
4 4512 COX1 cytochrome c oxidase subunit I
5 4513 COX2 cytochrome c oxidase subunit II
6 4514 COX3 cytochrome c oxidase subunit III
>
Mmm, SYMBOL
are ATP6
, ATP8
, etc. No prefix with MT...
Yet, at both NCBI and HGNC official symbols for all 4 input symbols are with MT prefix...?
(... and official symbol is MT-CO1
rather than COX1
...)
MT-ATP6
https://www.ncbi.nlm.nih.gov/gene/4508
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7414
MT-ATP8
https://www.ncbi.nlm.nih.gov/gene/4509
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7415
MT-CO1
https://www.ncbi.nlm.nih.gov/gene/4512
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7419
MT-CYB
https://www.ncbi.nlm.nih.gov/gene/4519
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7427
Thus, is this expected behavior, or maybe not? Any insights would be appreciated.
G
BTW, I understand that the OrgDb
are generated a while before a new Bioconductor release, and at NCBI it is stated that the last info update was on October 28, but ~2 weeks ago he prefix MT was already used (and maybe even long before then, but I don't know that...).
> sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.20.0 AnnotationDbi_1.68.0 IRanges_2.40.0
[4] S4Vectors_0.44.0 Biobase_2.66.0 BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] crayon_1.5.3 vctrs_0.6.5 httr_1.4.7
[4] cli_3.6.3 rlang_1.1.4 DBI_1.2.3
[7] png_0.1-8 UCSC.utils_1.2.0 jsonlite_1.8.9
[10] bit_4.5.0 Biostrings_2.74.0 KEGGREST_1.46.0
[13] fastmap_1.2.0 GenomeInfoDb_1.42.0 memoise_2.0.1
[16] compiler_4.4.2 RSQLite_2.3.7 blob_1.2.4
[19] pkgconfig_2.0.3 XVector_0.46.0 R6_2.5.1
[22] GenomeInfoDbData_1.2.13 tools_4.4.2 bit64_4.5.2
[25] zlibbioc_1.52.0 cachem_1.1.0
>
Thanks; I did not know which file exactly is being used. Good catch regarding the discrepancy between the 2 web pages. I will get in touch with NCBI regarding this issue.
If you care to know more about the process, there is a GitHub with all the code. It's pretty complicated though - the readme alone is like War and Peace length.
I have contacted NCBI, and meanwhile got an an answer. See below.
It turns out that the official, HGNC-approved symbols are present in the before-mentioned file
gene_info.gz
, namely in column 11 (Symbol_from_nomenclature_authority
). I manually looked up the above-mentioned mitochondrial genes (lines 12218018 - 12218026), and indeed the HGNC symbols are present in that column.Screenshot:
@James: yes, that is indeed a lot of code. Do you happen to know where in the code it is defined which columns are extracted? I noticed that after downloading the
gene_info
file first is filtered for the organisms for which Bioconductor provides anOrgb
(getsrc.sh
, https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/gene/script/getsrc.sh).In the
srcdb.sql
script the relevant annotation information seems to be extracted (https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/2cb9f84f10c836008eb52aea2ae939749f818201/annosrc/gene/script/srcdb.sql#L32-L49), but I wasn't able to find out what exactly is used fordefault_gene_symbol
.Anyway, do you think it would be possible to include the
Symbol_from_nomenclature_authority
(column 11) andFull_name_from_nomenclature_authority information
(column 12) in theOrgDb
as well?Reply NLM Support:
Thanks for following up with NCBI. I can change the code to query the correct column. It's up to somebody with a higher pay scale than mine to decide if this is worth a reissue of the
OrgDb
packages, so it might be next release before the changes propagate.Yes, I understand the implications of replacing the
OrgDb
packages in release. Maybe an upload to the annotationHub or GitHub could be another option? For the human and maybe mouseOrgDb
only? (To accommodate the users that are aware of this issue?)Also: do you plan to supersede the current default column with the official one, or do you rather want to add it?
Anyway, thanks for changing this!
OK, there are 157 MT genes in maybe 6 species for which the symbols are not correct, and I added code to swap the 'Symbol from nomenclature authority' for the 'Symbol' if A) they are not equal, and B) the 'Symbol from nomenclature authority' is not '-'.
Given how few symbols are affected, and the fact that this has only come up at this late date, I am not sure I want to re-run things.
Thanks for changing this! Just to confirm I understood it correctly: the
Symbol
will (still) be used, and will be replaced bySymbol from nomenclature authority
if these are not identical (but not-
). Is this only for the MT genes? Or all genes?You raise a fair point regarding the impact. I would agree to do this from next release onwards.
It's a simple two-liner in getsrc.sh
Which by default checks all the genes, but only affects the MT genes.
Thanks for looking into this. This issue results in the incorrect usage of multiple ORA software including EnrichR.