incomplete official symbol mitochondrial genes in org.Hs.eg.db?
1
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 2 days ago
Wageningen University, Wageningen, the …

At the GitHub of clusterProfiler a question was posted on the use of symbols of mitochondrial genes.

It basically boils down to the question whether a hyphen in a symbol may cause a problem with the org.Hs.eg.db, preventing it from being present in the database.

To illustrate the issue:

A set of 4 'official' mitochondrial symbols:

mito.hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CYB")

When the OrgDb is queried to retrieve the corresponding ENTREZID nothing is found....:

> library(org.Hs.eg.db)
> AnnotationDbi::select(org.Hs.eg.db, keys = mito.hgnc, keytype = "SYMBOL",
+                columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.
>

The same when ALIAS is used...

> AnnotationDbi::select(org.Hs.eg.db, keys = mito.hgnc, keytype = "ALIAS",
+                columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ALIAS'. Please use the keys method to see a listing of valid arguments.
> 

Using a code snippet to retrieve all mitochondrial genes (previously posted on this forum by James)

## check all genes on chromosome M
> z <- unlist(as.list(org.Hs.egCHR))
> mito.egids <- names(z)[z %in% "MT"]
> mito.egids
 [1] "4508" "4509" "4511" "4512" "4513" "4514" "4519" "4535" "4536" "4537"
[11] "4538" "4539" "4540" "4541" "4549" "4550" "4553" "4555" "4556" "4558"
[21] "4563" "4564" "4565" "4566" "4567" "4568" "4569" "4570" "4571" "4572"
[31] "4573" "4574" "4575" "4576" "4577" "4578" "4579"
> 
> head( AnnotationDbi::select(org.Hs.eg.db, keys = mito.egids, keytype = "ENTREZID",
+                columns = c("ENTREZID", "SYMBOL", "GENENAME") ) )
'select()' returned 1:1 mapping between keys and columns
  ENTREZID SYMBOL                         GENENAME
1     4508   ATP6        ATP synthase F0 subunit 6
2     4509   ATP8        ATP synthase F0 subunit 8
3     4511   TRNC                         tRNA-Cys
4     4512   COX1   cytochrome c oxidase subunit I
5     4513   COX2  cytochrome c oxidase subunit II
6     4514   COX3 cytochrome c oxidase subunit III
> 

Mmm, SYMBOL are ATP6, ATP8, etc. No prefix with MT...

Yet, at both NCBI and HGNC official symbols for all 4 input symbols are with MT prefix...? (... and official symbol is MT-CO1 rather than COX1...)

MT-ATP6

https://www.ncbi.nlm.nih.gov/gene/4508

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7414

MT-ATP8

https://www.ncbi.nlm.nih.gov/gene/4509

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7415

MT-CO1

https://www.ncbi.nlm.nih.gov/gene/4512

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7419

MT-CYB

https://www.ncbi.nlm.nih.gov/gene/4519

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7427

Thus, is this expected behavior, or maybe not? Any insights would be appreciated.

G

BTW, I understand that the OrgDb are generated a while before a new Bioconductor release, and at NCBI it is stated that the last info update was on October 28, but ~2 weeks ago he prefix MT was already used (and maybe even long before then, but I don't know that...).

> sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19042)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.20.0  AnnotationDbi_1.68.0 IRanges_2.40.0      
[4] S4Vectors_0.44.0     Biobase_2.66.0       BiocGenerics_0.52.0 

loaded via a namespace (and not attached):
 [1] crayon_1.5.3            vctrs_0.6.5             httr_1.4.7             
 [4] cli_3.6.3               rlang_1.1.4             DBI_1.2.3              
 [7] png_0.1-8               UCSC.utils_1.2.0        jsonlite_1.8.9         
[10] bit_4.5.0               Biostrings_2.74.0       KEGGREST_1.46.0        
[13] fastmap_1.2.0           GenomeInfoDb_1.42.0     memoise_2.0.1          
[16] compiler_4.4.2          RSQLite_2.3.7           blob_1.2.4             
[19] pkgconfig_2.0.3         XVector_0.46.0          R6_2.5.1               
[22] GenomeInfoDbData_1.2.13 tools_4.4.2             bit64_4.5.2            
[25] zlibbioc_1.52.0         cachem_1.1.0           
> 
org.Hs.eg.db • 1.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

The data for the OrgDb packages comes directly from NCBI, and in this case the symbols come from the gene_info.gz file you can get from their FTP site. It's the same data that are parsed when we use makeOrgDbFromNCBI, so I can use the NCBI.sqlite db that I have already generated.

> library(RSQLite)
> library(org.Hs.eg.db)
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> egids <- select(org.Hs.eg.db, "MT", "ENTREZID","CHR")$ENTREZID
'select()' returned 1:many mapping between keys and columns
Warning message:
In .deprecatedColsMessage() :
  Accessing gene location information via 'CHR','CHRLOC','CHRLOCEND' is deprecated. Please use a range
  based accessor like genes(), or select() with columns values like TXCHROM and TXSTART on a TxDb or
  OrganismDb object instead.
> dbGetQuery(con, paste0("select gene_id, symbol from gene_info where gene_id in ('", paste(egids, collapse = "','"), "');"))
   gene_id symbol
1     4508   ATP6
2     4509   ATP8
3     4511   TRNC
4     4512   COX1
5     4513   COX2
6     4514   COX3
7     4519   CYTB
8     4535    ND1
9     4536    ND2
10    4537    ND3
11    4538    ND4
12    4539   ND4L
13    4540    ND5
14    4541    ND6
15    4549   RNR1
16    4550   RNR2
17    4553   TRNA
18    4555   TRND
19    4556   TRNE
20    4558   TRNF
21    4563   TRNG
22    4564   TRNH
23    4565   TRNI
24    4566   TRNK
25    4567  TRNL1
26    4568  TRNL2
27    4569   TRNM
28    4570   TRNN
29    4571   TRNP
30    4572   TRNQ
31    4573   TRNR
32    4574  TRNS1
33    4575  TRNS2
34    4576   TRNT
35    4577   TRNV
36    4578   TRNW
37    4579   TRNY

I am not sure why the gene_info.gz file has a different gene symbol for these genes than what you see on their web page, but it's what we get from them, so it's what goes in the OrgDb. It might have something to do with this though. The search results call it 'ATP8', but when you go to the page, they list the HGNC symbol.

0
Entering edit mode

Thanks; I did not know which file exactly is being used. Good catch regarding the discrepancy between the 2 web pages. I will get in touch with NCBI regarding this issue.

ADD REPLY
1
Entering edit mode

If you care to know more about the process, there is a GitHub with all the code. It's pretty complicated though - the readme alone is like War and Peace length.

ADD REPLY
0
Entering edit mode

I have contacted NCBI, and meanwhile got an an answer. See below.

It turns out that the official, HGNC-approved symbols are present in the before-mentioned file gene_info.gz, namely in column 11 (Symbol_from_nomenclature_authority). I manually looked up the above-mentioned mitochondrial genes (lines 12218018 - 12218026), and indeed the HGNC symbols are present in that column.

Screenshot:

Screenshot

@James: yes, that is indeed a lot of code. Do you happen to know where in the code it is defined which columns are extracted? I noticed that after downloading the gene_info file first is filtered for the organisms for which Bioconductor provides an Orgb (getsrc.sh, https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/gene/script/getsrc.sh).

In the srcdb.sql script the relevant annotation information seems to be extracted (https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/2cb9f84f10c836008eb52aea2ae939749f818201/annosrc/gene/script/srcdb.sql#L32-L49), but I wasn't able to find out what exactly is used for default_gene_symbol.

Anyway, do you think it would be possible to include the Symbol_from_nomenclature_authority (column 11) and Full_name_from_nomenclature_authority information (column 12) in the OrgDb as well?

Reply NLM Support:

Hello,

Thank you for contacting NCBI and for your inquiry. If you are using the "Symbol" (third) column in the gene_info file, you will get the default symbol for the gene.
I have checked with the Gene database staff and they tell me that for mitochondrion genes, we use a standardized set of symbols across vertebrates (and also >100k vertebrate MT genomes in GenBank), and that is our primary nomenclature as shown in columns 3 & 9. 
You need to use columns 11 and 12 that correspond to official nomenclature. Column 11 is for "Symbol from nomenclature authority"

Here is also the link to the readme file:
https://ftp.ncbi.nih.gov/gene/DATA/README

Best regards,

NCBI Sequence Resources Help Desk

NLM | NIH | HHS | USA.gov
ADD REPLY
0
Entering edit mode

Thanks for following up with NCBI. I can change the code to query the correct column. It's up to somebody with a higher pay scale than mine to decide if this is worth a reissue of the OrgDb packages, so it might be next release before the changes propagate.

ADD REPLY
0
Entering edit mode

Yes, I understand the implications of replacing the OrgDb packages in release. Maybe an upload to the annotationHub or GitHub could be another option? For the human and maybe mouse OrgDb only? (To accommodate the users that are aware of this issue?)

Also: do you plan to supersede the current default column with the official one, or do you rather want to add it?

Anyway, thanks for changing this!

ADD REPLY
1
Entering edit mode

OK, there are 157 MT genes in maybe 6 species for which the symbols are not correct, and I added code to swap the 'Symbol from nomenclature authority' for the 'Symbol' if A) they are not equal, and B) the 'Symbol from nomenclature authority' is not '-'.

Given how few symbols are affected, and the fact that this has only come up at this late date, I am not sure I want to re-run things.

0
Entering edit mode

Thanks for changing this! Just to confirm I understood it correctly: the Symbol will (still) be used, and will be replaced by Symbol from nomenclature authority if these are not identical (but not -). Is this only for the MT genes? Or all genes?

You raise a fair point regarding the impact. I would agree to do this from next release onwards.

ADD REPLY
0
Entering edit mode

It's a simple two-liner in getsrc.sh

awk -F'\t' 'BEGIN {OFS=FS} {if($3 != $11 && $11 != "-") $3 = $11; print $0}' gene_info > tmp
mv tmp gene_info

Which by default checks all the genes, but only affects the MT genes.

0
Entering edit mode

Thanks for looking into this. This issue results in the incorrect usage of multiple ORA software including EnrichR.

ADD REPLY

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6