Question: Error selecting Gene Symbol using Entrez ID via Homo.sapiens
gravatar for Moiz Bootwalla
3.3 years ago by
United States
Moiz Bootwalla50 wrote:

I am getting a very weird error while trying to select a particular SYMBOL using a GENEID using the Homo.sapiens package. The Entrez ID '6218' maps to the Gene Symbol 'RPS17'. When I try to use the select interface to retrieve the gene symbol for only the above entrez id I get an error but if I combine the id with a couple of other entrez ids it works.

Here is a reproducible example:

select(Homo.sapiens, keys="6218", columns="SYMBOL", keytype="GENEID")
Error in .testForValidKeys(x, keys, keytype) :
  None of the keys entered are valid keys for 'GENEID'. Please use the keys method to see a listing of valid arguments.

Enter a frame number, or 0 to exit   

1: select(Homo.sapiens, keys = "6218", columns = "SYMBOL", keytype = "GENEID")
2: select(Homo.sapiens, keys = "6218", columns = "SYMBOL", keytype = "GENEID")
3: .select(x, keys, columns, keytype, ...)
4: AnnotationDbi:::.testSelectArgs(x, keys = keys, cols = cols, keytype = keytype)
5: .testForValidKeys(x, keys, keytype)

Selection: 0
> select(Homo.sapiens, keys=c("6218", "2184", "29929", "6218"), columns="SYMBOL", keytype="GENEID")
1   6218  RPS17
2   2184    FAH
3  29929   ALG6
4   6218  RPS17

Here is my sessionInfo() :

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE 13.1 (Bottle) (x86_64)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.34.1                        
 [3] Biostrings_2.34.1                       XVector_0.6.0                          
 [5] Homo.sapiens_1.1.2                                 
 [7] GO.db_3.0.0                             RSQLite_1.0.0                          
 [9] DBI_0.3.1                               OrganismDbi_1.8.0                      
[11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 rtracklayer_1.26.2                     
[13] GenomicFeatures_1.18.3                  AnnotationDbi_1.28.1                   
[15] Biobase_2.26.0                          GenomicRanges_1.18.4                   
[17] GenomeInfoDb_1.2.4                      IRanges_2.0.1                          
[19] S4Vectors_0.4.0                         BiocGenerics_0.12.1                    
[21] BiocInstaller_1.16.1                   

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.9              BiocParallel_1.0.3     
 [5] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6              checkmate_1.5.1        
 [9] codetools_0.2-10        digest_0.6.8            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.1 graph_1.44.1            iterators_1.0.7         RBGL_1.42.0            
[17] RCurl_1.95-4.5          Rsamtools_1.18.2        sendmailR_1.2-1         stringr_0.6.2          
[21] tools_3.1.3             XML_3.98-1.1            zlibbioc_1.12.0

Any insights into this issue would be appreciated.



ADD COMMENTlink modified 3.3 years ago by Marc Carlson7.2k • written 3.3 years ago by Moiz Bootwalla50
gravatar for James W. MacDonald
3.3 years ago by
United States
James W. MacDonald46k wrote:

This is sort of a tricky question. Note that the Homo.sapiens package is intended to allow people to interrogate three different packages seamlessly; the, GO.db, and TxDb.Hsapiens.UCSC.hg19.knownGene packages. In addition, note that the GENEID key comes from the TxDb.Hsapiens.UCSC.hg19.knownGene package:

> keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene)

And if we do

> "6218" %in% keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "GENEID")

We see it isn't in there. However, this is actually an ENTREZID, and the mapping of ENTREZID to SYMBOL will be performed by the package, not the TxDb.Hsapiens.UCSC.hg19.knownGene package.

> "6218" %in% keys(, "ENTREZID")
[1] TRUE

So what you want to do is to use the correct keytype, in order to get what you want:

> select(Homo.sapiens, "6218", "SYMBOL","ENTREZID")
1     6218  RPS17
ADD COMMENTlink written 3.3 years ago by James W. MacDonald46k

What might be the cause of such genes that only appears only in but not in TxDb.Hsapiens.UCSC.hg19.knownGene?

I have found a similar gene. GP1BB (entrez id: 2812) can not be found in TxDb.Hsapiens.UCSC.hg19.knownGene. It failed when I try to plot a gene track using Gviz package since nothing is returned from the txdb object. 

What might be possible solution to this?

ADD REPLYlink written 9 months ago by izzy.yichao.cai10

Please don't hijack old threads. If you have a new question, ask a new question.

Note that the package you are using has UCSC and knownGene in the name. This indicates that the data you are using comes from the UCSC genome browser's knownGene table. And if we go to the genome browser and search for GP1BB, what comes up is SEPT5-GP1BB. Because UCSC thinks that those genes are the same thing (this is what NCBI calls the SEPT5-GP1BB readthrough).

Since the TxDb package is based on UCSC's knownGene table, you get what they have, so when you find what you think are inconsistencies, the first thing to do is go to the source and see what they have there.

Anyway, this gene is in the TxDb package:

> select(TxDb.Hsapiens.UCSC.hg19.knownGene, "uc002zpv.2", c("GENEID","TXNAME"), "TXNAME")
'select()' returned 1:1 mapping between keys and columns
1 uc002zpv.2 100526833
> select(, "100526833", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
1 100526833 SEPT5-GP1BB
> select(TxDb.Hsapiens.UCSC.hg19.knownGene, "100526833", c("CDSCHROM","CDSSTART","CDSEND"), "GENEID")
'select()' returned 1:many mapping between keys and columns
1  100526833    chr22 19702112 19702154
2  100526833    chr22 19702304 19702314
3  100526833    chr22 19707125 19707221
4  100526833    chr22 19707329 19707415
5  100526833    chr22 19707638 19707761
6  100526833    chr22 19707843 19707977
7  100526833    chr22 19708072 19708189
8  100526833    chr22 19708291 19708392
9  100526833    chr22 19709163 19709259
10 100526833    chr22 19709345 19709480
11 100526833    chr22 19709760 19709862
12 100526833    chr22 19709951 19710007
13 100526833    chr22 19706261 19706341
14 100526833    chr22 19709356 19709480
15 100526833    chr22 19709760 19709834
16 100526833    chr22 19711093 19711102
17 100526833    chr22 19711377 19711987


ADD REPLYlink written 9 months ago by James W. MacDonald46k
gravatar for Marc Carlson
3.3 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

OK so I empathize with how this might be confusing.  Basically Jim has shown the answer and I just thought I should add a little prose to explain it a little more.  The short explanation is that GENEID and ENTREZID are two very different key sets.  You can see this by comparing the full output of these two key sets like this:

length(keys(Homo.sapiens, 'ENTREZID'))
## vs
length(keys(Homo.sapiens, 'GENEID'))

It's true that both of these key sets contain the same *kind* of keys (entrez gene IDs).  But they are not used for the same thing.  The difference is that GENEID contains gene Ids for which there is some kind of genomic (TxDb) information while ENTREZID contains entrez gene IDs for which there is other gene information (like gene symbols or names etc.). 

So the ultimate reason for the difference is because the resource that was used to make the TxDb just didn't have any information for a gene id '6218', and so the result is that there is not a GENEID key that matches '6218'.

ADD COMMENTlink written 3.3 years ago by Marc Carlson7.2k

Thanks Jim and Marc. I kind of got a handle of it last night when I started looking at the columns being returned by Homo.sapiens and TxDb and saw that TxDb had just GENEID and not EntrezID in it. I thought they were both the same and did not know about the difference between the two. Thank you Marc for pointing it out and for the great explanation. This was really helpful.



ADD REPLYlink written 3.3 years ago by Moiz Bootwalla50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 333 users visited in the last hour