Question

Error selecting Gene Symbol using Entrez ID via Homo.sapiens

0

Entering edit mode

Moiz Bootwalla ▴ 50

@moiz-bootwalla-5215

Last seen 9.7 years ago

United States

I am getting a very weird error while trying to select a particular SYMBOL using a GENEID using the Homo.sapiens package. The Entrez ID '6218' maps to the Gene Symbol 'RPS17'. When I try to use the select interface to retrieve the gene symbol for only the above entrez id I get an error but if I combine the id with a couple of other entrez ids it works.

Here is a reproducible example:

select(Homo.sapiens, keys="6218", columns="SYMBOL", keytype="GENEID")
Error in .testForValidKeys(x, keys, keytype) :
  None of the keys entered are valid keys for 'GENEID'. Please use the keys method to see a listing of valid arguments.

Enter a frame number, or 0 to exit   

1: select(Homo.sapiens, keys = "6218", columns = "SYMBOL", keytype = "GENEID")
2: select(Homo.sapiens, keys = "6218", columns = "SYMBOL", keytype = "GENEID")
3: .select(x, keys, columns, keytype, ...)
4: AnnotationDbi:::.testSelectArgs(x, keys = keys, cols = cols, keytype = keytype)
5: .testForValidKeys(x, keys, keytype)

Selection: 0
> select(Homo.sapiens, keys=c("6218", "2184", "29929", "6218"), columns="SYMBOL", keytype="GENEID")
  GENEID SYMBOL
1   6218  RPS17
2   2184    FAH
3  29929   ALG6
4   6218  RPS17

Here is my sessionInfo() :

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE 13.1 (Bottle) (x86_64)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.34.1                        
 [3] Biostrings_2.34.1                       XVector_0.6.0                          
 [5] Homo.sapiens_1.1.2                      org.Hs.eg.db_3.0.0                     
 [7] GO.db_3.0.0                             RSQLite_1.0.0                          
 [9] DBI_0.3.1                               OrganismDbi_1.8.0                      
[11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 rtracklayer_1.26.2                     
[13] GenomicFeatures_1.18.3                  AnnotationDbi_1.28.1                   
[15] Biobase_2.26.0                          GenomicRanges_1.18.4                   
[17] GenomeInfoDb_1.2.4                      IRanges_2.0.1                          
[19] S4Vectors_0.4.0                         BiocGenerics_0.12.1                    
[21] BiocInstaller_1.16.1                   

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.9              BiocParallel_1.0.3     
 [5] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6              checkmate_1.5.1        
 [9] codetools_0.2-10        digest_0.6.8            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.1 graph_1.44.1            iterators_1.0.7         RBGL_1.42.0            
[17] RCurl_1.95-4.5          Rsamtools_1.18.2        sendmailR_1.2-1         stringr_0.6.2          
[21] tools_3.1.3             XML_3.98-1.1            zlibbioc_1.12.0

Any insights into this issue would be appreciated.

Thanks,

Moiz

homo sapiens • 9.9k views

ADD COMMENT • link updated 9.7 years ago by Marc Carlson ★ 7.2k • written 9.7 years ago by Moiz Bootwalla ▴ 50

1

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.3 years ago

United States

OK so I empathize with how this might be confusing. Basically Jim has shown the answer and I just thought I should add a little prose to explain it a little more. The short explanation is that GENEID and ENTREZID are two very different key sets. You can see this by comparing the full output of these two key sets like this:

length(keys(Homo.sapiens, 'ENTREZID'))
## vs
length(keys(Homo.sapiens, 'GENEID'))

It's true that both of these key sets contain the same *kind* of keys (entrez gene IDs). But they are not used for the same thing. The difference is that GENEID contains gene Ids for which there is some kind of genomic (TxDb) information while ENTREZID contains entrez gene IDs for which there is other gene information (like gene symbols or names etc.).

So the ultimate reason for the difference is because the resource that was used to make the TxDb just didn't have any information for a gene id '6218', and so the result is that there is not a GENEID key that matches '6218'.

ADD COMMENT • link 9.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Thanks Jim and Marc. I kind of got a handle of it last night when I started looking at the columns being returned by Homo.sapiens and TxDb and saw that TxDb had just GENEID and not EntrezID in it. I thought they were both the same and did not know about the difference between the two. Thank you Marc for pointing it out and for the great explanation. This was really helpful.

Best,

Moiz

ADD REPLY • link 9.7 years ago Moiz Bootwalla ▴ 50

score 2 · Accepted Answer · 2015-04-01

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

This is sort of a tricky question. Note that the Homo.sapiens package is intended to allow people to interrogate three different packages seamlessly; the org.Hs.eg.db, GO.db, and TxDb.Hsapiens.UCSC.hg19.knownGene packages. In addition, note that the GENEID key comes from the TxDb.Hsapiens.UCSC.hg19.knownGene package:

> keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene)
[1] "GENEID"   "TXID"     "TXNAME"   "EXONID"   "EXONNAME" "CDSID"    "CDSNAME"

And if we do

> "6218" %in% keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "GENEID")
[1] FALSE

We see it isn't in there. However, this is actually an ENTREZID, and the mapping of ENTREZID to SYMBOL will be performed by the org.Hs.eg.db package, not the TxDb.Hsapiens.UCSC.hg19.knownGene package.

> "6218" %in% keys(org.Hs.eg.db, "ENTREZID")
[1] TRUE

So what you want to do is to use the correct keytype, in order to get what you want:

> select(Homo.sapiens, "6218", "SYMBOL","ENTREZID")
  ENTREZID SYMBOL
1     6218  RPS17

ADD COMMENT • link 9.7 years ago James W. MacDonald 67k

0

Entering edit mode

What might be the cause of such genes that only appears only in org.Hs.eg.db but not in TxDb.Hsapiens.UCSC.hg19.knownGene?

I have found a similar gene. GP1BB (entrez id: 2812) can not be found in TxDb.Hsapiens.UCSC.hg19.knownGene. It failed when I try to plot a gene track using Gviz package since nothing is returned from the txdb object.

What might be possible solution to this?

ADD REPLY • link 7.1 years ago izzy.yichao.cai ▴ 10

0

Entering edit mode

Please don't hijack old threads. If you have a new question, ask a new question.

Note that the package you are using has UCSC and knownGene in the name. This indicates that the data you are using comes from the UCSC genome browser's knownGene table. And if we go to the genome browser and search for GP1BB, what comes up is SEPT5-GP1BB. Because UCSC thinks that those genes are the same thing (this is what NCBI calls the SEPT5-GP1BB readthrough).

Since the TxDb package is based on UCSC's knownGene table, you get what they have, so when you find what you think are inconsistencies, the first thing to do is go to the source and see what they have there.

Anyway, this gene is in the TxDb package:

> select(TxDb.Hsapiens.UCSC.hg19.knownGene, "uc002zpv.2", c("GENEID","TXNAME"), "TXNAME")
'select()' returned 1:1 mapping between keys and columns
      TXNAME    GENEID
1 uc002zpv.2 100526833
> select(org.Hs.eg.db, "100526833", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
   ENTREZID      SYMBOL
1 100526833 SEPT5-GP1BB
> select(TxDb.Hsapiens.UCSC.hg19.knownGene, "100526833", c("CDSCHROM","CDSSTART","CDSEND"), "GENEID")
'select()' returned 1:many mapping between keys and columns
      GENEID CDSCHROM CDSSTART   CDSEND
1  100526833    chr22 19702112 19702154
2  100526833    chr22 19702304 19702314
3  100526833    chr22 19707125 19707221
4  100526833    chr22 19707329 19707415
5  100526833    chr22 19707638 19707761
6  100526833    chr22 19707843 19707977
7  100526833    chr22 19708072 19708189
8  100526833    chr22 19708291 19708392
9  100526833    chr22 19709163 19709259
10 100526833    chr22 19709345 19709480
11 100526833    chr22 19709760 19709862
12 100526833    chr22 19709951 19710007
13 100526833    chr22 19706261 19706341
14 100526833    chr22 19709356 19709480
15 100526833    chr22 19709760 19709834
16 100526833    chr22 19711093 19711102
17 100526833    chr22 19711377 19711987

ADD REPLY • link 7.1 years ago James W. MacDonald 67k