I am getting a very weird error while trying to select a particular SYMBOL using a GENEID using the Homo.sapiens package. The Entrez ID '6218' maps to the Gene Symbol 'RPS17'. When I try to use the select interface to retrieve the gene symbol for only the above entrez id I get an error but if I combine the id with a couple of other entrez ids it works.
Here is a reproducible example:
select(Homo.sapiens, keys="6218", columns="SYMBOL", keytype="GENEID") Error in .testForValidKeys(x, keys, keytype) : None of the keys entered are valid keys for 'GENEID'. Please use the keys method to see a listing of valid arguments. Enter a frame number, or 0 to exit 1: select(Homo.sapiens, keys = "6218", columns = "SYMBOL", keytype = "GENEID") 2: select(Homo.sapiens, keys = "6218", columns = "SYMBOL", keytype = "GENEID") 3: .select(x, keys, columns, keytype, ...) 4: AnnotationDbi:::.testSelectArgs(x, keys = keys, cols = cols, keytype = keytype) 5: .testForValidKeys(x, keys, keytype) Selection: 0 > select(Homo.sapiens, keys=c("6218", "2184", "29929", "6218"), columns="SYMBOL", keytype="GENEID") GENEID SYMBOL 1 6218 RPS17 2 2184 FAH 3 29929 ALG6 4 6218 RPS17
Here is my sessionInfo() :
> sessionInfo() R version 3.1.3 (2015-03-09) Platform: x86_64-suse-linux-gnu (64-bit) Running under: openSUSE 13.1 (Bottle) (x86_64) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.34.1 [3] Biostrings_2.34.1 XVector_0.6.0 [5] Homo.sapiens_1.1.2 org.Hs.eg.db_3.0.0 [7] GO.db_3.0.0 RSQLite_1.0.0 [9] DBI_0.3.1 OrganismDbi_1.8.0 [11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 rtracklayer_1.26.2 [13] GenomicFeatures_1.18.3 AnnotationDbi_1.28.1 [15] Biobase_2.26.0 GenomicRanges_1.18.4 [17] GenomeInfoDb_1.2.4 IRanges_2.0.1 [19] S4Vectors_0.4.0 BiocGenerics_0.12.1 [21] BiocInstaller_1.16.1 loaded via a namespace (and not attached): [1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9 BiocParallel_1.0.3 [5] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.1 [9] codetools_0.2-10 digest_0.6.8 fail_1.2 foreach_1.4.2 [13] GenomicAlignments_1.2.1 graph_1.44.1 iterators_1.0.7 RBGL_1.42.0 [17] RCurl_1.95-4.5 Rsamtools_1.18.2 sendmailR_1.2-1 stringr_0.6.2 [21] tools_3.1.3 XML_3.98-1.1 zlibbioc_1.12.0
Any insights into this issue would be appreciated.
Thanks,
Moiz
What might be the cause of such genes that only appears only in org.Hs.eg.db but not in TxDb.Hsapiens.UCSC.hg19.knownGene?
I have found a similar gene. GP1BB (entrez id: 2812) can not be found in TxDb.Hsapiens.UCSC.hg19.knownGene. It failed when I try to plot a gene track using Gviz package since nothing is returned from the txdb object.
What might be possible solution to this?
Please don't hijack old threads. If you have a new question, ask a new question.
Note that the package you are using has UCSC and knownGene in the name. This indicates that the data you are using comes from the UCSC genome browser's knownGene table. And if we go to the genome browser and search for GP1BB, what comes up is SEPT5-GP1BB. Because UCSC thinks that those genes are the same thing (this is what NCBI calls the SEPT5-GP1BB readthrough).
Since the TxDb package is based on UCSC's knownGene table, you get what they have, so when you find what you think are inconsistencies, the first thing to do is go to the source and see what they have there.
Anyway, this gene is in the TxDb package: