problem with retrieving UniProt annotation data
Entering edit mode
Guido Hooiveld ★ 3.3k
Last seen 5 hours ago
Wageningen University, Wageningen, the …

I would like to annotate a set of UniProt IDs within BioC/R. To this end I am using the package. However, when using R/BioC I am not able to retrieve the full information that apparently is available in the UniProt database.

To be specific; among others I would like to retrieve the preferred gene name, which is labeled "Gene names (primary)" in the resulting table when using UniProt's web-based ID mapping interface.

As example, when 'manually' retrieving annotation info for the (rat) UniProt ID "Q6MGA6" (through the Uniprot website) I got this results:,id%2Centry%20name%2Creviewed%2Cgenes%28PREFERRED%29%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Ccomment%28PATHWAY%29%2Cec%2Ccomment%28FUNCTION%29%2Corganism-id%2Cdatabase%28Ensembl%29%2Cdatabase%28GeneID%29%2Cyourlist%28M2015022613L2TBUNHS%29

As can be seen, this ID links to multiple (13) gene names/synonyms (7th column), but the primary gene name is Psmb9 (5th column).


Question: which columns to select for retrieving the primary gene name when using the package?

--> From the URL I deduced that the name of the 'column' that should be selected for this query is labelled "genes(PREFERRED)" (corresponds to "genes%28PREFERRED%29") in the URL. However, this column is not present/accessible when using the library.

Any hints would be greatly appreciated.



> library(

> #set taxonomy ID for Rn
> taxId( <- 10116
> # check
> species(
[1] "Rattus norvegicus"
> # check which columns (annotation info) can be retrieved
> # In total there are 125 annotation columns available, but "genes(PREFERRED)" isn't one of these...
> head(columns(
[1] "UNIPROTKB"         "UNIPARC"           "UNIREF50"         
[4] "UNIREF90"          "UNIREF100"         "EMBL/GENBANK/DDBJ"

> IDkeys <- c("Q6MGA6", "A0A023IMI6")
> annotation <- select(, keys=IDkeys, columns=c("RGD"), keytype="UNIPROTKB") # This works!
Getting mapping data for Q6MGA6 ... and RGD_ID
> annotation
1     Q6MGA6 3427
2 A0A023IMI6 <NA>

> # but this not, although "GENEID" is listed as column type...
> annotation <- select(, keys=IDkeys, columns=c("GENEID"), keytype="UNIPROTKB")
Getting mapping data for Q6MGA6 ... and P_ENTREZGENEID
Error in `[.data.frame`(tab, , oriTabCols) : undefined columns selected

>  columns([37]
[1] "GENEID"

> ## this works
> annotation <- select(, keys=IDkeys, columns=c("ENTREZ_GENE"), keytype="UNIPROTKB")
Getting mapping data for Q6MGA6 ... and P_ENTREZGENEID
> annotation
1     Q6MGA6       24967
2 A0A023IMI6       24968
> # but this not....
> annotation <- select(, keys=IDkeys, columns=c("genes(PREFERRED)"), keytype="UNIPROTKB")
Error in .select(x, keys, columns, keytype) :
  columns argument MUST match a value returned by columns method

> sessionInfo()
R version 3.1.2 Patched (2015-02-03 r67717)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.6.0 RCurl_1.95-4.5   bitops_1.0-6     RSQLite_1.0.0   
[5] DBI_0.3.1       

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 Biobase_2.26.0       BiocGenerics_0.12.1
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        parallel_3.1.2      
[7] S4Vectors_0.4.0      stats4_3.1.2        
> uniprot • 3.0k views
Entering edit mode
Last seen 2 days ago
United States

You might be better served using the package instead:

> select(, IDkeys, "SYMBOL", "UNIPROT")
1     Q6MGA6  Psmb9
2 A0A023IMI6  Psmb8

Unless you need other specific things that are only available from UniProt. An alternative is to get what you want from UniProt, and then fill in later. Do note that what UniProt is calling 'gene names' (actually HGNC symbols) that are listed on that UniProt webpage have a little drop down that shows you where the data are imported from. It may well be that this is done on the fly, and is not actually part of the UniProt database, hence why you are having problems retrieving from


Entering edit mode

Thanks, I didn't think of using the package, but this indeed works fine for my current case.

As a side node, I had the impression that keys and columns would only show rat-specific annotation info, and not all available info, including those specific for certain species.

Entering edit mode
Marc Carlson ★ 7.2k
Last seen 6.1 years ago
United States

Hi Guido,

This was a bug caused by the fact that internally we had two labels matched to the same thing (from UniProt).  I am now pushing a patch for this online.  However "ENTREZ_GENE" is what you were after when you asked for "GENEID" (They mean the same thing to UniProt).  So that is equivalent.  I will remove 'GENEID' from the list of values returned from columns in order to fix this.  Also I don't know where you got the idea to try 'genes(PREFERRED)' from??

Also Jims answer is a good one for you to follow for those things that you don't *need* to get from UniProt as it will be more performant (since it can run off a local DB instead of a web service).


Entering edit mode

Thanks, Marc.

Regarding the try of 'genes(PREFERRED)': I naively deduced/tried this based on the URL that is returned when performing a manual query using the UniProt website; see my first post. It is indeed not one of the database identifiers.


Login before adding your answer.

Traffic: 525 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6