Question: problem with retrieving UniProt annotation data
0
gravatar for Guido Hooiveld
4.7 years ago by
Guido Hooiveld2.5k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.5k wrote:

I would like to annotate a set of UniProt IDs within BioC/R. To this end I am using the UniProt.ws package. However, when using R/BioC I am not able to retrieve the full information that apparently is available in the UniProt database.

To be specific; among others I would like to retrieve the preferred gene name, which is labeled "Gene names (primary)" in the resulting table when using UniProt's web-based ID mapping interface.

As example, when 'manually' retrieving annotation info for the (rat) UniProt ID "Q6MGA6" (through the Uniprot website) I got this results:

http://www.uniprot.org/uniprot/?query=yourlist:M2015022663MSS5M698&sort=yourlist:M2015022663MSS5M698&columns=yourlist%28M2015022663MSS5M698%29,id%2Centry%20name%2Creviewed%2Cgenes%28PREFERRED%29%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Ccomment%28PATHWAY%29%2Cec%2Ccomment%28FUNCTION%29%2Corganism-id%2Cdatabase%28Ensembl%29%2Cdatabase%28GeneID%29%2Cyourlist%28M2015022613L2TBUNHS%29

As can be seen, this ID links to multiple (13) gene names/synonyms (7th column), but the primary gene name is Psmb9 (5th column).

 

Question: which columns to select for retrieving the primary gene name when using the UniProt.ws package?

--> From the URL I deduced that the name of the 'column' that should be selected for this query is labelled "genes(PREFERRED)" (corresponds to "genes%28PREFERRED%29") in the URL. However, this column is not present/accessible when using the UniProt.ws library.

Any hints would be greatly appreciated.

Thanks,

Guido

> library(UniProt.ws)

> #set taxonomy ID for Rn
> taxId(UniProt.ws) <- 10116
>
> # check
> species(UniProt.ws)
[1] "Rattus norvegicus"
> # check which columns (annotation info) can be retrieved
> # In total there are 125 annotation columns available, but "genes(PREFERRED)" isn't one of these...
>
> head(columns(UniProt.ws))
[1] "UNIPROTKB"         "UNIPARC"           "UNIREF50"         
[4] "UNIREF90"          "UNIREF100"         "EMBL/GENBANK/DDBJ"
>

> IDkeys <- c("Q6MGA6", "A0A023IMI6")
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("RGD"), keytype="UNIPROTKB") # This works!
Getting mapping data for Q6MGA6 ... and RGD_ID
> annotation
   UNIPROTKB  RGD
1     Q6MGA6 3427
2 A0A023IMI6 <NA>
>

> # but this not, although "GENEID" is listed as column type...
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("GENEID"), keytype="UNIPROTKB")
Getting mapping data for Q6MGA6 ... and P_ENTREZGENEID
Error in `[.data.frame`(tab, , oriTabCols) : undefined columns selected
>

>  columns(UniProt.ws)[37]
[1] "GENEID"
>

> ## this works
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("ENTREZ_GENE"), keytype="UNIPROTKB")
Getting mapping data for Q6MGA6 ... and P_ENTREZGENEID
> annotation
   UNIPROTKB ENTREZ_GENE
1     Q6MGA6       24967
2 A0A023IMI6       24968
>
> # but this not....
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("genes(PREFERRED)"), keytype="UNIPROTKB")
Error in .select(x, keys, columns, keytype) :
  columns argument MUST match a value returned by columns method
>

> sessionInfo()
R version 3.1.2 Patched (2015-02-03 r67717)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.6.0 RCurl_1.95-4.5   bitops_1.0-6     RSQLite_1.0.0   
[5] DBI_0.3.1       

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 Biobase_2.26.0       BiocGenerics_0.12.1
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        parallel_3.1.2      
[7] S4Vectors_0.4.0      stats4_3.1.2        
>

 

uniprot.ws uniprot • 2.0k views
ADD COMMENTlink modified 4.7 years ago by Marc Carlson7.2k • written 4.7 years ago by Guido Hooiveld2.5k
Answer: problem with retrieving UniProt annotation data
1
gravatar for James W. MacDonald
4.7 years ago by
United States
James W. MacDonald51k wrote:

You might be better served using the org.Rn.eg.db package instead:

> select(org.Rn.eg.db, IDkeys, "SYMBOL", "UNIPROT")
     UNIPROT SYMBOL
1     Q6MGA6  Psmb9
2 A0A023IMI6  Psmb8

Unless you need other specific things that are only available from UniProt. An alternative is to get what you want from UniProt, and then fill in later. Do note that what UniProt is calling 'gene names' (actually HGNC symbols) that are listed on that UniProt webpage have a little drop down that shows you where the data are imported from. It may well be that this is done on the fly, and is not actually part of the UniProt database, hence why you are having problems retrieving from UniProt.ws.

 

ADD COMMENTlink written 4.7 years ago by James W. MacDonald51k

Thanks, I didn't think of using the org.Rn.eg.db package, but this indeed works fine for my current case.

As a side node, I had the impression that keys and columns would only show rat-specific annotation info, and not all available info, including those specific for certain species.

ADD REPLYlink written 4.7 years ago by Guido Hooiveld2.5k
Answer: problem with retrieving UniProt annotation data
0
gravatar for Marc Carlson
4.7 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Hi Guido,

This was a bug caused by the fact that internally we had two labels matched to the same thing (from UniProt).  I am now pushing a patch for this online.  However "ENTREZ_GENE" is what you were after when you asked for "GENEID" (They mean the same thing to UniProt).  So that is equivalent.  I will remove 'GENEID' from the list of values returned from columns in order to fix this.  Also I don't know where you got the idea to try 'genes(PREFERRED)' from??

Also Jims answer is a good one for you to follow for those things that you don't *need* to get from UniProt as it will be more performant (since it can run off a local DB instead of a web service).

 Marc

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Marc Carlson7.2k

Thanks, Marc.

Regarding the try of 'genes(PREFERRED)': I naively deduced/tried this based on the URL that is returned when performing a manual query using the UniProt website; see my first post. It is indeed not one of the database identifiers.

ADD REPLYlink written 4.7 years ago by Guido Hooiveld2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 481 users visited in the last hour