problem with retrieving UniProt annotation data
2
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 7 days ago
Wageningen University, Wageningen, the …

I would like to annotate a set of UniProt IDs within BioC/R. To this end I am using the UniProt.ws package. However, when using R/BioC I am not able to retrieve the full information that apparently is available in the UniProt database.

To be specific; among others I would like to retrieve the preferred gene name, which is labeled "Gene names (primary)" in the resulting table when using UniProt's web-based ID mapping interface.

As example, when 'manually' retrieving annotation info for the (rat) UniProt ID "Q6MGA6" (through the Uniprot website) I got this results:

http://www.uniprot.org/uniprot/?query=yourlist:M2015022663MSS5M698&sort=yourlist:M2015022663MSS5M698&columns=yourlist%28M2015022663MSS5M698%29,id%2Centry%20name%2Creviewed%2Cgenes%28PREFERRED%29%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Ccomment%28PATHWAY%29%2Cec%2Ccomment%28FUNCTION%29%2Corganism-id%2Cdatabase%28Ensembl%29%2Cdatabase%28GeneID%29%2Cyourlist%28M2015022613L2TBUNHS%29

As can be seen, this ID links to multiple (13) gene names/synonyms (7th column), but the primary gene name is Psmb9 (5th column).

 

Question: which columns to select for retrieving the primary gene name when using the UniProt.ws package?

--> From the URL I deduced that the name of the 'column' that should be selected for this query is labelled "genes(PREFERRED)" (corresponds to "genes%28PREFERRED%29") in the URL. However, this column is not present/accessible when using the UniProt.ws library.

Any hints would be greatly appreciated.

Thanks,

Guido

> library(UniProt.ws)

> #set taxonomy ID for Rn
> taxId(UniProt.ws) <- 10116
>
> # check
> species(UniProt.ws)
[1] "Rattus norvegicus"
> # check which columns (annotation info) can be retrieved
> # In total there are 125 annotation columns available, but "genes(PREFERRED)" isn't one of these...
>
> head(columns(UniProt.ws))
[1] "UNIPROTKB"         "UNIPARC"           "UNIREF50"         
[4] "UNIREF90"          "UNIREF100"         "EMBL/GENBANK/DDBJ"
>

> IDkeys <- c("Q6MGA6", "A0A023IMI6")
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("RGD"), keytype="UNIPROTKB") # This works!
Getting mapping data for Q6MGA6 ... and RGD_ID
> annotation
   UNIPROTKB  RGD
1     Q6MGA6 3427
2 A0A023IMI6 <NA>
>

> # but this not, although "GENEID" is listed as column type...
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("GENEID"), keytype="UNIPROTKB")
Getting mapping data for Q6MGA6 ... and P_ENTREZGENEID
Error in `[.data.frame`(tab, , oriTabCols) : undefined columns selected
>

>  columns(UniProt.ws)[37]
[1] "GENEID"
>

> ## this works
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("ENTREZ_GENE"), keytype="UNIPROTKB")
Getting mapping data for Q6MGA6 ... and P_ENTREZGENEID
> annotation
   UNIPROTKB ENTREZ_GENE
1     Q6MGA6       24967
2 A0A023IMI6       24968
>
> # but this not....
> annotation <- select(x=UniProt.ws, keys=IDkeys, columns=c("genes(PREFERRED)"), keytype="UNIPROTKB")
Error in .select(x, keys, columns, keytype) :
  columns argument MUST match a value returned by columns method
>

> sessionInfo()
R version 3.1.2 Patched (2015-02-03 r67717)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.6.0 RCurl_1.95-4.5   bitops_1.0-6     RSQLite_1.0.0   
[5] DBI_0.3.1       

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 Biobase_2.26.0       BiocGenerics_0.12.1
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        parallel_3.1.2      
[7] S4Vectors_0.4.0      stats4_3.1.2        
>

 

uniprot.ws uniprot • 4.2k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 3 days ago
United States

You might be better served using the org.Rn.eg.db package instead:

> select(org.Rn.eg.db, IDkeys, "SYMBOL", "UNIPROT")
     UNIPROT SYMBOL
1     Q6MGA6  Psmb9
2 A0A023IMI6  Psmb8

Unless you need other specific things that are only available from UniProt. An alternative is to get what you want from UniProt, and then fill in later. Do note that what UniProt is calling 'gene names' (actually HGNC symbols) that are listed on that UniProt webpage have a little drop down that shows you where the data are imported from. It may well be that this is done on the fly, and is not actually part of the UniProt database, hence why you are having problems retrieving from UniProt.ws.

 

ADD COMMENT
0
Entering edit mode

Thanks, I didn't think of using the org.Rn.eg.db package, but this indeed works fine for my current case.

As a side node, I had the impression that keys and columns would only show rat-specific annotation info, and not all available info, including those specific for certain species.

ADD REPLY
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.4 years ago
United States

Hi Guido,

This was a bug caused by the fact that internally we had two labels matched to the same thing (from UniProt).  I am now pushing a patch for this online.  However "ENTREZ_GENE" is what you were after when you asked for "GENEID" (They mean the same thing to UniProt).  So that is equivalent.  I will remove 'GENEID' from the list of values returned from columns in order to fix this.  Also I don't know where you got the idea to try 'genes(PREFERRED)' from??

Also Jims answer is a good one for you to follow for those things that you don't *need* to get from UniProt as it will be more performant (since it can run off a local DB instead of a web service).

 Marc

ADD COMMENT
0
Entering edit mode

Thanks, Marc.

Regarding the try of 'genes(PREFERRED)': I naively deduced/tried this based on the URL that is returned when performing a manual query using the UniProt website; see my first post. It is indeed not one of the database identifiers.

ADD REPLY

Login before adding your answer.

Traffic: 454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6