UniprotKB AC/ID to Entrez ID Conversion
1
0
Entering edit mode
tom.kloter • 0
@3d0bd6f0
Last seen 2.5 years ago
Switzerland

I used the TTD database to identify a list of proteins that I want to highlight in a KEGG pathway map.

From this database I can extract the UniprotKB AC/ID for each protein. For example for HDAC1, this would be HDAC1_HUMAN. On the Uniprot website (https://www.uniprot.org/uploadlists/) I can enter that and select From UniProtKB AC/ID to GeneID (Entrez Gene), which will give me 3605 for the HDAC1_HUMAN protein. I am trying to do exactly this in RStudio with a vector of UniProt IDs that I want to convert to Entrez IDs. However, I couldn't manage and therefore wanted to ask whether someone can help me with this? I installed the Uniprot.ws package but cannot get a hold of which function to use.

So what I want to achieve is the following:

Input vector:

c(HDAC1_HUMAN,RIR2_HUMAN,PK3CG_HUMAN,TOP1_HUMAN,TOP2A_HUMAN, TOP2B_HUMAN,S19A1_HUMAN, PCFT_HUMAN)

Some R function to convert this

Output vector:

c(3065, 6241, 5294, 7150, 7153, 7155, 6573, 113235)

This output vector would then be used in the following function to highlight the proteins in two KEGG pathway maps:

pv.out <- pathview(gene.data = 'Output vector', pathway.id = c('hsa05200','hsa04110'), species = "hsa", out.suffix = "gse16873", kegg.native = T)

Which function can I use to achieve this?

Thanks a lot for your help!


R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.4.0       readr_2.1.2         UniProt.ws_2.34.0   BiocGenerics_0.40.0
[5] RCurl_1.98-1.6      RSQLite_2.2.10      pathview_1.34.0
ID Uniprot EntrezID • 4.3k views
ADD COMMENT
2
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 19 hours ago
Wageningen University, Wageningen, the …

The key to realize is that your input is a vector of Uniprot IDs (also known as 'entry names') . The central identifier in the Uniprot database are accession numbers, though. Because of this it would be best to retrieve (use) the accession numbers, also because these remain stable between database versions (see here for more info). Moreover, as a bonus, you would also be able to use the more efficient (=faster) annotation library org.Hs.eg.db!! (See e.g. here; the IDs cannot be queried in the OrgDb).

So regarding your first entry; HDAC1_HUMAN is a Uniprot ID, and its accession number (=central identifier) is Q13547.

Now using some code based on that posted before:

> library(UniProt.ws)
> up <- UniProt.ws(taxId=9606)
> up
"UniProt.ws" object:
An interface object for UniProt web services
Current Taxonomy ID:
9606
Current Species name:
Downloading: 710 kB     
Homo sapiens
To change Species see: help('availableUniprotSpecies')
>
> # using the Uniprot accession number of your first protein
> select(up, 
+               keys = c("Q13547"), 
+               columns = c("ENTREZ_GENE"),
+               keytype = "UNIPROTKB")
Getting mapping data for Q13547 ... and P_ENTREZGENEID
'select()' returned 1:1 mapping between keys and columns
  UNIPROTKB ENTREZ_GENE
1    Q13547        3065
> 
> # using the Uniprot identifier (entry); the type of keytype you have
> # note that querying goes much slower when compared to using an accession number
> # because of the cross-mappings to the central identifier that have to be performed.
>  select(up, 
+               keys = c("HDAC1_HUMAN"), 
+               columns = c("ENTREZ_GENE"),
+               keytype = "UNIPROTKB_ID")
Getting mapping data for HDAC1_HUMAN ... and ACC
Getting mapping data for Q13547 ... and P_ENTREZGENEID
'select()' returned 1:1 mapping between keys and columns
  UNIPROTKB_ID ENTREZ_GENE
1  HDAC1_HUMAN        3065
> 
> # use all your inputs (note that you have to add quotation marks):
> input <- c("HDAC1_HUMAN","RIR2_HUMAN","PK3CG_HUMAN","TOP1_HUMAN","TOP2A_HUMAN",
+ "TOP2B_HUMAN","S19A1_HUMAN", "PCFT_HUMAN")
>
> 
> select(up, 
+               keys = input, 
+               columns = c("ENTREZ_GENE"),
+               keytype = "UNIPROTKB_ID")
Getting mapping data for HDAC1_HUMAN ... and ACC
Getting mapping data for Q13547 ... and P_ENTREZGENEID
'select()' returned 1:1 mapping between keys and columns
  UNIPROTKB_ID ENTREZ_GENE
1  HDAC1_HUMAN        3065
2   RIR2_HUMAN        6241
3  PK3CG_HUMAN        5294
4   TOP1_HUMAN        7150
5  TOP2A_HUMAN        7153
6  TOP2B_HUMAN        7155
7  S19A1_HUMAN        6573
8   PCFT_HUMAN      113235
> 
> sessionInfo()
R version 4.2.0 Patched (2022-05-12 r82348 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.36.0   BiocGenerics_0.42.0 RSQLite_2.2.14     
ADD COMMENT
0
Entering edit mode

Thanks a lot for your response!

I tried the code in my script and got to the following result:

> up <- UniProt.ws(taxId=9606)
> up
"UniProt.ws" object:
An interface object for UniProt web services
Current Taxonomy ID:
9606
Current Species name:
Homo sapiens
To change Species see: help('availableUniprotSpecies')
>   # using the Uniprot accession number of your first protein
> select(up, keys = c("Q13547"), columns = c("ENTREZ_GENE"),keytype = "UNIPROTKB")
Getting mapping data for Q13547 ... and P_ENTREZGENEID
error while trying to retrieve data in chunk 1:
    no results after 5 attempts; please try again later
continuing to try
Fehler in `colnames<-`(`*tmp*`, value = rosetta[idx, 1]) : 
  Versuch die 'colnames' für ein Objekt mit weniger als zwei Dimensionen zu setzen

The error message translates to the following: Trying to put the colnames for an object with less than two dimensions. The only object appearing in the script so far is 'up', do I have to manually modify 'up'?

ADD REPLY
0
Entering edit mode

No, no need to modify up.

It somehow has to do with the R.BioC version; it works when using the latest versions (R-4.2.0/BioC-3.15), but not with the previous one (R-4.1.x/BioC-3.14). You will thus need to update!

# R-4.1.x/BioC-3.14    
> library(UniProt.ws)
> up <- UniProt.ws(taxId=9606)
> select(up, keys = c("Q13547"), columns = c("ENTREZ_GENE"),keytype = "UNIPROTKB")
Getting mapping data for Q13547 ... and P_ENTREZGENEID
error while trying to retrieve data in chunk 1:
    no results after 5 attempts; please try again later
continuing to try
Error in `colnames<-`(`*tmp*`, value = rosetta[idx, 1]) : 
  attempt to set 'colnames' on an object with less than two dimensions
> sessionInfo()
R version 4.1.1 Patched (2021-09-28 r80981)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

other attached packages:
[1] UniProt.ws_2.34.0   BiocGenerics_0.40.0 RCurl_1.98-1.6     
[4] RSQLite_2.2.12     


# R-4.2.0/BioC-3.15    
> library(UniProt.ws)
> up <- UniProt.ws(taxId=9606)
> select(up, keys = c("Q13547"), columns = c("ENTREZ_GENE"),keytype = "UNIPROTKB")
Getting mapping data for Q13547 ... and P_ENTREZGENEID
'select()' returned 1:1 mapping between keys and columns
  UNIPROTKB ENTREZ_GENE
1    Q13547        3065
> sessionInfo()
R version 4.2.0 Patched (2022-05-12 r82348 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

other attached packages:
[1] UniProt.ws_2.36.0   BiocGenerics_0.42.0 RSQLite_2.2.14     
ADD REPLY
0
Entering edit mode

I fixed this when Bioc 3.13 was the release version, so it should work in 3.14 and the current release.

> up <- UniProt.ws()
> input <- c("HDAC1_HUMAN","RIR2_HUMAN","PK3CG_HUMAN","TOP1_HUMAN","TOP2A_HUMAN","TOP2B_HUMAN","S19A1_HUMAN", "PCFT_HUMAN")

> select(up, input, "ENTREZ_GENE","UNIPROTKB_ID")
Getting mapping data for HDAC1_HUMAN ... and ACC
Getting mapping data for Q13547 ... and P_ENTREZGENEID
'select()' returned 1:1 mapping between keys and columns
  UNIPROTKB_ID ENTREZ_GENE
1  HDAC1_HUMAN        3065
2   RIR2_HUMAN        6241
3  PK3CG_HUMAN        5294
4   TOP1_HUMAN        7150
5  TOP2A_HUMAN        7153
6  TOP2B_HUMAN        7155
7  S19A1_HUMAN        6573
8   PCFT_HUMAN      113235

## AND

> select(up, keys = c("Q13547"), columns = c("ENTREZ_GENE"),keytype = "UNIPROTKB")
Getting mapping data for Q13547 ... and P_ENTREZGENEID
'select()' returned 1:1 mapping between keys and columns
  UNIPROTKB ENTREZ_GENE
1    Q13547        3065

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /share/apps/MKL/mkl-2019.3/compilers_and_libraries_2019.3.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.34.0   BiocGenerics_0.40.0 RCurl_1.98-1.6     
[4] RSQLite_2.2.10     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8             compiler_4.1.2         pillar_1.7.0          
 [4] dbplyr_2.1.1           GenomeInfoDb_1.30.1    XVector_0.34.0        
 [7] bitops_1.0-7           tools_4.1.2            zlibbioc_1.40.0       
[10] bit_4.0.4              tibble_3.1.6           memoise_2.0.1         
[13] BiocFileCache_2.2.1    lifecycle_1.0.1        pkgconfig_2.0.3       
[16] png_0.1-7              rlang_1.0.1            DBI_1.1.2             
[19] cli_3.2.0              filelock_1.0.2         curl_4.3.2            
[22] fastmap_1.1.0          GenomeInfoDbData_1.2.7 httr_1.4.2            
[25] dplyr_1.0.8            rappdirs_0.3.3         Biostrings_2.62.0     
[28] generics_0.1.2         S4Vectors_0.32.3       vctrs_0.3.8           
[31] IRanges_2.28.0         tidyselect_1.1.2       stats4_4.1.2          
[34] bit64_4.0.5            glue_1.6.2             Biobase_2.54.0        
[37] R6_2.5.1               fansi_1.0.2            AnnotationDbi_1.56.2  
[40] purrr_0.3.4            blob_1.2.2             magrittr_2.0.2        
[43] ellipsis_0.3.2         KEGGREST_1.34.0        assertthat_0.2.1      
[46] utf8_1.2.2             cachem_1.0.6           crayon_1.5.0

But we don't support anything but the current release, so the OP should upgrade if not using the current release.

ADD REPLY
0
Entering edit mode

Hi,

I updated R, RStudio as well as the packages and now everything is working neatly! Thank you so much for your contributions to my problem and competent answers, this helped me a lot!

Have a nice day!

ADD REPLY

Login before adding your answer.

Traffic: 747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6