biomaRt's uniprot_swissprot and uniprot_sptrembl attributes
1
0
Entering edit mode
rubi ▴ 110
@rubi-6462
Last seen 6.3 years ago

I'm trying to match a list of uniprot mouse IDs with ensembl protein, transcript, and gene IDs.

So I download biomaRt data:

require(biomaRt,quietly=T)

mart <- useMart(biomart="ensembl",dataset = "mmusculus_gene_ensembl")

mart.df <- getBM(attributes = c("uniprot_swissprot","uniprot_sptrembl","ensembl_gene_id","external_gene_name","description"),mart=mart)

However, many in my list of uniprot mouse IDs do not have a match in the mart.df I downloaded. 

For example:

Q922S4 is in my data, has a uniprot page (http://www.uniprot.org/uniprot/Q922S4) but does not match neither in mart.df$uniprot_swissprot nor in mart.df$uniprot_sptrembl. However, if I follow the cross-ref link to UCSC (http://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc009ioq.3&org=mouse) and from there follow the cross-ref link to ensembl (http://uswest.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000030653;r=7:101410886-101512819;t=ENSMUST00000084894), the latter page has a link to uniprot accession: F7D3W5 (http://www.uniprot.org/uniprot/F7D3W5). In contrast to Q922S4, F7D3W5 is an unreviewed entry.

 

So my questions are:

1. Is there any way to download from biomaRt both reviewed and unreviewed uniprot proteins? So I can guarantee to match my data?

2. Why is biomaRt even holding the unreviewed accession over the reviewed ones? 

 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.30.0       doBy_4.5-15          yaml_2.1.14          doParallel_1.0.10    iterators_1.0.8      foreach_1.4.3       
 [7] snpEnrichment_1.7.0  fgsea_1.0.2          Rcpp_0.12.8          data.tree_0.6.2      zoo_1.7-13           gplots_3.0.1        
[13] ggdendro_0.1-20      RColorBrewer_1.1-2   venneuler_1.1-0      rJava_0.9-8          scales_0.4.1         reshape2_1.4.2      
[19] plotrix_3.6-3        outliers_0.14        Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34     
[25] data.table_1.9.6     edgeR_3.16.1         limma_3.30.2         ggpmisc_0.2.12       dplyr_0.5.0          plyr_1.8.4          
[31] magrittr_1.5         gridExtra_2.2.1      ggplot2_2.2.1        rtracklayer_1.34.1   Rsamtools_1.26.1     Biostrings_2.42.1   
[37] XVector_0.14.0       GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 

loaded via a namespace (and not attached):
 [1] Biobase_2.34.0             viridis_0.3.4              jsonlite_1.1               splines_3.3.2              gtools_3.5.0              
 [6] assertthat_0.1             latticeExtra_0.6-28        RSQLite_1.0.0              chron_2.3-47               digest_0.6.11             
[11] colorspace_1.2-7           htmltools_0.3.5            Matrix_1.2-7.1             XML_3.98-1.4               DiagrammeR_0.9.0          
[16] zlibbioc_1.20.0            brew_1.0-6                 gdata_2.17.0               BiocParallel_1.8.1         tibble_1.2                
[21] influenceR_0.1.0           SummarizedExperiment_1.2.3 nnet_7.3-12                lazyeval_0.2.0             rgexf_0.15.3              
[26] MASS_7.3-45                foreign_0.8-67             Rook_1.1-1                 tools_3.3.2                stringr_1.1.0             
[31] munsell_0.4.3              locfit_1.5-9.1             cluster_2.0.5              AnnotationDbi_1.36.0       snpStats_1.24.0           
[36] caTools_1.17.1             RCurl_1.95-4.8             rstudioapi_0.6             htmlwidgets_0.8            visNetwork_1.0.3          
[41] igraph_1.0.1               bitops_1.0-6               codetools_0.2-15           gtable_0.2.0               DBI_0.5-1                 
[46] R6_2.2.0                   GenomicAlignments_1.8.4    fastmatch_1.0-4            KernSmooth_2.23-15         stringi_1.1.2             
[51] rpart_4.1-10               acepack_1.4.1             
 
 

 

biomart getDB • 1.5k views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 13 hours ago
EMBL Heidelberg

Rather than a discussion on keeping multiple IDs in sync across multiple databases, which is something that has proved challenging for as long as I've been in this field, for now I'll offer a possible solution based on gene names, which I find to be slightly more reliable between sources. Suffice to say, this isn't something that biomaRt (the R package) has any control over, it just gets information from the database you point it at, so you may get a more comprehensive response from someone directly involved with Ensembl or Uniprot.

First we can use the httr package to query Uniprot directly and convert a list of Uniprot IDs into gene names.

library(httr)
my_protein_ids <- c('Q922S4', 'Q9UM73')
results <- POST(url = "http://www.uniprot.org/mapping/",
                body = list(from = 'ID',
                            to = 'GENENAME',
                            format = 'tab',
                            query = paste(my_protein_ids, collapse = ' ')))

uniprot_results <- content(results, type = 'text/tab-separated-values', 
                           col_names = TRUE, 
                           col_types = NULL, 
                           encoding = "ISO-8859-1")

We can then query Ensembl Biomart using those gene names to get the Ensembl gene, transcript and protein identifiers.

library(biomaRt)
mart <- useMart(biomart="ensembl", dataset = "mmusculus_gene_ensembl")

mart.df <- getBM(attributes = c("external_gene_name", "ensembl_gene_id",
                                "ensembl_transcript_id","ensembl_peptide_id"),
                 filter = c("external_gene_name"),
                 values = uniprot_results$To,
                 mart = mart)

And my results look like this:

> mart.df
   external_gene_name    ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1                 Alk ENSMUSG00000055471    ENSMUST00000086639 ENSMUSP00000083840
2               Pde2a ENSMUSG00000110195    ENSMUST00000163751 ENSMUSP00000131553
3               Pde2a ENSMUSG00000110195    ENSMUST00000211368 ENSMUSP00000147847
4               Pde2a ENSMUSG00000110195    ENSMUST00000210364 ENSMUSP00000148212
5               Pde2a ENSMUSG00000110195    ENSMUST00000209537 ENSMUSP00000147553
6               Pde2a ENSMUSG00000110195    ENSMUST00000210092 ENSMUSP00000147884
7               Pde2a ENSMUSG00000110195    ENSMUST00000209695                   
8               Pde2a ENSMUSG00000110195    ENSMUST00000211051                   
9               Pde2a ENSMUSG00000110195    ENSMUST00000210100                   
10              Pde2a ENSMUSG00000110195    ENSMUST00000210535                   
11              Pde2a ENSMUSG00000110195    ENSMUST00000210348                   
12              Pde2a ENSMUSG00000110195    ENSMUST00000209315                   
13              Pde2a ENSMUSG00000110195    ENSMUST00000166652 ENSMUSP00000127521
14              Pde2a ENSMUSG00000030653    ENSMUST00000084894 ENSMUSP00000081956

 

 

ADD COMMENT
0
Entering edit mode

The url parameter of the POST function changed to https://www.uniprot.org/uploadlists/.

ADD REPLY

Login before adding your answer.

Traffic: 522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6