Question: biomaRt's uniprot_swissprot and uniprot_sptrembl attributes
gravatar for rubi
2.8 years ago by
rubi90 wrote:

I'm trying to match a list of uniprot mouse IDs with ensembl protein, transcript, and gene IDs.

So I download biomaRt data:


mart <- useMart(biomart="ensembl",dataset = "mmusculus_gene_ensembl")

mart.df <- getBM(attributes = c("uniprot_swissprot","uniprot_sptrembl","ensembl_gene_id","external_gene_name","description"),mart=mart)

However, many in my list of uniprot mouse IDs do not have a match in the mart.df I downloaded. 

For example:

Q922S4 is in my data, has a uniprot page ( but does not match neither in mart.df$uniprot_swissprot nor in mart.df$uniprot_sptrembl. However, if I follow the cross-ref link to UCSC ( and from there follow the cross-ref link to ensembl (;r=7:101410886-101512819;t=ENSMUST00000084894), the latter page has a link to uniprot accession: F7D3W5 ( In contrast to Q922S4, F7D3W5 is an unreviewed entry.


So my questions are:

1. Is there any way to download from biomaRt both reviewed and unreviewed uniprot proteins? So I can guarantee to match my data?

2. Why is biomaRt even holding the unreviewed accession over the reviewed ones? 


> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.30.0       doBy_4.5-15          yaml_2.1.14          doParallel_1.0.10    iterators_1.0.8      foreach_1.4.3       
 [7] snpEnrichment_1.7.0  fgsea_1.0.2          Rcpp_0.12.8          data.tree_0.6.2      zoo_1.7-13           gplots_3.0.1        
[13] ggdendro_0.1-20      RColorBrewer_1.1-2   venneuler_1.1-0      rJava_0.9-8          scales_0.4.1         reshape2_1.4.2      
[19] plotrix_3.6-3        outliers_0.14        Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34     
[25] data.table_1.9.6     edgeR_3.16.1         limma_3.30.2         ggpmisc_0.2.12       dplyr_0.5.0          plyr_1.8.4          
[31] magrittr_1.5         gridExtra_2.2.1      ggplot2_2.2.1        rtracklayer_1.34.1   Rsamtools_1.26.1     Biostrings_2.42.1   
[37] XVector_0.14.0       GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 

loaded via a namespace (and not attached):
 [1] Biobase_2.34.0             viridis_0.3.4              jsonlite_1.1               splines_3.3.2              gtools_3.5.0              
 [6] assertthat_0.1             latticeExtra_0.6-28        RSQLite_1.0.0              chron_2.3-47               digest_0.6.11             
[11] colorspace_1.2-7           htmltools_0.3.5            Matrix_1.2-7.1             XML_3.98-1.4               DiagrammeR_0.9.0          
[16] zlibbioc_1.20.0            brew_1.0-6                 gdata_2.17.0               BiocParallel_1.8.1         tibble_1.2                
[21] influenceR_0.1.0           SummarizedExperiment_1.2.3 nnet_7.3-12                lazyeval_0.2.0             rgexf_0.15.3              
[26] MASS_7.3-45                foreign_0.8-67             Rook_1.1-1                 tools_3.3.2                stringr_1.1.0             
[31] munsell_0.4.3              locfit_1.5-9.1             cluster_2.0.5              AnnotationDbi_1.36.0       snpStats_1.24.0           
[36] caTools_1.17.1             RCurl_1.95-4.8             rstudioapi_0.6             htmlwidgets_0.8            visNetwork_1.0.3          
[41] igraph_1.0.1               bitops_1.0-6               codetools_0.2-15           gtable_0.2.0               DBI_0.5-1                 
[46] R6_2.2.0                   GenomicAlignments_1.8.4    fastmatch_1.0-4            KernSmooth_2.23-15         stringi_1.1.2             
[51] rpart_4.1-10               acepack_1.4.1             


biomart getdb • 561 views
ADD COMMENTlink modified 2.8 years ago by Mike Smith4.0k • written 2.8 years ago by rubi90
Answer: biomaRt's uniprot_swissprot and uniprot_sptrembl attributes
gravatar for Mike Smith
2.8 years ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

Rather than a discussion on keeping multiple IDs in sync across multiple databases, which is something that has proved challenging for as long as I've been in this field, for now I'll offer a possible solution based on gene names, which I find to be slightly more reliable between sources. Suffice to say, this isn't something that biomaRt (the R package) has any control over, it just gets information from the database you point it at, so you may get a more comprehensive response from someone directly involved with Ensembl or Uniprot.

First we can use the httr package to query Uniprot directly and convert a list of Uniprot IDs into gene names.

my_protein_ids <- c('Q922S4', 'Q9UM73')
results <- POST(url = "",
                body = list(from = 'ID',
                            to = 'GENENAME',
                            format = 'tab',
                            query = paste(my_protein_ids, collapse = ' ')))

uniprot_results <- content(results, type = 'text/tab-separated-values', 
                           col_names = TRUE, 
                           col_types = NULL, 
                           encoding = "ISO-8859-1")

We can then query Ensembl Biomart using those gene names to get the Ensembl gene, transcript and protein identifiers.

mart <- useMart(biomart="ensembl", dataset = "mmusculus_gene_ensembl")

mart.df <- getBM(attributes = c("external_gene_name", "ensembl_gene_id",
                 filter = c("external_gene_name"),
                 values = uniprot_results$To,
                 mart = mart)

And my results look like this:

> mart.df
   external_gene_name    ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1                 Alk ENSMUSG00000055471    ENSMUST00000086639 ENSMUSP00000083840
2               Pde2a ENSMUSG00000110195    ENSMUST00000163751 ENSMUSP00000131553
3               Pde2a ENSMUSG00000110195    ENSMUST00000211368 ENSMUSP00000147847
4               Pde2a ENSMUSG00000110195    ENSMUST00000210364 ENSMUSP00000148212
5               Pde2a ENSMUSG00000110195    ENSMUST00000209537 ENSMUSP00000147553
6               Pde2a ENSMUSG00000110195    ENSMUST00000210092 ENSMUSP00000147884
7               Pde2a ENSMUSG00000110195    ENSMUST00000209695                   
8               Pde2a ENSMUSG00000110195    ENSMUST00000211051                   
9               Pde2a ENSMUSG00000110195    ENSMUST00000210100                   
10              Pde2a ENSMUSG00000110195    ENSMUST00000210535                   
11              Pde2a ENSMUSG00000110195    ENSMUST00000210348                   
12              Pde2a ENSMUSG00000110195    ENSMUST00000209315                   
13              Pde2a ENSMUSG00000110195    ENSMUST00000166652 ENSMUSP00000127521
14              Pde2a ENSMUSG00000030653    ENSMUST00000084894 ENSMUSP00000081956



ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Mike Smith4.0k

The url parameter of the POST function changed to

ADD REPLYlink written 4 weeks ago by matthias.stahl0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 419 users visited in the last hour