Question

biomaRt's uniprot_swissprot and uniprot_sptrembl attributes

0

Entering edit mode

rubi ▴ 110

@rubi-6462

Last seen 5.7 years ago

I'm trying to match a list of uniprot mouse IDs with ensembl protein, transcript, and gene IDs.

So I download biomaRt data:

require(biomaRt,quietly=T)

mart <- useMart(biomart="ensembl",dataset = "mmusculus_gene_ensembl")

mart.df <- getBM(attributes = c("uniprot_swissprot","uniprot_sptrembl","ensembl_gene_id","external_gene_name","description"),mart=mart)

However, many in my list of uniprot mouse IDs do not have a match in the mart.df I downloaded.

For example:

Q922S4 is in my data, has a uniprot page (http://www.uniprot.org/uniprot/Q922S4) but does not match neither in mart.df$uniprot_swissprot nor in mart.df$uniprot_sptrembl. However, if I follow the cross-ref link to UCSC (http://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc009ioq.3&org=mouse) and from there follow the cross-ref link to ensembl (http://uswest.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000030653;r=7:101410886-101512819;t=ENSMUST00000084894), the latter page has a link to uniprot accession: F7D3W5 (http://www.uniprot.org/uniprot/F7D3W5). In contrast to Q922S4, F7D3W5 is an unreviewed entry.

So my questions are:

1. Is there any way to download from biomaRt both reviewed and unreviewed uniprot proteins? So I can guarantee to match my data?

2. Why is biomaRt even holding the unreviewed accession over the reviewed ones?

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.30.0       doBy_4.5-15          yaml_2.1.14          doParallel_1.0.10    iterators_1.0.8      foreach_1.4.3       
 [7] snpEnrichment_1.7.0  fgsea_1.0.2          Rcpp_0.12.8          data.tree_0.6.2      zoo_1.7-13           gplots_3.0.1        
[13] ggdendro_0.1-20      RColorBrewer_1.1-2   venneuler_1.1-0      rJava_0.9-8          scales_0.4.1         reshape2_1.4.2      
[19] plotrix_3.6-3        outliers_0.14        Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34     
[25] data.table_1.9.6     edgeR_3.16.1         limma_3.30.2         ggpmisc_0.2.12       dplyr_0.5.0          plyr_1.8.4          
[31] magrittr_1.5         gridExtra_2.2.1      ggplot2_2.2.1        rtracklayer_1.34.1   Rsamtools_1.26.1     Biostrings_2.42.1   
[37] XVector_0.14.0       GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 

loaded via a namespace (and not attached):
 [1] Biobase_2.34.0             viridis_0.3.4              jsonlite_1.1               splines_3.3.2              gtools_3.5.0              
 [6] assertthat_0.1             latticeExtra_0.6-28        RSQLite_1.0.0              chron_2.3-47               digest_0.6.11             
[11] colorspace_1.2-7           htmltools_0.3.5            Matrix_1.2-7.1             XML_3.98-1.4               DiagrammeR_0.9.0          
[16] zlibbioc_1.20.0            brew_1.0-6                 gdata_2.17.0               BiocParallel_1.8.1         tibble_1.2                
[21] influenceR_0.1.0           SummarizedExperiment_1.2.3 nnet_7.3-12                lazyeval_0.2.0             rgexf_0.15.3              
[26] MASS_7.3-45                foreign_0.8-67             Rook_1.1-1                 tools_3.3.2                stringr_1.1.0             
[31] munsell_0.4.3              locfit_1.5-9.1             cluster_2.0.5              AnnotationDbi_1.36.0       snpStats_1.24.0           
[36] caTools_1.17.1             RCurl_1.95-4.8             rstudioapi_0.6             htmlwidgets_0.8            visNetwork_1.0.3          
[41] igraph_1.0.1               bitops_1.0-6               codetools_0.2-15           gtable_0.2.0               DBI_0.5-1                 
[46] R6_2.2.0                   GenomicAlignments_1.8.4    fastmatch_1.0-4            KernSmooth_2.23-15         stringi_1.1.2             
[51] rpart_4.1-10               acepack_1.4.1

biomart getDB • 1.3k views

ADD COMMENT • link updated 7.2 years ago by Mike Smith ★ 6.5k • written 7.2 years ago by rubi ▴ 110

score 2 · Answer 1 · 2017-02-07

Rather than a discussion on keeping multiple IDs in sync across multiple databases, which is something that has proved challenging for as long as I've been in this field, for now I'll offer a possible solution based on gene names, which I find to be slightly more reliable between sources. Suffice to say, this isn't something that biomaRt (the R package) has any control over, it just gets information from the database you point it at, so you may get a more comprehensive response from someone directly involved with Ensembl or Uniprot.

First we can use the httr package to query Uniprot directly and convert a list of Uniprot IDs into gene names.

library(httr)
my_protein_ids <- c('Q922S4', 'Q9UM73')
results <- POST(url = "http://www.uniprot.org/mapping/",
                body = list(from = 'ID',
                            to = 'GENENAME',
                            format = 'tab',
                            query = paste(my_protein_ids, collapse = ' ')))

uniprot_results <- content(results, type = 'text/tab-separated-values', 
                           col_names = TRUE, 
                           col_types = NULL, 
                           encoding = "ISO-8859-1")

We can then query Ensembl Biomart using those gene names to get the Ensembl gene, transcript and protein identifiers.

library(biomaRt)
mart <- useMart(biomart="ensembl", dataset = "mmusculus_gene_ensembl")

mart.df <- getBM(attributes = c("external_gene_name", "ensembl_gene_id",
                                "ensembl_transcript_id","ensembl_peptide_id"),
                 filter = c("external_gene_name"),
                 values = uniprot_results$To,
                 mart = mart)

And my results look like this:

> mart.df
   external_gene_name    ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1                 Alk ENSMUSG00000055471    ENSMUST00000086639 ENSMUSP00000083840
2               Pde2a ENSMUSG00000110195    ENSMUST00000163751 ENSMUSP00000131553
3               Pde2a ENSMUSG00000110195    ENSMUST00000211368 ENSMUSP00000147847
4               Pde2a ENSMUSG00000110195    ENSMUST00000210364 ENSMUSP00000148212
5               Pde2a ENSMUSG00000110195    ENSMUST00000209537 ENSMUSP00000147553
6               Pde2a ENSMUSG00000110195    ENSMUST00000210092 ENSMUSP00000147884
7               Pde2a ENSMUSG00000110195    ENSMUST00000209695                   
8               Pde2a ENSMUSG00000110195    ENSMUST00000211051                   
9               Pde2a ENSMUSG00000110195    ENSMUST00000210100                   
10              Pde2a ENSMUSG00000110195    ENSMUST00000210535                   
11              Pde2a ENSMUSG00000110195    ENSMUST00000210348                   
12              Pde2a ENSMUSG00000110195    ENSMUST00000209315                   
13              Pde2a ENSMUSG00000110195    ENSMUST00000166652 ENSMUSP00000127521
14              Pde2a ENSMUSG00000030653    ENSMUST00000084894 ENSMUSP00000081956