Unimart not available in 2015?
1
0
Entering edit mode
john ▴ 10
@john-9266
Last seen 9.0 years ago
United States

I've used BiomaRt to map Ensemble to Entrez Id's and Uniprot Accession Id's . Recently, biomart made some changes and it seems that Unimart is no longer available? If i try this code (that was working before), i get:

mart = useMart(biomart = 'unimart',dataset='uniprot',verbose = T)

Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

 

I tried changing the host option as I did with the Ensemble :

listMarts(host="www.ensembl.org")   <-- works

listMarts(host="www.uniprot.org")  <--- does NOT work

I read somewhere in the past that unimart support was later updated to work again and I wonder if that is the same now? Is there a fix for this? Am I missing something? 

 

EDIT:

One thing I forgot to mention is that I could also get the protein and gene names and symbols from Unimart, which is something I'm also looking for.

Thanks, 

j

biomart uniprot unimart • 4.2k views
ADD COMMENT
0
Entering edit mode

Dear John,

Unimart is now hosted on the EBI website, so I would expect the following to work:

listMarts(host="www.ebi.ac.uk/uniprot")

Sadly it doesn't as it seems that there is something wrong with the registry, I will email them.

Please note that the Uniprot team are retiring the Uniprot mart:

We at UniProt are always committed to improving our level of service and openly communicating changes with our users. 

Based on recent user surveys and service evaluations, we have decided that our UniProt Biomart service will be retired later this year. The October 2015 data release will be the final update for the Uniprot Biomart however, the service will remain available until December 2015.

For those of you who rely on the UniProt Biomart for tasks such as: ID mapping, bulk retrieval of entries, or programmatic access to entry annotations; we have alternative services that we hope satisfy your needs. Please visit our YouTube channels and help pages for tutorials and more information about these services.

UniProt ID Mapping Service

YouTube ID Mapping Tutorial

UniProt Programmatic Access Help Pages   

Regards,

Uniprot Team

 

ADD REPLY
0
Entering edit mode

Dear John,

I've been a bit hasty, the following works:

> listMarts(host="www.ebi.ac.uk", path="/uniprot/biomart/martservice")
               biomart                  version
1              unimart         UNIPROT (EBI UK)
2 ENSEMBL_MART_ENSEMBL ENSEMBL GENES 80(EBI UK)
3                pride           PRIDE (EBI UK)

 

Hope this helps,

Thomas

ADD REPLY
1
Entering edit mode

Thanks Thomas, this is perfect! How did you know where to find the "host" site?

ADD REPLY
0
Entering edit mode

Dear John,

You can find a list of the marts and their hosts on the following page: http://www.biomart.org/notice.html

If you want more information regarding a mart, you can replace "martview" in the following URL "http://www.ebi.ac.uk/uniprot/biomart/martview" by "martservice?type=registry" which will give you the following URL "http://www.ebi.ac.uk/uniprot/biomart/martservice?type=registry". This will help you find the path and mart name.

Regards,

Thomas

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

I don't know about the UniProt mart, but you could alternatively use the UniProt.ws package.

> library(UniProt.ws)
> up <- UniProt.ws(taxId=9606)
> keytypes(up)
 [1] "AARHUS/GHENT-2DPAGE"        "AGD"                       
 [3] "ALLERGOME"                  "ARACHNOSERVER"             
 [5] "BIOCYC"                     "CGD"                       
 [7] "CLEANEX"                    "CONOSERVER"                
 [9] "CYGD"                       "DICTYBASE"                 
[11] "DIP"                        "DISPROT"                   
[13] "DMDM"                       "DNASU"                     
[15] "DRUGBANK"                   "ECHOBASE"                  
[17] "ECO2DBASE"                  "ECOGENE"                   
[19] "EGGNOG"                     "EMBL/GENBANK/DDBJ"         
[21] "EMBL/GENBANK/DDBJ_CDS"      "ENSEMBL"                   
[23] "ENSEMBL_GENOMES"            "ENSEMBL_GENOMES PROTEIN"   
[25] "ENSEMBL_GENOMES TRANSCRIPT" "ENSEMBL_PROTEIN"           
[27] "ENSEMBL_TRANSCRIPT"         "ENTREZ_GENE"               
[29] "EUHCVDB"                    "EUPATHDB"                  
[31] "FLYBASE"                    "GENECARDS"                 
[33] "GENEFARM"                   "GENETREE"                  
[35] "GENOLIST"                   "GENOMERNAI"                
[37] "GERMONLINE"                 "GI_NUMBER*"                
[39] "HGNC"                       "H-INVDB"                   
[41] "HOGENOM"                    "HOVERGEN"                  
[43] "HPA"                        "HSSP"                      
[45] "KEGG"                       "KO"                        
[47] "LEGIOLIST"                  "LEPROMA"                   
[49] "MAIZEGDB"                   "MEROPS"                    
[51] "MGI"                        "MIM"                       
[53] "MINT"                       "NEXTBIO"                   
[55] "NEXTPROT"                   "OMA"                       
[57] "ORPHANET"                   "ORTHODB"                   
[59] "PATRIC"                     "PDB"                       
[61] "PEROXIBASE"                 "PHARMGKB"                  
[63] "PHOSSITE"                   "PIR"                       
[65] "POMBASE"                    "PPTASEDB"                  
[67] "PROTCLUSTDB"                "PSEUDOCAP"                 
[69] "REACTOME"                   "REBASE"                    
[71] "REFSEQ_NUCLEOTIDE"          "REFSEQ_PROTEIN"            
[73] "RGD"                        "SGD"                       
[75] "TAIR"                       "TCDB"                      
[77] "TIGR"                       "TUBERCULIST"               
[79] "UCSC"                       "UNIGENE"                   
[81] "UNIPARC"                    "UNIPATHWAY"                
[83] "UNIPROTKB"                  "UNIREF100"                 
[85] "UNIREF50"                   "UNIREF90"                  
[87] "VECTORBASE"                 "WORLD-2DPAGE"              
[89] "WORMBASE"                   "WORMBASE_PROTEIN"          
[91] "WORMBASE_TRANSCRIPT"        "XENBASE"                   
[93] "ZFIN"                      
> columns(up)
  [1] "3D"                         "AARHUS/GHENT-2DPAGE"       
  [3] "AGD"                        "ALLERGOME"                 
  [5] "ARACHNOSERVER"              "BIOCYC"                    
  [7] "CGD"                        "CITATION"                  
  [9] "CLEANEX"                    "CLUSTERS"                  
 [11] "COMMENTS"                   "CONOSERVER"                
 [13] "CYGD"                       "DATABASE(PDB)"             
 [15] "DATABASE(PFAM)"             "DICTYBASE"                 
 [17] "DIP"                        "DISPROT"                   
 [19] "DMDM"                       "DNASU"                     
 [21] "DOMAIN"                     "DOMAINS"                   
 [23] "DRUGBANK"                   "EC"                        
 [25] "ECHOBASE"                   "ECO2DBASE"                 
 [27] "ECOGENE"                    "EGGNOG"                    
 [29] "EMBL/GENBANK/DDBJ"          "EMBL/GENBANK/DDBJ_CDS"     
 [31] "ENSEMBL"                    "ENSEMBL_GENOMES"           
 [33] "ENSEMBL_GENOMES PROTEIN"    "ENSEMBL_GENOMES TRANSCRIPT"
 [35] "ENSEMBL_PROTEIN"            "ENSEMBL_TRANSCRIPT"        
 [37] "ENTREZ_GENE"                "ENTRY-NAME"                
 [39] "EUHCVDB"                    "EUPATHDB"                  
 [41] "EXISTENCE"                  "FAMILIES"                  
 [43] "FEATURES"                   "FLYBASE"                   
 [45] "GENECARDS"                  "GENEFARM"                  
 [47] "GENES"                      "GENETREE"                  
 [49] "GENOLIST"                   "GENOMERNAI"                
 [51] "GERMONLINE"                 "GI_NUMBER*"                
 [53] "GO"                         "GO-ID"                     
 [55] "HGNC"                       "H-INVDB"                   
 [57] "HOGENOM"                    "HOVERGEN"                  
 [59] "HPA"                        "HSSP"                      
 [61] "ID"                         "INTERACTOR"                
 [63] "INTERPRO"                   "KEGG"                      
 [65] "KEYWORD-ID"                 "KEYWORDS"                  
 [67] "KO"                         "LAST-MODIFIED"             
 [69] "LEGIOLIST"                  "LENGTH"                    
 [71] "LEPROMA"                    "MAIZEGDB"                  
 [73] "MEROPS"                     "MGI"                       
 [75] "MIM"                        "MINT"                      
 [77] "NEXTBIO"                    "NEXTPROT"                  
 [79] "OMA"                        "ORGANISM"                  
 [81] "ORGANISM-ID"                "ORPHANET"                  
 [83] "ORTHODB"                    "PATHWAY"                   
 [85] "PATRIC"                     "PDB"                       
 [87] "PEROXIBASE"                 "PHARMGKB"                  
 [89] "PHOSSITE"                   "PIR"                       
 [91] "POMBASE"                    "PPTASEDB"                  
 [93] "PROTCLUSTDB"                "PROTEIN-NAMES"             
 [95] "PSEUDOCAP"                  "REACTOME"                  
 [97] "REBASE"                     "REFSEQ_NUCLEOTIDE"         
 [99] "REFSEQ_PROTEIN"             "REVIEWED"                  
[101] "RGD"                        "SCORE"                     
[103] "SEQUENCE"                   "SGD"                       
[105] "SUBCELLULAR-LOCATIONS"      "TAIR"                      
[107] "TAXON"                      "TCDB"                      
[109] "TIGR"                       "TOOLS"                     
[111] "TUBERCULIST"                "UCSC"                      
[113] "UNIGENE"                    "UNIPARC"                   
[115] "UNIPATHWAY"                 "UNIPROTKB"                 
[117] "UNIREF100"                  "UNIREF50"                  
[119] "UNIREF90"                   "VECTORBASE"                
[121] "VERSION"                    "VIRUS-HOSTS"               
[123] "WORLD-2DPAGE"               "WORMBASE"                  
[125] "WORMBASE_PROTEIN"           "WORMBASE_TRANSCRIPT"       
[127] "XENBASE"                    "ZFIN"  
                    
> select(up, c("1","2","5"), "UNIPROTKB","ENTREZ_GENE")
Getting mapping data for 1 ... and ACC
'select()' returned 1:many mapping between keys and columns
  ENTREZ_GENE UNIPROTKB
1           1    P04217
2           1    V9HWD8
3           2    P01023
4           5      <NA>

> select(up, c("ENSG00000139618"), "UNIPROTKB","ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
'select()' returned 1:many mapping between keys and columns
          ENSEMBL UNIPROTKB
1 ENSG00000139618    H0YD86
2 ENSG00000139618    H0YE37
3 ENSG00000139618    P51587
ADD COMMENT
0
Entering edit mode

Thanks James. I like the look of this, especially since the labels are pulled directly from the source. One thing I forgot to mention is that I would also pull the protein and gene names as well as the symbols for annotation purposes. UniProt.ws doesn't seem to have a gene name option that I'm seeing.

ADD REPLY
0
Entering edit mode
> select(up, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
'select()' returned 1:many mapping between keys and columns
          ENSEMBL             GENES
1 ENSG00000139618             BRCA2
2 ENSG00000139618 BRCA2 FACD FANCD1
                                                                  PROTEIN-NAMES
1                        Breast cancer type 2 susceptibility protein (Fragment)
2 Breast cancer type 2 susceptibility protein (Fanconi anemia group D1 protein)
ADD REPLY
0
Entering edit mode

This looks like exactly what I want. However, I'm getting an error when I run your code. I am using the most recent version of Uniprot.ws from the bioconductor packages. Do you have a more recent dev version? Here's my output:

> taxId(UniProt.ws)
[1] 9606
> select(UniProt.ws, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
Error in `[.data.frame`(tab, , oriTabCols) : undefined columns selected

It looks to me that the "PROTEIN-NAMES" column isn't recognized.

Also, how did you find this option of protein names? I don't see any options for "PROTEIN-NAMES" when I look at the columns or keytypes.

ADD REPLY
0
Entering edit mode

Nope, just using the release version. It's usually not the best idea to mask a function with an object name (e.g., calling your UniProt.ws object 'UniProt.ws' isn't the best idea, because that is also a function name, as well as a class name, and a package name. R is usually good about figuring out what you want, but why take chances?). Anyway, even if I do that it still works for me:

> select(UniProt.ws, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
'select()' returned 1:many mapping between keys and columns
          ENSEMBL             GENES
1 ENSG00000139618             BRCA2
2 ENSG00000139618 BRCA2 FACD FANCD1
                                                                  PROTEIN-NAMES
1                        Breast cancer type 2 susceptibility protein (Fragment)
2 Breast cancer type 2 susceptibility protein (Fanconi anemia group D1 protein)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] UniProt.ws_2.10.0   BiocGenerics_0.16.1 RCurl_1.95-4.7     
[4] bitops_1.0-6        RSQLite_1.0.0       DBI_0.3.1          

loaded via a namespace (and not attached):
[1] compiler_3.2.2       IRanges_2.4.4        tools_3.2.2         
[4] Biobase_2.30.0       AnnotationDbi_1.32.1 S4Vectors_0.8.3     
[7] stats4_3.2.2        

Also:

> grep("PROTEIN", columns(up), value = TRUE)
[1] "ENSEMBL_GENOMES PROTEIN" "ENSEMBL_PROTEIN"        
[3] "PROTEIN-NAMES"           "REFSEQ_PROTEIN"         
[5] "WORMBASE_PROTEIN"
ADD REPLY
0
Entering edit mode

I also checked on Windows:

> up <- UniProt.ws(taxId=9606)
> select(up, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
'select()' returned 1:many mapping between keys and columns
          ENSEMBL             GENES
1 ENSG00000139618             BRCA2
2 ENSG00000139618 BRCA2 FACD FANCD1
                                                                  PROTEIN-NAMES
1                        Breast cancer type 2 susceptibility protein (Fragment)
2 Breast cancer type 2 susceptibility protein (Fanconi anemia group D1 protein)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] UniProt.ws_2.10.0    BiocGenerics_0.16.1  RCurl_1.95-4.7      
[4] bitops_1.0-6         RSQLite_1.0.0        DBI_0.3.1           
[7] BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] compiler_3.2.2       IRanges_2.4.3        tools_3.2.2         
[4] Biobase_2.30.0       AnnotationDbi_1.32.0 S4Vectors_0.8.3     
[7] stats4_3.2.2       
ADD REPLY
0
Entering edit mode

This is really strange. I've re-installed bioconductor and UniProt.ws package and I still get the same error. And as for your comment about overwriting the UniProt.ws function, it's not a function for me. Even after clearing R's memory and restarting:

> library(UniProt.ws)
Loading required package: RSQLite
Loading required package: DBI
Loading required package: RCurl
Loading required package: bitops
> up <- UniProt.ws(taxId=9606)
Error: could not find function "UniProt.ws"

Which is why I was using UniProt.ws the way I was in the select statement. I would think this is more of a UniProt.ws version difference more than an R version difference:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.6.2 RCurl_1.95-4.6   bitops_1.0-6     RSQLite_1.0.0   
[5] DBI_0.3.1       

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.2 Biobase_2.26.0       BiocGenerics_0.12.1 
[4] GenomeInfoDb_1.2.5   IRanges_2.0.1        parallel_3.1.3      
[7] S4Vectors_0.4.0      stats4_3.1.3         tools_3.1.3   

 

ADD REPLY
0
Entering edit mode

That explains it. You are living in the past. We are now on R-3.2.2 and Bioconductor 3.2. You have to first upgrade R to the current version, then update Bioconductor.

ADD REPLY
0
Entering edit mode

Hi all!

I was wondering if you would know of a way to use the UniProt.ws R package to query with uniprot accession IDs regardless of species information? I have a list of uniprot accession IDs I got from PFAM, and would like to get their corresponding nucleotide sequence for an R package...
I was hoping on BioMart but it seems that now this is no longer possible...

Many thanks for your help!

ADD REPLY
0
Entering edit mode

If you have a question, please open a new thread rather than tacking it on the end of a months-old thread that isn't even relevant to your question.

ADD REPLY

Login before adding your answer.

Traffic: 621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6