I've used BiomaRt to map Ensemble to Entrez Id's and Uniprot Accession Id's . Recently, biomart made some changes and it seems that Unimart is no longer available? If i try this code (that was working before), i get:
mart = useMart(biomart = 'unimart',dataset='uniprot',verbose = T)
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2
I tried changing the host option as I did with the Ensemble :
listMarts(host="www.ensembl.org") <-- works
listMarts(host="www.uniprot.org") <--- does NOT work
I read somewhere in the past that unimart support was later updated to work again and I wonder if that is the same now? Is there a fix for this? Am I missing something?
EDIT:
One thing I forgot to mention is that I could also get the protein and gene names and symbols from Unimart, which is something I'm also looking for.
Unimart is now hosted on the EBI website, so I would expect the following to work:
listMarts(host="www.ebi.ac.uk/uniprot")
Sadly it doesn't as it seems that there is something wrong with the registry, I will email them.
Please note that the Uniprot team are retiring the Uniprot mart:
We at UniProt are always committed to improving our level of service and openly communicating changes with our users.
Based on recent user surveys and service evaluations, we have decided that our UniProt Biomart service will be retired later this year. The October 2015 data release will be the final update for the Uniprot Biomart however, the service will remain available until December 2015.
For those of you who rely on the UniProt Biomart for tasks such as: ID mapping, bulk retrieval of entries, or programmatic access to entry annotations; we have alternative services that we hope satisfy your needs. Please visit our YouTube channels and help pages for tutorials and more information about these services.
UniProt ID Mapping ServiceYouTube ID Mapping TutorialUniProt Programmatic Access Help Pages
Regards,
Uniprot Team
Thanks James. I like the look of this, especially since the labels are pulled directly from the source. One thing I forgot to mention is that I would also pull the protein and gene names as well as the symbols for annotation purposes. UniProt.ws doesn't seem to have a gene name option that I'm seeing.
> select(up, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
'select()' returned 1:many mapping between keys and columns
ENSEMBL GENES
1 ENSG00000139618 BRCA2
2 ENSG00000139618 BRCA2 FACD FANCD1
PROTEIN-NAMES
1 Breast cancer type 2 susceptibility protein (Fragment)
2 Breast cancer type 2 susceptibility protein (Fanconi anemia group D1 protein)
This looks like exactly what I want. However, I'm getting an error when I run your code. I am using the most recent version of Uniprot.ws from the bioconductor packages. Do you have a more recent dev version? Here's my output:
> taxId(UniProt.ws)
[1] 9606
> select(UniProt.ws, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
Error in `[.data.frame`(tab, , oriTabCols) : undefined columns selected
It looks to me that the "PROTEIN-NAMES" column isn't recognized.
Also, how did you find this option of protein names? I don't see any options for "PROTEIN-NAMES" when I look at the columns or keytypes.
Nope, just using the release version. It's usually not the best idea to mask a function with an object name (e.g., calling your UniProt.ws object 'UniProt.ws' isn't the best idea, because that is also a function name, as well as a class name, and a package name. R is usually good about figuring out what you want, but why take chances?). Anyway, even if I do that it still works for me:
> select(UniProt.ws, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
'select()' returned 1:many mapping between keys and columns
ENSEMBL GENES
1 ENSG00000139618 BRCA2
2 ENSG00000139618 BRCA2 FACD FANCD1
PROTEIN-NAMES
1 Breast cancer type 2 susceptibility protein (Fragment)
2 Breast cancer type 2 susceptibility protein (Fanconi anemia group D1 protein)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] UniProt.ws_2.10.0 BiocGenerics_0.16.1 RCurl_1.95-4.7
[4] bitops_1.0-6 RSQLite_1.0.0 DBI_0.3.1
loaded via a namespace (and not attached):
[1] compiler_3.2.2 IRanges_2.4.4 tools_3.2.2
[4] Biobase_2.30.0 AnnotationDbi_1.32.1 S4Vectors_0.8.3
[7] stats4_3.2.2
> up <- UniProt.ws(taxId=9606)
> select(up, "ENSG00000139618", c("GENES","PROTEIN-NAMES"),"ENSEMBL")
Getting mapping data for ENSG00000139618 ... and ACC
Getting extra data for H0YD86 H0YE37 P51587 etc
'select()' returned 1:many mapping between keys and columns
ENSEMBL GENES
1 ENSG00000139618 BRCA2
2 ENSG00000139618 BRCA2 FACD FANCD1
PROTEIN-NAMES
1 Breast cancer type 2 susceptibility protein (Fragment)
2 Breast cancer type 2 susceptibility protein (Fanconi anemia group D1 protein)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] UniProt.ws_2.10.0 BiocGenerics_0.16.1 RCurl_1.95-4.7
[4] bitops_1.0-6 RSQLite_1.0.0 DBI_0.3.1
[7] BiocInstaller_1.20.1
loaded via a namespace (and not attached):
[1] compiler_3.2.2 IRanges_2.4.3 tools_3.2.2
[4] Biobase_2.30.0 AnnotationDbi_1.32.0 S4Vectors_0.8.3
[7] stats4_3.2.2
This is really strange. I've re-installed bioconductor and UniProt.ws package and I still get the same error. And as for your comment about overwriting the UniProt.ws function, it's not a function for me. Even after clearing R's memory and restarting:
> library(UniProt.ws)
Loading required package: RSQLite
Loading required package: DBI
Loading required package: RCurl
Loading required package: bitops
> up <- UniProt.ws(taxId=9606)
Error: could not find function "UniProt.ws"
Which is why I was using UniProt.ws the way I was in the select statement. I would think this is more of a UniProt.ws version difference more than an R version difference:
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] UniProt.ws_2.6.2 RCurl_1.95-4.6 bitops_1.0-6 RSQLite_1.0.0
[5] DBI_0.3.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.2 Biobase_2.26.0 BiocGenerics_0.12.1
[4] GenomeInfoDb_1.2.5 IRanges_2.0.1 parallel_3.1.3
[7] S4Vectors_0.4.0 stats4_3.1.3 tools_3.1.3
That explains it. You are living in the past. We are now on R-3.2.2 and Bioconductor 3.2. You have to first upgrade R to the current version, then update Bioconductor.
I was wondering if you would know of a way to use the UniProt.ws R package to query with uniprot accession IDs regardless of species information? I have a list of uniprot accession IDs I got from PFAM, and would like to get their corresponding nucleotide sequence for an R package...
I was hoping on BioMart but it seems that now this is no longer possible...
Dear John,
Unimart is now hosted on the EBI website, so I would expect the following to work:
Sadly it doesn't as it seems that there is something wrong with the registry, I will email them.
Please note that the Uniprot team are retiring the Uniprot mart:
Dear John,
I've been a bit hasty, the following works:
Hope this helps,
Thomas
Thanks Thomas, this is perfect! How did you know where to find the "host" site?
Dear John,
You can find a list of the marts and their hosts on the following page: http://www.biomart.org/notice.html
If you want more information regarding a mart, you can replace "martview" in the following URL "http://www.ebi.ac.uk/uniprot/biomart/martview" by "martservice?type=registry" which will give you the following URL "http://www.ebi.ac.uk/uniprot/biomart/martservice?type=registry". This will help you find the path and mart name.
Regards,
Thomas