Question: Secondary accession lookup with UniProt.ws
0
4.5 years ago by
bengelmann10
Chicago
bengelmann10 wrote:

Hello

I am having trouble retrieving FASTA sequences for a some uniprot identifiers. It seems that in most cases this is due to the accession number now being a 'secondary accession number'. Is there a way to retrieve sequences using these secondary accession numbers with uniprot.ws?

Thanks

-Brett

uniprot.ws • 1.3k views
modified 4.4 years ago by Marc Carlson7.2k • written 4.5 years ago by bengelmann10
Answer: Secondary accession lookup with UniProt.ws
0
4.5 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

I would like to help you with this.  But could you please give us a specific example (as described in our posting guidelines here:  http://bioconductor.org/help/support/posting-guide/)

Thanks!

Marc

Hi Marc

Sorry for the brevity there. Hopefully this is a little better. Here is some example code:

proteins <- c("Q15366", "B4DXP5", "B4DLC0", "F8W0G4")
sequences <- selectUniProt.ws, keys = proteins, columns = "SEQUENCE", keytype = "UNIPROTKB")

The "B4DXP5" accession returns NA for sequence:

sequences$UNIPROTKB[whichis.na(sequences$SEQUENCE))]
[1] "B4DXP5"

It seems that this is the case because of this annotation being 'rolled up' into the new primary accession number, "Q15366".

http://www.uniprot.org/uniprot/Q15366#entry_information

Is there a way for these secondary accession numbers to return the sequence information for the primary accession number? Perhaps with a warning message or a similar flag passed?

Many Thanks

-Brett

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Homo.sapiens_1.1.2                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 org.Hs.eg.db_3.0.0
[4] GO.db_3.0.0                             OrganismDbi_1.8.1                       GenomicFeatures_1.18.6
[7] GenomicRanges_1.18.4                    AnnotationDbi_1.28.2                    GenomeInfoDb_1.2.4
[10] IRanges_2.0.1                           S4Vectors_0.4.0                         Biobase_2.26.0
[13] BiocGenerics_0.12.1                     UniProt.ws_2.6.2                        RCurl_1.95-4.5
[16] bitops_1.0-6                            RSQLite_1.0.0                           DBI_0.3.1
[19] BiocInstaller_1.16.2

loaded via a namespace (and not attached):
[1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9              BiocParallel_1.0.3      biomaRt_2.22.0          Biostrings_2.34.1
[7] brew_1.0-6              checkmate_1.5.2         codetools_0.2-11        digest_0.6.8            fail_1.2                foreach_1.4.2
[13] GenomicAlignments_1.2.2 graph_1.44.1            iterators_1.0.7         RBGL_1.42.0             Rsamtools_1.18.3        rtracklayer_1.26.3
[19] sendmailR_1.2-1         stringr_0.6.2           tools_3.1.3             XML_3.98-1.1            XVector_0.6.0           zlibbioc_1.12.0        

Answer: Secondary accession lookup with UniProt.ws
0
4.4 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Hi Brett,

Thanks for you patience with me (we are doing a release and so other things keep jumping the queue on you).

But as of right now, I can't find any evidence that B4DXP5 is currently a Uniprot.ws accession.  It seems that it probably was at one point in time, and maybe it even was when you gave me that link above, but it doesn't seem to be anywhere on that page now.

Also, the keys method does not currently return "B4DXP5" as a valid key of type "UNIPROTKB":

k <- keys(UniProt.ws, "UNIPROTKB")
c("Q15366","B4DXP5","B4DLC0","F8W0G4") %in% k

Ultimately, Uniprot.ws talks to the Uniprot web service, so if they have deprecated this ID, then it's possible that it was inactive (but still on their web site for a couple days).  Do you have another example of an ID that is currently valid and that you feel should work?

Marc

This is bizarre - allow me to jump in.

'B4DXP5' is still a valid accession (see http://www.uniprot.org/uniprot/Q15366.txt) :

ID   PCBP2_HUMAN             Reviewed;         365 AA.
AC   Q15366; A8K7X6; B4DXP5; F8VYL7; G3V0E8; I6L8F9; Q32Q82; Q59HD4;
AC   Q68Y55; Q6IPF4; Q6PKG5;
DT   29-MAY-2000, integrated into UniProtKB/Swiss-Prot.
DT   31-OCT-1996, sequence version 1.
DT   31-MAR-2015, entry version 151.
DE   RecName: Full=Poly(rC)-binding protein 2;

Hans-Rudolf

Actually it's not. It was replaced by Q15366. The page you show just lists historical accession numbers that got scooped into Q15366.

...and that is the reason why it is called a 'secondary' accession. It is still a valid accessions for UniProtKB. And you can use it to search UniProtKB, eg:

http://www.uniprot.org/uniprot/?query=accession:B4DXP5&format=fasta

it will give you, of course: "Q15366"  (this is how the UniProtKB deals with UniProtKB-TrEMBL entries who have been merged into a UniProtKB-Swiss-Prot entry).

I am sorry, my intention was not to start a debate and I don't know much about the inner works of 'UniProt.ws' at all. I just wanted to support Bret's original question about "Secondary accession lookup with UniProt.ws"

Hans-Rudolf

I am not sure anybody calls it a secondary accession, least of all UniProt:

http://www.uniprot.org/uniprot/?query=accession%3AB4DXP5&sort=score

The term obsolete seems pretty unambiguous, doesn't it?

I am sorry, but I can't resist:

"while the others are referred to as ‘Secondary accession numbers"

Fair enough. I stand corrected.

That said, I don't believe Uniprot.ws is going to be able to come up with any secondary accession numbers because it scrapes this page:

http://www.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id

for IDs, and those are only the primary IDs.

Yes.  This is unfortunate, but unless UniProt sees fit to actually allow me to use the IDs to look up actual information, then it doesn't really matter what they are called.  :(

Also problematic is the fact that the UniProt site is a very large resource and is not extremely fast as a result.  I suspect that adding in all the older IDs might represent as very significant slowdown for the service (depending on how many there are).