Question: UniProt.ws report 1:many mappings in a single row
0
gravatar for l.nilse
3.0 years ago by
l.nilse0
l.nilse0 wrote:

I am using UNIPROTKB keys to retrieve KEGG entries. Often multiple KEGG entries are found for the same key, resulting in a select() result with more rows than keys.

'select()' returned 1:many mapping between keys and columns

Is it possible to get multiple KEGG entries in the same row, i.e. number of keys and number of rows in the result will always be the same?

 

annotation uniprot.ws uniprot • 682 views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by l.nilse0
Answer: UniProt.ws report 1:many mappings in a single row
1
gravatar for l.nilse
3.0 years ago by
l.nilse0
l.nilse0 wrote:

Here a possible solution complete with an example. Note that there are duplicate entries in the keys and multiple KEGG results.

library(UniProt.ws)

species <- 9606    # homo sapiens
up <- UniProt.ws(taxId=species)

UniqueSpaceSeparated <- function(x) {
  u <- unique(x)
  r <- paste(u, collapse=" ")
  return(r)
}

columns <- c("SEQUENCE","GO", "SUBCELLULAR-LOCATIONS", "PROTEIN-NAMES", "GENES", "KEGG")
keytype <- "UNIPROTKB"
keys <- c("O75083","O75083","O75084","O75131","O75144","O75264","O75309","O75339","O75340","O75348","O75351","O75144","O75144")
# Note that O75144 has got two KEGG entries.
a <- select(up, keys, columns, keytype)
b <- aggregate(a, by=list(a[,keytype]), FUN=UniqueSpaceSeparated)
b[,1] <- NULL
idx <- match(keys,b[,keytype])
annotations <- b[idx,]
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by l.nilse0

.                    .

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by l.nilse0
Answer: UniProt.ws report 1:many mappings in a single row
0
gravatar for James W. MacDonald
3.0 years ago by
United States
James W. MacDonald51k wrote:

It's possible, but it depends on how you want the data to be formatted. You can use split to convert the returned data.frame into a list, and then you could collapse the list into a row-wise structure.

As an example I'll just use some fake data.

> thedf <- data.frame(UNIPTROTKB = rep(c("Q12345", "R4534", "T96874"), c(3,1,5)), KEGG = paste0("hsa:", c(1, 12, 123, 1234, 12345, 2345, 234, 432, 44323)))
> thedf
  UNIPTROTKB      KEGG
1     Q12345     hsa:1
2     Q12345    hsa:12
3     Q12345   hsa:123
4      R4534  hsa:1234
5     T96874 hsa:12345
6     T96874  hsa:2345
7     T96874   hsa:234
8     T96874   hsa:432
9     T96874 hsa:44323

That seems suitably fake, no? Now split into a list.

> thelst <- split(thedf[,2], thedf[,1])
> thelst
$Q12345
[1] "hsa:1"   "hsa:12"  "hsa:123"

$R4534
[1] "hsa:1234"

$T96874
[1] "hsa:12345" "hsa:2345"  "hsa:234"   "hsa:432"   "hsa:44323"

And we can do something like

> data.frame(UNIPROTKB = names(thelst), KEGG = do.call(c, lapply(thelst, paste, collapse = ", ")))
       UNIPROTKB                                             KEGG
Q12345    Q12345                           hsa:1, hsa:12, hsa:123
R4534      R4534                                         hsa:1234
T96874    T96874 hsa:12345, hsa:2345, hsa:234, hsa:432, hsa:44323

Which is sort of bootleg. We can be more modern by

> library(S4Vectors)
> DataFrame(UNIPROTKB = names(thelst), KEGG = as(thelst, "CharacterList"))
DataFrame with 3 rows and 2 columns
         UNIPROTKB                           KEGG
       <character>                <CharacterList>
Q12345      Q12345           hsa:1,hsa:12,hsa:123
R4534        R4534                       hsa:1234
T96874      T96874 hsa:12345,hsa:2345,hsa:234,...

But it all depends on what you are planning on doing with these data...

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by James W. MacDonald51k
Answer: UniProt.ws report 1:many mappings in a single row
0
gravatar for l.nilse
3.0 years ago by
l.nilse0
l.nilse0 wrote:

My UNIPROT keys are not necessarily unique (in your example they are: Q12345, R4543, T96874), i.e. split/collapse will not work.

I had hoped for a multiVals parameter (first, CharacterList, ...) in select() as is available in the mapIds() interface. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html That would solve my problem.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by l.nilse0

I am not sure what you mean when you say my UNIPROT keys are unique. There are replicates for two of them (e.g., there are three Q12345 and five T96874)! That's the scenario you are talking about, no?

In addition, the strategy I have laid out here is exactly what mapIds does - it calls select first, then uses split to convert to a list and then either returns the list or converts to a CharacterList - so I have provided you with code to emulate exactly what you say you want.

Also, please don't use the 'Add your answer' box below, unless you are actually answering a question. Do note that it says clearly, right above that box, that you should use the ADD COMMENT button to discuss an answer, etc.

ADD REPLYlink written 3.0 years ago by James W. MacDonald51k

I guess replicate id's have the same symbol mapping, so you could subset the (non-replicated) results by the original vector; if df is the 'more modern' result from Jim, then df$KEGG[match(your_ids, df$UNIPROTKB)].

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Martin Morgan ♦♦ 24k

Together with match() this is a nice solution. Avoids me calling select() multiple times.

Thanks a lot for this, Jim and Martin. 

ADD REPLYlink written 3.0 years ago by l.nilse0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 221 users visited in the last hour