Question

UniProt.ws report 1:many mappings in a single row

0

Entering edit mode

l.nilse • 0

@lnilse-11875

Last seen 6.1 years ago

I am using UNIPROTKB keys to retrieve KEGG entries. Often multiple KEGG entries are found for the same key, resulting in a select() result with more rows than keys.

'select()' returned 1:many mapping between keys and columns

Is it possible to get multiple KEGG entries in the same row, i.e. number of keys and number of rows in the result will always be the same?

uniprot.ws UniProt annotation • 1.9k views

ADD COMMENT • link 7.5 years ago • updated 7.4 years ago l.nilse • 0

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

It's possible, but it depends on how you want the data to be formatted. You can use split to convert the returned data.frame into a list, and then you could collapse the list into a row-wise structure.

As an example I'll just use some fake data.

> thedf <- data.frame(UNIPTROTKB = rep(c("Q12345", "R4534", "T96874"), c(3,1,5)), KEGG = paste0("hsa:", c(1, 12, 123, 1234, 12345, 2345, 234, 432, 44323)))
> thedf
  UNIPTROTKB      KEGG
1     Q12345     hsa:1
2     Q12345    hsa:12
3     Q12345   hsa:123
4      R4534  hsa:1234
5     T96874 hsa:12345
6     T96874  hsa:2345
7     T96874   hsa:234
8     T96874   hsa:432
9     T96874 hsa:44323

That seems suitably fake, no? Now split into a list.

> thelst <- split(thedf[,2], thedf[,1])
> thelst
$Q12345
[1] "hsa:1"   "hsa:12"  "hsa:123"

$R4534
[1] "hsa:1234"

$T96874
[1] "hsa:12345" "hsa:2345"  "hsa:234"   "hsa:432"   "hsa:44323"

And we can do something like

> data.frame(UNIPROTKB = names(thelst), KEGG = do.call(c, lapply(thelst, paste, collapse = ", ")))
       UNIPROTKB                                             KEGG
Q12345    Q12345                           hsa:1, hsa:12, hsa:123
R4534      R4534                                         hsa:1234
T96874    T96874 hsa:12345, hsa:2345, hsa:234, hsa:432, hsa:44323

Which is sort of bootleg. We can be more modern by

> library(S4Vectors)
> DataFrame(UNIPROTKB = names(thelst), KEGG = as(thelst, "CharacterList"))
DataFrame with 3 rows and 2 columns
         UNIPROTKB                           KEGG
       <character>                <CharacterList>
Q12345      Q12345           hsa:1,hsa:12,hsa:123
R4534        R4534                       hsa:1234
T96874      T96874 hsa:12345,hsa:2345,hsa:234,...

But it all depends on what you are planning on doing with these data...

ADD COMMENT • link 7.5 years ago James W. MacDonald 65k

0

Entering edit mode

l.nilse • 0

@lnilse-11875

Last seen 6.1 years ago

My UNIPROT keys are not necessarily unique (in your example they are: Q12345, R4543, T96874), i.e. split/collapse will not work.

I had hoped for a multiVals parameter (first, CharacterList, ...) in select() as is available in the mapIds() interface. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html That would solve my problem.

ADD COMMENT • link 7.5 years ago l.nilse • 0

0

Entering edit mode

I am not sure what you mean when you say my UNIPROT keys are unique. There are replicates for two of them (e.g., there are three Q12345 and five T96874)! That's the scenario you are talking about, no?

In addition, the strategy I have laid out here is exactly what mapIds does - it calls select first, then uses split to convert to a list and then either returns the list or converts to a CharacterList - so I have provided you with code to emulate exactly what you say you want.

Also, please don't use the 'Add your answer' box below, unless you are actually answering a question. Do note that it says clearly, right above that box, that you should use the ADD COMMENT button to discuss an answer, etc.

ADD REPLY • link 7.5 years ago James W. MacDonald 65k

0

Entering edit mode

I guess replicate id's have the same symbol mapping, so you could subset the (non-replicated) results by the original vector; if df is the 'more modern' result from Jim, then df$KEGG[match(your_ids, df$UNIPROTKB)].

ADD REPLY • link 7.5 years ago Martin Morgan 25k

0

Entering edit mode

Together with match() this is a nice solution. Avoids me calling select() multiple times.

Thanks a lot for this, Jim and Martin.

ADD REPLY • link 7.5 years ago l.nilse • 0

score 1 · Accepted Answer · 2016-11-26

Here a possible solution complete with an example. Note that there are duplicate entries in the keys and multiple KEGG results.

library(UniProt.ws)

species <- 9606    # homo sapiens
up <- UniProt.ws(taxId=species)

UniqueSpaceSeparated <- function(x) {
  u <- unique(x)
  r <- paste(u, collapse=" ")
  return(r)
}

columns <- c("SEQUENCE","GO", "SUBCELLULAR-LOCATIONS", "PROTEIN-NAMES", "GENES", "KEGG")

keytype <- "UNIPROTKB"

keys <- c("O75083","O75083","O75084","O75131","O75144","O75264","O75309","O75339","O75340","O75348","O75351","O75144","O75144")

# Note that O75144 has got two KEGG entries.

a <- select(up, keys, columns, keytype)

b <- aggregate(a, by=list(a[,keytype]), FUN=UniqueSpaceSeparated)

b[,1] <- NULL

idx <- match(keys,b[,keytype])

annotations <- b[idx,]