Search
Question: Writing a loop to clean up protein names
0
gravatar for laural710
8 weeks ago by
laural7100
laural7100 wrote:

Hi All

I'm not sure if this is even the correct forum to ask this. I wonder if anyone can offer some advice about where i am going wrong with a loop i'm trying to write for cleaning up UNIPROT data names. 

Basically, the name i have from proteomics analysis is something like tr|A0A02DLI66|A0A02DLI66_MYTGA but i would like to strip it to just A0A02DLI66.

I have gotten to the point where i can manually clean it up per sample using this code: 

 fData(x)$UNPROTKB=fData(x)$DatabaseAccess

  fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)

  fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)

but my samples are running quite high and this is time cosuming. So i thought a loop would help and while it is processing, it is not changing anything. I'm unsure if i am missing something in the code which is as follows:

tmp <- sapply(nms, function(.bap) {

  cat("Processing", .bap, "... ")

  x <- get(.bap, envir = .GlobalEnv)

  fData(x)$UNPROTKB=fData(x)$DatabaseAccess

  fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)

  fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)

  varnm <- sub("bap", "bap", .bap)

  assign(varnm, x, envir = .GlobalEnv)

  cat("done\n")})

Any help would be much appreciated. 

L

ADD COMMENTlink modified 8 weeks ago by Sebastian Gibb80 • written 8 weeks ago by laural7100
2
gravatar for James W. MacDonald
8 weeks ago by
United States
James W. MacDonald48k wrote:

It's not really a Bioconductor question, but maybe tangential enough?

I tend to like the *apply family of functions because I'm old and so are they. So I would normally just do something like

fData(x)$UNIPROTKB <- sapply(strsplit(fData(x)$DatabaseAccess, "\\|"), "[", 2)

Which isn't going to be super fast if you have lots of proteins, but it's one line and relatively straightforward. To decompose, the middle part of that is

strsplit(fData(x)$DatabaseAccess, "\\|")

which generates a list for each value, where we split on the pipe (|) symbol, which has to be escaped with \\ so you split on the literal '|' character. So that does something like

> strstrsplit("tr|A0A02DLI66|A0A02DLI66_MYTGA", "\\|")
[[1]]
[1] "tr"               "A0A02DLI66"       "A0A02DLI66_MYTGA"

and the sapply(<stuff goes here>, "[", 2) part just means 'return the second thing from each list item, and if it can be reduced to a vector, do so, please'. So as a (super boring) example

> lotsastuff <- rep("tr|A0A02DLI66|A0A02DLI66_MYTGA", 30)
> sapply(strsplit(lotsastuff, "\\|"), "[", 2)
 [1] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
 [6] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[11] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[16] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[21] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[26] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"

 

ADD COMMENTlink written 8 weeks ago by James W. MacDonald48k
0
gravatar for Sebastian Gibb
8 weeks ago by
Germany
Sebastian Gibb80 wrote:

This is a typical fasta header (see the following link for a full explanation: https://www.uniprot.org/help/fasta-headers).

While using strsplit is easy to understand it requires a loop (or *apply) to access the AccessionNumber/UnqiueIdentifier. You could use gsub with a different pattern to use a vectorized approach:

fasta <- c(
    "tr|A0A02DLI66|A0A02DLI66_MYTGA",
    "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
)

gsub("^tr\\|([A-Z_0-9-]+)\\|.*$", "\\1", fasta)
# [1] "A0A02DLI66" "B1XC03"     "A9UF05"

(A-Z_0-9-]+) defines a group that contains A-Z, 0-9, - or _ (that are the allowed characters for an AccessionNumber). The pattern matches the whole string but the replacement \\1 is just the predefined group. BTW: You could have a look at Pbase fasta parser if you interested in a full featured parser for fasta headers: https://github.com/ComputationalProteomicsUnit/Pbase/blob/master/R/fasta-functions.R

ADD COMMENTlink written 8 weeks ago by Sebastian Gibb80
1

These days the tradeoff for a loop and a vectorized operation really only start to matter with really large N:

> z <- rep(c(
    "tr|A0A02DLI66|A0A02DLI66_MYTGA",
    "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e5)

> system.time(sapply(strsplit(z, "\\|"), "[", 2))
   user  system elapsed
   2.44    0.15    2.62
> system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z))
   user  system elapsed
   1.89    0.00    1.91

> z <- rep(c(
    "tr|A0A02DLI66|A0A02DLI66_MYTGA",
    "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e6)

> system.time(sapply(strsplit(z, "\\|"), "[", 2))
   user  system elapsed
  30.25    0.41   30.97
> system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z))
   user  system elapsed
  19.08    0.00   19.30

So at 5M, the vectorized approach is ~40% faster, but it's ten seconds.

ADD REPLYlink written 8 weeks ago by James W. MacDonald48k

You are right. Your solution will be even twice as fast as the gsub if you use fixed=TRUE and vapply:

z <- rep(c(
    "tr|A1A02DLI66|A0A02DLI66_MYTGA",
    "tr|B2XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A10UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e3)

gs <- function(x)gsub("^tr\\|([A-Z_0-9-]+)\\|.*", "\\1", x)
ssp1 <- function(x)sapply(strsplit(x, "\\|"), "[", 2)
ssp2 <- function(x)sapply(strsplit(x, "|", fixed=TRUE), "[", 2)
ssp3 <- function(x)vapply(strsplit(x, "|", fixed=TRUE), "[", character(1), 2)

all.equal(gs(z), ssp1(z))
# [1] TRUE
all.equal(gs(z), ssp2(z))
# [1] TRUE
all.equal(gs(z), ssp3(z))
# [1] TRUE

library("rbenchmark")
benchmark(gs(z), ssp1(z), ssp2(z), ssp3(z), order="relative", replications=10)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 4 ssp3(z)           10   0.126    1.000     0.125    0.000          0         0
# 3 ssp2(z)           10   0.139    1.103     0.139    0.000          0         0
# 1   gs(z)           10   0.240    1.905     0.239    0.001          0         0
# 2 ssp1(z)           10   0.531    4.214     0.531    0.000          0         0
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Sebastian Gibb80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 251 users visited in the last hour