Search
Question: Writing a loop to clean up protein names
0
8 weeks ago by
laural7100 wrote:

Hi All

I'm not sure if this is even the correct forum to ask this. I wonder if anyone can offer some advice about where i am going wrong with a loop i'm trying to write for cleaning up UNIPROT data names.

Basically, the name i have from proteomics analysis is something like tr|A0A02DLI66|A0A02DLI66_MYTGA but i would like to strip it to just A0A02DLI66.

I have gotten to the point where i can manually clean it up per sample using this code:

fData(x)$UNPROTKB=fData(x)$DatabaseAccess

fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)

fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)

but my samples are running quite high and this is time cosuming. So i thought a loop would help and while it is processing, it is not changing anything. I'm unsure if i am missing something in the code which is as follows:

tmp <- sapply(nms, function(.bap) {

cat("Processing", .bap, "... ")

x <- get(.bap, envir = .GlobalEnv)

fData(x)$UNPROTKB=fData(x)$DatabaseAccess

fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)

fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)

varnm <- sub("bap", "bap", .bap)

assign(varnm, x, envir = .GlobalEnv)

cat("done\n")})

Any help would be much appreciated.

L

modified 8 weeks ago by Sebastian Gibb80 • written 8 weeks ago by laural7100
2
8 weeks ago by
United States
James W. MacDonald48k wrote:

It's not really a Bioconductor question, but maybe tangential enough?

I tend to like the *apply family of functions because I'm old and so are they. So I would normally just do something like

fData(x)$UNIPROTKB <- sapply(strsplit(fData(x)$DatabaseAccess, "\\|"), "[", 2)

Which isn't going to be super fast if you have lots of proteins, but it's one line and relatively straightforward. To decompose, the middle part of that is

strsplit(fData(x)$DatabaseAccess, "\\|") which generates a list for each value, where we split on the pipe (|) symbol, which has to be escaped with \\ so you split on the literal '|' character. So that does something like > strstrsplit("tr|A0A02DLI66|A0A02DLI66_MYTGA", "\\|") [[1]] [1] "tr" "A0A02DLI66" "A0A02DLI66_MYTGA" and the sapply(<stuff goes here>, "[", 2) part just means 'return the second thing from each list item, and if it can be reduced to a vector, do so, please'. So as a (super boring) example > lotsastuff <- rep("tr|A0A02DLI66|A0A02DLI66_MYTGA", 30) > sapply(strsplit(lotsastuff, "\\|"), "[", 2) [1] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" [6] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" [11] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" [16] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" [21] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" [26] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" ADD COMMENTlink written 8 weeks ago by James W. MacDonald48k 0 8 weeks ago by Germany Sebastian Gibb80 wrote: This is a typical fasta header (see the following link for a full explanation: https://www.uniprot.org/help/fasta-headers). While using strsplit is easy to understand it requires a loop (or *apply) to access the AccessionNumber/UnqiueIdentifier. You could use gsub with a different pattern to use a vectorized approach: fasta <- c( "tr|A0A02DLI66|A0A02DLI66_MYTGA", "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator", "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3" ) gsub("^tr\\|([A-Z_0-9-]+)\\|.*$", "\\1", fasta)
# [1] "A0A02DLI66" "B1XC03"     "A9UF05"


(A-Z_0-9-]+) defines a group that contains A-Z, 0-9, - or _ (that are the allowed characters for an AccessionNumber). The pattern matches the whole string but the replacement \\1 is just the predefined group. BTW: You could have a look at Pbase fasta parser if you interested in a full featured parser for fasta headers: https://github.com/ComputationalProteomicsUnit/Pbase/blob/master/R/fasta-functions.R

1

These days the tradeoff for a loop and a vectorized operation really only start to matter with really large N:

> z <- rep(c(
"tr|A0A02DLI66|A0A02DLI66_MYTGA",
"tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
"tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e5)

> system.time(sapply(strsplit(z, "\\|"), "[", 2))
user  system elapsed
2.44    0.15    2.62
> system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z)) user system elapsed 1.89 0.00 1.91 > z <- rep(c( "tr|A0A02DLI66|A0A02DLI66_MYTGA", "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator", "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3" ), 5e6) > system.time(sapply(strsplit(z, "\\|"), "[", 2)) user system elapsed 30.25 0.41 30.97 > system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z))
user  system elapsed
19.08    0.00   19.30

So at 5M, the vectorized approach is ~40% faster, but it's ten seconds.

You are right. Your solution will be even twice as fast as the gsub if you use fixed=TRUE and vapply:

z <- rep(c(
"tr|A1A02DLI66|A0A02DLI66_MYTGA",
"tr|B2XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
"tr|A10UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e3)

gs <- function(x)gsub("^tr\\|([A-Z_0-9-]+)\\|.*", "\\1", x)
ssp1 <- function(x)sapply(strsplit(x, "\\|"), "[", 2)
ssp2 <- function(x)sapply(strsplit(x, "|", fixed=TRUE), "[", 2)
ssp3 <- function(x)vapply(strsplit(x, "|", fixed=TRUE), "[", character(1), 2)

all.equal(gs(z), ssp1(z))
# [1] TRUE
all.equal(gs(z), ssp2(z))
# [1] TRUE
all.equal(gs(z), ssp3(z))
# [1] TRUE

library("rbenchmark")
benchmark(gs(z), ssp1(z), ssp2(z), ssp3(z), order="relative", replications=10)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 4 ssp3(z)           10   0.126    1.000     0.125    0.000          0         0
# 3 ssp2(z)           10   0.139    1.103     0.139    0.000          0         0
# 1   gs(z)           10   0.240    1.905     0.239    0.001          0         0
# 2 ssp1(z)           10   0.531    4.214     0.531    0.000          0         0