Writing a loop to clean up protein names
2
0
Entering edit mode
laural710 • 0
@laural710-14567
Last seen 4.4 years ago

Hi All

I'm not sure if this is even the correct forum to ask this. I wonder if anyone can offer some advice about where i am going wrong with a loop i'm trying to write for cleaning up UNIPROT data names. 

Basically, the name i have from proteomics analysis is something like tr|A0A02DLI66|A0A02DLI66_MYTGA but i would like to strip it to just A0A02DLI66.

I have gotten to the point where i can manually clean it up per sample using this code: 

 fData(x)$UNPROTKB=fData(x)$DatabaseAccess

  fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)

  fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)

but my samples are running quite high and this is time cosuming. So i thought a loop would help and while it is processing, it is not changing anything. I'm unsure if i am missing something in the code which is as follows:

tmp <- sapply(nms, function(.bap) {

  cat("Processing", .bap, "... ")

  x <- get(.bap, envir = .GlobalEnv)

  fData(x)$UNPROTKB=fData(x)$DatabaseAccess

  fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)

  fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)

  varnm <- sub("bap", "bap", .bap)

  assign(varnm, x, envir = .GlobalEnv)

  cat("done\n")})

Any help would be much appreciated. 

L

Proteomics Transcriptomics Loops batch-processing R • 1.3k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States

It's not really a Bioconductor question, but maybe tangential enough?

I tend to like the *apply family of functions because I'm old and so are they. So I would normally just do something like

fData(x)$UNIPROTKB <- sapply(strsplit(fData(x)$DatabaseAccess, "\\|"), "[", 2)

Which isn't going to be super fast if you have lots of proteins, but it's one line and relatively straightforward. To decompose, the middle part of that is

strsplit(fData(x)$DatabaseAccess, "\\|")

which generates a list for each value, where we split on the pipe (|) symbol, which has to be escaped with \\ so you split on the literal '|' character. So that does something like

> strstrsplit("tr|A0A02DLI66|A0A02DLI66_MYTGA", "\\|")
[[1]]
[1] "tr"               "A0A02DLI66"       "A0A02DLI66_MYTGA"

and the sapply(<stuff goes here>, "[", 2) part just means 'return the second thing from each list item, and if it can be reduced to a vector, do so, please'. So as a (super boring) example

> lotsastuff <- rep("tr|A0A02DLI66|A0A02DLI66_MYTGA", 30)
> sapply(strsplit(lotsastuff, "\\|"), "[", 2)
 [1] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
 [6] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[11] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[16] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[21] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"
[26] "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66" "A0A02DLI66"

 

ADD COMMENT
0
Entering edit mode
@sebastian-gibb-6858
Last seen 5.5 years ago
Germany

This is a typical fasta header (see the following link for a full explanation: https://www.uniprot.org/help/fasta-headers).

While using strsplit is easy to understand it requires a loop (or *apply) to access the AccessionNumber/UnqiueIdentifier. You could use gsub with a different pattern to use a vectorized approach:

fasta <- c(
    "tr|A0A02DLI66|A0A02DLI66_MYTGA",
    "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
)

gsub("^tr\\|([A-Z_0-9-]+)\\|.*$", "\\1", fasta)
# [1] "A0A02DLI66" "B1XC03"     "A9UF05"

(A-Z_0-9-]+) defines a group that contains A-Z, 0-9, - or _ (that are the allowed characters for an AccessionNumber). The pattern matches the whole string but the replacement \\1 is just the predefined group. BTW: You could have a look at Pbase fasta parser if you interested in a full featured parser for fasta headers: https://github.com/ComputationalProteomicsUnit/Pbase/blob/master/R/fasta-functions.R

ADD COMMENT
1
Entering edit mode

These days the tradeoff for a loop and a vectorized operation really only start to matter with really large N:

> z <- rep(c(
    "tr|A0A02DLI66|A0A02DLI66_MYTGA",
    "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e5)

> system.time(sapply(strsplit(z, "\\|"), "[", 2))
   user  system elapsed
   2.44    0.15    2.62
> system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z))
   user  system elapsed
   1.89    0.00    1.91

> z <- rep(c(
    "tr|A0A02DLI66|A0A02DLI66_MYTGA",
    "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e6)

> system.time(sapply(strsplit(z, "\\|"), "[", 2))
   user  system elapsed
  30.25    0.41   30.97
> system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z))
   user  system elapsed
  19.08    0.00   19.30

So at 5M, the vectorized approach is ~40% faster, but it's ten seconds.

ADD REPLY
0
Entering edit mode

You are right. Your solution will be even twice as fast as the gsub if you use fixed=TRUE and vapply:

z <- rep(c(
    "tr|A1A02DLI66|A0A02DLI66_MYTGA",
    "tr|B2XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator",
    "tr|A10UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3"
), 5e3)

gs <- function(x)gsub("^tr\\|([A-Z_0-9-]+)\\|.*", "\\1", x)
ssp1 <- function(x)sapply(strsplit(x, "\\|"), "[", 2)
ssp2 <- function(x)sapply(strsplit(x, "|", fixed=TRUE), "[", 2)
ssp3 <- function(x)vapply(strsplit(x, "|", fixed=TRUE), "[", character(1), 2)

all.equal(gs(z), ssp1(z))
# [1] TRUE
all.equal(gs(z), ssp2(z))
# [1] TRUE
all.equal(gs(z), ssp3(z))
# [1] TRUE

library("rbenchmark")
benchmark(gs(z), ssp1(z), ssp2(z), ssp3(z), order="relative", replications=10)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 4 ssp3(z)           10   0.126    1.000     0.125    0.000          0         0
# 3 ssp2(z)           10   0.139    1.103     0.139    0.000          0         0
# 1   gs(z)           10   0.240    1.905     0.239    0.001          0         0
# 2 ssp1(z)           10   0.531    4.214     0.531    0.000          0         0
ADD REPLY

Login before adding your answer.

Traffic: 751 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6