Hi All
I'm not sure if this is even the correct forum to ask this. I wonder if anyone can offer some advice about where i am going wrong with a loop i'm trying to write for cleaning up UNIPROT data names.
Basically, the name i have from proteomics analysis is something like tr|A0A02DLI66|A0A02DLI66_MYTGA but i would like to strip it to just A0A02DLI66.
I have gotten to the point where i can manually clean it up per sample using this code:
fData(x)$UNPROTKB=fData(x)$DatabaseAccess
fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)
fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)
but my samples are running quite high and this is time cosuming. So i thought a loop would help and while it is processing, it is not changing anything. I'm unsure if i am missing something in the code which is as follows:
tmp <- sapply(nms, function(.bap) {
cat("Processing", .bap, "... ")
x <- get(.bap, envir = .GlobalEnv)
fData(x)$UNPROTKB=fData(x)$DatabaseAccess
fData(x)$UNPROTKB=gsub(pattern="^[^ab][^ab].", replacement="",x=fData(x)$UNPROTKB)
fData(x)$UNPROTKB=gsub(pattern="\|..", replacement="",x=fData(x)$UNPROTKB)
varnm <- sub("bap", "bap", .bap)
assign(varnm, x, envir = .GlobalEnv)
cat("done\n")})
Any help would be much appreciated.
L

These days the tradeoff for a loop and a vectorized operation really only start to matter with really large N:
> z <- rep(c( "tr|A0A02DLI66|A0A02DLI66_MYTGA", "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator", "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3" ), 5e5) > system.time(sapply(strsplit(z, "\\|"), "[", 2)) user system elapsed 2.44 0.15 2.62 > system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z)) user system elapsed 1.89 0.00 1.91 > z <- rep(c( "tr|A0A02DLI66|A0A02DLI66_MYTGA", "tr|B1XC03|B1XC03_ECODH HU, DNA-binding transcriptional regulator", "tr|A9UF05|A9UF05_HUMAN BCR/ABL fusion protein isoform Y3" ), 5e6) > system.time(sapply(strsplit(z, "\\|"), "[", 2)) user system elapsed 30.25 0.41 30.97 > system.time(gsub("^tr\\|([A-Z_0-9]+)\\|.*$", "\\1", z)) user system elapsed 19.08 0.00 19.30So at 5M, the vectorized approach is ~40% faster, but it's ten seconds.
You are right. Your solution will be even twice as fast as the
gsubif you usefixed=TRUEandvapply: