LLL is the delimiter
From looking at the entries of your GG vector, it is clear that someone has assembled gene symbols and entrez gene ids but the symbols and ids have become concatenated together separated by "LLL". My guess is that you started with delimited entries such as "UNKL;64718" but you or someone else has mistakingly substituted "LLL" for all the delimiters ";". Needless to say, this is not standard and should never had happened.
Using strsplit()
Anyway, knowing that "LLL" is the delimiter, you could unconcatenate the symbols and ids like this:
> GG <- c("UNKLLLL64718", "FLJ45340LLL402483", "CROCCL2LLL114819",
+ "NCRNA00115LLL79854", "RG9MTD3LLL158234", "ZNF345LLL25850",
+ "ZMAT1LLL84460", "SUGT1L1LLL283507",
+ "LOC100128842LLL100128842", "SFRS18LLL25957")
> GG_ <- sub("LLLLL","__LLL", GG)
> GG_ <- sub("LLLL","_LLL", GG_)
> Genes <- unlist(strsplit(GG_, split="LLL"))
> Genes <- gsub("_", "L", Genes)
> Genes <- matrix(Genes, ncol=2, byrow=TRUE)
> colnames(Genes) <- c("Symbol", "EntrezID")
> Genes
Symbol EntrezID
[1,] "UNKL" "64718"
[2,] "FLJ45340" "402483"
[3,] "CROCCL2" "114819"
[4,] "NCRNA00115" "79854"
[5,] "RG9MTD3" "158234"
[6,] "ZNF345" "25850"
[7,] "ZMAT1" "84460"
[8,] "SUGT1L1" "283507"
[9,] "LOC100128842" "100128842"
[10,] "SFRS18" "25957"
The code has to allow for symbols that end in "L" or "LL". My code should handle all cases because there are no human gene symbols ending in more than two Ls.
Using substring()
Base R is very powerful and there are always lots of ways to do the same thing. Here is somewhat shorter and faster way starting with Martin's code:
> EntrezID <- regmatches(GG, regexpr("\\d+$", GG))
> n <- nchar(GG) - nchar(EntrezID) - 3
> Symbol <- substring(GG, 1, n)
On the other hand ...
On the other hand, you could consider going back to the data repository from which you got the vector GG, and trying to obtain the original data before all the LLLs got inserted in the first place.
Thanks and yes. I figured that out just now. Do you have any idea how to extract the Gene Symbols???? e.g. for "UNKLLLL64718" , "UNK" is a Gene Symbol , but I cant find a pattern.
Seems very tricky. Using regular expressions you could try something like
where the pattern tries to identify symbols as at least 3 letters / numbers long, and then all non-L symbols after the third. This would fail for SPARCL1, for instance...
I got it. The last four numbers n each entry are the ENTREZIDs.... Took me time as I have no expertise in this area.
I guess it's actually the trailing digits?
Yea, it is the trailing digits. Thanks for the code. I wrote my own code too:
f1 <- function(x) { ENTIDS <- c() for(fa in x){ si<-strlength(fa) #strsub(fa,-4,-4) ent1<-c() for (k in seq(1:si)){ ss=strsub(fa,-k,-k) nl<-grepl("\d", ss) if(nl==FALSE){ #print(k) ent<-strsub(fa,-(k-1),-1) ent1<-append(ent1,ent) break } #print(ent) } ENTIDS<-append(ENTIDS,ent1) } ENTIDS }
Thanks for sharing your code, and sorry for now commenting on it!
In R it's usually better to 'vectorize', which is to say call a function once for each vector, rather than once for each element as you do above.
It's also very inefficient to follow the 'copy and append' pattern that you use above, where you create an empty result and append to it
ent1 <- c(); ... ent1 <- append(ent1, ent)
, which you can demonstrate to yourself by tryingI have
where you can see that the execution time scales as the square of
n
, and the process of concatenating 100000 values should really not take 20s on a modern computer! The alternatives are to uselapply()
/vapply()
to manage memory or to pre-allocate and fill (ent1 <- integer(n)... ent1[[i]] <- i
) perhaps with a filter step to remove unassigned values.And finally, the stringr package is indeed useful, but does one gain anything over base R functions in this case? Each package dependency makes your code more fragile, because you now are exposed to changes in the package (and it's dependent packages) code.
By way of comparison, I put your code and my code into functions
made sure the results were the same
and then timed their execution
obviously the user experienced time difference between 19 and 320 microseconds is not meaningful here, but the value of vectorization is apparent. Likewise, I suppose your more complicated code addresses 'edge cases' that are not present in your sample data; it would be interesting to present those and to arrive at a vectorized solution...
Thanks for the elaborate answer.