Question: Mapping of GENE symbols into ENTREZ
1
10 days ago by
omarrafiqued50
omarrafiqued50 wrote:

Below GG is the list of gene symbols to be mapped into entrez. The code is:

geneMapping2= select(org.Hs.eg.db, GG, c("ENTREZID","GENENAME"),"ALIAS")


The error is:

Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'ALIAS'. Please use the keys method to see a listing of valid arguments.


GG is supposed to contain GENE IDs for human Kidney cancer. The first 10 entries in GG are :

 [1] "UNKLLLL64718"           [2]  "FLJ45340LLL402483"
[3] "CROCCL2LLL114819"       [4]  "NCRNA00115LLL79854"
[5] "RG9MTD3LLL158234"       [6]  "ZNF345LLL25850"
[7] "ZMAT1LLL84460"          [8]  "SUGT1L1LLL283507"
[9] "LOC100128842LLL100128842" [10] "SFRS18LLL25957"


Any help would be greatly appreciated. Thanks!

annotation limma gene • 113 views
modified 9 days ago by Gordon Smyth38k • written 10 days ago by omarrafiqued50
Answer: Mapping of GENE symbols into ENTREZ
1
9 days ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Those are not gene symbols, but something else, maybe gene symbol (CROCC, ZNF345) plus additional information (L2LLL114819, LLL25850). You'll need to figure out how to extract the gene symbols from your entries in GG; the rest of your code looks ok.

1

Thanks and yes. I figured that out just now. Do you have any idea how to extract the Gene Symbols???? e.g. for "UNKLLLL64718" , "UNK" is a Gene Symbol , but I cant find a pattern.

Seems very tricky. Using regular expressions you could try something like

pattern = "^.{3}[^L]*"
regmatches(GG, regexpr(pattern, GG))


where the pattern tries to identify symbols as at least 3 letters / numbers long, and then all non-L symbols after the third. This would fail for SPARCL1, for instance...

1

I got it. The last four numbers n each entry are the ENTREZIDs.... Took me time as I have no expertise in this area.

1

I guess it's actually the trailing digits?

regmatches(GG, regexpr("\\d+$", GG))  ADD REPLYlink written 9 days ago by Martin Morgan ♦♦ 23k 1 Yea, it is the trailing digits. Thanks for the code. I wrote my own code too: f1 <- function(x) { ENTIDS <- c() for(fa in x){ si<-strlength(fa) #strsub(fa,-4,-4) ent1<-c() for (k in seq(1:si)){ ss=strsub(fa,-k,-k) nl<-grepl("\d", ss) if(nl==FALSE){ #print(k) ent<-strsub(fa,-(k-1),-1) ent1<-append(ent1,ent) break } #print(ent) } ENTIDS<-append(ENTIDS,ent1) } ENTIDS } ADD REPLYlink modified 8 days ago • written 9 days ago by omarrafiqued50 1 Thanks for sharing your code, and sorry for now commenting on it! In R it's usually better to 'vectorize', which is to say call a function once for each vector, rather than once for each element as you do above. It's also very inefficient to follow the 'copy and append' pattern that you use above, where you create an empty result and append to it ent1 <- c(); ... ent1 <- append(ent1, ent), which you can demonstrate to yourself by trying n <- 1000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) })  I have > n <- 1000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) }) user system elapsed 0.004 0.000 0.004 > n <- 10000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) }) user system elapsed 0.144 0.023 0.166 > n <- 100000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) }) user system elapsed 15.391 4.553 19.959  where you can see that the execution time scales as the square of n, and the process of concatenating 100000 values should really not take 20s on a modern computer! The alternatives are to use lapply() / vapply() to manage memory or to pre-allocate and fill (ent1 <- integer(n)... ent1[[i]] <- i) perhaps with a filter step to remove unassigned values. And finally, the stringr package is indeed useful, but does one gain anything over base R functions in this case? Each package dependency makes your code more fragile, because you now are exposed to changes in the package (and it's dependent packages) code. > db = available.packages(repos=BiocManager::repositories())> tools::package_dependencies("stringr", db, recursive=TRUE)[[1]] [1] "glue" "magrittr" "stringi" "methods" "tools" "utils" "stats"  By way of comparison, I put your code and my code into functions f1 <- function(x) { ENTIDS <- c() for(fa in x){ si<-str_length(fa) #str_sub(fa,-4,-4) ent1<-c() for (k in seq(1:si)){ ss=str_sub(fa,-k,-k) nl<-grepl("\\d", ss) if(nl==FALSE){ #print(k) ent<-str_sub(fa,-(k-1),-1) ent1<-append(ent1,ent) break } #print(ent) } ENTIDS<-append(ENTIDS,ent1) } ENTIDS } f2 <- function(x) regmatches(x, regexpr("\\d+$", x))


made sure the results were the same

> identical(f1(GG1), f2(GG1))
[1] TRUE


and then timed their execution

> library(microbenchmark)
> microbenchmark(f1(GG1), f2(GG1))
Unit: microseconds
expr     min       lq      mean  median      uq     max neval cld
f1(GG1) 285.166 306.8635 318.22949 320.423 328.166 426.162   100   b
f2(GG1)  16.024  18.2115  20.41742  19.360  21.579  60.132   100  a


obviously the user experienced time difference between 19 and 320 microseconds is not meaningful here, but the value of vectorization is apparent. Likewise, I suppose your more complicated code addresses 'edge cases' that are not present in your sample data; it would be interesting to present those and to arrive at a vectorized solution...

Answer: Mapping of GENE symbols into ENTREZ
1
9 days ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

LLL is the delimiter

From looking at the entries of your GG vector, it is clear that someone has assembled gene symbols and entrez gene ids but the symbols and ids have become concatenated together separated by "LLL". My guess is that you started with delimited entries such as "UNKL;64718" but you or someone else has mistakingly substituted "LLL" for all the delimiters ";". Needless to say, this is not standard and should never had happened.

Using strsplit()

Anyway, knowing that "LLL" is the delimiter, you could unconcatenate the symbols and ids like this:

> GG <- c("UNKLLLL64718", "FLJ45340LLL402483", "CROCCL2LLL114819",
+         "NCRNA00115LLL79854", "RG9MTD3LLL158234", "ZNF345LLL25850",
+         "ZMAT1LLL84460", "SUGT1L1LLL283507",
+         "LOC100128842LLL100128842", "SFRS18LLL25957")
> GG_ <- sub("LLLLL","__LLL", GG)
> GG_ <- sub("LLLL","_LLL", GG_)
> Genes <- unlist(strsplit(GG_, split="LLL"))
> Genes <- gsub("_", "L", Genes)
> Genes <- matrix(Genes, ncol=2, byrow=TRUE)
> colnames(Genes) <- c("Symbol", "EntrezID")
> Genes
Symbol         EntrezID
[1,] "UNKL"         "64718"
[2,] "FLJ45340"     "402483"
[3,] "CROCCL2"      "114819"
[4,] "NCRNA00115"   "79854"
[5,] "RG9MTD3"      "158234"
[6,] "ZNF345"       "25850"
[7,] "ZMAT1"        "84460"
[8,] "SUGT1L1"      "283507"
[9,] "LOC100128842" "100128842"
[10,] "SFRS18"       "25957"


The code has to allow for symbols that end in "L" or "LL". My code should handle all cases because there are no human gene symbols ending in more than two Ls.

Using substring()

Base R is very powerful and there are always lots of ways to do the same thing. Here is somewhat shorter and faster way starting with Martin's code:

> EntrezID <- regmatches(GG, regexpr("\\d+\$", GG))
> n <- nchar(GG) - nchar(EntrezID) - 3
> Symbol <- substring(GG, 1, n)


On the other hand ...

On the other hand, you could consider going back to the data repository from which you got the vector GG, and trying to obtain the original data before all the LLLs got inserted in the first place.

1

This is very useful. Thanks.