Question

Mapping of GENE symbols into ENTREZ

1

Entering edit mode

omarrafiqued ▴ 50

@omarrafiqued-21833

Last seen 4 weeks ago

India

Below, GG is the list of gene symbols to be mapped into entrez. The code is:

geneMapping2= select(org.Hs.eg.db, GG, c("ENTREZID","GENENAME"),"ALIAS")

The error is:

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ALIAS'. Please use the keys method to see a listing of valid arguments.

GG is supposed to contain GENE IDs for human Kidney cancer. The first 10 entries in GG are :

 [1] "UNKLLLL64718"           [2]  "FLJ45340LLL402483"       
 [3] "CROCCL2LLL114819"       [4]  "NCRNA00115LLL79854"      
 [5] "RG9MTD3LLL158234"       [6]  "ZNF345LLL25850"          
 [7] "ZMAT1LLL84460"          [8]  "SUGT1L1LLL283507"        
 [9] "LOC100128842LLL100128842" [10] "SFRS18LLL25957"

Any help would be greatly appreciated. Thanks!

limma annotation Gene • 1.4k views

ADD COMMENT • link 4.6 years ago • updated 4.2 years ago omarrafiqued ▴ 50

score 1 · Answer 1 · 2019-09-07

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 days ago

United States

Those are not gene symbols, but something else, maybe gene symbol (CROCC, ZNF345) plus additional information (L2LLL114819, LLL25850). You'll need to figure out how to extract the gene symbols from your entries in GG; the rest of your code looks ok.

ADD COMMENT • link 4.6 years ago Martin Morgan 25k

1

Entering edit mode

Thanks and yes. I figured that out just now. Do you have any idea how to extract the Gene Symbols???? e.g. for "UNKLLLL64718" , "UNK" is a Gene Symbol , but I cant find a pattern.

ADD REPLY • link 4.6 years ago omarrafiqued ▴ 50

0

Entering edit mode

Seems very tricky. Using regular expressions you could try something like

pattern = "^.{3}[^L]*"
regmatches(GG, regexpr(pattern, GG))

where the pattern tries to identify symbols as at least 3 letters / numbers long, and then all non-L symbols after the third. This would fail for SPARCL1, for instance...

ADD REPLY • link 4.6 years ago Martin Morgan 25k

1

Entering edit mode

I got it. The last four numbers n each entry are the ENTREZIDs.... Took me time as I have no expertise in this area.

ADD REPLY • link 4.6 years ago omarrafiqued ▴ 50

1

Entering edit mode

I guess it's actually the trailing digits?

regmatches(GG, regexpr("\\d+$", GG))

ADD REPLY • link 4.6 years ago Martin Morgan 25k

1

Entering edit mode

Yea, it is the trailing digits. Thanks for the code. I wrote my own code too:

f1 <- function(x) { ENTIDS <- c() for(fa in x){ si<-strlength(fa) #strsub(fa,-4,-4) ent1<-c() for (k in seq(1:si)){ ss=strsub(fa,-k,-k) nl<-grepl("\d", ss) if(nl==FALSE){ #print(k) ent<-strsub(fa,-(k-1),-1) ent1<-append(ent1,ent) break } #print(ent) } ENTIDS<-append(ENTIDS,ent1) } ENTIDS }

ADD REPLY • link 4.6 years ago omarrafiqued ▴ 50

1

Entering edit mode

Thanks for sharing your code, and sorry for now commenting on it!

In R it's usually better to 'vectorize', which is to say call a function once for each vector, rather than once for each element as you do above.

It's also very inefficient to follow the 'copy and append' pattern that you use above, where you create an empty result and append to it ent1 <- c(); ... ent1 <- append(ent1, ent), which you can demonstrate to yourself by trying

n <- 1000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) })

I have

>     n <- 1000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) })
   user  system elapsed
  0.004   0.000   0.004
>     n <- 10000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) })
   user  system elapsed
  0.144   0.023   0.166
>     n <- 100000; system.time({ x <- c(); for (i in 1:n) x <- c(x, i) })
   user  system elapsed
 15.391   4.553  19.959

where you can see that the execution time scales as the square of n, and the process of concatenating 100000 values should really not take 20s on a modern computer! The alternatives are to use lapply() / vapply() to manage memory or to pre-allocate and fill (ent1 <- integer(n)... ent1[[i]] <- i) perhaps with a filter step to remove unassigned values.

And finally, the stringr package is indeed useful, but does one gain anything over base R functions in this case? Each package dependency makes your code more fragile, because you now are exposed to changes in the package (and it's dependent packages) code.

> db = available.packages(repos=BiocManager::repositories())> tools::package_dependencies("stringr", db, recursive=TRUE)[[1]]
[1] "glue"     "magrittr" "stringi"  "methods"  "tools"    "utils"    "stats"

By way of comparison, I put your code and my code into functions

f1 <- function(x) {
    ENTIDS <- c()
    for(fa in x){
        si<-str_length(fa)
        #str_sub(fa,-4,-4)
        ent1<-c()
        for (k in seq(1:si)){
            ss=str_sub(fa,-k,-k)
            nl<-grepl("\\d", ss)
            if(nl==FALSE){
                 #print(k)
                ent<-str_sub(fa,-(k-1),-1)
                ent1<-append(ent1,ent)
                break
            }
                                        #print(ent)
        }
        ENTIDS<-append(ENTIDS,ent1)
    }
    ENTIDS
}

f2 <- function(x)
    regmatches(x, regexpr("\\d+$", x))

made sure the results were the same

> identical(f1(GG1), f2(GG1))
[1] TRUE

and then timed their execution

> library(microbenchmark)
> microbenchmark(f1(GG1), f2(GG1))
Unit: microseconds
    expr     min       lq      mean  median      uq     max neval cld
 f1(GG1) 285.166 306.8635 318.22949 320.423 328.166 426.162   100   b
 f2(GG1)  16.024  18.2115  20.41742  19.360  21.579  60.132   100  a

obviously the user experienced time difference between 19 and 320 microseconds is not meaningful here, but the value of vectorization is apparent. Likewise, I suppose your more complicated code addresses 'edge cases' that are not present in your sample data; it would be interesting to present those and to arrive at a vectorized solution...

ADD REPLY • link 4.6 years ago Martin Morgan 25k

0

Entering edit mode

Thanks for the elaborate answer.

ADD REPLY • link 4.6 years ago omarrafiqued ▴ 50

score 1 · Answer 2 · 2019-09-07

LLL is the delimiter

From looking at the entries of your GG vector, it is clear that someone has assembled gene symbols and entrez gene ids but the symbols and ids have become concatenated together separated by "LLL". My guess is that you started with delimited entries such as "UNKL;64718" but you or someone else has mistakingly substituted "LLL" for all the delimiters ";". Needless to say, this is not standard and should never had happened.

Using strsplit()

Anyway, knowing that "LLL" is the delimiter, you could unconcatenate the symbols and ids like this:

> GG <- c("UNKLLLL64718", "FLJ45340LLL402483", "CROCCL2LLL114819",
+         "NCRNA00115LLL79854", "RG9MTD3LLL158234", "ZNF345LLL25850",
+         "ZMAT1LLL84460", "SUGT1L1LLL283507",
+         "LOC100128842LLL100128842", "SFRS18LLL25957")
> GG_ <- sub("LLLLL","__LLL", GG)
> GG_ <- sub("LLLL","_LLL", GG_)
> Genes <- unlist(strsplit(GG_, split="LLL"))
> Genes <- gsub("_", "L", Genes)
> Genes <- matrix(Genes, ncol=2, byrow=TRUE)
> colnames(Genes) <- c("Symbol", "EntrezID")
> Genes
      Symbol         EntrezID   
 [1,] "UNKL"         "64718"    
 [2,] "FLJ45340"     "402483"   
 [3,] "CROCCL2"      "114819"   
 [4,] "NCRNA00115"   "79854"    
 [5,] "RG9MTD3"      "158234"   
 [6,] "ZNF345"       "25850"    
 [7,] "ZMAT1"        "84460"    
 [8,] "SUGT1L1"      "283507"   
 [9,] "LOC100128842" "100128842"
[10,] "SFRS18"       "25957"

The code has to allow for symbols that end in "L" or "LL". My code should handle all cases because there are no human gene symbols ending in more than two Ls.

Using substring()

Base R is very powerful and there are always lots of ways to do the same thing. Here is somewhat shorter and faster way starting with Martin's code:

> EntrezID <- regmatches(GG, regexpr("\\d+$", GG))
> n <- nchar(GG) - nchar(EntrezID) - 3
> Symbol <- substring(GG, 1, n)

On the other hand ...

On the other hand, you could consider going back to the data repository from which you got the vector GG, and trying to obtain the original data before all the LLLs got inserted in the first place.