error in getting entrez ids for gene names
1
0
Entering edit mode
gk13102603 • 0
@45d07326
Last seen 17 months ago
United States

Hi,

I am trying to convert gene names to enterz ids for further analysis. I am using:

BiocManager::install("AnnotationHub")
library(AnnotationHub)
hub <- AnnotationHub()
SL <- query(hub, "Solanum lycopersicum") 
SL["AH94247"] 
SL_OrgDb <- SL[["AH94247"]] 
columns(SL_OrgDb) 

BiocManager::install("clusterProfiler")
library(clusterProfiler)
genes <- read.table("genes.txt", quote="\"", comment.char="")
geneList<-genes
geneList1<-(geneList$V1)

gene.df <- bitr(geneList1, fromType = c("GENENAME"),
                toType = "ENTREZID",
                OrgDb = sl_OrgDb)

However, I got an error: Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'GENENAME'. Please use the keys method to see a listing of valid arguments.

Thank you!

Bioconductor • 1.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

First off, you are using a years-old version of R and Bioconductor (that object you are using is specific to Bioc 3.13, and we are on Bioc 3.16). You should update.

If you mean to convert gene symbols to NCBI Gene IDs, you want SYMBOL, not GENENAME.

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2022-10-26

> query(hub, c("orgdb","solanum"))
AnnotationHub with 8 records
# snapshotDate(): 2022-10-26
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Solanum verrucosum, Solanum tuberosum, Solanum stenotomum, Solan...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH107616"]]' 

             title                                             
  AH107616 | org.Solanum_stenotomum.eg.sqlite                  
  AH107725 | org.Solanum_verrucosum.eg.sqlite                  
  AH107737 | org.Solanum_pennellii_Correll,_1958.eg.sqlite     
  AH107738 | org.Solanum_pennellii.eg.sqlite                   
  AH107800 | org.Solanum_tuberosum.eg.sqlite                   
  AH107911 | org.Solanum_esculentum.eg.sqlite                  
  AH107912 | org.Solanum_lycopersicum.eg.sqlite                
  AH107913 | org.Solanum_lycopersicum_var._humboldtii.eg.sqlite
> orgdb <- hub[["AH107912"]]

> head(keys(orgdb, "GENENAME"))
[1] "(+)-menthofuran synthase-like"                       
[2] "(+)-neomenthol dehydrogenase"                        
[3] "(+)-neomenthol dehydrogenase-like"                   
[4] "(-)-camphene/tricyclene synthase, chloroplastic"     
[5] "(-)-camphene/tricyclene synthase, chloroplastic-like"
[6] "(-)-germacrene D synthase"

GENENAME, is as the name suggests, the name of the gene. SYMBOL is the official gene symbol.

ADD COMMENT
0
Entering edit mode

Thank you! I have updated the versions. I used "GENENAME" because the gene list that I am using has genes such as Solyc05g012020. Could you help on what will be the right key here?

ADD REPLY
0
Entering edit mode

That's almost an Ensembl Gene ID

> library(biomaRt)
> mart <- useMart("plants_mart", "slycopersicum_eg_gene","https://plants.ensembl.org")
> getBM(c("ensembl_gene_id","entrezgene_id"), "ensembl_gene_id", "Solyc05g012020", mart)
[1] ensembl_gene_id entrezgene_id  
<0 rows> (or 0-length row.names)
#Urf. Look up using Google.
## Oh, add a .3?
> getBM(c("ensembl_gene_id","entrezgene_id"), "ensembl_gene_id", "Solyc05g012020.3", mart)
   ensembl_gene_id entrezgene_id
1 Solyc05g012020.3        778278

So if you have the version numbers as well, you should be able to map. Usually without the .3 it's the stable ID, but maybe it's different for plants.

Also, you already have the Ensembl IDs, so maybe you can go forward with that?

ADD REPLY
0
Entering edit mode

Thank you! So I can extract entrez ids using this and then run downstream analysis. I also tried using an object for a list of genes to use in getBM but it does not give any output without an error.

Can I use a gene list of Ensembl ids for getBM to get entrez ids? Thank you!

ADD REPLY
1
Entering edit mode

Well, you are probably going to have to do something more brute force if you don't have the version numbers.

> library(biomaRt)
> mart <- useMart("plants_mart", "slycopersicum_eg_gene","https://plants.ensembl.org")

## Just get everything

> z <- getBM(c("ensembl_gene_id","entrezgene_id"), mart = mart)
> head(z)
   ensembl_gene_id entrezgene_id
1 Solyc12g009745.1            NA
2 Solyc10g008210.3     101260876
3 Solyc06g072480.1     100191115
4 Solyc09g057715.1            NA
5 Solyc05g015710.3     101249549
6 Solyc12g097070.2            NA

## strip off the version numbers

> z[,1] <- gsub("\\.[0-9]+", "", z[,1])
> head(z)
  ensembl_gene_id entrezgene_id
1  Solyc12g009745            NA
2  Solyc10g008210     101260876
3  Solyc06g072480     100191115
4  Solyc09g057715            NA
5  Solyc05g015710     101249549
6  Solyc12g097070            NA

And then you can use match to line up the rows correctly. It can be tricky, so here's an example.

## first some random IDs 
## I do this because I don't have your input data - it's just an example
> existingIDs <- z[sample(1:nrow(z), 20),1,drop = FALSE]
> existingIDs
      ensembl_gene_id
17384  Solyc03g034260
29746  Solyc06g053900
4623   Solyc06g074770
27281  Solyc01g107650
20266  Solyc03g044910
6263   Solyc09g066110
21173  Solyc03g082800
8365   Solyc01g087860
32536  Solyc05g008670
22200  Solyc11g008140
16100  Solyc05g051515
297    Solyc09g008940
23722  Solyc05g024020
8632   Solyc11g028010
6869   Solyc09g074117
29196  Solyc04g050090
10096  Solyc02g093850
3617   Solyc09g014820
32554  Solyc02g078460
17723  Solyc10g055080

## Now match
> z[match(existingIDs[,1], z[,1]),] 
      ensembl_gene_id entrezgene_id
17384  Solyc03g034260     101252664
29746  Solyc06g053900     101250807
4623   Solyc06g074770            NA
27281  Solyc01g107650            NA
20266  Solyc03g044910     101253978
6263   Solyc09g066110            NA
21173  Solyc03g082800            NA
8365   Solyc01g087860     104648672
32536  Solyc05g008670            NA
22200  Solyc11g008140            NA
16098  Solyc05g051515     101263309
297    Solyc09g008940            NA
23722  Solyc05g024020            NA
8632   Solyc11g028010            NA
6869   Solyc09g074117            NA
29196  Solyc04g050090            NA
10096  Solyc02g093850     101248928
3617   Solyc09g014820     101245578
32554  Solyc02g078460     101244445
17723  Solyc10g055080            NA

## Note that the row order is the same as my example input data.
ADD REPLY
1
Entering edit mode

This also illustrates my longstanding contention that you shouldn't map between annotation services unless absolutely necessary. Those NA values represent things that EBI/EMBL say are genes, but for whatever technical reason are not mappable to the corresponding NCBI Gene ID. There may be NCBI Gene IDs that should map, but it's a rules based process, and if the corresponding NCBI Gene doesn't meet all the criteria, it's a no match.

ADD REPLY
0
Entering edit mode

Thank you! Now I have an object for entrez ids. I have excluded NAs (the genes with no entrez ids). I also have an object for universe which is basically the list of expressed genes to use for enrichGO.

ego <- enrichGO(gene          = entrez,
                universe      = Genes,
                OrgDb         = orgdb,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

But I get this error:

No gene sets have size between 10 and 500 ...
--> return NULL..
ADD REPLY
0
Entering edit mode

It's telling you that none of the GO gene sets are between 10 and 500. Which is plausible, as CC is a relatively barren GO DAG. There are only 8654 genes that are appended to GO CC, and of those, only 268 that are appended to GO terms between 10 and 500. In comparison there are 8347 genes appended to GO BP terms, and of those there are 1201 with sets that are between 10 and 500 genes.

And, the GO annotation for tomato is pretty bad. There are 31328 genes in that orgdb, and less than a third have GO terms appended. So it's going to be problematic all the way around. You could lower the minimum gene set size, but that won't fix the overarching issue.

Blast2GO might have better annotations, but that's locked behind a paywall, which is why we no longer use their data. But maybe you could get a trial subscription.

ADD REPLY

Login before adding your answer.

Traffic: 843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6