Question

How to add my own Entrez Gene IDs rather than using the ones from a default package?

0

Entering edit mode

Raito92 ▴ 60

@raito92-20399

Last seen 2.8 years ago

Italy

Hello! I'm using the workflow RnaSeqGeneEdgeRQL to analyse some RNASeq data, and I've by now arrived to the end of my analysis, missing only the pathway analysis to contextualize genes with different expression levels.

The workflow itself is studied for mouse genes, and suggests, at a point, to import Entrez Gene Ids from the org.Mm.eg.db package (for mouse) as follows.

library(org.Mm.eg.db)
y$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y),
                         keytype="ENTREZID", column="SYMBOL")
head(y$genes)

But I'm working on a not-so-common organism, for which no default packages are available.Then, I skipped this step, and I was able to perform a statistical analysis anyway, without adding annotation data. I only required a .gff file to count reads abundances, in relation to different genes, but it is now specifically required for the identified genes to have an Entrez ID to continue with GO and KEGG analysis, rather than the name they had in the .gff file. Is there any way I can add my own IDs? And specifically retrieve Gene IDs for the species I'm working on, rather than using the default mouse package?

That's what I get if I look them up on Entrez, but can't retrieve the codes, nor I have any idea how to turn this list into an importable file...

enter image description here

The Entrez IDs aren't included in my gff.

The goana function, that I'm going to use for GO analysis, uses genomes for which a package is available (like Mm, which refers to mouse genome), but will give no results because of the missing IDs in the tr object.

go <- goana(tr, species="Mm")

topGO(go, n=15)

And so does kegga, for KEGG pathway analysis.

keg <- kegga(tr, species="Mm")
topKEGG(keg, n=15, truncate=34)

That's what I get, and as you can see my previous tr (at the top of the screenshoot) doesn't have gene ids but gene numbers from my gff.

enter image description here

This is how a tr object is supposed to look like in the workflow, with the Gene ID being the first number of each row.

enter image description here

Thanks in advance!

annotation entrez gene ids software error • 2.7k views

ADD COMMENT • link updated 6.0 years ago by Gordon Smyth 52k • written 6.0 years ago by Raito92 ▴ 60

score 4 · Accepted Answer · 2019-04-16

You can get data for that organism from the AnnotationHub:

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("olea europaea", "orgdb"))
AnnotationHub with 3 records
# snapshotDate(): 2018-10-24 
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Olea europaea subsp. europaea var. sylvestris, Olea europaea var...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH66232"]]' 

            title                                                      
  AH66232 | org.Olea_europaea_subsp._europaea_var._sylvestris.eg.sqlite
  AH66233 | org.Olea_europaea_var._oleaster.eg.sqlite                  
  AH66234 | org.Olea_europaea_var._sylvestris.eg.sqlite                
> orgdb <- hub[["AH66232"]]
downloading 1 resources
retrieving 1 resource

> orgdb
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Olea europaea_subsp._europaea_var._sylvestris
| SPECIES: Olea europaea_subsp._europaea_var._sylvestris
| CENTRALID: GID
| Taxonomy ID: 158386
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information

Note that this package is based on NCBI GIDs, which may or may not be applicable to what you have (you say that you don't have Gene IDs, but you don't say what you do have). Things you can use to map are listed by the columns argument:

> columns(orgdb)
[1] "ACCNUM"   "ALIAS"    "CHR"      "ENTREZID" "GENENAME" "GID"      "PMID"    
[8] "REFSEQ"   "SYMBOL"

So if you have any of those, you can map things. If your gff is based on EBI/EMBL IDs (like Ensembl IDs), then you should really be using data from biomaRt, but it appears that they don't have Olive data. But maybe there is a Biomart hosted by some plant-specific group?

score 2 · Accepted Answer · 2019-04-16

You can't use goana() to do a GO analysis of Olea europaea because GO annotation doesn't exist for that species. If you type

help("goana")

then it will tell you to type

help("alias2Symbol")

for a complete list of species for which goana() will work. You will see that Olea is not on the list.

James has shown you how to get Entrez Gene Ids for Olea, but the orgdb doesn't include GO annotation so it won't help you do a GO analysis.

On the other hand, you can do a kegga() analysis for Olea by setting species.KEGG="oeu". Again, you can find that out by following the limma documentation.