Question

"adding gene annotations" in edgeR doesn't work with select()

0

Entering edit mode

mjavad2012 • 0

@mjavad2012-22148

Last seen 3.5 years ago

I'm trying to add gene annotation to my DGEList object in edgeR package but it giving me error as following:

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ENTREZID'. Please use the keys method to see a listing of valid arguments.

Here is the code I'm using:

genes <- select(org.Hs.eg.db, keys=row.names(x), columns=c("SYMBOL", "TXCHROM"), keytype="GENEID")

I think the problem is that I don't know which Human db can cover my geneIDs, I've tried several different dbs but none of the could give me gene symbols. Here is the row names (geneIDs) which I honestly do not even know what keytype they are:

> head(geneid)
[1] "ENSCGRG00001004858" "ENSCGRG00001007305" "ENSCGRG00001010983" "ENSCGRG00001012374" "ENSCGRG00001013169" "ENSCGRG00001013944"

Any help would be appreciated, MJ

edgeR DEG • 1.4k views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 65k • written 4.5 years ago by mjavad2012 • 0

score 2 · Answer 1 · 2019-10-16

It's not clear if you know this or not, but those are Ensembl gene IDs for Chinese hamster. The annotation package you are trying to use is for Humans, which you are aware of, and the fact that there is an 'eg' in the name is intended to indicate that the central key for the underlying database is Entrez Gene, which is what NCBI used to call their Gene IDs.

So you are trying to use an NCBI-based annotation of human genes to annotate Chinese hamster data for which you have Ensembl IDs, which obviously isn't going to work. If you have Ensembl IDs, you should do all your annotations within the EBI/EBML world, rather than trying to do NCBI -> Ensembl mappings. That means you use biomaRt. The trick here is that you don't know a priori what arguments you need to use for various functions, so you need to figure out what you need.

## load biomaRt and get a 'base' mart. I use the useast mirror because I'm in the Eastern part of the US
> library(biomaRt)
> mart <- useEnsembl("ensembl", mirror = "useast")
## get a listing of the available data sets
> z <- listDatasets(mart)

## what are the mart names for Chinese hamster?
> z[grep("Chinese hamster", z[,2]),]
                    dataset                                  description
32 cgchok1gshd_gene_ensembl Chinese hamster CHOK1GS genes (CHOK1GS_HDv1)
33    cgcrigri_gene_ensembl    Chinese hamster CriGri genes (CriGri_1.0)
35      cgpicr_gene_ensembl     Chinese hamster PICR genes (CriGri-PICR)
        version
32 CHOK1GS_HDv1
33   CriGri_1.0
35  CriGri-PICR

## I know from doing a search at ensembl.org that these are CHOK1GS gene IDs, so we use that
> mart <- useEnsembl("ensembl", "cgchok1gshd_gene_ensembl", mirror = "useast")
> ids <- c("ENSCGRG00001004858", "ENSCGRG00001007305", "ENSCGRG00001010983", "ENSCGRG00001012374", "ENSCGRG00001013169", "ENSCGRG00001013944")

## now we need to know what to look for.

> z <- listAttributes(mart)
> head(z)
                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page

## I know that ensembl_gene_id is what those IDs are, because I've done this before. 
## We want to get the gene symbols though
> z[grep("symbol", z$description, ignore.case = TRUE),]
                name                description         page
45        mgi_symbol                 MGI symbol feature_page
65 uniprot_gn_symbol UniProtKB Gene Name symbol feature_page

## when you query biomaRt you always want to ask for the thing you are querying on, as the results
## are returned in random order, and you need to be able to re-sort. Plus if there isn't a return value you need
## to have a blank returned.

## the arguments for getBM are, in essence, 'what I want', 'what I have', 'the IDs themselves', 'the mart object'
## so for the first argument (the attributes argument, if you care to know), we ask for both what you want,
## and what you already have.

> getBM(c("mgi_symbol","ensembl_gene_id"), "ensembl_gene_id", ids, mart)
  mgi_symbol    ensembl_gene_id
1            ENSCGRG00001004858
2      Gpm6a ENSCGRG00001007305
3      Wdr17 ENSCGRG00001010983
4     Spata4 ENSCGRG00001012374
5       Asb5 ENSCGRG00001013169
6      Spcs3 ENSCGRG00001013944

I'll leave it up to you to figure out how to re-order those data to match your DGEList rownames, and note that you should read the biomaRt vignette and help pages if you have questions.