Question: How to produce a dataframe with unique ENTREZID to use in gage?
20 months ago by
colaneri30
United States
colaneri30 wrote:

Hi I have a dataframe of reads-counts (CNTS) with 47,540 unique ENSEMBL IDs. Now I want to use gage to test for differences in gene expression over gene-sets (e.g KEGG pathways) For example:

SIG.keeg.p <- gage(CNTS, gsets=kegg.sig, ref= ref.idx, samp = samp.idx, compare = "as.group")

To use gage I have to rowname my dataframe with ENTREZ IDs. For that purpose I used AnnotationDbi with multiVals= "asNA".

Entrez = select(org.Mm.eg.db, keys=row.names(cnts.norm), column="ENTREZID", keytype="ENSEMBL", multiVals="asNA")

According to ?select

"asNA": This will return an NA value whenever there are multiple matches"

Given that, I was expecting that each time my keys find multivalues I will find a NA in the ENTREZ column of the Entrez dataframe. In other words I was expecting that by removing all the rows with NA values I will have a dataframe with unique-unique pairs of ENSEMBL-ENTREZ

However this is not what I got. There are more than 400 ENSEMBL Id mapping to more than one ENTREX Id. Se table below.

 ENSEMBL ENTREZID ENSMUSG00000000486 54204 ENSMUSG00000000486 100043580 ENSMUSG00000000562 11542 ENSMUSG00000000562 69296 ENSMUSG00000002250 19015 ENSMUSG00000002250 69050 ENSMUSG00000002345 72368 ENSMUSG00000002345 105980076 ENSMUSG00000002379 69875 ENSMUSG00000002379 239760 ENSMUSG00000003680 67706 ENSMUSG00000003680 225895 ENSMUSG00000003812 13423 ENSMUSG00000003812 100503676 ENSMUSG00000004455 19047 ENSMUSG00000004455 434233 ENSMUSG00000006050 24068 ENSMUSG00000006050 225372 ENSMUSG00000008450 68051 ENSMUSG00000008450 621832 ENSMUSG00000008682 110954 ENSMUSG00000008682 434434 ENSMUSG00000010097 53319 ENSMUSG00000010097 66836 ENSMUSG00000015290 27643 ENSMUSG00000015290 100169864 ENSMUSG00000015882 209707 ENSMUSG00000015882 100041576 ENSMUSG00000016559 15081 ENSMUSG00000016559 625328 ENSMUSG00000016559 667250 ENSMUSG00000018378 70393 ENSMUSG00000018378 103841 ENSMUSG00000019857 66403

The same is true for ENTREZids.  There are also more than 200 ENTREZId mapping to more that one ENSEMBL Id.

1-I have a couple of questions. Why multiVals=”asNA” did not prevented this ambiguity in the results?

2-Is there any way to prevent this behavior of AnnotationDbi?

3-To produce a dataframe with unique Entrez Ids as rownames I will have to choose one, e.g. between

ENSEMBL                                                                                ENTREZ

 ENSMUSG00000060208 13216 ENSMUSG00000074440 13216

Which one I choose? And base in what? Each one of these ENSEMBL Ids have their own set of count values in the original CNTS dataframe. Meaning that the foldchange for the ENTREZ 13216 in the gage analysis will depend of which ENSMUSG assign to the ENTREZ:13216.

How are you expert people dealing with this? Or may be I am missing an important piece of information. In any case I will really appreciate your help

ALe

written 20 months ago by colaneri30
Answer: How to produce a dataframe with unique ENTREZID to use in gage?
20 months ago by
United States
James W. MacDonald51k wrote:

You misunderstand the help page for select. There are two parts. First the Usage section:

Usage:

columns(x)
keytypes(x)
keys(x, keytype, ...)
select(x, keys, columns, keytype, ...)
mapIds(x, keys, column, keytype, ..., multiVals)
saveDb(x, file)
loadDb(file, packageName=NA)

Note that the only function that has a multiVals argument is mapIds. Since select has an ellipsis (...) argument, you can pass ANY argument to that function and it will try to match to arguments for any functions that it calls. So you won't get an error by passing in random arguments, but if select doesn't call any functions that have a multiVals argument, it will just be ignored (which is what happens). Howeva:

> mapIds(org.Mm.eg.db, "ENSMUSG00000000486", "ENTREZID","ENSEMBL", multiVals="asNA")
'select()' returned 1:many mapping between keys and columns
ENSMUSG00000000486
NA

Answer: How to produce a dataframe with unique ENTREZID to use in gage?
20 months ago by
colaneri30
United States
colaneri30 wrote:

Right & thank for the clarification!

I am still trying to produce a dataframe with uniquely mapped pairs of IDs (e.g  ENSEMBL -> ENTREZID or SYMBOL -> ENTREZID)

Can you tell me why such a different results using two different databases? and which one you will use to go to gage-pathview?

edb = EnsDb.Mmusculus.v79
entrezIds_Org = as.data.frame(mapIds(org.Mm.eg.db,keys = rownames(cnts.norm), keytype = "SYMBOL",column = "ENTREZID", multiVals = "filter"))
entrezIds_edb = as.data.frame(mapIds(edb, keys = rownames(cnts.norm), keytype = "SYMBOL",column = "ENTREZID", multiVals = "filter"))

RESULTS

> length(entrezIds_edb[!is.na(entrezIds_edb)])
[1] 14435
> length(unique(entrezIds_edb[!is.na(entrezIds_edb)]))
[1] 14428
> length(entrezIds_edb[is.na(entrezIds_edb)])
[1] 9764
>
> length(entrezIds_Org[!is.na(entrezIds_Org)])
[1] 21565
> length(unique(entrezIds_Org[!is.na(entrezIds_Org)]))
[1] 21565
> length(entrezIds_Org[is.na(entrezIds_Org)])
[1] 3758

As you can notice I have retrieved much more ENTREZID by using the org.Mm.eg.db, and also the multiVals="filter" since to have done their work ( 21565 total ENTREZID restrieved with 24199 SYMBOL KEYS) and 21565 were UNIQUE ENTREZIDs

However working with the Ensembl database "EnsDb.Mmusculus.v7"  I got only 14435 ENTREZID, and some of them not unique (meaning that the multiVal ="filter" did not work.

As to why you get different mappings, that is beyond the scope of this support site. We simply re-package data that is publicly available from NCBI and EMBL-EBI. You should note however that Entrez Gene IDs are something that NCBI uses, and that EMBL-EBI have different IDs. So any mappings between SYMBOL and ENTREZID using the ensembldb package will necessarily be SYMBOL->ENSEMBLID->ENTREZID, and any mappings between Ensembl IDs and Entrez Gene IDs will tend to be fraught.

My general rule is to stay with whomever brung ya to the dance. So either stick with NCBI IDs or EMBL-EBI IDs.