Question

How to produce a dataframe with unique ENTREZID to use in gage?

0

Entering edit mode

colaneri ▴ 30

@colaneri-7770

Last seen 6.8 years ago

United States

Hi I have a dataframe of reads-counts (CNTS) with 47,540 unique ENSEMBL IDs. Now I want to use gage to test for differences in gene expression over gene-sets (e.g KEGG pathways) For example:

SIG.keeg.p <- gage(CNTS, gsets=kegg.sig, ref= ref.idx, samp = samp.idx, compare = "as.group")

To use gage I have to rowname my dataframe with ENTREZ IDs. For that purpose I used AnnotationDbi with multiVals= "asNA".

Entrez = select(org.Mm.eg.db, keys=row.names(cnts.norm), column="ENTREZID", keytype="ENSEMBL", multiVals="asNA")

According to ?select

"asNA": This will return an NA value whenever there are multiple matches"

Given that, I was expecting that each time my keys find multivalues I will find a NA in the ENTREZ column of the Entrez dataframe. In other words I was expecting that by removing all the rows with NA values I will have a dataframe with unique-unique pairs of ENSEMBL-ENTREZ

However this is not what I got. There are more than 400 ENSEMBL Id mapping to more than one ENTREX Id. Se table below.

ENSEMBL	ENTREZID
ENSMUSG00000000486	54204
ENSMUSG00000000486	100043580
ENSMUSG00000000562	11542
ENSMUSG00000000562	69296
ENSMUSG00000002250	19015
ENSMUSG00000002250	69050
ENSMUSG00000002345	72368
ENSMUSG00000002345	105980076
ENSMUSG00000002379	69875
ENSMUSG00000002379	239760
ENSMUSG00000003680	67706
ENSMUSG00000003680	225895
ENSMUSG00000003812	13423
ENSMUSG00000003812	100503676
ENSMUSG00000004455	19047
ENSMUSG00000004455	434233
ENSMUSG00000006050	24068
ENSMUSG00000006050	225372
ENSMUSG00000008450	68051
ENSMUSG00000008450	621832
ENSMUSG00000008682	110954
ENSMUSG00000008682	434434
ENSMUSG00000010097	53319
ENSMUSG00000010097	66836
ENSMUSG00000015290	27643
ENSMUSG00000015290	100169864
ENSMUSG00000015882	209707
ENSMUSG00000015882	100041576
ENSMUSG00000016559	15081
ENSMUSG00000016559	625328
ENSMUSG00000016559	667250
ENSMUSG00000018378	70393
ENSMUSG00000018378	103841
ENSMUSG00000019857	66403

The same is true for ENTREZids. There are also more than 200 ENTREZId mapping to more that one ENSEMBL Id.

1-I have a couple of questions. Why multiVals=”asNA” did not prevented this ambiguity in the results?

2-Is there any way to prevent this behavior of AnnotationDbi?

3-To produce a dataframe with unique Entrez Ids as rownames I will have to choose one, e.g. between

ENSEMBL ENTREZ

ENSMUSG00000060208	13216
ENSMUSG00000074440	13216

Which one I choose? And base in what? Each one of these ENSEMBL Ids have their own set of count values in the original CNTS dataframe. Meaning that the foldchange for the ENTREZ 13216 in the gage analysis will depend of which ENSMUSG assign to the ENTREZ:13216.

How are you expert people dealing with this? Or may be I am missing an important piece of information. In any case I will really appreciate your help

ALe

gage annotationdbi entrez ensembl • 2.6k views

ADD COMMENT • link 7.9 years ago colaneri ▴ 30

score 0 · Answer 1 · 2018-02-20

You misunderstand the help page for select. There are two parts. First the Usage section:

Usage:

       columns(x)
       keytypes(x)
       keys(x, keytype, ...)
       select(x, keys, columns, keytype, ...)
       mapIds(x, keys, column, keytype, ..., multiVals)
       saveDb(x, file)
       loadDb(file, packageName=NA)

Note that the only function that has a multiVals argument is mapIds. Since select has an ellipsis (...) argument, you can pass ANY argument to that function and it will try to match to arguments for any functions that it calls. So you won't get an error by passing in random arguments, but if select doesn't call any functions that have a multiVals argument, it will just be ignored (which is what happens). Howeva:

> mapIds(org.Mm.eg.db, "ENSMUSG00000000486", "ENTREZID","ENSEMBL", multiVals="asNA")
'select()' returned 1:many mapping between keys and columns
ENSMUSG00000000486
                NA

score 0 · Answer 2 · 2018-02-20

Right & thank for the clarification!

I am still trying to produce a dataframe with uniquely mapped pairs of IDs (e.g ENSEMBL -> ENTREZID or SYMBOL -> ENTREZID)

Can you tell me why such a different results using two different databases? and which one you will use to go to gage-pathview?

edb = EnsDb.Mmusculus.v79
entrezIds_Org = as.data.frame(mapIds(org.Mm.eg.db,keys = rownames(cnts.norm), keytype = "SYMBOL",column = "ENTREZID", multiVals = "filter"))
entrezIds_edb = as.data.frame(mapIds(edb, keys = rownames(cnts.norm), keytype = "SYMBOL",column = "ENTREZID", multiVals = "filter"))

RESULTS

> length(entrezIds_edb[!is.na(entrezIds_edb)])
[1] 14435
> length(unique(entrezIds_edb[!is.na(entrezIds_edb)]))
[1] 14428
> length(entrezIds_edb[is.na(entrezIds_edb)])
[1] 9764
> 
> length(entrezIds_Org[!is.na(entrezIds_Org)])
[1] 21565
> length(unique(entrezIds_Org[!is.na(entrezIds_Org)]))
[1] 21565
> length(entrezIds_Org[is.na(entrezIds_Org)])
[1] 3758

As you can notice I have retrieved much more ENTREZID by using the org.Mm.eg.db, and also the multiVals="filter" since to have done their work ( 21565 total ENTREZID restrieved with 24199 SYMBOL KEYS) and 21565 were UNIQUE ENTREZIDs

However working with the Ensembl database "EnsDb.Mmusculus.v7" I got only 14435 ENTREZID, and some of them not unique (meaning that the multiVal ="filter" did not work.