Question

Get corresponding gene id for probe id from Affymetrix array

0

Entering edit mode

nhaus ▴ 70

@789c70a6

Last seen 9 months ago

Switzerland

Hello,

I am currently analyzing a microarray study which was deposited on GEO. I am using the oligo package to obtain gene level summaries after background correction. Now I want to map all the remaining probesets to to entrez gene ids, but I wasn't sure how to do that. I found the getProbeInfo function which seems to be exactly what I am looking for, but was not sure which field corresponds to entrez gene id (or ensembl gene id).

This is my code so far:


file_names <- list.files("/DDAS/MA", pattern = "CEL", full.names = T)
rawData <- read.celfiles(file_names)
geneSummaries <- rma(rawData, target="core")
probeInfo <- getProbeInfo(geneSummaries ,field= c('fid', 'fsetid', 'atom',
                                     'transcript_cluster_id','exon_id'), target='core')

The command rownames(geneSummaries) returns a list which looks something like this: "2371346" "2371474" "2371547" "2371694" "2371738" "2371873" "2372103", ...

I would very much appreciated if somebody could tell me what the correct field name is to get the gene identifiers, or any other way how I can get gene ids from geneSummaries.

AnnotationDbi Microarray • 4.2k views

ADD COMMENT • link 3.8 years ago nhaus ▴ 70

score 2 · Accepted Answer · 2022-05-09

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 days ago

United States

It's easier to use my affycoretools package for that.

library(affycoretools)
geneSummaries <- annotateEset(geneSummaries, annotation(geneSummaries))

ADD COMMENT • link 3.8 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you very much! This is extremly useful. Just a quick question. When I ran your command and specify type="core" I still get this warning:

 There are 337842 probesets in the annotation data and 22011 probesets in the ExpressionSet;
the overlap between the annotation and ExpressionSet is 6% 
and the overlap between the ExpressionSet and annotation is 100%. 
Proceed with caution.

Any Idea why this might happen? The command annotation(geneSummaries) returns "pd.huex.1.0.st.v2".

Also, just for my own understanding, would the following command return a similar annotation dataframe?

AnnotationDbi::select(huex10sttranscriptcluster.db,
                                       keys = featureNames(geneSummaries),
                                       columns = c("SYMBOL", "ENTREZID"),
                                       keytype = "PROBEID")

Thanks again!

ADD REPLY • link 3.8 years ago nhaus ▴ 70

0

Entering edit mode

You could instead do

geneSummaries <- annotateEset(geneSummaries, huex10sttranscriptcluster.db)

You shouldn't use type = 'core' if you have summarized at the transcript level. Using type = 'core' implies that you summarized at the PSR (roughly exon) level, which is why there are so many more annotations than probesets.

It can be a problem to use select with more than one column, because under the hood it's doing a left join on those two columns, so if there are any 1 to many matches you end up with way more rows than you would expect. What annotateEset does is repeated calls to mapIds, which returns only one result per probeset ID, for each column, and then collapses to a data.frame. That way you are ensured that the annotations match perfectly with the underlying data.

Here's an extreme example:

> library(org.Hs.eg.db)
> z <-  head(keys(org.Hs.eg.db))
> z
[1] "1"  "2"  "3"  "9"  "10" "11"
> dim(select(org.Hs.eg.db, z, c("SYMBOL","ACCNUM")))
'select()' returned 1:many mapping between keys and columns
[1] 1862    3

## iterative usage of mapIds does what you expect though
> do.call(cbind, lapply(c("SYMBOL","ACCNUM"), function(x) mapIds(org.Hs.eg.db, z, x, "ENTREZID")))
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
   [,1]    [,2]      
1  "A1BG"  "AA484435"
2  "A2M"   "AAA51551"
3  "A2MP1" "DB301195"
9  "NAT1"  "AAB62398"
10 "NAT2"  "AAA59905"
11 "NATP"  NA

ADD REPLY • link 3.8 years ago James W. MacDonald 68k

0

Entering edit mode

I should note that mapIds might not do what you exactly want. By default it takes the first of any 1:many mappings, which may or may not be what you want. As an example, there are quite a few Genbank and RefSeq (the ACCNUM column) IDs that map to each of the Gene IDs I supplied in the above example. The first one might not be the 'right one' in some sense, and maybe you want more than that. There are ways to get all the 1:many mappings (a DataFrame being the canonical structure for Bioconductor), but in general usage having one is enough.

ADD REPLY • link 3.8 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you very much for your detailed answer and your example! It is much appreciated.

You shouldn't use type = 'core' if you have summarized at the transcript level. Using type = 'core' implies that you summarized at the PSR (roughly exon) level, which is why there are so many more annotations than probesets.

Might I ask why this is the case? I read in the oligo userguide that specifiying target ="core" in rma summarized to gene level. Does this mean that "core" is something different between annotateEset and oligo. And if so, what is the "correct" target if I have used target ="core" in rma?

Thanks again!

ADD REPLY • link 3.8 years ago nhaus ▴ 70