Question

Using hugene10sttranscriptcluster.db for annotation

0

Entering edit mode

Khawaja • 0

@khawaja-7772

Last seen 7.3 years ago

India

I am trying to annotate the data (with gene name and gene symbol) from [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version] platform using hugene10sttranscriptcluster.db. I use the following script to get the gene symbol.

x = hugene10sttranscriptclusterALIAS2PROBE
mapped_probes <- mappedkeys(x)
xx <- as.list(hugene10sttranscriptclusterALIAS2PROBE)

but it gives me many names for a probe identifier, I am not sure which one to keep. I want to get a unique gene symbol for probe identifier. Help appreciated.

bioconductor hugene10sttranscriptcluster.db annotation • 2.1k views

ADD COMMENT • link updated 8.3 years ago by James W. MacDonald 68k • written 8.3 years ago by Khawaja • 0

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 7 hours ago

United States

While you can use as.list to get the values, it's really not the way to go. You end up relying on old code that doesn't really do what you might think it does. Plus, why would you want a list in the first place?

As an example:

> z <- as.list(hugene10sttranscriptclusterSYMBOL[keys(hugene10sttranscriptcluster.db)])
> table(sapply(z, is.na))

FALSE  TRUE
19985 13312

So using as.list returns 13312 probesets that appear not to have a HUGO symbol appended. But that's not quite right:

> zz <- mapIds(hugene10sttranscriptcluster.db, keys(hugene10sttranscriptcluster.db), "SYMBOL", "PROBEID")
'select()' returned 1:many mapping between keys and columns

> sum(is.na(zz))
[1] 10992

There are actually a bit over 2000 probeids that do have a HUGO symbol, for which as.list is returning NA. This is because those probeids have multiple symbols!

> ind <- sapply(z, is.na) & !is.na(zz)
> sum(ind)
[1] 2320
> head(select(hugene10sttranscriptcluster.db, names(zz)[ind], "SYMBOL"))
'select()' returned 1:many mapping between keys and columns
  PROBEID      SYMBOL
1 7896740       OR4F4
2 7896740      OR4F17
3 7896740       OR4F5
4 7896742 LINC00266-1
5 7896742      PCMTD2
6 7896742   LOC728323

The old style functions return NA for any probeset ID that maps to more than one HUGO symbol (or whatever else you are mapping to), whereas mapIds and select do not. So it's a better idea to use the more current query methods.

In addition, assuming your data are in an ExpressionSet, you can use annotateEset from my affycoretools package to do this in one line. Say your ExpressionSet is called 'eset':

library(affycoretools)

eset <- annotateEset(eset, hugene10sttranscriptcluster.db)

And then model fitting tools like limma will return fully annotated topTable results.

ADD COMMENT • link 8.3 years ago James W. MacDonald 68k

score 2 · Accepted Answer · 2017-03-26

"I want to get a unique gene symbol for probe identifier"

You are not doing it right. For the above script, the package vignette says like this:

"Each gene symbol is mapped to a named vector of manufacturer identifiers. The name represents the gene symbol and the vector contains all manufacturer identifiers that are found for that symbol.An NA is reported for any gene symbol that cannot be mapped to any manufacturer identifiers. This mapping includes ALL gene symbols including those which are already listed in the SYMBOL map. The SYMBOL map is meant to only list official gene symbols, while the ALIAS maps are meant to store all used symbols".

And you are interested in something else, the script of which is shown below:

x <- hugene10sttranscriptclusterSYMBOL
# Get the probe identifiers that are mapped to a gene symbol
mapped_probes <- mappedkeys(x)
# Convert to a list
xx <- as.list(x[mapped_probes])

and the saying in the package vignette goes like this:

"Each manufacturer identifier is mapped to an abbreviation for the corresponding gene. An NA is reported if there is no known abbreviation for a given gene. Symbols typically consist of 3 letters that define either a single gene (ABC) or multiple genes (ABC1, ABC2, ABC3). Gene symbols can be used as key words to query public databases such as Entrez Gene".