What appears to be a simple mapping from probesets to gene symbols is
actually slightly more complex. Behind the scenes, the annotation
package has data to map the relationship from probesets to gene IDs,
also the relationship from gene IDs to gene symbols. This is
because there can be many probesets that map to a single gene, there
be many genes that map to a single probeset and there can be many gene
symbols that map to a single gene. Therefore there are two
relationships here, the 1st is potentially many (probes) to many
(genes), and the second is many (symbols) to one (gene).
Why then does it look simpler than that?
In the annotation packages, (by default), we hide probesets that map
more than one gene. This is because most of the time, you probably
don't want anything to do with probes that are not specific. But on
off chance that you really want to see those, you can expose them
the toggleProbes() method. So usually the 1st relationship is
many (probesets) to one (gene).
And in the SYMBOL mappings, the only gene symbol we expose is the most
standard one. If you want the other gene symbols that are associated
with a particular entrez gene ID, then you would have to use the
ALIAS2PROBE mapping. So this second relationship is also normally
simplified somewhat for you, from a "many to one" down to a "one to
Because gene symbols are not guaranteed to be unique, (sometimes the
same symbol is used as an alias for multiple different genes), I would
strongly urge you to NEVER use them as actual IDs. Instead if you
to use them, they should always be the last piece of data attached to
work flow. So whether you decide to use the annotation packages or
biomaRt, you will require a different strategy for matching up IDs
using gene symbols.
In short, any sort of "joining" operation that uses gene symbols as
is unsafe and should never be done.
On 10/01/2010 03:10 AM, Christian Ruckert wrote:
> I am doing some mapping of affymetrix probeset IDs to gene symbols
> using package hgu133plus2.db.
> As the following example illustrates, each of the 40686 mapped
> probesets maps to exactly one gene symbol.
> > library("hgu133plus2.db")
> > x <- hgu133plus2SYMBOL
> > Llength(x)
>  54675
> > count.mappedkeys(x)
>  40686
> > head(nhit(x))
> 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at
> 1 1 1 1 1 1
> > table(nhit(x))
> 0 1
> 13989 40686
> Am I correct, that annotation with gene symbol is only included in
> package if it is unambiguously?
> For example
> > x[["203074_at"]]
>  NA
> But netaffx and biomart return:
> ANXA8, ANXA8L1, ANXA8L2
> If doing a mapping between protein and gene expression arrays based
> gene symbols, can results be improved using biomart instead of the
> annotation packages?
> > sessionInfo()
> R version 2.11.0 (2010-04-22)
>  C
> attached base packages:
>  stats graphics grDevices utils datasets methods base
> other attached packages:
>  hgu133plus2.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-1
>  DBI_0.2-5 AnnotationDbi_1.10.1 Biobase_2.8.0
> loaded via a namespace (and not attached):
>  tools_2.11.0
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: