Question

hugene10sttranscriptcluster.db missing annotation compare to hgu133plus2.db

0

Entering edit mode

Stane ▴ 40

@stane-10974

Last seen 6.3 years ago

Hello,

I have recently been working with annotation for micro-array Affy platform GPL6244 using package hugene10sttranscriptcluster.db to retrieve symbols.

I notice that there are missing symbols compare to the manufacturer annotation file on NCBI with the crappy // in the fields. But if I intersect the difference NCBI/hugen10db with hgu133plus2.db then I found 397 symbol which means that some missing symbol in hugene10, present in NCBI annotation file, are also present in hgu133plus2.db

What confuses me is that I thought that the Bioconductor annotation packages were made on the same base then froze for a few months and therefore were in sync but apparently not.

annotation affy • 1.1k views

ADD COMMENT • link updated 7.5 years ago by James W. MacDonald 65k • written 7.5 years ago by Stane ▴ 40

score 1 · Answer 1 · 2016-10-31

The annotation packages are generated every release, and are then 'frozen' until the next release. But other than the fact that they are both intended to measure human samples (and are made by the same company), there is no reason to expect the HuGene 1.0 ST and HG-U133Plus2 arrays to have the same set of genes (or symbols).

Those arrays were designed five years apart (2001 for the U133Plus2 and 2006 for the HuGene), and are based on different gene databases (UniGene vs a combination of RefSeq and GenBank). I would expect a large degree of overlap, but not complete consistency.

In addition, all of the Bioconductor annotation packages are based on the idea that the Entrez Gene ID is the 'central gene ID'. We simply take the accession numbers (usually a combination of RefSeq and GenBank IDs) that Affy provides in the annotation file you reference, map those to Entrez Gene IDs, and then all other mappings (e.g. to HUGO symbols) are based on the mapping of Entrez Gene IDs onward. In fact, the chip-specific annotation packages like the hugene10sttranscriptcluster.db package don't have anything in them except for a probeset -> Entrez Gene ID mapping, and rely on the org.Hs.eg.db package to do all other mappings.

So there may well be instances where Affy says a given probeset maps to a particular gene symbol, but if NCBI doesn't agree, our annotation packages won't provide that mapping.