Question

Common workflow to build an microarray annatation package, like hgu133a.db

0

Entering edit mode

zhilongjia • 0

@zhilongjia-7339

Last seen 2.9 years ago

United Kingdom

What is the common work-flow to build an microarray annotation package, like hgu133a.db.

For some array, there are probe sequences available, then maybe mapping is used? While for other situations, how to deal with? If code used by the team available, that will be great. Thank you.

The specific goal is to build new platform annotation packages which are not available now from Bioconductor (what I need is just probe to gene symbols).

It seems Bioconductor update the annotation package when a new version releasing due to the update of gene symbols.

BTW, why name it as hgu133a.db instead of GPL96.db (from GEO) in Bioconductor? And user have to find the mapping relationship between them, though there are some mappings, such as https://gist.github.com/seandavi/bc6b1b82dc65c47510c7#file-platformmap-txt.

annotation • 3.8k views

ADD COMMENT • link updated 10.0 years ago by James W. MacDonald 68k • written 10.0 years ago by zhilongjia • 0

score 2 · Answer 1 · 2016-01-06

2

Entering edit mode

Diego Diez ▴ 760

@diego-diez-4520

Last seen 5.2 years ago

Japan

Regarding the generation of annotation packages take a look at the AnnotationForge package (https://bioconductor.org/packages/release/bioc/html/AnnotationForge.html). This is the package used to generate most (all?) annotation packages.

As for your question on the naming, others more knowledgable may give you a more specific answer, but hgu133a I think reflects the original name given by Affymetrix to this platform. Note that GPL96 is a GEO specific id, and somehow less informative.

ADD COMMENT • link 10.0 years ago Diego Diez ▴ 760

0

Entering edit mode

Thank you. I will check AnnotationForge package.

About the naming, another example is hgug4112a to Agilent-012391 Whole Human Genome Oligo Microarray G4112A (Feature Number version). It's awkward to find the mapping relationship between them if there is no the gist file supplied by seandavi, which is also incomplete. Usually, this kind of annotation package is used for annotating the GPLs. Are there other utility for the annotation package?

ADD REPLY • link 10.0 years ago zhilongjia • 0

0

Entering edit mode

Not really. A typical case study could be reading Affymetrix CEL files (affy package) using ReadAffy(), followed by rma() returns an ExpressionSet object. This automatically detects the correct annotation package. GEO is an independent project to Bioconductor (and so, there is no guarantee to have annotation packages at Bioconductor matching all platforms available at GEO). Of course, you may use the annotation packages in Bioconductor to annotate the arrays in GEO, but that is I think an extra benefit, not the original motivation. BTW, there is a way to obtain the existing correspondence between GPL and bioconductor annotation packages that may (or may not) be more up-to-date (I got this from Sean Davis: dplyr and the GEOmetadb package for mining NCBI GEO metadata):

library(GEOmetadb)
if (!file.exists('GEOmetadb.sqlite')) getSQLiteFile() # big file!

library(dplyr)
db <- src_sqlite('GEOmetadb.sqlite')
gpl <- tbl(db, 'gpl')
gpl %>% filter(!is.na(bioc_package)) %>% select(gpl, bioc_package, title)
Source: sqlite 3.8.6 [GEOmetadb.sqlite]
From: gpl [78 x 3]
Filter: !is.na(bioc_package) 

     gpl bioc_package                                            title
   (chr)        (chr)                                            (chr)
1  GPL32       mgu74a    [MG_U74A] Affymetrix Murine Genome U74A Array
2  GPL33       mgu74b    [MG_U74B] Affymetrix Murine Genome U74B Array
3  GPL34       mgu74c    [MG_U74C] Affymetrix Murine Genome U74C Array
4  GPL71           ag         [AG] Affymetrix Arabidopsis Genome Array
5  GPL72  drosgenome1 [DrosGenome1] Affymetrix Drosophila Genome Array
6  GPL74       hcg110          [HC_G110] Affymetrix Human Cancer Array
7  GPL75    mu11ksuba     [Mu11KsubA] Affymetrix Murine 11K SubA Array
8  GPL76    mu11ksubb     [Mu11KsubB] Affymetrix Murine 11K SubB Array
9  GPL77    mu19ksuba     [Mu19KsubA] Affymetrix Murine 19K SubA Array
10 GPL78    mu19ksubb     [Mu19KsubB] Affymetrix Murine 19K SubB Array
..   ...          ...                                              ...

ADD REPLY • link 10.0 years ago Diego Diez ▴ 760

0

Entering edit mode

ReadAffy(), working with Affymetrix only, will return an AffyBatch object with an annotation name, such as hgu133plus2, by function annotation(). And the featureData will be null (Why it's null? which means it does not catch the annotation package information automatically). While if getGEO() is used, the annotation() will return GPLxx and the featureData() will be from GPLxx in GEO and not empty. Thank you for you showing of the GPL-annotation relationship file.

~~update: getGEO will get the annotation package information if parameter AnnotGPL is TRUE and the package exists.~~

update2: getGEO will get the updated annotation information if parameter AnnotGPL is TRUE and the annotation file exists, like GPL570.annot.gz.

ADD REPLY • link 10.0 years ago zhilongjia • 0

0

Entering edit mode

Was missing the rma() step- corrected. It is NULL because by default it does not contain feature information (you can add it with the annotation package if you wish). getGEO() was developed later and gets the information from GEO, hence it contains the associated platform and some featureData that is available at GEO. But there is not direct translation as you said. Using the info I gave you (or the link to the gist) you can know the correspondence for existing packages. If the package is not there you may want to build one yourself with the AnnotationForge package.

ADD REPLY • link 10.0 years ago Diego Diez ▴ 760

0

Entering edit mode

About AnnotationForge, Here I have the probe name and sequence (for example, for GPL6480). I do not want to use the annotation from GEO and want to update the annotation by myself. However, AnnotationForge requires a kind of id (such as Genbank ID) with the probe name. Here it seems mapping is the first step. So usually why function in R are used to map the probe to gene/miRNA?

ADD REPLY • link 10.0 years ago zhilongjia • 0

0

Entering edit mode

You could map the probes to the genome using Biostrings package and then annotate them using overlapping to transcripts. I have done that for some arrays a few months ago. Will post a script later if I have some time... Not sure if the recipe is also available in some vignette/or in a post in the support site.

ADD REPLY • link 10.0 years ago Diego Diez ▴ 760

score 2 · Answer 2 · 2016-01-06

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

We should point out that you don't need to make an annotation package for the hgu133a package, as there is already one on bioconductor. Just install the hgu133a.db package using biocLite(). For most common arrays you won't need to make a package.

ADD COMMENT • link 10.0 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you for your remind. Here I want to know the common workflow to build this kind of annotation package and HUG133a is an example.

ADD REPLY • link 10.0 years ago zhilongjia • 0