What is the common work-flow to build an microarray annotation package, like hgu133a.db.
For some array, there are probe sequences available, then maybe mapping is used? While for other situations, how to deal with? If code used by the team available, that will be great. Thank you.
The specific goal is to build new platform annotation packages which are not available now from Bioconductor (what I need is just probe to gene symbols).
It seems Bioconductor update the annotation package when a new version releasing due to the update of gene symbols.
BTW, why name it as hgu133a.db instead of GPL96.db (from GEO) in Bioconductor? And user have to find the mapping relationship between them, though there are some mappings, such as https://gist.github.com/seandavi/bc6b1b82dc65c47510c7#file-platformmap-txt.

Thank you. I will check AnnotationForge package.
About the naming, another example is
hgug4112atoAgilent-012391 Whole Human Genome Oligo Microarray G4112A (Feature Number version). It's awkward to find the mapping relationship between them if there is no the gist file supplied by seandavi, which is also incomplete. Usually, this kind of annotation package is used for annotating the GPLs. Are there other utility for the annotation package?Not really. A typical case study could be reading Affymetrix CEL files (
affypackage) usingReadAffy(), followed byrma()returns anExpressionSetobject. This automatically detects the correct annotation package. GEO is an independent project to Bioconductor (and so, there is no guarantee to have annotation packages at Bioconductor matching all platforms available at GEO). Of course, you may use the annotation packages in Bioconductor to annotate the arrays in GEO, but that is I think an extra benefit, not the original motivation. BTW, there is a way to obtain the existing correspondence between GPL and bioconductor annotation packages that may (or may not) be more up-to-date (I got this from Sean Davis: dplyr and the GEOmetadb package for mining NCBI GEO metadata):library(GEOmetadb) if (!file.exists('GEOmetadb.sqlite')) getSQLiteFile() # big file! library(dplyr) db <- src_sqlite('GEOmetadb.sqlite') gpl <- tbl(db, 'gpl') gpl %>% filter(!is.na(bioc_package)) %>% select(gpl, bioc_package, title) Source: sqlite 3.8.6 [GEOmetadb.sqlite] From: gpl [78 x 3] Filter: !is.na(bioc_package) gpl bioc_package title (chr) (chr) (chr) 1 GPL32 mgu74a [MG_U74A] Affymetrix Murine Genome U74A Array 2 GPL33 mgu74b [MG_U74B] Affymetrix Murine Genome U74B Array 3 GPL34 mgu74c [MG_U74C] Affymetrix Murine Genome U74C Array 4 GPL71 ag [AG] Affymetrix Arabidopsis Genome Array 5 GPL72 drosgenome1 [DrosGenome1] Affymetrix Drosophila Genome Array 6 GPL74 hcg110 [HC_G110] Affymetrix Human Cancer Array 7 GPL75 mu11ksuba [Mu11KsubA] Affymetrix Murine 11K SubA Array 8 GPL76 mu11ksubb [Mu11KsubB] Affymetrix Murine 11K SubB Array 9 GPL77 mu19ksuba [Mu19KsubA] Affymetrix Murine 19K SubA Array 10 GPL78 mu19ksubb [Mu19KsubB] Affymetrix Murine 19K SubB Array .. ... ... ...ReadAffy(), working with Affymetrix only, will return anAffyBatchobject with an annotation name, such ashgu133plus2, by functionannotation(). And thefeatureDatawill be null (Why it's null? which means it does not catch the annotation package information automatically). While ifgetGEO()is used, theannotation()will return GPLxx and thefeatureData()will be from GPLxx in GEO and not empty. Thank you for you showing of the GPL-annotation relationship file.update:getGEOwill get the annotation package information if parameterAnnotGPLisTRUEand the package exists.update2:
getGEOwill get the updated annotation information if parameterAnnotGPLisTRUEand the annotation file exists, like GPL570.annot.gz.Was missing the
rma()step- corrected. It is NULL because by default it does not contain feature information (you can add it with the annotation package if you wish). getGEO() was developed later and gets the information from GEO, hence it contains the associated platform and some featureData that is available at GEO. But there is not direct translation as you said. Using the info I gave you (or the link to the gist) you can know the correspondence for existing packages. If the package is not there you may want to build one yourself with the AnnotationForge package.About AnnotationForge, Here I have the probe name and sequence (for example, for GPL6480). I do not want to use the annotation from GEO and want to update the annotation by myself. However, AnnotationForge requires a kind of id (such as Genbank ID) with the probe name. Here it seems mapping is the first step. So usually why function in R are used to map the probe to gene/miRNA?
You could map the probes to the genome using
Biostringspackage and then annotate them using overlapping to transcripts. I have done that for some arrays a few months ago. Will post a script later if I have some time... Not sure if the recipe is also available in some vignette/or in a post in the support site.