Entering edit mode
Bornman, Daniel M
▴
110
@bornman-daniel-m-1391
Last seen 10.2 years ago
Dear BioC,
I am finding inconsistencies in my annotation results using
bioconductor
and NetAffx. This post is long and I apologize in advance but I did a
lot of work and need to fully describe my process.
I am analyzing the affymetrix mouse430_2 chip and have 246 probes
called
as differentially expressed and have annotated this list with gene
ontology data using the various bioconductor packages (mouse4302, GO,
annotate, goStats, etc..). My first step was to generate all
biological
process goids that are mapped to these probes using
mouse4302PROBE2GO{mouse4302}(documentation states this is done using
entrez ids: probe id -> entrez id -> go id(s)). Next I wanted to
generate a list of all the ancestors of the returned go ids using
GOBPANCESTOR{GO}. I now can use these reusults to build a list of all
goids that are either directly mapped to my probes or are ancestors in
the gene onotology tree to these directly mapped go ids. In general,
this is the basic scheme for building a list of nodes to test for
significance associated with a probe list using the hypergeometric
calculation, phyper{stats}.
In parrallel to this, I also uploaded this list of 246 probe ids in
the
NetAffx web app to generate a listing of all biological process go ids
associated with this list. The "all_values" list of go ids from
NetAffx
should be the same list generated above - all diretly mapped go ids
plus
all ancestors. However, they are not the same. Comparing the two
lists
of biological process goids revealed a great deal of disparity.
In order to get to the bottom of this I deceided to build this myself
from two main reference files. 1) the affymetrix library file for the
mouse430_2 chip and 2) the latest gene ontology reference file
("gene_ontology_obo.txt") from http://www.geneontology.org/.
The affy library file is a csv table of probe ids with tons of
annotation including go ids and terms. For many probe ids, several go
ids from each go category are associated with a single probe id. I
parsed this file and restructured it so that for each go category I
created a separate tab-delimited file of go id matched to its probe
id.
Since a single go id can (usualy does) have many probe ids associated
with it, each row in the parsed file contains a unique go-to-probe
pairing.
Next, I parsed the gene ontology master file of all known go ids into
a
lookup table. The gene_ontology_obo file lists each go id as a record
and give its "is_a:" or "part_of:" information. These relationships
were
used to build the lookup table.
Now, with a few simple steps I can take my probe id list, find all the
directly mapped goids and then use these go ids to find all their
ancestor ids. This will return a list of go ids that I can use in
hypergeometric calculataions to find significantly represented go ids
associated with my differentially expressed probe list. But first,
how
does my list compare with the lists using the bioc packages and
netaffx?
My list closely matches the netaffx list, however there are some small
differences. My list returned 674 biological process go ids, netaffx
returned 671. 665 go ids were identical between the two, 9 unique to
my
list, and 6 unique to the netaffx list. To investigate why netaffx
found something I did not, I discovered that the 6 go ids incorrectly
called. These six netaffx-specific go ids are either terminal nodes
and
do not mapped to any of my probes based on the affymetrix library
file,
or are ancestor nodes and do not have any offspring that mapped to my
probes, and one go id is not recognized by the gene ontology website.
So, the netaffx application gives some incorrect results and I don't
know what is going on with the bioc method (very different results).
If
anyone has actually read my post and was able to follow along, do you
think this could be done correctly within a bioc package. The chip
annotation package {mouse4302} and {GO} package should be able to do
this but maybe they are outdated. Could I use {AnnBuilder} to update
my
mouse4302 package? What about the GO package?
Thanks all,
Daniel Bornman
Research Scientist
Battelle Memorial Institute
Department of Statistical and Information Analysis
Columbus, OH 43201