Entering edit mode
Wolfgang Huber
★
13k
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…
Hi,
I send this through the mailing list since it may be of general
interest. On
the bioconductor page, there are a number of data packages, such as
hgu95a,
hgu133a. Among other goodies, they contain XDR files with environments
full
of mappings from Affymetrix probe set identifiers to Unigene cluster
IDs.
For example, these may be used to figure out which probe sets from the
same,
or from different chips, represent a given gene.
Now here's a quote from the NCBI web page:
"... Since the sequences which make up a cluster may change from week
to
week, and since the cluster identifier may disappear (typically when
two
clusters merge) using the cluster identifier as a reference is ill-
advised.
Using the GB accession numbers of the sequences which comprise the
cluster
is a safe alternative."
(from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml)
>From this it sounds that clusters only ever merge, but from what I
understand there are also situations in which they may split. Hence,
Unigene
cluster IDs are very useful intermediate IDs for processing data at a
given
time, but they cannot be used as persistent identifiers. And whenever
one
does such processing, one has to make sure that all Unigene-IDs
involved
refer to the same Unigene-Build. Now, to come to the point, I haven't
seen
that information (about the Unigene build) in the data packages on
Bioconductor, and I would like to suggest to include that information
in the
package data, and to document how to find the information in a
prominent
place, so software using these packages can make sure the Unigene IDs
are
consistent.
Best regards
Wolfgang.
Dr. Wolfgang Huber
http://www.dkfz.de/abt0840/whuber
Tel +49-6221-424709
Fax +49-6221-42524709
DKFZ
Division of Molecular Genome Analysis
69120 Heidelberg
Germany