Bioconductor data sets - Unigene IDs
2
0
Entering edit mode
@wolfgang-huber-3550
Last seen 18 days ago
EMBL European Molecular Biology Laborat…
Hi, I send this through the mailing list since it may be of general interest. On the bioconductor page, there are a number of data packages, such as hgu95a, hgu133a. Among other goodies, they contain XDR files with environments full of mappings from Affymetrix probe set identifiers to Unigene cluster IDs. For example, these may be used to figure out which probe sets from the same, or from different chips, represent a given gene. Now here's a quote from the NCBI web page: "... Since the sequences which make up a cluster may change from week to week, and since the cluster identifier may disappear (typically when two clusters merge) using the cluster identifier as a reference is ill- advised. Using the GB accession numbers of the sequences which comprise the cluster is a safe alternative." (from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml) >From this it sounds that clusters only ever merge, but from what I understand there are also situations in which they may split. Hence, Unigene cluster IDs are very useful intermediate IDs for processing data at a given time, but they cannot be used as persistent identifiers. And whenever one does such processing, one has to make sure that all Unigene-IDs involved refer to the same Unigene-Build. Now, to come to the point, I haven't seen that information (about the Unigene build) in the data packages on Bioconductor, and I would like to suggest to include that information in the package data, and to document how to find the information in a prominent place, so software using these packages can make sure the Unigene IDs are consistent. Best regards Wolfgang. Dr. Wolfgang Huber http://www.dkfz.de/abt0840/whuber Tel +49-6221-424709 Fax +49-6221-42524709 DKFZ Division of Molecular Genome Analysis 69120 Heidelberg Germany
probe safe probe safe • 1.3k views
ADD COMMENT
0
Entering edit mode
rgentleman ★ 5.5k
@rgentleman-7725
Last seen 9.0 years ago
United States
Funny that you should mention that, in the soon to be released -- new and improved builds, you will get just that (and much more). For example, in the hgu95a package (up later this week), I get: UniGene Date built: Build #155.<url: ftp:="" ftp.ncbi.nih.gov="" repository="" unigene="" hs.data.gz="">. I agree completely that this is what we want. And we have done some work to ensure that all data sources are adequately documented. We (Jianhua and myself) would like to hear from others with similar concerns, if the new format does not address the issues appropriately. Robert On Wed, Oct 02, 2002 at 11:33:00PM +0200, Wolfgang Huber wrote: > Hi, > > I send this through the mailing list since it may be of general interest. On > the bioconductor page, there are a number of data packages, such as hgu95a, > hgu133a. Among other goodies, they contain XDR files with environments full > of mappings from Affymetrix probe set identifiers to Unigene cluster IDs. > For example, these may be used to figure out which probe sets from the same, > or from different chips, represent a given gene. > > Now here's a quote from the NCBI web page: > "... Since the sequences which make up a cluster may change from week to > week, and since the cluster identifier may disappear (typically when two > clusters merge) using the cluster identifier as a reference is ill- advised. > Using the GB accession numbers of the sequences which comprise the cluster > is a safe alternative." > (from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml) > > >From this it sounds that clusters only ever merge, but from what I > understand there are also situations in which they may split. Hence, Unigene > cluster IDs are very useful intermediate IDs for processing data at a given > time, but they cannot be used as persistent identifiers. And whenever one > does such processing, one has to make sure that all Unigene-IDs involved > refer to the same Unigene-Build. Now, to come to the point, I haven't seen > that information (about the Unigene build) in the data packages on > Bioconductor, and I would like to suggest to include that information in the > package data, and to document how to find the information in a prominent > place, so software using these packages can make sure the Unigene IDs are > consistent. > > Best regards > Wolfgang. > > Dr. Wolfgang Huber > http://www.dkfz.de/abt0840/whuber > Tel +49-6221-424709 > Fax +49-6221-42524709 > DKFZ > Division of Molecular Genome Analysis > 69120 Heidelberg > Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | Harvard School of Public Health email: rgentlem@jimmy.dfci.harvard.edu | +--------------------------------------------------------------------- ------+
ADD COMMENT
0
Entering edit mode
Laurent Gautier ★ 2.3k
@laurent-gautier-29
Last seen 9.6 years ago
On Wed, Oct 02, 2002 at 11:33:00PM +0200, Wolfgang Huber wrote: > > >From this it sounds that clusters only ever merge, but from what I > understand there are also situations in which they may split. Hence, Unigene > cluster IDs are very useful intermediate IDs for processing data at a given > time, but they cannot be used as persistent identifiers. And whenever one > does such processing, one has to make sure that all Unigene-IDs involved > refer to the same Unigene-Build. Now, to come to the point, I haven't seen > that information (about the Unigene build) in the data packages on > Bioconductor, and I would like to suggest to include that information in the > package data, and to document how to find the information in a prominent > place, so software using these packages can make sure the Unigene IDs are > consistent. I agree with Wolfgang. I am currently working on environments that are not completely unrelated to the one provided by the Data package and I would like to attach extra information to the environment. I would be happy to follow a convention to stick the information to the environment. Here are few thoughts about that (just in case): - set a class for 'Data' environment (chipData) - set an attribute 'cdfName' (to be used with the affy package. indicates the chip it corresponds to) - set an attribute 'releaseDate' or 'bestBefore' - set an attribute 'size' (memory usage) - set an attribute 'comments' Cheers, L. > > Best regards > Wolfgang. > > Dr. Wolfgang Huber > http://www.dkfz.de/abt0840/whuber > Tel +49-6221-424709 > Fax +49-6221-42524709 > DKFZ > Division of Molecular Genome Analysis > 69120 Heidelberg > Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- -------------------------------------------------------------- Laurent Gautier CBS, Building 208, DTU PhD. Student DK-2800 Lyngby,Denmark tel: +45 45 25 24 89 http://www.cbs.dtu.dk/laurent
ADD COMMENT

Login before adding your answer.

Traffic: 781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6