Bioconductor data sets

Bioconductor data sets - Unigene IDs

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

Hi, I send this through the mailing list since it may be of general interest. On the bioconductor page, there are a number of data packages, such as hgu95a, hgu133a. Among other goodies, they contain XDR files with environments full of mappings from Affymetrix probe set identifiers to Unigene cluster IDs. For example, these may be used to figure out which probe sets from the same, or from different chips, represent a given gene. Now here's a quote from the NCBI web page: "... Since the sequences which make up a cluster may change from week to week, and since the cluster identifier may disappear (typically when two clusters merge) using the cluster identifier as a reference is ill- advised. Using the GB accession numbers of the sequences which comprise the cluster is a safe alternative." (from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml) >From this it sounds that clusters only ever merge, but from what I understand there are also situations in which they may split. Hence, Unigene cluster IDs are very useful intermediate IDs for processing data at a given time, but they cannot be used as persistent identifiers. And whenever one does such processing, one has to make sure that all Unigene-IDs involved refer to the same Unigene-Build. Now, to come to the point, I haven't seen that information (about the Unigene build) in the data packages on Bioconductor, and I would like to suggest to include that information in the package data, and to document how to find the information in a prominent place, so software using these packages can make sure the Unigene IDs are consistent. Best regards Wolfgang. Dr. Wolfgang Huber http://www.dkfz.de/abt0840/whuber Tel +49-6221-424709 Fax +49-6221-42524709 DKFZ Division of Molecular Genome Analysis 69120 Heidelberg Germany

probe safe probe safe • 1.8k views

ADD COMMENT • link updated 23.4 years ago by Laurent Gautier ★ 2.3k • written 23.4 years ago by Wolfgang Huber ★ 13k

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.8 years ago

United States

Funny that you should mention that, in the soon to be released -- new and improved builds, you will get just that (and much more). For example, in the hgu95a package (up later this week), I get: UniGene Date built: Build #155.<url: ftp:="" ftp.ncbi.nih.gov="" repository="" unigene="" hs.data.gz="">. I agree completely that this is what we want. And we have done some work to ensure that all data sources are adequately documented. We (Jianhua and myself) would like to hear from others with similar concerns, if the new format does not address the issues appropriately. Robert On Wed, Oct 02, 2002 at 11:33:00PM +0200, Wolfgang Huber wrote: > Hi, > > I send this through the mailing list since it may be of general interest. On > the bioconductor page, there are a number of data packages, such as hgu95a, > hgu133a. Among other goodies, they contain XDR files with environments full > of mappings from Affymetrix probe set identifiers to Unigene cluster IDs. > For example, these may be used to figure out which probe sets from the same, > or from different chips, represent a given gene. > > Now here's a quote from the NCBI web page: > "... Since the sequences which make up a cluster may change from week to > week, and since the cluster identifier may disappear (typically when two > clusters merge) using the cluster identifier as a reference is ill- advised. > Using the GB accession numbers of the sequences which comprise the cluster > is a safe alternative." > (from http://www.ncbi.nlm.nih.gov/UniGene/build.shtml) > > >From this it sounds that clusters only ever merge, but from what I > understand there are also situations in which they may split. Hence, Unigene > cluster IDs are very useful intermediate IDs for processing data at a given > time, but they cannot be used as persistent identifiers. And whenever one > does such processing, one has to make sure that all Unigene-IDs involved > refer to the same Unigene-Build. Now, to come to the point, I haven't seen > that information (about the Unigene build) in the data packages on > Bioconductor, and I would like to suggest to include that information in the > package data, and to document how to find the information in a prominent > place, so software using these packages can make sure the Unigene IDs are > consistent. > > Best regards > Wolfgang. > > Dr. Wolfgang Huber > http://www.dkfz.de/abt0840/whuber > Tel +49-6221-424709 > Fax +49-6221-42524709 > DKFZ > Division of Molecular Genome Analysis > 69120 Heidelberg > Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | Harvard School of Public Health email: rgentlem@jimmy.dfci.harvard.edu | +--------------------------------------------------------------------- ------+

ADD COMMENT • link 23.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

Laurent Gautier ★ 2.3k

@laurent-gautier-29

Last seen 11.4 years ago

On Wed, Oct 02, 2002 at 11:33:00PM +0200, Wolfgang Huber wrote: > > >From this it sounds that clusters only ever merge, but from what I > understand there are also situations in which they may split. Hence, Unigene > cluster IDs are very useful intermediate IDs for processing data at a given > time, but they cannot be used as persistent identifiers. And whenever one > does such processing, one has to make sure that all Unigene-IDs involved > refer to the same Unigene-Build. Now, to come to the point, I haven't seen > that information (about the Unigene build) in the data packages on > Bioconductor, and I would like to suggest to include that information in the > package data, and to document how to find the information in a prominent > place, so software using these packages can make sure the Unigene IDs are > consistent. I agree with Wolfgang. I am currently working on environments that are not completely unrelated to the one provided by the Data package and I would like to attach extra information to the environment. I would be happy to follow a convention to stick the information to the environment. Here are few thoughts about that (just in case): - set a class for 'Data' environment (chipData) - set an attribute 'cdfName' (to be used with the affy package. indicates the chip it corresponds to) - set an attribute 'releaseDate' or 'bestBefore' - set an attribute 'size' (memory usage) - set an attribute 'comments' Cheers, L. > > Best regards > Wolfgang. > > Dr. Wolfgang Huber > http://www.dkfz.de/abt0840/whuber > Tel +49-6221-424709 > Fax +49-6221-42524709 > DKFZ > Division of Molecular Genome Analysis > 69120 Heidelberg > Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- -------------------------------------------------------------- Laurent Gautier CBS, Building 208, DTU PhD. Student DK-2800 Lyngby,Denmark tel: +45 45 25 24 89 http://www.cbs.dtu.dk/laurent

ADD COMMENT • link 23.4 years ago Laurent Gautier ★ 2.3k

Login before adding your answer.