what's really in hgu133plus2.db?

0

Entering edit mode

David Iles ▴ 130

@david-iles-4487

Last seen 11.0 years ago

Dear All, Can anyone point me to a URL where I can obtain an overview of the sources of the data incorporated in the current version of hgu133plus2.db? I saw to my horror that the actual probesets are based on a really obsolete human genome assembly (2003), which has changed significantly over the years. As have also genes, gene locations, genomic intervals, RefSeq/UniGene entries etcetcetc...... Thanks Dave Dr David Iles Institute for Integrative and Comparative Biology University of Leeds Leeds LS2 9JT d.e.iles at leeds.ac.uk

• 3.3k views

ADD COMMENT • link updated 15.0 years ago by Tim Yates ▴ 250 • written 15.0 years ago by David Iles ▴ 130

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 5.4 years ago

United States

Use the _dbInfo() function to find out what your annotation package is built against. Here's the result for version 2.4.5: > library(hgu133plus2.db) > hgu133plus2_dbInfo() name value 1 DBSCHEMAVERSION 2.1 2 DBSCHEMA HUMANCHIP_DB 3 ORGANISM Homo sapiens 4 SPECIES Human 5 MANUFACTURER Affymetrix 6 CHIPNAME Human Genome U133 Plus 2.0 Array 7 MANUFACTURERURL http://www.affymetrix.com/support/technical/byproduct.affx?product=hg- u133-plus 8 EGSOURCEDATE 2010-Sep7 9 EGSOURCENAME Entrez Gene 10 EGSOURCEURL ftp://ftp.ncbi.nlm.nih.gov/gene/DATA 11 CENTRALID ENTREZID 12 TAXID 9606 13 GOSOURCENAME Gene Ontology 14 GOSOURCEURL ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ 15 GOSOURCEDATE 20100904 16 GOEGSOURCEDATE 2010-Sep7 17 GOEGSOURCENAME Entrez Gene 18 GOEGSOURCEURL ftp://ftp.ncbi.nlm.nih.gov/gene/DATA 19 KEGGSOURCENAME KEGG GENOME 20 KEGGSOURCEURL ftp://ftp.genome.jp/pub/kegg/genomes 21 KEGGSOURCEDATE 2010-Sep7 22 GPSOURCENAME UCSC Genome Bioinformatics (Homo sapiens) 23 GPSOURCEURL ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19 24 GPSOURCEDATE 2010-Mar22 25 IPISOURCENAME The International Protein Index 26 IPISOURCEURL ftp://ftp.ebi.ac.uk/pub/databases/IPI/current 27 IPISOURCEDATE 2010-Aug19 28 ENSOURCEDATE 2010-Aug5 29 ENSOURCENAME Ensembl 30 ENSOURCEURL ftp://ftp.ensembl.org/pub/current_fasta On Fri, Feb 18, 2011 at 8:41 AM, David Iles <d.e.iles@leeds.ac.uk> wrote: > Dear All, > > Can anyone point me to a URL where I can obtain an overview of the sources > of the data incorporated in the current version of hgu133plus2.db? I saw to > my horror that the actual probesets are based on a really obsolete human > genome assembly (2003), which has changed significantly over the years. As > have also genes, gene locations, genomic intervals, RefSeq/UniGene entries > etcetcetc...... > > Thanks > > Dave > Dr David Iles > Institute for Integrative and Comparative Biology > University of Leeds > Leeds LS2 9JT > > d.e.iles@leeds.ac.uk > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is. John von Neumann<http: www-groups.dcs.st-="" and.ac.uk="" ~history="" biographies="" von_neumann.html=""> [[alternative HTML version deleted]]

ADD COMMENT • link 15.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.5 years ago

United States

Hi David, All annotation data changes continuously with time as we learn more. This is why the entire annotation repository gets rebuilt biannually (for each new release of Bioconductor). To more directly, answer your question, the majority of the data in these packages is supplied by NCBI although other data comes from UCSC, and other sources as appropriate. You can see where the individual mappings each get their resources from by looking at the help pages associated with each one. But the thing that you are probably most worried about is the mapping that connects the individual probesets with the genes that are annotated in these packages. Those mappings too are updated twice a year to the latest thing that is available from Affymetrix at that time. And yes, they do change these mappings from time to time. However if you don't trust them, then you might be interested to know that the MBNI also has a series of annotation packages that are based on re-mapped gene to probeset associations. You can learn about those here: http://brainarray.mbni.med.umich.edu/Brainarray/Service/Service.asp Finally, if you feel really enterprising, you can also find some way to remake these mappings yourself and then use the SQLForge code in AnnotationDbi to generate a new package based on those mappings. You can find instructions for that here: http://www.bioconductor.org/help/bioc- views/release/bioc/html/AnnotationDbi.html hope this helps, Marc ----- Original Message ----- From: "David Iles" <d.e.iles@leeds.ac.uk> To: bioconductor at r-project.org Sent: Friday, February 18, 2011 8:41:04 AM Subject: [BioC] what's really in hgu133plus2.db? Dear All, Can anyone point me to a URL where I can obtain an overview of the sources of the data incorporated in the current version of hgu133plus2.db? I saw to my horror that the actual probesets are based on a really obsolete human genome assembly (2003), which has changed significantly over the years. As have also genes, gene locations, genomic intervals, RefSeq/UniGene entries etcetcetc...... Thanks Dave Dr David Iles Institute for Integrative and Comparative Biology University of Leeds Leeds LS2 9JT d.e.iles at leeds.ac.uk _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.0 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 4 days ago

United States

On Fri, Feb 18, 2011 at 11:41 AM, David Iles <d.e.iles@leeds.ac.uk> wrote: > Dear All, > > Can anyone point me to a URL where I can obtain an overview of the sources > of the data incorporated in the current version of hgu133plus2.db? I saw to > my horror that the actual probesets are based on a really obsolete human > genome assembly (2003), What do you mean by "actual probesets"? The hgu133plus2.db documentation is in the package, for example: help("hgu133plus2ENTREZID") Map between Manufacturer Identifiers and Entrez Gene Description: hgu133plus2ENTREZID is an R object that provides mappings between manufacturer identifiers and Entrez Gene identifiers. Details: Each manufacturer identifier is mapped to a vector of Entrez Gene identifiers. An 'NA' is assigned to those manufacturer identifiers that can not be mapped to an Entrez Gene identifier at this time. If a given manufacturer identifier can be mapped to different Entrez Gene identifiers from various sources, we attempt to select the common identifiers. If a concensus cannot be determined, we select the smallest identifier. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7 References: <url: http:="" www.ncbi.nlm.nih.gov="" entrez="" query.fcgi?db="gene"> Let us know what additional details are needed, and do note your sessionInfo() > sessionInfo() R version 2.13.0 Under development (unstable) (2011-02-11 r54332) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] hgu133plus2.db_2.4.5 org.Hs.eg.db_2.4.6 RSQLite_0.9-4 [4] DBI_0.2-5 AnnotationDbi_1.13.13 Biobase_2.11.8 loaded via a namespace (and not attached): [1] tools_2.13.0 > which has changed significantly over the years. As have also genes, gene > locations, genomic intervals, RefSeq/UniGene entries etcetcetc...... > > Thanks > > Dave > Dr David Iles > Institute for Integrative and Comparative Biology > University of Leeds > Leeds LS2 9JT > > d.e.iles@leeds.ac.uk > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 15.0 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 days ago

United States

Hi David, On 2/18/2011 11:41 AM, David Iles wrote: > Dear All, > > Can anyone point me to a URL where I can obtain an overview of the > sources of the data incorporated in the current version of > hgu133plus2.db? I saw to my horror that the actual probesets are > based on a really obsolete human genome assembly (2003), which has > changed significantly over the years. As have also genes, gene > locations, genomic intervals, RefSeq/UniGene entries etcetcetc...... So what exactly is the question? As you note, the chip was designed in the early 2000's, so was necessarily based on a (now) old version of the UniGene database. That is the downfall of the expression arrays; they are stale almost from the instant they hit the market. Since the probesets are based on things that may now be different, it is to a certain extent irrelevant how current the hgu133plus2.db data are, because the probeset --> gene mappings may be suspect. You can update the gene info all you want, but if the probeset doesn't actually measure a given transcript, then what is the point? We base the annotation on the probeset --> entrez gene mappings supplied by Affymetrix, which are supposed to be updated regularly. Not having checked that (and given the fact that we take no stance on the veracity of these mappings), they are what they are. Any significant results will require close inspection of the probesets to determine if you believe that they measure what they purport to measure. As an alternative, you can try the MBNI re-mapped probesets, which both update the mappings and remove replicate probesets (by creating single probesets per gene/transcript/etc). They can be obtained via biocLite, or individually here: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF _download.asp Best, Jim > > Thanks > > Dave Dr David Iles Institute for Integrative and Comparative Biology > University of Leeds Leeds LS2 9JT > > d.e.iles at leeds.ac.uk > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD COMMENT • link 15.0 years ago James W. MacDonald 68k

0

Entering edit mode

Jim, Thanks for your response. The point of understanding exactly where a probeset is located is of fundamental importance because it is now clear from the ENCODE project that around 90% of genome sequence is actively transcribed in a regulated way - John Mattick presented an excellent talk introducing this topic at the HGM2007 meeting in Montreal. The question then is; 'is it mRNA or another (regulatory?) RNA species that we are measuring?'. The fact that 'orphaned' probesets detect significantly up- or down-regulated transcription is extremely interesting and should not be ignored just because they now map outside 'genes' (whatever they may be - the human GNAS locus generates 59 different transcripts, some of which do not overlap). Dave Dr David Iles Institute for Integrative and Comparative Biology University of Leeds Leeds LS2 9JT d.e.iles at leeds.ac.uk On 18 Feb 2011, at 19:24, James W. MacDonald wrote: > Hi David, > > On 2/18/2011 11:41 AM, David Iles wrote: >> Dear All, >> >> Can anyone point me to a URL where I can obtain an overview of the >> sources of the data incorporated in the current version of >> hgu133plus2.db? I saw to my horror that the actual probesets are >> based on a really obsolete human genome assembly (2003), which has >> changed significantly over the years. As have also genes, gene >> locations, genomic intervals, RefSeq/UniGene entries etcetcetc...... > > So what exactly is the question? As you note, the chip was designed in > the early 2000's, so was necessarily based on a (now) old version of the > UniGene database. That is the downfall of the expression arrays; they > are stale almost from the instant they hit the market. > > Since the probesets are based on things that may now be different, it is > to a certain extent irrelevant how current the hgu133plus2.db data are, > because the probeset --> gene mappings may be suspect. You can update > the gene info all you want, but if the probeset doesn't actually measure > a given transcript, then what is the point? > > We base the annotation on the probeset --> entrez gene mappings supplied > by Affymetrix, which are supposed to be updated regularly. Not having > checked that (and given the fact that we take no stance on the veracity > of these mappings), they are what they are. Any significant results will > require close inspection of the probesets to determine if you believe > that they measure what they purport to measure. > > As an alternative, you can try the MBNI re-mapped probesets, which both > update the mappings and remove replicate probesets (by creating single > probesets per gene/transcript/etc). They can be obtained via biocLite, > or individually here: > > http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/C DF_download.asp > > Best, > > Jim > > >> >> Thanks >> >> Dave Dr David Iles Institute for Integrative and Comparative Biology >> University of Leeds Leeds LS2 9JT >> >> d.e.iles at leeds.ac.uk >> >> _______________________________________________ Bioconductor mailing >> list Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >> archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues >

ADD REPLY • link 15.0 years ago David Iles ▴ 130

0

Entering edit mode

Hi David, On 2/18/2011 2:58 PM, David Iles wrote: > Jim, > > Thanks for your response. The point of understanding exactly where a > probeset is located is of fundamental importance because it is now > clear from the ENCODE project that around 90% of genome sequence is > actively transcribed in a regulated way - John Mattick presented an > excellent talk introducing this topic at the HGM2007 meeting in > Montreal. The question then is; 'is it mRNA or another (regulatory?) > RNA species that we are measuring?'. The fact that 'orphaned' > probesets detect significantly up- or down-regulated transcription is > extremely interesting and should not be ignored just because they now > map outside 'genes' (whatever they may be - the human GNAS locus > generates 59 different transcripts, some of which do not overlap). Which is the gist of my original question to you. The annotation packages we provide take the original manufacturer at their word and simply map the intended target to other annotation sources. Therefore, if you are interested in 'non-traditional' (for lack of a better term) transcripts, then the updated status of the annotation databases isn't relevant. However, the packages that have been developed for next-gen sequencing may be of interest. The Biostrings and BSGenome.Hsapiens.UCSC.hgXX packages will allow you to very quickly align all the probesets to the genome of your choice. Then depending on how you want to proceed, things like rtracklayer, GenomicFeatures, GenomicRanges, etc can help discern known transcripts from possible 'other' RNA species. Best, Jim > > Dave Dr David Iles Institute for Integrative and Comparative Biology > University of Leeds Leeds LS2 9JT > > d.e.iles at leeds.ac.uk > > > > > On 18 Feb 2011, at 19:24, James W. MacDonald wrote: > >> Hi David, >> >> On 2/18/2011 11:41 AM, David Iles wrote: >>> Dear All, >>> >>> Can anyone point me to a URL where I can obtain an overview of >>> the sources of the data incorporated in the current version of >>> hgu133plus2.db? I saw to my horror that the actual probesets are >>> based on a really obsolete human genome assembly (2003), which >>> has changed significantly over the years. As have also genes, >>> gene locations, genomic intervals, RefSeq/UniGene entries >>> etcetcetc...... >> >> So what exactly is the question? As you note, the chip was designed >> in the early 2000's, so was necessarily based on a (now) old >> version of the UniGene database. That is the downfall of the >> expression arrays; they are stale almost from the instant they hit >> the market. >> >> Since the probesets are based on things that may now be different, >> it is to a certain extent irrelevant how current the hgu133plus2.db >> data are, because the probeset --> gene mappings may be suspect. >> You can update the gene info all you want, but if the probeset >> doesn't actually measure a given transcript, then what is the >> point? >> >> We base the annotation on the probeset --> entrez gene mappings >> supplied by Affymetrix, which are supposed to be updated regularly. >> Not having checked that (and given the fact that we take no stance >> on the veracity of these mappings), they are what they are. Any >> significant results will require close inspection of the probesets >> to determine if you believe that they measure what they purport to >> measure. >> >> As an alternative, you can try the MBNI re-mapped probesets, which >> both update the mappings and remove replicate probesets (by >> creating single probesets per gene/transcript/etc). They can be >> obtained via biocLite, or individually here: >> >> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ CDF_download.asp >> >> >> Best, >> >> Jim >> >> >>> >>> Thanks >>> >>> Dave Dr David Iles Institute for Integrative and Comparative >>> Biology University of Leeds Leeds LS2 9JT >>> >>> d.e.iles at leeds.ac.uk >>> >>> _______________________________________________ Bioconductor >>> mailing list Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>> archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >>> -- >> James W. MacDonald, M.S. Biostatistician Douglas Lab University of >> Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine >> St. Ann Arbor MI 48109-5618 734-615-7826 >> ********************************************************** >> Electronic Mail is not secure, may not be read every day, and >> should not be used for urgent or sensitive issues >> > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD REPLY • link 15.0 years ago James W. MacDonald 68k

0

Entering edit mode

Tim Yates ▴ 250

@tim-yates-4040

Last seen 11.5 years ago

We map the hgu133plus2 array to ensembl as part of the xmapcore package. The mappings can be seen on the xmap browser http://xmap.picr.man.ac.uk/?a=HG-U133Plus2&ch=17&lay=gene&q=Tp53 Our you can install the human xmapcore database (from the downloads page of that site) into a local copy of mysql, install the xmapcore package from bioconductor, and map from probesets to exons, transcripts or genes. Just another option, Tim ----- Reply message ----- From: "James W. MacDonald" <jmacdon@med.umich.edu> Date: Fri, Feb 18, 2011 20:42 Subject: [BioC] what's really in hgu133plus2.db? To: "David Iles" <d.e.iles at="" leeds.ac.uk=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Hi David, On 2/18/2011 2:58 PM, David Iles wrote: > Jim, > > Thanks for your response. The point of understanding exactly where a > probeset is located is of fundamental importance because it is now > clear from the ENCODE project that around 90% of genome sequence is > actively transcribed in a regulated way - John Mattick presented an > excellent talk introducing this topic at the HGM2007 meeting in > Montreal. The question then is; 'is it mRNA or another (regulatory?) > RNA species that we are measuring?'. The fact that 'orphaned' > probesets detect significantly up- or down-regulated transcription is > extremely interesting and should not be ignored just because they now > map outside 'genes' (whatever they may be - the human GNAS locus > generates 59 different transcripts, some of which do not overlap). Which is the gist of my original question to you. The annotation packages we provide take the original manufacturer at their word and simply map the intended target to other annotation sources. Therefore, if you are interested in 'non-traditional' (for lack of a better term) transcripts, then the updated status of the annotation databases isn't relevant. However, the packages that have been developed for next-gen sequencing may be of interest. The Biostrings and BSGenome.Hsapiens.UCSC.hgXX packages will allow you to very quickly align all the probesets to the genome of your choice. Then depending on how you want to proceed, things like rtracklayer, GenomicFeatures, GenomicRanges, etc can help discern known transcripts from possible 'other' RNA species. Best, Jim > > Dave Dr David Iles Institute for Integrative and Comparative Biology > University of Leeds Leeds LS2 9JT > > d.e.iles at leeds.ac.uk > > > > > On 18 Feb 2011, at 19:24, James W. MacDonald wrote: > >> Hi David, >> >> On 2/18/2011 11:41 AM, David Iles wrote: >>> Dear All, >>> >>> Can anyone point me to a URL where I can obtain an overview of >>> the sources of the data incorporated in the current version of >>> hgu133plus2.db? I saw to my horror that the actual probesets are >>> based on a really obsolete human genome assembly (2003), which >>> has changed significantly over the years. As have also genes, >>> gene locations, genomic intervals, RefSeq/UniGene entries >>> etcetcetc...... >> >> So what exactly is the question? As you note, the chip was designed >> in the early 2000's, so was necessarily based on a (now) old >> version of the UniGene database. That is the downfall of the >> expression arrays; they are stale almost from the instant they hit >> the market. >> >> Since the probesets are based on things that may now be different, >> it is to a certain extent irrelevant how current the hgu133plus2.db >> data are, because the probeset --> gene mappings may be suspect. >> You can update the gene info all you want, but if the probeset >> doesn't actually measure a given transcript, then what is the >> point? >> >> We base the annotation on the probeset --> entrez gene mappings >> supplied by Affymetrix, which are supposed to be updated regularly. >> Not having checked that (and given the fact that we take no stance >> on the veracity of these mappings), they are what they are. Any >> significant results will require close inspection of the probesets >> to determine if you believe that they measure what they purport to >> measure. >> >> As an alternative, you can try the MBNI re-mapped probesets, which >> both update the mappings and remove replicate probesets (by >> creating single probesets per gene/transcript/etc). They can be >> obtained via biocLite, or individually here: >> >> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ CDF_download.asp >> >> >> Best, >> >> Jim >> >> >>> >>> Thanks >>> >>> Dave Dr David Iles Institute for Integrative and Comparative >>> Biology University of Leeds Leeds LS2 9JT >>> >>> d.e.iles at leeds.ac.uk >>> >>> _______________________________________________ Bioconductor >>> mailing list Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>> archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >>> -- >> James W. MacDonald, M.S. Biostatistician Douglas Lab University of >> Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine >> St. Ann Arbor MI 48109-5618 734-615-7826 >> ********************************************************** >> Electronic Mail is not secure, may not be read every day, and >> should not be used for urgent or sensitive issues >> > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.0 years ago Tim Yates ▴ 250

0

Entering edit mode

Hey Tim, I discovered the old chips in XMAP browser yesterday, when I was looking for genes with people using the old arrays... cool "legacy" feature :) Are the hits of HGU133plus2 etc in the xmapcore database by chance too? Cheers, Michal ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] on behalf of Tim Yates [TYates@picr.man.ac.uk] Sent: Saturday, February 19, 2011 12:03 AM To: d.e.iles at leeds.ac.uk Cc: mailman, bioconductor Subject: Re: [BioC] what's really in hgu133plus2.db? We map the hgu133plus2 array to ensembl as part of the xmapcore package. The mappings can be seen on the xmap browser http://xmap.picr.man.ac.uk/?a=HG-U133Plus2&ch=17&lay=gene&q=Tp53 Our you can install the human xmapcore database (from the downloads page of that site) into a local copy of mysql, install the xmapcore package from bioconductor, and map from probesets to exons, transcripts or genes. Just another option, Tim ----- Reply message ----- From: "James W. MacDonald" <jmacdon@med.umich.edu> Date: Fri, Feb 18, 2011 20:42 Subject: [BioC] what's really in hgu133plus2.db? To: "David Iles" <d.e.iles at="" leeds.ac.uk=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Hi David, On 2/18/2011 2:58 PM, David Iles wrote: > Jim, > > Thanks for your response. The point of understanding exactly where a > probeset is located is of fundamental importance because it is now > clear from the ENCODE project that around 90% of genome sequence is > actively transcribed in a regulated way - John Mattick presented an > excellent talk introducing this topic at the HGM2007 meeting in > Montreal. The question then is; 'is it mRNA or another (regulatory?) > RNA species that we are measuring?'. The fact that 'orphaned' > probesets detect significantly up- or down-regulated transcription is > extremely interesting and should not be ignored just because they now > map outside 'genes' (whatever they may be - the human GNAS locus > generates 59 different transcripts, some of which do not overlap). Which is the gist of my original question to you. The annotation packages we provide take the original manufacturer at their word and simply map the intended target to other annotation sources. Therefore, if you are interested in 'non-traditional' (for lack of a better term) transcripts, then the updated status of the annotation databases isn't relevant. However, the packages that have been developed for next-gen sequencing may be of interest. The Biostrings and BSGenome.Hsapiens.UCSC.hgXX packages will allow you to very quickly align all the probesets to the genome of your choice. Then depending on how you want to proceed, things like rtracklayer, GenomicFeatures, GenomicRanges, etc can help discern known transcripts from possible 'other' RNA species. Best, Jim > > Dave Dr David Iles Institute for Integrative and Comparative Biology > University of Leeds Leeds LS2 9JT > > d.e.iles at leeds.ac.uk > > > > > On 18 Feb 2011, at 19:24, James W. MacDonald wrote: > >> Hi David, >> >> On 2/18/2011 11:41 AM, David Iles wrote: >>> Dear All, >>> >>> Can anyone point me to a URL where I can obtain an overview of >>> the sources of the data incorporated in the current version of >>> hgu133plus2.db? I saw to my horror that the actual probesets are >>> based on a really obsolete human genome assembly (2003), which >>> has changed significantly over the years. As have also genes, >>> gene locations, genomic intervals, RefSeq/UniGene entries >>> etcetcetc...... >> >> So what exactly is the question? As you note, the chip was designed >> in the early 2000's, so was necessarily based on a (now) old >> version of the UniGene database. That is the downfall of the >> expression arrays; they are stale almost from the instant they hit >> the market. >> >> Since the probesets are based on things that may now be different, >> it is to a certain extent irrelevant how current the hgu133plus2.db >> data are, because the probeset --> gene mappings may be suspect. >> You can update the gene info all you want, but if the probeset >> doesn't actually measure a given transcript, then what is the >> point? >> >> We base the annotation on the probeset --> entrez gene mappings >> supplied by Affymetrix, which are supposed to be updated regularly. >> Not having checked that (and given the fact that we take no stance >> on the veracity of these mappings), they are what they are. Any >> significant results will require close inspection of the probesets >> to determine if you believe that they measure what they purport to >> measure. >> >> As an alternative, you can try the MBNI re-mapped probesets, which >> both update the mappings and remove replicate probesets (by >> creating single probesets per gene/transcript/etc). They can be >> obtained via biocLite, or individually here: >> >> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ CDF_download.asp >> >> >> Best, >> >> Jim >> >> >>> >>> Thanks >>> >>> Dave Dr David Iles Institute for Integrative and Comparative >>> Biology University of Leeds Leeds LS2 9JT >>> >>> d.e.iles at leeds.ac.uk >>> >>> _______________________________________________ Bioconductor >>> mailing list Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>> archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >>> -- >> James W. MacDonald, M.S. Biostatistician Douglas Lab University of >> Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine >> St. Ann Arbor MI 48109-5618 734-615-7826 >> ********************************************************** >> Electronic Mail is not secure, may not be read every day, and >> should not be used for urgent or sensitive issues >> > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 15.0 years ago Michal Okoniewski ▴ 190

0

Entering edit mode

Hi Michal :) Yeah, the HGU133plus2 array went in to the xmapcore database as of version 58, and the HuGene-1_0-st-v1 (Human Gene 1.0 ST Array plus 2) went in at version 60 (the current latest version) In the xmapcore package, the method to set the current array you are using is xmap.array.type(). This will present you with a list of available arrays (or you can specify one by name if you wish) Then, for every call, there is an internal method which vorks out if the arraytype is required as a parameter and if so, appends it to the calls to the stored procedures (it was done this way to try and present a simpler API to the end user). In the database, the probeset table has an array_id field which maps probesets to arrays, and the hits can be found by joining the probeset table to the hit table via the probemap table where array_id is the array of interest. Obviously, tere is quite a bit of overlap in the probe design of these arrays, so some probes appear in multiple arrays (which is why the probemap table is larger than the probe table) Hope this helps :-) Tim On 19/02/2011 10:15, "Michal Okoniewski" <michal.okoniewski at="" fgcz.ethz.ch=""> wrote: > Hey Tim, > > I discovered the old chips in XMAP browser yesterday, when I was looking for > genes with people using the old arrays... cool "legacy" feature :) > Are the hits of HGU133plus2 etc in the xmapcore database by chance too? > > Cheers, > Michal > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] > on behalf of Tim Yates [TYates at picr.man.ac.uk] > Sent: Saturday, February 19, 2011 12:03 AM > To: d.e.iles at leeds.ac.uk > Cc: mailman, bioconductor > Subject: Re: [BioC] what's really in hgu133plus2.db? > > We map the hgu133plus2 array to ensembl as part of the xmapcore package. > > The mappings can be seen on the xmap browser > > http://xmap.picr.man.ac.uk/?a=HG-U133Plus2&ch=17&lay=gene&q=Tp53 > > Our you can install the human xmapcore database (from the downloads page of > that site) into a local copy of mysql, install the xmapcore package from > bioconductor, and map from probesets to exons, transcripts or genes. > > Just another option, > > Tim > > > > ----- Reply message ----- > From: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> > Date: Fri, Feb 18, 2011 20:42 > Subject: [BioC] what's really in hgu133plus2.db? > To: "David Iles" <d.e.iles at="" leeds.ac.uk=""> > Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > > Hi David, > > On 2/18/2011 2:58 PM, David Iles wrote: >> Jim, >> >> Thanks for your response. The point of understanding exactly where a >> probeset is located is of fundamental importance because it is now >> clear from the ENCODE project that around 90% of genome sequence is >> actively transcribed in a regulated way - John Mattick presented an >> excellent talk introducing this topic at the HGM2007 meeting in >> Montreal. The question then is; 'is it mRNA or another (regulatory?) >> RNA species that we are measuring?'. The fact that 'orphaned' >> probesets detect significantly up- or down-regulated transcription is >> extremely interesting and should not be ignored just because they now >> map outside 'genes' (whatever they may be - the human GNAS locus >> generates 59 different transcripts, some of which do not overlap). > > Which is the gist of my original question to you. The annotation > packages we provide take the original manufacturer at their word and > simply map the intended target to other annotation sources. Therefore, > if you are interested in 'non-traditional' (for lack of a better term) > transcripts, then the updated status of the annotation databases isn't > relevant. > > However, the packages that have been developed for next-gen sequencing > may be of interest. The Biostrings and BSGenome.Hsapiens.UCSC.hgXX > packages will allow you to very quickly align all the probesets to the > genome of your choice. Then depending on how you want to proceed, things > like rtracklayer, GenomicFeatures, GenomicRanges, etc can help discern > known transcripts from possible 'other' RNA species. > > Best, > > Jim > > >> >> Dave Dr David Iles Institute for Integrative and Comparative Biology >> University of Leeds Leeds LS2 9JT >> >> d.e.iles at leeds.ac.uk >> >> >> >> >> On 18 Feb 2011, at 19:24, James W. MacDonald wrote: >> >>> Hi David, >>> >>> On 2/18/2011 11:41 AM, David Iles wrote: >>>> Dear All, >>>> >>>> Can anyone point me to a URL where I can obtain an overview of >>>> the sources of the data incorporated in the current version of >>>> hgu133plus2.db? I saw to my horror that the actual probesets are >>>> based on a really obsolete human genome assembly (2003), which >>>> has changed significantly over the years. As have also genes, >>>> gene locations, genomic intervals, RefSeq/UniGene entries >>>> etcetcetc...... >>> >>> So what exactly is the question? As you note, the chip was designed >>> in the early 2000's, so was necessarily based on a (now) old >>> version of the UniGene database. That is the downfall of the >>> expression arrays; they are stale almost from the instant they hit >>> the market. >>> >>> Since the probesets are based on things that may now be different, >>> it is to a certain extent irrelevant how current the hgu133plus2.db >>> data are, because the probeset --> gene mappings may be suspect. >>> You can update the gene info all you want, but if the probeset >>> doesn't actually measure a given transcript, then what is the >>> point? >>> >>> We base the annotation on the probeset --> entrez gene mappings >>> supplied by Affymetrix, which are supposed to be updated regularly. >>> Not having checked that (and given the fact that we take no stance >>> on the veracity of these mappings), they are what they are. Any >>> significant results will require close inspection of the probesets >>> to determine if you believe that they measure what they purport to >>> measure. >>> >>> As an alternative, you can try the MBNI re-mapped probesets, which >>> both update the mappings and remove replicate probesets (by >>> creating single probesets per gene/transcript/etc). They can be >>> obtained via biocLite, or individually here: >>> >>> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF /CDF_downl >>> oad.asp >>> >>> >>> > Best, >>> >>> Jim >>> >>> >>>> >>>> Thanks >>>> >>>> Dave Dr David Iles Institute for Integrative and Comparative >>>> Biology University of Leeds Leeds LS2 9JT >>>> >>>> d.e.iles at leeds.ac.uk >>>> >>>> _______________________________________________ Bioconductor >>>> mailing list Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>> archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>>> > -- >>> James W. MacDonald, M.S. Biostatistician Douglas Lab University of >>> Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine >>> St. Ann Arbor MI 48109-5618 734-615-7826 >>> ********************************************************** >>> Electronic Mail is not secure, may not be read every day, and >>> should not be used for urgent or sensitive issues >>> >> >> _______________________________________________ Bioconductor mailing >> list Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >> archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be > used for urgent or sensitive issues > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 15.0 years ago Tim Yates ▴ 250

Login before adding your answer.