Normalization of array data from GEO repository
1
0
Entering edit mode
Aleš Maver ▴ 80
@ales-maver-3556
Last seen 10.2 years ago
Hi all, I have obtained several GEO Series (GSE) entries from GEO repository using getGEO function (GEOquery package). Data obtained in this manner is stored in ExpressionSet class. The problem is I don't know how to perform quality control analyses and normalization procedures on ExpressionSet data, because functions like expresso (affy package) work only on AffyBatch classes. Is there anything I am missing? And- does anyone know whether data in GEO repository is already normalised or not? Thank you for any replies! Ales Maver Ales.Maver@gmail.com [[alternative HTML version deleted]]
• 4.0k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States
Hi, On Jul 7, 2009, at 5:38 AM, Ale? Maver wrote: > Hi all, > I have obtained several GEO Series (GSE) entries from GEO repository > using > getGEO function (GEOquery package). > Data obtained in this manner is stored in ExpressionSet class. The > problem > is I don't know how to perform quality control analyses and > normalization > procedures on ExpressionSet data, because functions like expresso > (affy > package) work only on AffyBatch classes. Is there anything I am > missing? Sorry, I've never used the GEOquery package before, so I can't speak much to that, but I'd be surprised if there isn't an option to return your results as an AffyBatch object, because I'd dare say that you can get most of the data from geo in its raw format (eg, CEL file or whatever). > And- does anyone know whether data in GEO repository is already > normalised > or not? It depends, sometimes you aren't given the raw files: sometimes the data is from a custom array, or I've also seen some datasets provided in the post-processed form (already MAS5 normalized, for example), but it's been my experience that you can get the raw data for most of the experiments you find there. Also, for array quality assessment, look into the arrayQualityMetrics package: http://www.bioconductor.org/packages/release/bioc/html/arrayQualityMet rics.html Hope that helps, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Hello, just a small addendum: you may also want to have a look at the ArrayExpress package which allows the user to retrieve data sets from the ArrayExpress database at EBI and returns the data in form of an AffyBatch, NChannelSet, RGList or the like. Since GEO and ArrayExpress are regularly synchronized, you may be able to find your data sets of interest there as well. Regards, Joern On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote > Hi, > > On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]Ale? Maver wrote: > > > Hi all, > > I have obtained several GEO Series (GSE) entries from GEO repository > > using > > getGEO function (GEOquery package). > > Data obtained in this manner is stored in ExpressionSet class. The > > problem > > is I don't know how to perform quality control analyses and > > normalization > > procedures on ExpressionSet data, because functions like expresso > > (affy > > package) work only on AffyBatch classes. Is there anything I am > > missing? > > Sorry, I've never used the GEOquery package before, so I can't speak > much to that, but I'd be surprised if there isn't an option to > return your results as an AffyBatch object, because I'd dare say > that you can get most of the data from geo in its raw format (eg, > CEL file or whatever). > > > And- does anyone know whether data in GEO repository is already > > normalised > > or not? > > It depends, sometimes you aren't given the raw files: sometimes the > data is from a custom array, or I've also seen some datasets > provided in the post-processed form (already MAS5 normalized, for > example), but it's been my experience that you can get the raw data > for most of the experiments you find there. > > Also, for array quality assessment, look into the > arrayQualityMetrics package: > > http://www.bioconductor.org/packages/release/bioc/html/arrayQualityM etrics.html > > Hope that helps, > -steve
ADD REPLY
0
Entering edit mode
Great! thank you for all the info and useful advice regarding arrayQualityMetrics and ArrayExpress! Regards, Ales 2009/7/8 Joern Toedling <joern.toedling at="" curie.fr=""> > > Hello, > > just a small addendum: you may also want to have a look at the ArrayExpress > package which allows the user to retrieve data sets from the ArrayExpress > database at EBI and returns the data in form of an AffyBatch, NChannelSet, > RGList or the like. Since GEO and ArrayExpress are regularly synchronized, you > may be able to find your data sets of interest there as well. > > Regards, > Joern > > > On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote > > Hi, > > > > On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]Ale? Maver wrote: > > > > > Hi all, > > > I have obtained several GEO Series (GSE) entries from GEO repository > > > using > > > getGEO function (GEOquery package). > > > Data obtained in this manner is stored in ExpressionSet class. The > > > problem > > > is I don't know how to perform quality control analyses and > > > normalization > > > procedures on ExpressionSet data, because functions like expresso > > > (affy > > > package) work only on AffyBatch classes. Is there anything I am > > > missing? > > > > Sorry, I've never used the GEOquery package before, so I can't speak > > ?much to that, but I'd be surprised if there isn't an option to > > return ?your results as an AffyBatch object, because I'd dare say > > that you can ?get most of the data from geo in its raw format (eg, > > CEL file or ?whatever). > > > > > And- does anyone know whether data in GEO repository is already > > > normalised > > > or not? > > > > It depends, sometimes you aren't given the raw files: sometimes the > > data is from a custom array, or I've also seen some datasets > > provided ?in the post-processed form (already MAS5 normalized, for > > example), but ?it's been my experience that you can get the raw data > > for most of the ?experiments you find there. > > > > Also, for array quality assessment, look into the > > arrayQualityMetrics ?package: > > > > http://www.bioconductor.org/packages/release/bioc/html/arrayQualit yMetrics.html > > > > Hope that helps, > > -steve > -- Ale? Maver Ales.Maver at gmail.com
ADD REPLY
0
Entering edit mode
On Wed, Jul 8, 2009 at 6:16 AM, Joern Toedling <joern.toedling@curie.fr>wrote: > Hello, > > just a small addendum: you may also want to have a look at the ArrayExpress > package which allows the user to retrieve data sets from the ArrayExpress > database at EBI and returns the data in form of an AffyBatch, NChannelSet, > RGList or the like. Since GEO and ArrayExpress are regularly synchronized, > you > may be able to find your data sets of interest there as well. > Actually, ArrayExpress and GEO are NOT synchronized. There are some overlaps where investigators have submitted to both and for other reasons, but GEO is still the larger of the two and they each contain largely non-overlapping data sets. > > Regards, > Joern > > > On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote > > Hi, > > > > On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]Aleš Maver wrote: > > > > > Hi all, > > > I have obtained several GEO Series (GSE) entries from GEO repository > > > using > > > getGEO function (GEOquery package). > > > Data obtained in this manner is stored in ExpressionSet class. The > > > problem > > > is I don't know how to perform quality control analyses and > > > normalization > > > procedures on ExpressionSet data, because functions like expresso > > > (affy > > > package) work only on AffyBatch classes. Is there anything I am > > > missing? > > > > Sorry, I've never used the GEOquery package before, so I can't speak > > much to that, but I'd be surprised if there isn't an option to > > return your results as an AffyBatch object, because I'd dare say > > that you can get most of the data from geo in its raw format (eg, > > CEL file or whatever). > > > > > And- does anyone know whether data in GEO repository is already > > > normalised > > > or not? > > > > It depends, sometimes you aren't given the raw files: sometimes the > > data is from a custom array, or I've also seen some datasets > > provided in the post-processed form (already MAS5 normalized, for > > example), but it's been my experience that you can get the raw data > > for most of the experiments you find there. > > > > Also, for array quality assessment, look into the > > arrayQualityMetrics package: > > > > > http://www.bioconductor.org/packages/release/bioc/html/arrayQualityM etrics.html > > > > Hope that helps, > > -steve > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, care: this is my understanding and I might be quite wrong. There is indeed no synchronization between the two databases for lack of a common standard (each have their own flavour of MAGE-ML). In addition to investigators submitting to both repositories, ArrayExpress also imports experiments from GEO according to certain criteria. These are prefixed by 'E-GEOD' in the experiment ID. Querying ArrayExpress for these returns 5155 such experiments out of a total of 8372. GEO contains 12810 Series (experiments), so GEO does contain more data I would say. HTH, James. Sean Davis wrote: > On Wed, Jul 8, 2009 at 6:16 AM, Joern Toedling <joern.toedling at="" curie.fr="">wrote: > >> Hello, >> >> just a small addendum: you may also want to have a look at the ArrayExpress >> package which allows the user to retrieve data sets from the ArrayExpress >> database at EBI and returns the data in form of an AffyBatch, NChannelSet, >> RGList or the like. Since GEO and ArrayExpress are regularly synchronized, >> you >> may be able to find your data sets of interest there as well. >> > > Actually, ArrayExpress and GEO are NOT synchronized. There are some > overlaps where investigators have submitted to both and for other reasons, > but GEO is still the larger of the two and they each contain largely > non-overlapping data sets. > > >> Regards, >> Joern >> >> >> On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote >>> Hi, >>> >>> On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]Ale?? Maver wrote: >>> >>>> Hi all, >>>> I have obtained several GEO Series (GSE) entries from GEO repository >>>> using >>>> getGEO function (GEOquery package). >>>> Data obtained in this manner is stored in ExpressionSet class. The >>>> problem >>>> is I don't know how to perform quality control analyses and >>>> normalization >>>> procedures on ExpressionSet data, because functions like expresso >>>> (affy >>>> package) work only on AffyBatch classes. Is there anything I am >>>> missing? >>> Sorry, I've never used the GEOquery package before, so I can't speak >>> much to that, but I'd be surprised if there isn't an option to >>> return your results as an AffyBatch object, because I'd dare say >>> that you can get most of the data from geo in its raw format (eg, >>> CEL file or whatever). >>> >>>> And- does anyone know whether data in GEO repository is already >>>> normalised >>>> or not? >>> It depends, sometimes you aren't given the raw files: sometimes the >>> data is from a custom array, or I've also seen some datasets >>> provided in the post-processed form (already MAS5 normalized, for >>> example), but it's been my experience that you can get the raw data >>> for most of the experiments you find there. >>> >>> Also, for array quality assessment, look into the >>> arrayQualityMetrics package: >>> >>> >> http://www.bioconductor.org/packages/release/bioc/html/arrayQuality Metrics.html >>> Hope that helps, >>> -steve >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > > > -------------------------------------------------------------------- ---- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi, have a look to the AE FAQ: http://www.ebi.ac.uk/microarray/doc/help/faq.html#submitter_FAQ_genera l *How much over-lap is there between ArrayExpress and the Gene Expression Omnibus (GEO)?* We import data on a weekly basis from GEO (NCBI). As a priority all GEO experiments which are in GEO datasets on catalogue Affymetrix and Agilent platforms are imported and we re-curate these before loading into ArrayExpress. We also import all GSE on these platforms and these are loaded uncurated if they pass our quality checks (e.g. no corrupt data files). All experiments imported from GEO have accession numbers in the format of E-GEOD-n, where n is a number. For more information see the http://www.ebi.ac.uk/microarray/doc/help/GEO_data.html I had a more detailed look at the "HG-U133A" chip type. There I found an overlap of more than 90%. Especially all the new experiments are available in AE, too. Using R and Bioconductor for analyses, I recognized that the file format in AE is more suitable. Best Markus James F. Reid schrieb: > Hi, > > care: this is my understanding and I might be quite wrong. > > There is indeed no synchronization between the two databases for lack > of a common standard (each have their own flavour of MAGE-ML). > In addition to investigators submitting to both repositories, > ArrayExpress also imports experiments from GEO according to certain > criteria. These are prefixed by 'E-GEOD' in the experiment ID. > Querying ArrayExpress for these returns 5155 such experiments out of a > total of 8372. GEO contains 12810 Series (experiments), so GEO does > contain more data I would say. > > HTH, > James. > > > Sean Davis wrote: >> On Wed, Jul 8, 2009 at 6:16 AM, Joern Toedling >> <joern.toedling at="" curie.fr="">wrote: >> >>> Hello, >>> >>> just a small addendum: you may also want to have a look at the >>> ArrayExpress >>> package which allows the user to retrieve data sets from the >>> ArrayExpress >>> database at EBI and returns the data in form of an AffyBatch, >>> NChannelSet, >>> RGList or the like. Since GEO and ArrayExpress are regularly >>> synchronized, >>> you >>> may be able to find your data sets of interest there as well. >>> >> >> Actually, ArrayExpress and GEO are NOT synchronized. There are some >> overlaps where investigators have submitted to both and for other >> reasons, >> but GEO is still the larger of the two and they each contain largely >> non-overlapping data sets. >> >> >>> Regards, >>> Joern >>> >>> >>> On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote >>>> Hi, >>>> >>>> On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]Ale?? Maver wrote: >>>> >>>>> Hi all, >>>>> I have obtained several GEO Series (GSE) entries from GEO repository >>>>> using >>>>> getGEO function (GEOquery package). >>>>> Data obtained in this manner is stored in ExpressionSet class. The >>>>> problem >>>>> is I don't know how to perform quality control analyses and >>>>> normalization >>>>> procedures on ExpressionSet data, because functions like expresso >>>>> (affy >>>>> package) work only on AffyBatch classes. Is there anything I am >>>>> missing? >>>> Sorry, I've never used the GEOquery package before, so I can't speak >>>> much to that, but I'd be surprised if there isn't an option to >>>> return your results as an AffyBatch object, because I'd dare say >>>> that you can get most of the data from geo in its raw format (eg, >>>> CEL file or whatever). >>>> >>>>> And- does anyone know whether data in GEO repository is already >>>>> normalised >>>>> or not? >>>> It depends, sometimes you aren't given the raw files: sometimes the >>>> data is from a custom array, or I've also seen some datasets >>>> provided in the post-processed form (already MAS5 normalized, for >>>> example), but it's been my experience that you can get the raw data >>>> for most of the experiments you find there. >>>> >>>> Also, for array quality assessment, look into the >>>> arrayQualityMetrics package: >>>> >>>> >>> http://www.bioconductor.org/packages/release/bioc/html/arrayQualit yMetrics.html >>> >>>> Hope that helps, >>>> -steve >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> [[alternative HTML version deleted]] >> >> >> >> ------------------------------------------------------------------- ----- >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dipl.-Tech. Math. Markus Schmidberger Ludwig-Maximilians-Universit?t M?nchen IBE - Institut f?r medizinische Informationsverarbeitung, Biometrie und Epidemiologie Marchioninistr. 15, D-81377 Muenchen URL: http://www.ibe.med.uni-muenchen.de Mail: Markus.Schmidberger [at] ibe.med.uni-muenchen.de Tel: +49 (089) 7095 - 4497
ADD REPLY

Login before adding your answer.

Traffic: 864 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6