How to cope with arrays hybridized at significantly different time.

0

Entering edit mode

Paparountas Triantafyllos ▴ 40

@paparountas-triantafyllos-2958

Last seen 9.6 years ago

Dear list, I would like to have your opinions on the following subject. In hospital-studies most of the time we get more than 200 arrays per study.It is evident that the arrays have significant differences among them due to different array batch and many other conditions ie technical competence, hybridization difference due to time span , circadian rhythm , fresh sample or not->different time from RNA extraction to hybridization , and others. How can we cope with the many uncontrollable factors and be able to use 80 , 200 or even a higher number of arrays at the same analysis fixing for any of the uncontrollable effects. I am using mostly Affymetrix arrays , Hu133plus2 , MOE Gene 1 St , Moe 430 2 , and currently my favorite software apart from Bioconductor are Partek's Gene Suite (which -at least according to the manual- can fix for uncontrolled effects) , and Genespring due to the magnificent cluster algorithm that incorporates. Thanks in advance. T. Paparountas www.bioinformatics.gr [[alternative HTML version deleted]]

GeneSpring GeneSpring • 1.2k views

ADD COMMENT • link updated 15.1 years ago by Steve Lianoglou ★ 13k • written 15.1 years ago by Paparountas Triantafyllos ▴ 40

0

Entering edit mode

Michal Okoniewski ▴ 50

@michal-okoniewski-3249

Last seen 9.6 years ago

Dear Triantafillos, Your question sounds like a serious problem in a real (clinical) application of microarrays. To tell the truth, not many people have such big datasets, many are not aware about sources of variability, especially at the stage of RNA extraction, because Affy hybridization itself most often do not add more variability than the extraction conditions (patien's stress, sample degradation, habits and moods of the person who gathers the matherial and extracts RNA). Anyway - there are some "rules of good practice" that could be applied, eg * keep precise and detailed annotation of samples - then you can try with anova to estimate the strength of influencing factors * try to extract RNA in the same/similar conditions - if it is not possible, randomize extractions * use in the experiment as many replicates as you can afford :) * do not pool unless you have really good reason for it * define your goal and adjust the subset of your data and types of analysis to it - eg if you need just an "expression signature" of 10-100 probesets, apply different methods and check how they overlap to avoid false positives, if you need an answer to a "biological question" - use eg limma anova with contrasts and play with pathways... The list is by far not complete, but I think it would be interesting to discuss good practices in the applications of big microarray dataset - because this is the case where the science becomes really directly applicable and useful... all the best, Michal Triantafillos Paparountas wrote: > Dear list, > > I would like to have your opinions on the following subject. > > In hospital-studies most of the time we get more than 200 arrays per > study.It is evident that the arrays have significant differences among them > due to different array batch and many other conditions ie technical > competence, hybridization difference due to time span , circadian rhythm , > fresh sample or not->different time from RNA extraction to hybridization , > and others. How can we cope with the many uncontrollable factors and be able > to use 80 , 200 or even a higher number of arrays at the same analysis > fixing for any of the uncontrollable effects. > > I am using mostly Affymetrix arrays , Hu133plus2 , MOE Gene 1 St , Moe 430 2 > , and currently my favorite software apart from Bioconductor are Partek's > Gene Suite (which -at least according to the manual- can fix for > uncontrolled effects) , and Genespring due to the magnificent cluster > algorithm that incorporates. > > Thanks in advance. > > T. Paparountas > www.bioinformatics.gr > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 15.1 years ago Michal Okoniewski ▴ 50

0

Entering edit mode

Good points, I would say, remember three basic principles of experimental design: 1) Replication 2) Randomization 3) Blocking If you have batch (or other "environmental") effects, you need multiple batches, with experimental conditions crossed with batches. Ideally, you want to randomize within batch and keep the within batch variation as controlled as possible. Also a complete block (where all experimental conditions are represented in all batches, ~batch=block~) is probably better. Then you have to account for the batch effect in the analysis, for example if you are using a linear mixed model to analyze expression, you should include a batch effect (random or fixed) in it, as it was suggested before. Moreover, having repeats of the same experimental condition in each batch (example: multiple affected and control samples per batch), allows you to test for batch*condition interaction (and if that is significant... good luck with the interpretation...). Even if you are working with "observational data" (meaning non- designed experiment), if you have many samples, you can probably account for some sources of variation. In that case, having good annotation of "environmental conditions" is a must. If your model (for example clustering) can not account for multiple sources of variation, you may consider pre-whitening the data by adjusting a linear model with batch and other systematic effects first, then use the residuals from the model to do your clustering and see if the samples group together reflecting experimental conditions of interest. Hope this helps. Cheers, JP Michal Okoniewski wrote: > Dear Triantafillos, > > Your question sounds like a serious problem in a real (clinical) > application of microarrays. > To tell the truth, not many people have such big datasets, many are > not aware about sources > of variability, especially at the stage of RNA extraction, because > Affy hybridization itself > most often do not add more variability than the extraction conditions > (patien's stress, sample > degradation, habits and moods of the person who gathers the matherial > and extracts RNA). > Anyway - there are some "rules of good practice" that could be > applied, eg > > * keep precise and detailed annotation of samples - then you can try > with anova to > estimate the strength of influencing factors > * try to extract RNA in the same/similar conditions - if it is not > possible, randomize extractions > * use in the experiment as many replicates as you can afford :) * do > not pool unless you have really good reason for it > * define your goal and adjust the subset of your data and types of > analysis to it - eg if you need just an "expression signature" > of 10-100 probesets, apply different methods and check how they > overlap to avoid false positives, > if you need an answer to a "biological question" - use eg limma anova > with contrasts and play with pathways... > > The list is by far not complete, but I think it would be interesting > to discuss good practices in the > applications of big microarray dataset - because this is the case > where the science becomes > really directly applicable and useful... > > all the best, > Michal > > Triantafillos Paparountas wrote: >> Dear list, >> >> I would like to have your opinions on the following subject. >> >> In hospital-studies most of the time we get more than 200 arrays per >> study.It is evident that the arrays have significant differences >> among them >> due to different array batch and many other conditions ie technical >> competence, hybridization difference due to time span , circadian >> rhythm , >> fresh sample or not->different time from RNA extraction to >> hybridization , >> and others. How can we cope with the many uncontrollable factors and >> be able >> to use 80 , 200 or even a higher number of arrays at the same analysis >> fixing for any of the uncontrollable effects. >> >> I am using mostly Affymetrix arrays , Hu133plus2 , MOE Gene 1 St , >> Moe 430 2 >> , and currently my favorite software apart from Bioconductor are >> Partek's >> Gene Suite (which -at least according to the manual- can fix for >> uncontrolled effects) , and Genespring due to the magnificent cluster >> algorithm that incorporates. >> >> Thanks in advance. >> >> T. Paparountas >> www.bioinformatics.gr >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- ============================= Juan Pedro Steibel Assistant Professor Statistical Genetics and Genomics Department of Animal Science & Department of Fisheries and Wildlife Michigan State University 1205-I Anthony Hall East Lansing, MI 48824 USA Phone: 1-517-353-5102 E-mail: steibelj at msu.edu

ADD REPLY • link 15.1 years ago Juan Pedro Steibel ▴ 130

0

Entering edit mode

The problem with the type of studies described in the original post is that you don't really have control over the design and thus experimental design principles are not helpful. The best approach might be to apply a simple normalization to all arrays and try to model potential batch effects through some meta-analytic method. Robert Gentleman, among others, has done considerable work in this area, which might serve as a starting point: http://bioconductor.org/packages/2.3/bioc/vignettes/GeneMeta/inst/doc/ GeneMe ta.pdf http://www.bepress.com/bioconductor/paper8/ -Christos Christos Hatzis, Ph.D. Nuvera Biosciences, Inc. 400 West Cummings Park Suite 5350 Woburn, MA 01801 Tel: 781-938-3830 www.nuverabio.com > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of > Juan Pedro Steibel > Sent: Friday, March 13, 2009 4:28 PM > To: Michal Okoniewski > Cc: Triantafillos Paparountas; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] How to cope with arrays hybridized at > significantly different time. > > Good points, I would say, remember three basic principles of > experimental design: > 1) Replication > 2) Randomization > 3) Blocking > > If you have batch (or other "environmental") effects, you > need multiple batches, with experimental conditions crossed > with batches. Ideally, you want to randomize within batch and > keep the within batch variation as controlled as possible. > Also a complete block (where all experimental conditions are > represented in all batches, ~batch=block~) is probably > better. Then you have to account for the batch effect in the > analysis, for example if you are using a linear mixed model > to analyze expression, you should include a batch effect > (random or fixed) in it, as it was suggested before. > > Moreover, having repeats of the same experimental condition > in each batch (example: multiple affected and control samples > per batch), allows you to test for batch*condition > interaction (and if that is significant... good luck with the > interpretation...). > > Even if you are working with "observational data" (meaning > non-designed experiment), if you have many samples, you can > probably account for some sources of variation. In that case, > having good annotation of "environmental conditions" is a must. > > If your model (for example clustering) can not account for > multiple sources of variation, you may consider pre-whitening > the data by adjusting a linear model with batch and other > systematic effects first, then use the residuals from the > model to do your clustering and see if the samples group > together reflecting experimental conditions of interest. > > Hope this helps. > Cheers, > JP > > > > Michal Okoniewski wrote: > > Dear Triantafillos, > > > > Your question sounds like a serious problem in a real (clinical) > > application of microarrays. > > To tell the truth, not many people have such big datasets, many are > > not aware about sources of variability, especially at the > stage of > > RNA extraction, because Affy hybridization itself most often do not > > add more variability than the extraction conditions > (patien's stress, > > sample degradation, habits and moods of the person who gathers the > > matherial and extracts RNA). > > Anyway - there are some "rules of good practice" that could be > > applied, eg > > > > * keep precise and detailed annotation of samples - then > you can try > > with anova to estimate the strength of influencing factors > > * try to extract RNA in the same/similar conditions - if it is not > > possible, randomize extractions > > * use in the experiment as many replicates as you can > afford :) * do > > not pool unless you have really good reason for it > > * define your goal and adjust the subset of your data and types of > > analysis to it - eg if you need just an "expression signature" > > of 10-100 probesets, apply different methods and check how they > > overlap to avoid false positives, if you need an answer to a > > "biological question" - use eg limma anova with contrasts and play > > with pathways... > > > > The list is by far not complete, but I think it would be > interesting > > to discuss good practices in the applications of big microarray > > dataset - because this is the case where the science becomes really > > directly applicable and useful... > > > > all the best, > > Michal > > > > Triantafillos Paparountas wrote: > >> Dear list, > >> > >> I would like to have your opinions on the following subject. > >> > >> In hospital-studies most of the time we get more than 200 > arrays per > >> study.It is evident that the arrays have significant differences > >> among them due to different array batch and many other > conditions ie > >> technical competence, hybridization difference due to time span , > >> circadian rhythm , fresh sample or not->different time from RNA > >> extraction to hybridization , and others. How can we cope with the > >> many uncontrollable factors and be able to use 80 , 200 or even a > >> higher number of arrays at the same analysis fixing for any of the > >> uncontrollable effects. > >> > >> I am using mostly Affymetrix arrays , Hu133plus2 , MOE Gene 1 St , > >> Moe 430 2 , and currently my favorite software apart from > >> Bioconductor are Partek's Gene Suite (which -at least according to > >> the manual- can fix for uncontrolled effects) , and > Genespring due to > >> the magnificent cluster algorithm that incorporates. > >> > >> Thanks in advance. > >> > >> T. Paparountas > >> www.bioinformatics.gr > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > -- > ============================= > Juan Pedro Steibel > > Assistant Professor > Statistical Genetics and Genomics > > Department of Animal Science & > Department of Fisheries and Wildlife > > Michigan State University > 1205-I Anthony Hall > East Lansing, MI > 48824 USA > > Phone: 1-517-353-5102 > E-mail: steibelj at msu.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 15.1 years ago Christos Hatzis ▴ 110

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi, > In hospital-studies most of the time we get more than 200 arrays per > study.It is evident that the arrays have significant differences > among them > due to different array batch and many other conditions ie technical > competence, hybridization difference due to time span , circadian > rhythm , > fresh sample or not->different time from RNA extraction to > hybridization , > and others. How can we cope with the many uncontrollable factors and > be able > to use 80 , 200 or even a higher number of arrays at the same analysis > fixing for any of the uncontrollable effects. This isn't really a direct answer, but perhaps it can lead you to some helpful information. Rafael Irizarry gave a talk at our uni near the end of last year, and I remember he made mention that he's found that it's easier to predict which lab a microarray comes from than it is to predict which tissue it comes from using the gene expression profiles (this is serious paraphrasing, and perhaps a gross over representation of what he really said). Anyway, if I remember correctly, he mentioned this in the context of talking about his work for the gene expression barcode paper, which might be a good read to get you thinking about these problems: http://www.nature.com/nmeth/journal/v4/n11/full/nmeth1102.html He also spoke of frozen RMA, which I believe essentially normalizes new chips against a (potentially very large) compendium of already normalized ones. I'm not sure that it's officially available yet, but his grad student (Matthew McCall) at least lists it w/ an R implementation here: http://biostat.jhsph.edu/~mmccall/research.html So, no real answers here, just some food for thought that might give you some ideas. Hope that helps, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos

ADD COMMENT • link 15.1 years ago Steve Lianoglou ★ 13k

Login before adding your answer.