Affy normalization question

0

Entering edit mode

Mark W Kimpel ▴ 830

@mark-w-kimpel-2027

Last seen 11.2 years ago

Not infrequently on this list the question arises as to how to perform RMA on a large number of CEL files. The simple answer, of course, is to use "justRMA" or buy more RAM. As I have learned more about the wet-lab side of microarray experiments it has come to my attention that there is a technical limitation in our lab as to how many chips can actually be run at one time and that there is a substantial batch effect between batches. So, in my case at least, it seems to me that it would be incorrect to normalize 60 CEL files at once when in fact they have been run in 4 batches of 16. Would it not be better to normalize them separately, within-batch, and then include a batch effect in an analytical model? Is my situation unique or, in fact, is this the way most MA wet-labs are set up? If the latter is correct, should the recommendation not be to use justRMA on 80 CEL files if they have been run in batches? Thanks, Mark -- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 204-4202 Home (no voice mail please) mwkimpel<at>gmail<dot>com

• 1.1k views

ADD COMMENT • link updated 18.0 years ago by James W. MacDonald 68k • written 18.0 years ago by Mark W Kimpel ▴ 830

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Hi Mark, Mark W Kimpel wrote: > Not infrequently on this list the question arises as to how to perform > RMA on a large number of CEL files. The simple answer, of course, is to > use "justRMA" or buy more RAM. > > As I have learned more about the wet-lab side of microarray experiments > it has come to my attention that there is a technical limitation in our > lab as to how many chips can actually be run at one time and that there > is a substantial batch effect between batches. > > So, in my case at least, it seems to me that it would be incorrect to > normalize 60 CEL files at once when in fact they have been run in 4 > batches of 16. Would it not be better to normalize them separately, > within-batch, and then include a batch effect in an analytical model? Ideally you would randomize the samples when you are processing them (we randomize at four different steps) so you don't have batches that are processed together all the way through. Whether or not you fit a batch effect in a linear model depends on how the samples were processed. If the lab processed all the same type of samples in each of the batches (please say they didn't), then any batch effect will be aliased with the sample types and fitting an effect won't really help. If the batches were at least semi-randomized, then with 60 samples you won't be losing that many degrees of freedom, and it probably won't hurt to do so, and it just might help. > > Is my situation unique or, in fact, is this the way most MA wet-labs are > set up? If the latter is correct, should the recommendation not be to > use justRMA on 80 CEL files if they have been run in batches? Regardless of how the lab is set up, once you get to large sample sets there will always be batches. If you do proper randomization of the samples during processing IMO there should be no need to do any post-processing adjustments for the batches. Best, Jim > > Thanks, > Mark -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 18.0 years ago James W. MacDonald 68k

0

Entering edit mode

This is interesting, I guess after the fact in our case, but interesting We did 3 Duroc and 3 Yorkshire pigs Shallow and deep wound on each And wounds biopsied at 1 2 3 12 and 20 weeks So 60 samples obtained over 10 months, 6 samples at a time And we pretty much processed them as we went, so rather then like 10 batches of 6 each over 10 months And then we normalized them all together Should we have done something for batches? Did we miss something? Thank you -- Loren Engrav Univ Washington Seattle > From: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> > Date: Sat, 22 Dec 2007 16:08:42 -0500 > To: <mwkimpel at="" gmail.com=""> > Cc: Bioconductor_help <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] Affy normalization question > > Hi Mark, > > Mark W Kimpel wrote: >> Not infrequently on this list the question arises as to how to perform >> RMA on a large number of CEL files. The simple answer, of course, is to >> use "justRMA" or buy more RAM. >> >> As I have learned more about the wet-lab side of microarray experiments >> it has come to my attention that there is a technical limitation in our >> lab as to how many chips can actually be run at one time and that there >> is a substantial batch effect between batches. >> >> So, in my case at least, it seems to me that it would be incorrect to >> normalize 60 CEL files at once when in fact they have been run in 4 >> batches of 16. Would it not be better to normalize them separately, >> within-batch, and then include a batch effect in an analytical model? > > Ideally you would randomize the samples when you are processing them (we > randomize at four different steps) so you don't have batches that are > processed together all the way through. > > Whether or not you fit a batch effect in a linear model depends on how > the samples were processed. If the lab processed all the same type of > samples in each of the batches (please say they didn't), then any batch > effect will be aliased with the sample types and fitting an effect won't > really help. > > If the batches were at least semi-randomized, then with 60 samples you > won't be losing that many degrees of freedom, and it probably won't hurt > to do so, and it just might help. > >> >> Is my situation unique or, in fact, is this the way most MA wet- labs are >> set up? If the latter is correct, should the recommendation not be to >> use justRMA on 80 CEL files if they have been run in batches? > > Regardless of how the lab is set up, once you get to large sample sets > there will always be batches. If you do proper randomization of the > samples during processing IMO there should be no need to do any > post-processing adjustments for the batches. > > Best, > > Jim > > >> >> Thanks, >> Mark > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 18.0 years ago Loren Engrav ★ 1.0k

0

Entering edit mode

Jim, My understanding is that our lab normally randomizes by 1. treatment 2. RNA extraction 3. labeling 4. hybridization In addition, we sometimes have multiple brain regions, and, for the purpose of the MA run, each region is treated as an independent experiment, thus there is no randomization across brain regions for the above factors. My question arises because of two recent situations. First, in one experiment, for a reason not clear to me, the labeling and hybridization groups were combined and there is a clear batch effect when this labeling-hybridization factor is put into Limma. In such a case, would separate normalization be suggested? It will make the batch effect larger, but would seem to be addressed by using the batch-effect as a factor. Secondly, in another experiment I need to perform an analysis across 5 brain regions to look for overall gene expression differences resulting from genetic differences between strains. In that experiment the 4 factors mentioned at the beginning were randomized for so there is no batch effect within-brain region, but there is across brain region. In this experiment I am not trying to find differences across brain regions, which would be impossible to separate out from a batch effect, but rather between two treatments that are independent of brain region. One way I have done this in the past has been to simply average all 5 brain regions together to come up with an average-brain expression measure, but, I wonder if it would be better to put brain region in as a factor. Regardless of whether I average or not, I need to decide whether to normalize all brain regions together or, because they were run as separate MA experiments, to normalize them individually. Really, the question seems to be whether RMA should be used on a group of CEL files in the presence of a non-chip related batch effect, if so, will it make a batch effect "go away" (not from my experience), and then if not, how to incorporate the batch effect in a model. Finally, I realize that by randomizing at each step mentioned at the top, one spreads any variance out so that it cannot be picked up with a batch effect. With the "n" we usually use, if one were to take each of the 4 factors into account one usually would run out of degrees of freedom. Nevertheless the variance induced at each step of the wet-lab is there, it is just not apparent and presumably doesn't induce bias. It does, however, decrease power, and I wonder if it wouldn't be better to block by treatment, so that equal numbers from each treatment are in a group, but that then each group is processed totally together. There the batch effect would be large, but it would be present as only one factor, which with large enough "n" one could take into account in a statistical model. That, it seems, might increase power to detect differential expression. Maybe this is counter-intuitive, and would probably only work if "n" were large enough to provide enough degrees of freedom, but it makes some sense to me. Am I nuts? (many people think so, so don't be shy about saying so ;) ). Thanks so much for your helpful input, Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 204-4202 Home (no voice mail please) mwkimpel<at>gmail<dot>com ****************************************************************** James W. MacDonald wrote: > Hi Mark, > > Mark W Kimpel wrote: >> Not infrequently on this list the question arises as to how to perform >> RMA on a large number of CEL files. The simple answer, of course, is >> to use "justRMA" or buy more RAM. >> >> As I have learned more about the wet-lab side of microarray >> experiments it has come to my attention that there is a technical >> limitation in our lab as to how many chips can actually be run at one >> time and that there is a substantial batch effect between batches. >> >> So, in my case at least, it seems to me that it would be incorrect to >> normalize 60 CEL files at once when in fact they have been run in 4 >> batches of 16. Would it not be better to normalize them separately, >> within-batch, and then include a batch effect in an analytical model? > > Ideally you would randomize the samples when you are processing them (we > randomize at four different steps) so you don't have batches that are > processed together all the way through. > > Whether or not you fit a batch effect in a linear model depends on how > the samples were processed. If the lab processed all the same type of > samples in each of the batches (please say they didn't), then any batch > effect will be aliased with the sample types and fitting an effect won't > really help. > > If the batches were at least semi-randomized, then with 60 samples you > won't be losing that many degrees of freedom, and it probably won't hurt > to do so, and it just might help. > >> >> Is my situation unique or, in fact, is this the way most MA wet- labs >> are set up? If the latter is correct, should the recommendation not be >> to use justRMA on 80 CEL files if they have been run in batches? > > Regardless of how the lab is set up, once you get to large sample sets > there will always be batches. If you do proper randomization of the > samples during processing IMO there should be no need to do any > post-processing adjustments for the batches. > > Best, > > Jim > > >> >> Thanks, >> Mark >

ADD REPLY • link 18.0 years ago Mark W Kimpel ▴ 830

0

Entering edit mode

Mark W Kimpel wrote: > Jim, > > My understanding is that our lab normally randomizes by > 1. treatment > 2. RNA extraction > 3. labeling > 4. hybridization > > In addition, we sometimes have multiple brain regions, and, for the > purpose of the MA run, each region is treated as an independent > experiment, thus there is no randomization across brain regions for the > above factors. > > My question arises because of two recent situations. First, in one > experiment, for a reason not clear to me, the labeling and hybridization > groups were combined and there is a clear batch effect when this > labeling-hybridization factor is put into Limma. In such a case, would > separate normalization be suggested? It will make the batch effect > larger, but would seem to be addressed by using the batch-effect as a > factor. I think there are two different questions here. First, when should one normalize things separately, and when should a batch effect be used. For me, it takes a lot to want to run RMA separately on chips that were all processed in a single facility. In general, the normalization is intended to address technical differences between samples while retaining biological differences, so unless I can see some large differences between the sample distributions or I think that most genes will be differentially expressed between samples, I would tend to process them all together. > > Secondly, in another experiment I need to perform an analysis across 5 > brain regions to look for overall gene expression differences resulting > from genetic differences between strains. In that experiment the 4 > factors mentioned at the beginning were randomized for so there is no > batch effect within-brain region, but there is across brain region. In > this experiment I am not trying to find differences across brain > regions, which would be impossible to separate out from a batch effect, > but rather between two treatments that are independent of brain region. > One way I have done this in the past has been to simply average all 5 > brain regions together to come up with an average-brain expression > measure, but, I wonder if it would be better to put brain region in as a > factor. Regardless of whether I average or not, I need to decide whether > to normalize all brain regions together or, because they were run as > separate MA experiments, to normalize them individually. This is a situation where it makes sense to me to add a brain region effect so you are in effect blocking on brain region. I think it makes much less sense to average over all regions. In this case it might make sense to normalize separately, but I wonder just how different the expression of each region might be. I usually look at NUSE plots to see if I think the normalization should be done separately or not. If the NUSE plot looks reasonable, then I figure the model is fitting the data OK, so why bother with separate normalizations? Then again, we ran over 1800 chips last year, so I don't have a lot of time to ponder a given analysis. ;-D > > Really, the question seems to be whether RMA should be used on a group > of CEL files in the presence of a non-chip related batch effect, if so, > will it make a batch effect "go away" (not from my experience), and then > if not, how to incorporate the batch effect in a model. > > Finally, I realize that by randomizing at each step mentioned at the > top, one spreads any variance out so that it cannot be picked up with a > batch effect. With the "n" we usually use, if one were to take each of > the 4 factors into account one usually would run out of degrees of > freedom. Nevertheless the variance induced at each step of the wet- lab > is there, it is just not apparent and presumably doesn't induce bias. It > does, however, decrease power, and I wonder if it wouldn't be better to > block by treatment, so that equal numbers from each treatment are in a > group, but that then each group is processed totally together. There the > batch effect would be large, but it would be present as only one > factor, which with large enough "n" one could take into account in a > statistical model. That, it seems, might increase power to detect > differential expression. Maybe this is counter-intuitive, and would > probably only work if "n" were large enough to provide enough degrees of > freedom, but it makes some sense to me. Am I nuts? (many people think > so, so don't be shy about saying so ;) ). Doing things that way is a split-plot design, and I don't recall anybody advocating batch effects for the plots in a split-plot design. But a split-plot design is intended for situations where you can only randomize at one step. I would tend to want to mix things up more, but others may have different opinions. Best, Jim > > Thanks so much for your helpful input, > Mark > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 204-4202 Home (no voice mail please) > > mwkimpel<at>gmail<dot>com > > ****************************************************************** > > > James W. MacDonald wrote: >> Hi Mark, >> >> Mark W Kimpel wrote: >>> Not infrequently on this list the question arises as to how to perform >>> RMA on a large number of CEL files. The simple answer, of course, is >>> to use "justRMA" or buy more RAM. >>> >>> As I have learned more about the wet-lab side of microarray >>> experiments it has come to my attention that there is a technical >>> limitation in our lab as to how many chips can actually be run at one >>> time and that there is a substantial batch effect between batches. >>> >>> So, in my case at least, it seems to me that it would be incorrect to >>> normalize 60 CEL files at once when in fact they have been run in 4 >>> batches of 16. Would it not be better to normalize them separately, >>> within-batch, and then include a batch effect in an analytical model? >> Ideally you would randomize the samples when you are processing them (we >> randomize at four different steps) so you don't have batches that are >> processed together all the way through. >> >> Whether or not you fit a batch effect in a linear model depends on how >> the samples were processed. If the lab processed all the same type of >> samples in each of the batches (please say they didn't), then any batch >> effect will be aliased with the sample types and fitting an effect won't >> really help. >> >> If the batches were at least semi-randomized, then with 60 samples you >> won't be losing that many degrees of freedom, and it probably won't hurt >> to do so, and it just might help. >> >>> Is my situation unique or, in fact, is this the way most MA wet- labs >>> are set up? If the latter is correct, should the recommendation not be >>> to use justRMA on 80 CEL files if they have been run in batches? >> Regardless of how the lab is set up, once you get to large sample sets >> there will always be batches. If you do proper randomization of the >> samples during processing IMO there should be no need to do any >> post-processing adjustments for the batches. >> >> Best, >> >> Jim >> >> >>> Thanks, >>> Mark > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, MS Biostatistician UMCCC cDNA and Affymetrix Core University of Michigan 1500 E Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 18.0 years ago James W. MacDonald 68k

0

Entering edit mode

Jim, Thanks for your helpful advice. I'll be taking a few days of for Christmas and will dig into this again when I return. In the meantime, Merry Christmas/Happy Holidays to you and all on the BioC list who are celebrating. Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 204-4202 Home (no voice mail please) mwkimpel<at>gmail<dot>com ****************************************************************** James MacDonald wrote: > > > Mark W Kimpel wrote: >> Jim, >> >> My understanding is that our lab normally randomizes by >> 1. treatment >> 2. RNA extraction >> 3. labeling >> 4. hybridization >> >> In addition, we sometimes have multiple brain regions, and, for the >> purpose of the MA run, each region is treated as an independent >> experiment, thus there is no randomization across brain regions for >> the above factors. >> >> My question arises because of two recent situations. First, in one >> experiment, for a reason not clear to me, the labeling and >> hybridization groups were combined and there is a clear batch effect >> when this labeling-hybridization factor is put into Limma. In such a >> case, would separate normalization be suggested? It will make the >> batch effect larger, but would seem to be addressed by using the >> batch-effect as a factor. > > I think there are two different questions here. First, when should one > normalize things separately, and when should a batch effect be used. > > For me, it takes a lot to want to run RMA separately on chips that were > all processed in a single facility. In general, the normalization is > intended to address technical differences between samples while > retaining biological differences, so unless I can see some large > differences between the sample distributions or I think that most genes > will be differentially expressed between samples, I would tend to > process them all together. > > >> >> Secondly, in another experiment I need to perform an analysis across 5 >> brain regions to look for overall gene expression differences >> resulting from genetic differences between strains. In that experiment >> the 4 factors mentioned at the beginning were randomized for so there >> is no batch effect within-brain region, but there is across brain >> region. In this experiment I am not trying to find differences across >> brain regions, which would be impossible to separate out from a batch >> effect, but rather between two treatments that are independent of >> brain region. One way I have done this in the past has been to simply >> average all 5 brain regions together to come up with an average- brain >> expression measure, but, I wonder if it would be better to put brain >> region in as a factor. Regardless of whether I average or not, I need >> to decide whether to normalize all brain regions together or, because >> they were run as separate MA experiments, to normalize them individually. > > This is a situation where it makes sense to me to add a brain region > effect so you are in effect blocking on brain region. I think it makes > much less sense to average over all regions. In this case it might make > sense to normalize separately, but I wonder just how different the > expression of each region might be. I usually look at NUSE plots to see > if I think the normalization should be done separately or not. If the > NUSE plot looks reasonable, then I figure the model is fitting the data > OK, so why bother with separate normalizations? Then again, we ran over > 1800 chips last year, so I don't have a lot of time to ponder a given > analysis. ;-D > >> >> Really, the question seems to be whether RMA should be used on a group >> of CEL files in the presence of a non-chip related batch effect, if >> so, will it make a batch effect "go away" (not from my experience), >> and then if not, how to incorporate the batch effect in a model. >> >> Finally, I realize that by randomizing at each step mentioned at the >> top, one spreads any variance out so that it cannot be picked up with >> a batch effect. With the "n" we usually use, if one were to take each >> of the 4 factors into account one usually would run out of degrees of >> freedom. Nevertheless the variance induced at each step of the wet- lab >> is there, it is just not apparent and presumably doesn't induce bias. >> It does, however, decrease power, and I wonder if it wouldn't be >> better to block by treatment, so that equal numbers from each >> treatment are in a group, but that then each group is processed >> totally together. There the batch effect would be large, but it >> would be present as only one factor, which with large enough "n" one >> could take into account in a statistical model. That, it seems, might >> increase power to detect differential expression. Maybe this is >> counter-intuitive, and would probably only work if "n" were large >> enough to provide enough degrees of freedom, but it makes some sense >> to me. Am I nuts? (many people think so, so don't be shy about saying >> so ;) ). > > Doing things that way is a split-plot design, and I don't recall anybody > advocating batch effects for the plots in a split-plot design. But a > split-plot design is intended for situations where you can only > randomize at one step. I would tend to want to mix things up more, but > others may have different opinions. > > Best, > > Jim > > >> >> Thanks so much for your helpful input, >> Mark >> >> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry >> Indiana University School of Medicine >> >> 15032 Hunter Court, Westfield, IN 46074 >> >> (317) 490-5129 Work, & Mobile & VoiceMail >> (317) 204-4202 Home (no voice mail please) >> >> mwkimpel<at>gmail<dot>com >> >> ****************************************************************** >> >> >> James W. MacDonald wrote: >>> Hi Mark, >>> >>> Mark W Kimpel wrote: >>>> Not infrequently on this list the question arises as to how to >>>> perform RMA on a large number of CEL files. The simple answer, of >>>> course, is to use "justRMA" or buy more RAM. >>>> >>>> As I have learned more about the wet-lab side of microarray >>>> experiments it has come to my attention that there is a technical >>>> limitation in our lab as to how many chips can actually be run at >>>> one time and that there is a substantial batch effect between batches. >>>> >>>> So, in my case at least, it seems to me that it would be incorrect >>>> to normalize 60 CEL files at once when in fact they have been run in >>>> 4 batches of 16. Would it not be better to normalize them >>>> separately, within-batch, and then include a batch effect in an >>>> analytical model? >>> Ideally you would randomize the samples when you are processing them >>> (we randomize at four different steps) so you don't have batches that >>> are processed together all the way through. >>> >>> Whether or not you fit a batch effect in a linear model depends on >>> how the samples were processed. If the lab processed all the same >>> type of samples in each of the batches (please say they didn't), then >>> any batch effect will be aliased with the sample types and fitting an >>> effect won't really help. >>> >>> If the batches were at least semi-randomized, then with 60 samples >>> you won't be losing that many degrees of freedom, and it probably >>> won't hurt to do so, and it just might help. >>> >>>> Is my situation unique or, in fact, is this the way most MA wet- labs >>>> are set up? If the latter is correct, should the recommendation not >>>> be to use justRMA on 80 CEL files if they have been run in batches? >>> Regardless of how the lab is set up, once you get to large sample >>> sets there will always be batches. If you do proper randomization of >>> the samples during processing IMO there should be no need to do any >>> post-processing adjustments for the batches. >>> >>> Best, >>> >>> Jim >>> >>> >>>> Thanks, >>>> Mark >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 17.9 years ago Mark W Kimpel ▴ 830

Login before adding your answer.