Not infrequently on this list the question arises as to how to perform
RMA on a large number of CEL files. The simple answer, of course, is
to
use "justRMA" or buy more RAM.
As I have learned more about the wet-lab side of microarray
experiments
it has come to my attention that there is a technical limitation in
our
lab as to how many chips can actually be run at one time and that
there
is a substantial batch effect between batches.
So, in my case at least, it seems to me that it would be incorrect to
normalize 60 CEL files at once when in fact they have been run in 4
batches of 16. Would it not be better to normalize them separately,
within-batch, and then include a batch effect in an analytical model?
Is my situation unique or, in fact, is this the way most MA wet-labs
are
set up? If the latter is correct, should the recommendation not be to
use justRMA on 80 CEL files if they have been run in batches?
Thanks,
Mark
--
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)
mwkimpel<at>gmail<dot>com
Hi Mark,
Mark W Kimpel wrote:
> Not infrequently on this list the question arises as to how to
perform
> RMA on a large number of CEL files. The simple answer, of course, is
to
> use "justRMA" or buy more RAM.
>
> As I have learned more about the wet-lab side of microarray
experiments
> it has come to my attention that there is a technical limitation in
our
> lab as to how many chips can actually be run at one time and that
there
> is a substantial batch effect between batches.
>
> So, in my case at least, it seems to me that it would be incorrect
to
> normalize 60 CEL files at once when in fact they have been run in 4
> batches of 16. Would it not be better to normalize them separately,
> within-batch, and then include a batch effect in an analytical
model?
Ideally you would randomize the samples when you are processing them
(we
randomize at four different steps) so you don't have batches that are
processed together all the way through.
Whether or not you fit a batch effect in a linear model depends on how
the samples were processed. If the lab processed all the same type of
samples in each of the batches (please say they didn't), then any
batch
effect will be aliased with the sample types and fitting an effect
won't
really help.
If the batches were at least semi-randomized, then with 60 samples you
won't be losing that many degrees of freedom, and it probably won't
hurt
to do so, and it just might help.
>
> Is my situation unique or, in fact, is this the way most MA wet-labs
are
> set up? If the latter is correct, should the recommendation not be
to
> use justRMA on 80 CEL files if they have been run in batches?
Regardless of how the lab is set up, once you get to large sample sets
there will always be batches. If you do proper randomization of the
samples during processing IMO there should be no need to do any
post-processing adjustments for the batches.
Best,
Jim
>
> Thanks,
> Mark
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
This is interesting, I guess after the fact in our case, but
interesting
We did 3 Duroc and 3 Yorkshire pigs
Shallow and deep wound on each
And wounds biopsied at 1 2 3 12 and 20 weeks
So 60 samples obtained over 10 months, 6 samples at a time
And we pretty much processed them as we went, so rather then like 10
batches
of 6 each over 10 months
And then we normalized them all together
Should we have done something for batches? Did we miss something?
Thank you
--
Loren Engrav
Univ Washington
Seattle
> From: "James W. MacDonald" <jmacdon at="" med.umich.edu="">
> Date: Sat, 22 Dec 2007 16:08:42 -0500
> To: <mwkimpel at="" gmail.com="">
> Cc: Bioconductor_help <bioconductor at="" stat.math.ethz.ch="">
> Subject: Re: [BioC] Affy normalization question
>
> Hi Mark,
>
> Mark W Kimpel wrote:
>> Not infrequently on this list the question arises as to how to
perform
>> RMA on a large number of CEL files. The simple answer, of course,
is to
>> use "justRMA" or buy more RAM.
>>
>> As I have learned more about the wet-lab side of microarray
experiments
>> it has come to my attention that there is a technical limitation in
our
>> lab as to how many chips can actually be run at one time and that
there
>> is a substantial batch effect between batches.
>>
>> So, in my case at least, it seems to me that it would be incorrect
to
>> normalize 60 CEL files at once when in fact they have been run in 4
>> batches of 16. Would it not be better to normalize them separately,
>> within-batch, and then include a batch effect in an analytical
model?
>
> Ideally you would randomize the samples when you are processing them
(we
> randomize at four different steps) so you don't have batches that
are
> processed together all the way through.
>
> Whether or not you fit a batch effect in a linear model depends on
how
> the samples were processed. If the lab processed all the same type
of
> samples in each of the batches (please say they didn't), then any
batch
> effect will be aliased with the sample types and fitting an effect
won't
> really help.
>
> If the batches were at least semi-randomized, then with 60 samples
you
> won't be losing that many degrees of freedom, and it probably won't
hurt
> to do so, and it just might help.
>
>>
>> Is my situation unique or, in fact, is this the way most MA wet-
labs are
>> set up? If the latter is correct, should the recommendation not be
to
>> use justRMA on 80 CEL files if they have been run in batches?
>
> Regardless of how the lab is set up, once you get to large sample
sets
> there will always be batches. If you do proper randomization of the
> samples during processing IMO there should be no need to do any
> post-processing adjustments for the batches.
>
> Best,
>
> Jim
>
>
>>
>> Thanks,
>> Mark
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
Jim,
My understanding is that our lab normally randomizes by
1. treatment
2. RNA extraction
3. labeling
4. hybridization
In addition, we sometimes have multiple brain regions, and, for the
purpose of the MA run, each region is treated as an independent
experiment, thus there is no randomization across brain regions for
the
above factors.
My question arises because of two recent situations. First, in one
experiment, for a reason not clear to me, the labeling and
hybridization
groups were combined and there is a clear batch effect when this
labeling-hybridization factor is put into Limma. In such a case, would
separate normalization be suggested? It will make the batch effect
larger, but would seem to be addressed by using the batch-effect as a
factor.
Secondly, in another experiment I need to perform an analysis across 5
brain regions to look for overall gene expression differences
resulting
from genetic differences between strains. In that experiment the 4
factors mentioned at the beginning were randomized for so there is no
batch effect within-brain region, but there is across brain region. In
this experiment I am not trying to find differences across brain
regions, which would be impossible to separate out from a batch
effect,
but rather between two treatments that are independent of brain
region.
One way I have done this in the past has been to simply average all 5
brain regions together to come up with an average-brain expression
measure, but, I wonder if it would be better to put brain region in as
a
factor. Regardless of whether I average or not, I need to decide
whether
to normalize all brain regions together or, because they were run as
separate MA experiments, to normalize them individually.
Really, the question seems to be whether RMA should be used on a group
of CEL files in the presence of a non-chip related batch effect, if
so,
will it make a batch effect "go away" (not from my experience), and
then
if not, how to incorporate the batch effect in a model.
Finally, I realize that by randomizing at each step mentioned at the
top, one spreads any variance out so that it cannot be picked up with
a
batch effect. With the "n" we usually use, if one were to take each of
the 4 factors into account one usually would run out of degrees of
freedom. Nevertheless the variance induced at each step of the wet-lab
is there, it is just not apparent and presumably doesn't induce bias.
It
does, however, decrease power, and I wonder if it wouldn't be better
to
block by treatment, so that equal numbers from each treatment are in a
group, but that then each group is processed totally together. There
the
batch effect would be large, but it would be present as only one
factor, which with large enough "n" one could take into account in a
statistical model. That, it seems, might increase power to detect
differential expression. Maybe this is counter-intuitive, and would
probably only work if "n" were large enough to provide enough degrees
of
freedom, but it makes some sense to me. Am I nuts? (many people think
so, so don't be shy about saying so ;) ).
Thanks so much for your helpful input,
Mark
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)
mwkimpel<at>gmail<dot>com
******************************************************************
James W. MacDonald wrote:
> Hi Mark,
>
> Mark W Kimpel wrote:
>> Not infrequently on this list the question arises as to how to
perform
>> RMA on a large number of CEL files. The simple answer, of course,
is
>> to use "justRMA" or buy more RAM.
>>
>> As I have learned more about the wet-lab side of microarray
>> experiments it has come to my attention that there is a technical
>> limitation in our lab as to how many chips can actually be run at
one
>> time and that there is a substantial batch effect between batches.
>>
>> So, in my case at least, it seems to me that it would be incorrect
to
>> normalize 60 CEL files at once when in fact they have been run in 4
>> batches of 16. Would it not be better to normalize them separately,
>> within-batch, and then include a batch effect in an analytical
model?
>
> Ideally you would randomize the samples when you are processing them
(we
> randomize at four different steps) so you don't have batches that
are
> processed together all the way through.
>
> Whether or not you fit a batch effect in a linear model depends on
how
> the samples were processed. If the lab processed all the same type
of
> samples in each of the batches (please say they didn't), then any
batch
> effect will be aliased with the sample types and fitting an effect
won't
> really help.
>
> If the batches were at least semi-randomized, then with 60 samples
you
> won't be losing that many degrees of freedom, and it probably won't
hurt
> to do so, and it just might help.
>
>>
>> Is my situation unique or, in fact, is this the way most MA wet-
labs
>> are set up? If the latter is correct, should the recommendation not
be
>> to use justRMA on 80 CEL files if they have been run in batches?
>
> Regardless of how the lab is set up, once you get to large sample
sets
> there will always be batches. If you do proper randomization of the
> samples during processing IMO there should be no need to do any
> post-processing adjustments for the batches.
>
> Best,
>
> Jim
>
>
>>
>> Thanks,
>> Mark
>
Mark W Kimpel wrote:
> Jim,
>
> My understanding is that our lab normally randomizes by
> 1. treatment
> 2. RNA extraction
> 3. labeling
> 4. hybridization
>
> In addition, we sometimes have multiple brain regions, and, for the
> purpose of the MA run, each region is treated as an independent
> experiment, thus there is no randomization across brain regions for
the
> above factors.
>
> My question arises because of two recent situations. First, in one
> experiment, for a reason not clear to me, the labeling and
hybridization
> groups were combined and there is a clear batch effect when this
> labeling-hybridization factor is put into Limma. In such a case,
would
> separate normalization be suggested? It will make the batch effect
> larger, but would seem to be addressed by using the batch-effect as
a
> factor.
I think there are two different questions here. First, when should one
normalize things separately, and when should a batch effect be used.
For me, it takes a lot to want to run RMA separately on chips that
were
all processed in a single facility. In general, the normalization is
intended to address technical differences between samples while
retaining biological differences, so unless I can see some large
differences between the sample distributions or I think that most
genes
will be differentially expressed between samples, I would tend to
process them all together.
>
> Secondly, in another experiment I need to perform an analysis across
5
> brain regions to look for overall gene expression differences
resulting
> from genetic differences between strains. In that experiment the 4
> factors mentioned at the beginning were randomized for so there is
no
> batch effect within-brain region, but there is across brain region.
In
> this experiment I am not trying to find differences across brain
> regions, which would be impossible to separate out from a batch
effect,
> but rather between two treatments that are independent of brain
region.
> One way I have done this in the past has been to simply average all
5
> brain regions together to come up with an average-brain expression
> measure, but, I wonder if it would be better to put brain region in
as a
> factor. Regardless of whether I average or not, I need to decide
whether
> to normalize all brain regions together or, because they were run as
> separate MA experiments, to normalize them individually.
This is a situation where it makes sense to me to add a brain region
effect so you are in effect blocking on brain region. I think it makes
much less sense to average over all regions. In this case it might
make
sense to normalize separately, but I wonder just how different the
expression of each region might be. I usually look at NUSE plots to
see
if I think the normalization should be done separately or not. If the
NUSE plot looks reasonable, then I figure the model is fitting the
data
OK, so why bother with separate normalizations? Then again, we ran
over
1800 chips last year, so I don't have a lot of time to ponder a given
analysis. ;-D
>
> Really, the question seems to be whether RMA should be used on a
group
> of CEL files in the presence of a non-chip related batch effect, if
so,
> will it make a batch effect "go away" (not from my experience), and
then
> if not, how to incorporate the batch effect in a model.
>
> Finally, I realize that by randomizing at each step mentioned at the
> top, one spreads any variance out so that it cannot be picked up
with a
> batch effect. With the "n" we usually use, if one were to take each
of
> the 4 factors into account one usually would run out of degrees of
> freedom. Nevertheless the variance induced at each step of the wet-
lab
> is there, it is just not apparent and presumably doesn't induce
bias. It
> does, however, decrease power, and I wonder if it wouldn't be better
to
> block by treatment, so that equal numbers from each treatment are in
a
> group, but that then each group is processed totally together. There
the
> batch effect would be large, but it would be present as only one
> factor, which with large enough "n" one could take into account in a
> statistical model. That, it seems, might increase power to detect
> differential expression. Maybe this is counter-intuitive, and would
> probably only work if "n" were large enough to provide enough
degrees of
> freedom, but it makes some sense to me. Am I nuts? (many people
think
> so, so don't be shy about saying so ;) ).
Doing things that way is a split-plot design, and I don't recall
anybody
advocating batch effects for the plots in a split-plot design. But a
split-plot design is intended for situations where you can only
randomize at one step. I would tend to want to mix things up more, but
others may have different opinions.
Best,
Jim
>
> Thanks so much for your helpful input,
> Mark
>
> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN 46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 204-4202 Home (no voice mail please)
>
> mwkimpel<at>gmail<dot>com
>
> ******************************************************************
>
>
> James W. MacDonald wrote:
>> Hi Mark,
>>
>> Mark W Kimpel wrote:
>>> Not infrequently on this list the question arises as to how to
perform
>>> RMA on a large number of CEL files. The simple answer, of course,
is
>>> to use "justRMA" or buy more RAM.
>>>
>>> As I have learned more about the wet-lab side of microarray
>>> experiments it has come to my attention that there is a technical
>>> limitation in our lab as to how many chips can actually be run at
one
>>> time and that there is a substantial batch effect between batches.
>>>
>>> So, in my case at least, it seems to me that it would be incorrect
to
>>> normalize 60 CEL files at once when in fact they have been run in
4
>>> batches of 16. Would it not be better to normalize them
separately,
>>> within-batch, and then include a batch effect in an analytical
model?
>> Ideally you would randomize the samples when you are processing
them (we
>> randomize at four different steps) so you don't have batches that
are
>> processed together all the way through.
>>
>> Whether or not you fit a batch effect in a linear model depends on
how
>> the samples were processed. If the lab processed all the same type
of
>> samples in each of the batches (please say they didn't), then any
batch
>> effect will be aliased with the sample types and fitting an effect
won't
>> really help.
>>
>> If the batches were at least semi-randomized, then with 60 samples
you
>> won't be losing that many degrees of freedom, and it probably won't
hurt
>> to do so, and it just might help.
>>
>>> Is my situation unique or, in fact, is this the way most MA wet-
labs
>>> are set up? If the latter is correct, should the recommendation
not be
>>> to use justRMA on 80 CEL files if they have been run in batches?
>> Regardless of how the lab is set up, once you get to large sample
sets
>> there will always be batches. If you do proper randomization of the
>> samples during processing IMO there should be no need to do any
>> post-processing adjustments for the batches.
>>
>> Best,
>>
>> Jim
>>
>>
>>> Thanks,
>>> Mark
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
Jim,
Thanks for your helpful advice. I'll be taking a few days of for
Christmas and will dig into this again when I return.
In the meantime, Merry Christmas/Happy Holidays to you and all on the
BioC list who are celebrating.
Mark
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)
mwkimpel<at>gmail<dot>com
******************************************************************
James MacDonald wrote:
>
>
> Mark W Kimpel wrote:
>> Jim,
>>
>> My understanding is that our lab normally randomizes by
>> 1. treatment
>> 2. RNA extraction
>> 3. labeling
>> 4. hybridization
>>
>> In addition, we sometimes have multiple brain regions, and, for the
>> purpose of the MA run, each region is treated as an independent
>> experiment, thus there is no randomization across brain regions for
>> the above factors.
>>
>> My question arises because of two recent situations. First, in one
>> experiment, for a reason not clear to me, the labeling and
>> hybridization groups were combined and there is a clear batch
effect
>> when this labeling-hybridization factor is put into Limma. In such
a
>> case, would separate normalization be suggested? It will make the
>> batch effect larger, but would seem to be addressed by using the
>> batch-effect as a factor.
>
> I think there are two different questions here. First, when should
one
> normalize things separately, and when should a batch effect be used.
>
> For me, it takes a lot to want to run RMA separately on chips that
were
> all processed in a single facility. In general, the normalization is
> intended to address technical differences between samples while
> retaining biological differences, so unless I can see some large
> differences between the sample distributions or I think that most
genes
> will be differentially expressed between samples, I would tend to
> process them all together.
>
>
>>
>> Secondly, in another experiment I need to perform an analysis
across 5
>> brain regions to look for overall gene expression differences
>> resulting from genetic differences between strains. In that
experiment
>> the 4 factors mentioned at the beginning were randomized for so
there
>> is no batch effect within-brain region, but there is across brain
>> region. In this experiment I am not trying to find differences
across
>> brain regions, which would be impossible to separate out from a
batch
>> effect, but rather between two treatments that are independent of
>> brain region. One way I have done this in the past has been to
simply
>> average all 5 brain regions together to come up with an average-
brain
>> expression measure, but, I wonder if it would be better to put
brain
>> region in as a factor. Regardless of whether I average or not, I
need
>> to decide whether to normalize all brain regions together or,
because
>> they were run as separate MA experiments, to normalize them
individually.
>
> This is a situation where it makes sense to me to add a brain region
> effect so you are in effect blocking on brain region. I think it
makes
> much less sense to average over all regions. In this case it might
make
> sense to normalize separately, but I wonder just how different the
> expression of each region might be. I usually look at NUSE plots to
see
> if I think the normalization should be done separately or not. If
the
> NUSE plot looks reasonable, then I figure the model is fitting the
data
> OK, so why bother with separate normalizations? Then again, we ran
over
> 1800 chips last year, so I don't have a lot of time to ponder a
given
> analysis. ;-D
>
>>
>> Really, the question seems to be whether RMA should be used on a
group
>> of CEL files in the presence of a non-chip related batch effect, if
>> so, will it make a batch effect "go away" (not from my experience),
>> and then if not, how to incorporate the batch effect in a model.
>>
>> Finally, I realize that by randomizing at each step mentioned at
the
>> top, one spreads any variance out so that it cannot be picked up
with
>> a batch effect. With the "n" we usually use, if one were to take
each
>> of the 4 factors into account one usually would run out of degrees
of
>> freedom. Nevertheless the variance induced at each step of the wet-
lab
>> is there, it is just not apparent and presumably doesn't induce
bias.
>> It does, however, decrease power, and I wonder if it wouldn't be
>> better to block by treatment, so that equal numbers from each
>> treatment are in a group, but that then each group is processed
>> totally together. There the batch effect would be large, but it
>> would be present as only one factor, which with large enough "n"
one
>> could take into account in a statistical model. That, it seems,
might
>> increase power to detect differential expression. Maybe this is
>> counter-intuitive, and would probably only work if "n" were large
>> enough to provide enough degrees of freedom, but it makes some
sense
>> to me. Am I nuts? (many people think so, so don't be shy about
saying
>> so ;) ).
>
> Doing things that way is a split-plot design, and I don't recall
anybody
> advocating batch effects for the plots in a split-plot design. But a
> split-plot design is intended for situations where you can only
> randomize at one step. I would tend to want to mix things up more,
but
> others may have different opinions.
>
> Best,
>
> Jim
>
>
>>
>> Thanks so much for your helpful input,
>> Mark
>>
>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN 46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 204-4202 Home (no voice mail please)
>>
>> mwkimpel<at>gmail<dot>com
>>
>> ******************************************************************
>>
>>
>> James W. MacDonald wrote:
>>> Hi Mark,
>>>
>>> Mark W Kimpel wrote:
>>>> Not infrequently on this list the question arises as to how to
>>>> perform RMA on a large number of CEL files. The simple answer, of
>>>> course, is to use "justRMA" or buy more RAM.
>>>>
>>>> As I have learned more about the wet-lab side of microarray
>>>> experiments it has come to my attention that there is a technical
>>>> limitation in our lab as to how many chips can actually be run at
>>>> one time and that there is a substantial batch effect between
batches.
>>>>
>>>> So, in my case at least, it seems to me that it would be
incorrect
>>>> to normalize 60 CEL files at once when in fact they have been run
in
>>>> 4 batches of 16. Would it not be better to normalize them
>>>> separately, within-batch, and then include a batch effect in an
>>>> analytical model?
>>> Ideally you would randomize the samples when you are processing
them
>>> (we randomize at four different steps) so you don't have batches
that
>>> are processed together all the way through.
>>>
>>> Whether or not you fit a batch effect in a linear model depends on
>>> how the samples were processed. If the lab processed all the same
>>> type of samples in each of the batches (please say they didn't),
then
>>> any batch effect will be aliased with the sample types and fitting
an
>>> effect won't really help.
>>>
>>> If the batches were at least semi-randomized, then with 60 samples
>>> you won't be losing that many degrees of freedom, and it probably
>>> won't hurt to do so, and it just might help.
>>>
>>>> Is my situation unique or, in fact, is this the way most MA wet-
labs
>>>> are set up? If the latter is correct, should the recommendation
not
>>>> be to use justRMA on 80 CEL files if they have been run in
batches?
>>> Regardless of how the lab is set up, once you get to large sample
>>> sets there will always be batches. If you do proper randomization
of
>>> the samples during processing IMO there should be no need to do
any
>>> post-processing adjustments for the batches.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> Thanks,
>>>> Mark
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>