Hello!
I am processing Affymetrix gene chip Rat Gene 2.0 ST chips with
bioconductor package xps using rma normalization. I have included the
ExFold ERCC external RNA controls with 2 mixes of different
concentrations. I am able to pull out intensities for the ERCC
controls at different points along the processing scheme. If I pull
the ERCC raw intensities, order them by increasing concentration, and
transform both the concentration and intensity by log base 2, I see a
nice sigmoid curve that I can fit with a cubic polynomial.
However, when I pull out the ERCC controls after summarization, when I
reorder by concentration, and roughly calculate the log-fold change
they are all close to 1?? My supposition is that I am overfitting the
data with RMA and that I need to find a better normalization scheme.
Does anyone have any ideas for different normalization and
summarization methods that I should look at? Like iter-PLIER or FARMS
or ? Any advice or comments are welcome.
Thanks,
Matt
matthew.thornton at med.usc.edu
Dear Matt,
If you want to use a different normalization method, I would suggest
to
try MAS5. Alternatively, you are free to play around with different
methods.
As you can see in my vignette 'xpsPreprocess.pdf' you can do the
calculation stepwise, i.e. use different methods for background
correction, normalization and summarization, e.g. I would try sector
background (mas4), then median normalization and lowess summarization.
For FARMS you can do no bgrd-correction, quantile normalization and
farms summarization, see chapter 5.5. However, if you think that you
may
be overfitting data I would not use quantile normalization.
With respect to PLIER please see my note in Appendix A.1 of vignette
'APTvsXPS.pdf'.
Best regards,
Christian
On 6/24/14 9:38 PM, Thornton, Matthew wrote:
> Hello!
>
> I am processing Affymetrix gene chip Rat Gene 2.0 ST chips with
bioconductor package xps using rma normalization. I have included the
ExFold ERCC external RNA controls with 2 mixes of different
concentrations. I am able to pull out intensities for the ERCC
controls at different points along the processing scheme. If I pull
the ERCC raw intensities, order them by increasing concentration, and
transform both the concentration and intensity by log base 2, I see a
nice sigmoid curve that I can fit with a cubic polynomial.
>
> However, when I pull out the ERCC controls after summarization, when
I reorder by concentration, and roughly calculate the log-fold change
they are all close to 1?? My supposition is that I am overfitting the
data with RMA and that I need to find a better normalization scheme.
Does anyone have any ideas for different normalization and
summarization methods that I should look at? Like iter-PLIER or FARMS
or ? Any advice or comments are welcome.
>
> Thanks,
>
> Matt
>
> matthew.thornton at med.usc.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
Matt,
Here are some comments that may be helpful, but they don't directly
address your question...
The "Subgroup B" ERCC spike-ins *should* have lfc=0. I like to look at
that group first.
I also like to look at the raw data "within subject" across the (log)
concentration range and see where the linearly breaks down and the
concentrations become indistinguishable (i.e., asymptotic parts of
sigmoid curve); I am suspicious of any differences among groups for a
gene with expression levels falling in those areas.
You might also consider looking at a density plot for each sample with
a rug plot showing the values of the ERCC controls. (Non-graphically,
use ecdf() to see where they fall in each sample.) Are the upper
tails dominated by ERCCs? If so, I would be concerned about using RMA
because quantile normalization may be too strong in the presence of
such (intentional) differences. For example, Mix 1 has a max
concentration of 30,000 while Mix 2 only goes up to 15,000. Based on
my understanding, if those controls are indeed the strongest signals
in your samples, then by definition they would be equal after RMA.
Indeed, Bolstad et al. (2003) mention this in their quantile
normalization paper, which is one of the three papers that make up the
RMA procedure:
"One possible problem with this method is that it forces
the values of quantiles to be equal. This would be
most problematic in the tails where it is possible that a
probe could have the same value across all the arrays.
However, in practice, since probeset expression measures
are typically computed using the value of multiple probes,
we have not found this to be a problem. "
Based on this, I would filter out ERCC controls that are in the non-
linear range or dominant the tails; you want the ERCC used to be
intermingled with "real" data to help avoid these problems.
Just some thoughts!
Wade
-----Original Message-----
From: Thornton, Matthew [mailto:Matthew.Thornton@med.usc.edu]
Sent: Tuesday, June 24, 2014 2:39 PM
To: bioconductor at r-project.org
Subject: [BioC] External RNA controls on Rat Gene ST 2.0 chip lfc ~ 1
after xps rma??
Hello!
I am processing Affymetrix gene chip Rat Gene 2.0 ST chips with
bioconductor package xps using rma normalization. I have included the
ExFold ERCC external RNA controls with 2 mixes of different
concentrations. I am able to pull out intensities for the ERCC
controls at different points along the processing scheme. If I pull
the ERCC raw intensities, order them by increasing concentration, and
transform both the concentration and intensity by log base 2, I see a
nice sigmoid curve that I can fit with a cubic polynomial.
However, when I pull out the ERCC controls after summarization, when I
reorder by concentration, and roughly calculate the log-fold change
they are all close to 1?? My supposition is that I am overfitting the
data with RMA and that I need to find a better normalization scheme.
Does anyone have any ideas for different normalization and
summarization methods that I should look at? Like iter-PLIER or FARMS
or ? Any advice or comments are welcome.
Thanks,
Matt
matthew.thornton at med.usc.edu
Thank you for the suggestions! I will look at where the ERCC controls
fall in the data. I am thinking to use a paired-down set of the ERCC
controls in the 'linear' range and which are within my experimental
data. I am planning to use the spike-in probes procedure in the vsn
package. I will also try mas5 and try to iterate with the different
processing procedures in xps. It is good to have an outside metric for
assessing normalization. If I can get matching observed log-fold
changes similar to my expected log-fold changes, it will give me a
little more confidence in my data. When you process data with the ERCC
controls, what normalization methods do you use?
Thanks again!
Matt
matthew.thornton at med.usc.edu
________________________________________
From: Davis, Wade [davisjwa@health.missouri.edu]
Sent: Wednesday, June 25, 2014 8:26 AM
To: Thornton, Matthew; bioconductor at r-project.org
Subject: RE: [BioC] External RNA controls on Rat Gene ST 2.0 chip lfc
~ 1 after xps rma??
Matt,
Here are some comments that may be helpful, but they don't directly
address your question...
The "Subgroup B" ERCC spike-ins *should* have lfc=0. I like to look at
that group first.
I also like to look at the raw data "within subject" across the (log)
concentration range and see where the linearly breaks down and the
concentrations become indistinguishable (i.e., asymptotic parts of
sigmoid curve); I am suspicious of any differences among groups for a
gene with expression levels falling in those areas.
You might also consider looking at a density plot for each sample with
a rug plot showing the values of the ERCC controls. (Non-graphically,
use ecdf() to see where they fall in each sample.) Are the upper
tails dominated by ERCCs? If so, I would be concerned about using RMA
because quantile normalization may be too strong in the presence of
such (intentional) differences. For example, Mix 1 has a max
concentration of 30,000 while Mix 2 only goes up to 15,000. Based on
my understanding, if those controls are indeed the strongest signals
in your samples, then by definition they would be equal after RMA.
Indeed, Bolstad et al. (2003) mention this in their quantile
normalization paper, which is one of the three papers that make up the
RMA procedure:
"One possible problem with this method is that it forces
the values of quantiles to be equal. This would be
most problematic in the tails where it is possible that a
probe could have the same value across all the arrays.
However, in practice, since probeset expression measures
are typically computed using the value of multiple probes,
we have not found this to be a problem. "
Based on this, I would filter out ERCC controls that are in the non-
linear range or dominant the tails; you want the ERCC used to be
intermingled with "real" data to help avoid these problems.
Just some thoughts!
Wade
-----Original Message-----
From: Thornton, Matthew [mailto:Matthew.Thornton@med.usc.edu]
Sent: Tuesday, June 24, 2014 2:39 PM
To: bioconductor at r-project.org
Subject: [BioC] External RNA controls on Rat Gene ST 2.0 chip lfc ~ 1
after xps rma??
Hello!
I am processing Affymetrix gene chip Rat Gene 2.0 ST chips with
bioconductor package xps using rma normalization. I have included the
ExFold ERCC external RNA controls with 2 mixes of different
concentrations. I am able to pull out intensities for the ERCC
controls at different points along the processing scheme. If I pull
the ERCC raw intensities, order them by increasing concentration, and
transform both the concentration and intensity by log base 2, I see a
nice sigmoid curve that I can fit with a cubic polynomial.
However, when I pull out the ERCC controls after summarization, when I
reorder by concentration, and roughly calculate the log-fold change
they are all close to 1?? My supposition is that I am overfitting the
data with RMA and that I need to find a better normalization scheme.
Does anyone have any ideas for different normalization and
summarization methods that I should look at? Like iter-PLIER or FARMS
or ? Any advice or comments are welcome.
Thanks,
Matt
matthew.thornton at med.usc.edu