Here are some comments that may be helpful, but they don't directly
address your question...
The "Subgroup B" ERCC spike-ins *should* have lfc=0. I like to look at
that group first.
I also like to look at the raw data "within subject" across the (log)
concentration range and see where the linearly breaks down and the
concentrations become indistinguishable (i.e., asymptotic parts of
sigmoid curve); I am suspicious of any differences among groups for a
gene with expression levels falling in those areas.
You might also consider looking at a density plot for each sample with
a rug plot showing the values of the ERCC controls. (Non-graphically,
use ecdf() to see where they fall in each sample.) Are the upper
tails dominated by ERCCs? If so, I would be concerned about using RMA
because quantile normalization may be too strong in the presence of
such (intentional) differences. For example, Mix 1 has a max
concentration of 30,000 while Mix 2 only goes up to 15,000. Based on
my understanding, if those controls are indeed the strongest signals
in your samples, then by definition they would be equal after RMA.
Indeed, Bolstad et al. (2003) mention this in their quantile
normalization paper, which is one of the three papers that make up the
"One possible problem with this method is that it forces
the values of quantiles to be equal. This would be
most problematic in the tails where it is possible that a
probe could have the same value across all the arrays.
However, in practice, since probeset expression measures
are typically computed using the value of multiple probes,
we have not found this to be a problem. "
Based on this, I would filter out ERCC controls that are in the non-
linear range or dominant the tails; you want the ERCC used to be
intermingled with "real" data to help avoid these problems.
Just some thoughts!
From: Thornton, Matthew [mailto:Matthew.Thornton@med.usc.edu]
Sent: Tuesday, June 24, 2014 2:39 PM
To: bioconductor at r-project.org
Subject: [BioC] External RNA controls on Rat Gene ST 2.0 chip lfc ~ 1
after xps rma??
I am processing Affymetrix gene chip Rat Gene 2.0 ST chips with
bioconductor package xps using rma normalization. I have included the
ExFold ERCC external RNA controls with 2 mixes of different
concentrations. I am able to pull out intensities for the ERCC
controls at different points along the processing scheme. If I pull
the ERCC raw intensities, order them by increasing concentration, and
transform both the concentration and intensity by log base 2, I see a
nice sigmoid curve that I can fit with a cubic polynomial.
However, when I pull out the ERCC controls after summarization, when I
reorder by concentration, and roughly calculate the log-fold change
they are all close to 1?? My supposition is that I am overfitting the
data with RMA and that I need to find a better normalization scheme.
Does anyone have any ideas for different normalization and
summarization methods that I should look at? Like iter-PLIER or FARMS
or ? Any advice or comments are welcome.
matthew.thornton at med.usc.edu