Entering edit mode
Guilherme Rocha
▴
40
@guilherme-rocha-6354
Last seen 7.7 years ago
Dear all,
I am trying to pre-process (bg correction, quantile normalization,
summarization) the readings in a large number (~1,000s) of large
microarray
chips (~10^6 probes).
As far as I can tell, pre-processing functions in most packages will
load
data from all chips at once which, in this case, is infeasible.
In addition, I'd like to have flexibility in how to do summarization
to
the gene or exon level at the last pre-processing step.
If I understand correctly, it should be possible to do all the
processing
without loading the entire data set at once as described below.
Can anyone comment whether that sounds sensible?
The main questions are:
1) In the RMA background correction step, data from each chip is
used
independently from data in other chips, correct?
1a) If so, the background corrected intensities for each chip can
be
computed using the affyio::read.celfile and
preprocessCore::rma.background.correct functions, correct?
2) In the quantile normalization step: which probes are included in
the
ordered vector of intensities used to construct the "reference
distribution
of intensities" shared across all probes?
Specifically, MM probes are NOT included, but what about control
probes?
For HTA2,0 the probeset types are control->affx,
control->affx->asc,
control->affx->bac_spike, ..., normgene->exon, normgene->intron. Which
of
these are included in quantile normalization?
3) When a CEL file is read using affyio::read.celfile, in what order
are
the mean intensities included in the INTENSITY$MEAN vector:
3a) X first as in (X=0, Y=0), (X=1, Y=0), ..., (X=max.X, Y=0),
(X=0,
Y=1), (X=1, Y=1), ..., (X=max.X, Y=1), ..., (X=0, Y=max.Y), (X=1,
Y=max.Y),
..., (X=max.X, Y=max.Y)?
3b) Y first as in (X=0, Y=0), (X=0, Y=1), ..., (X=0, Y=max.Y),
(X=1,
Y=0), (X=1, Y=1), ..., (X=1, Y=max.Y), ..., (X=max.X, Y=0), (X=max.X,
Y=1),
..., (X=max.X, Y=max.Y)?
3c) Some different order?
4) In the RMA summarization step, the normalized intensities in a
chip
are processed independently from data in other chips, correct?
The subColSummarize in preprocessCore can be used to do this (as
long
as I can group probes into probesets or genes), correct?
Any help appreciated,
Thanks,
G. Rocha
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
-------------
"Distributed" RMA normalization algorithm:
a) Background correction:
For each chip, CEL file can be read using affyio::read.celfile
and the
background correction can be done using
preprocessCore::rma.background.correct and save intensities in a
separate
(binary) file
Is this any different than what is done internally at rma???
b) Quantile normalization:
This is the more involved step as it requires data from all
chips.
But it is possible to avoid loading the entire data by doing two
passes through the data:
Pass 1) Open file with bg corrected for each chip and sum ORDERED
intensities along the way; once finished summing, divide by n_chips to
get
ordered intensities in a ordered vector of "reference intensities";
Pass 2) For each chip, Open file with bg corrected measurements,
compute rank for each probe and substitute it with the corresponding
rank
on the vector of "reference intensities".
Save bg-corrected, normalized probe level intensities for
each
chip separately.
c) Summarization:
For each chip, open file of bg-corrected, normalized probe level
intensities created in (b).
Summarize to probeset, gene, exon, junction level using your
favorite
version of preprocessCore::subColSummarize.
--
Guilherme V. Rocha
gvrocha@gmail.com
[[alternative HTML version deleted]]