Question

RMA normalization in large number of chips

0

Entering edit mode

Guilherme Rocha ▴ 40

@guilherme-rocha-6354

Last seen 7.0 years ago

Dear all, I am trying to pre-process (bg correction, quantile normalization, summarization) the readings in a large number (~1,000s) of large microarray chips (~10^6 probes). As far as I can tell, pre-processing functions in most packages will load data from all chips at once which, in this case, is infeasible. In addition, I'd like to have flexibility in how to do summarization to the gene or exon level at the last pre-processing step. If I understand correctly, it should be possible to do all the processing without loading the entire data set at once as described below. Can anyone comment whether that sounds sensible? The main questions are: 1) In the RMA background correction step, data from each chip is used independently from data in other chips, correct? 1a) If so, the background corrected intensities for each chip can be computed using the affyio::read.celfile and preprocessCore::rma.background.correct functions, correct? 2) In the quantile normalization step: which probes are included in the ordered vector of intensities used to construct the "reference distribution of intensities" shared across all probes? Specifically, MM probes are NOT included, but what about control probes? For HTA2,0 the probeset types are control->affx, control->affx->asc, control->affx->bac_spike, ..., normgene->exon, normgene->intron. Which of these are included in quantile normalization? 3) When a CEL file is read using affyio::read.celfile, in what order are the mean intensities included in the INTENSITY$MEAN vector: 3a) X first as in (X=0, Y=0), (X=1, Y=0), ..., (X=max.X, Y=0), (X=0, Y=1), (X=1, Y=1), ..., (X=max.X, Y=1), ..., (X=0, Y=max.Y), (X=1, Y=max.Y), ..., (X=max.X, Y=max.Y)? 3b) Y first as in (X=0, Y=0), (X=0, Y=1), ..., (X=0, Y=max.Y), (X=1, Y=0), (X=1, Y=1), ..., (X=1, Y=max.Y), ..., (X=max.X, Y=0), (X=max.X, Y=1), ..., (X=max.X, Y=max.Y)? 3c) Some different order? 4) In the RMA summarization step, the normalized intensities in a chip are processed independently from data in other chips, correct? The subColSummarize in preprocessCore can be used to do this (as long as I can group probes into probesets or genes), correct? Any help appreciated, Thanks, G. Rocha ---------------------------------------------------------------------- ---------------------------------------------------------------------- ---------------------------------------------------------------------- ------------- "Distributed" RMA normalization algorithm: a) Background correction: For each chip, CEL file can be read using affyio::read.celfile and the background correction can be done using preprocessCore::rma.background.correct and save intensities in a separate (binary) file Is this any different than what is done internally at rma??? b) Quantile normalization: This is the more involved step as it requires data from all chips. But it is possible to avoid loading the entire data by doing two passes through the data: Pass 1) Open file with bg corrected for each chip and sum ORDERED intensities along the way; once finished summing, divide by n_chips to get ordered intensities in a ordered vector of "reference intensities"; Pass 2) For each chip, Open file with bg corrected measurements, compute rank for each probe and substitute it with the corresponding rank on the vector of "reference intensities". Save bg-corrected, normalized probe level intensities for each chip separately. c) Summarization: For each chip, open file of bg-corrected, normalized probe level intensities created in (b). Summarize to probeset, gene, exon, junction level using your favorite version of preprocessCore::subColSummarize. -- Guilherme V. Rocha gvrocha@gmail.com [[alternative HTML version deleted]]

Normalization probe preprocessCore Normalization probe preprocessCore • 1.5k views

ADD COMMENT • link updated 10.3 years ago by James W. MacDonald 65k • written 10.3 years ago by Guilherme Rocha ▴ 40

score 0 · Answer 1 · 2014-01-30

Hi Guilherme, See SCAN.UPC, fRMA, xps, rmaExpress (http://rmaexpress.bmbolstad.com/) or aroma.affymetrix (http://www.aroma-project.org/publications) for memory-bounded implementations of RMA. Best, Jim On 1/30/2014 12:02 PM, Guilherme Rocha wrote: > Dear all, > > I am trying to pre-process (bg correction, quantile normalization, > summarization) the readings in a large number (~1,000s) of large microarray > chips (~10^6 probes). > As far as I can tell, pre-processing functions in most packages will load > data from all chips at once which, in this case, is infeasible. > In addition, I'd like to have flexibility in how to do summarization to > the gene or exon level at the last pre-processing step. > > If I understand correctly, it should be possible to do all the processing > without loading the entire data set at once as described below. > Can anyone comment whether that sounds sensible? > > The main questions are: > 1) In the RMA background correction step, data from each chip is used > independently from data in other chips, correct? > 1a) If so, the background corrected intensities for each chip can be > computed using the affyio::read.celfile and > preprocessCore::rma.background.correct functions, correct? > > 2) In the quantile normalization step: which probes are included in the > ordered vector of intensities used to construct the "reference distribution > of intensities" shared across all probes? > Specifically, MM probes are NOT included, but what about control > probes? > For HTA2,0 the probeset types are control->affx, control->affx->asc, > control->affx->bac_spike, ..., normgene->exon, normgene->intron. Which of > these are included in quantile normalization? > > 3) When a CEL file is read using affyio::read.celfile, in what order are > the mean intensities included in the INTENSITY$MEAN vector: > 3a) X first as in (X=0, Y=0), (X=1, Y=0), ..., (X=max.X, Y=0), (X=0, > Y=1), (X=1, Y=1), ..., (X=max.X, Y=1), ..., (X=0, Y=max.Y), (X=1, Y=max.Y), > ..., (X=max.X, Y=max.Y)? > 3b) Y first as in (X=0, Y=0), (X=0, Y=1), ..., (X=0, Y=max.Y), (X=1, > Y=0), (X=1, Y=1), ..., (X=1, Y=max.Y), ..., (X=max.X, Y=0), (X=max.X, Y=1), > ..., (X=max.X, Y=max.Y)? > 3c) Some different order? > > 4) In the RMA summarization step, the normalized intensities in a chip > are processed independently from data in other chips, correct? > The subColSummarize in preprocessCore can be used to do this (as long > as I can group probes into probesets or genes), correct? > > > Any help appreciated, > > Thanks, > > G. Rocha > > > > -------------------------------------------------------------------- ---------------------------------------------------------------------- ---------------------------------------------------------------------- --------------- > > "Distributed" RMA normalization algorithm: > > > a) Background correction: > For each chip, CEL file can be read using affyio::read.celfile and the > background correction can be done using > preprocessCore::rma.background.correct and save intensities in a separate > (binary) file > Is this any different than what is done internally at rma??? > > b) Quantile normalization: > This is the more involved step as it requires data from all chips. > But it is possible to avoid loading the entire data by doing two > passes through the data: > Pass 1) Open file with bg corrected for each chip and sum ORDERED > intensities along the way; once finished summing, divide by n_chips to get > ordered intensities in a ordered vector of "reference intensities"; > Pass 2) For each chip, Open file with bg corrected measurements, > compute rank for each probe and substitute it with the corresponding rank > on the vector of "reference intensities". > Save bg-corrected, normalized probe level intensities for each > chip separately. > > c) Summarization: > For each chip, open file of bg-corrected, normalized probe level > intensities created in (b). > Summarize to probeset, gene, exon, junction level using your favorite > version of preprocessCore::subColSummarize. > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099