I work in cancer research with 450k array methylation data. I have been reading about normalization in 450k arrays, with a mind towards deducing what factors in study design are important to determining my upstream workflow. A lot of the literature focuses on technical and analytical reproducibility, and some of these include study of cancer cohorts.
My current workflow, starting from IDATs includes: Illumina normalization -> SWAN normalization -> filtering on intensity p-value -> Gset and filtering on probe type -> ComBat -> Analysis.
SWAN seems to be pretty commonly used, but I am seeing more recent studies using BMIQ instead. Given current concern about cell type heterogeneity (and recommendations *not* to use quantile filtering in cancer/instances where global differential methylation is expected), I am interested in the appropriate normalization method(s) to use.
I realize it is important to validate array findings wherever possible with technical replicates and more high-fidelity approaches like sequencing. Also it is vital to tailor your approach to the particular conditions of your study and not go by the book or some blanket one-size-fits all approach. However, it is also important that results are replicable, and an important way of doing this would be to know to what extent there is a standard for upstream data processing. Maybe there isn't a particular "right" approach, but maybe too there are trends in how labs process their data that should be known in order to assess independent findings side-by-side.
A further consideration: why isn't it more common to chain together normalizations that would seem to complement one another? (ie. Noob->SWAN->BMIQ seems logical for background correction->within-array normalization->between-array normalization).
Thanks Bioconductor community!
Sean
smaden@fredhutch.org
Thanks so much for your input, Kasper!
best,
Sean