My question related to normalization for EPIC methylation arrays, as performed by the minfi Bioc package.
I am helping with a longitudinal study where they are following changes in methylation over a time period of 3 years. I was given two sets of IDAT files for 192 total samples; 96 were performed last year, and 96 this year. I was not part of the design and there are no control samples that are common to each chip for the purposes of batch effect estimation. A collaborator decided to plot the distribution of Beta values for alleged "housekeeping" genes (which they believe should not show huge changes), and there is a very clear difference, as shown below:
Putting aside that there is only so much we can do since the chip and biological state are completely confounded, is there a sensible method for inter-chip normalization? I loaded both sample sheets together and used preprocessNoob
, as shown below,
> library(minfi)
> targets = read.metharray.sheet('/workspace')
[read.metharray.sheet] Found the following CSV files:
[1] "/workspace/Samplesheet_051319.csv"
[2] "/workspace/SampleSheet_012120.csv"
>
> RGSet <- read.metharray.exp(targets=targets, force=TRUE)
> normed_ms <- preprocessNoob(RGSet)
(full SessionInfo
below the post)
My impression from reading Fortin (Bioinformatics. 2017 Feb 15; 33(4): 558–560.) is that noob (is the ssNoob they reference the same as minfi's preprocessNoob
?) is "best" for my situation. In that paper, they state: "Single sample normalization is of great potential benefit to users, particularly for analyzing large datasets which arrive in batches, because data can be processed separately and independently of the previously processed data."
Am I on the right track? My collaborators clearly understand the confounding issue and indicate that there may, in fact, be biological reasons for this difference. However, they are still unsettled by it.
An additional question:
The minfi vignette says that the only column used by read.metharray.exp
is Basename
("The only important column in sheet data.frame used in the targets argument for the read.metharray.exp() function is a column names Basename"). Does this mean it effectively ignores the array/slide/plate hierarchy? Or does it use that somehow? I know the full data frame can contain that information, but it is not clear if it is used since only Basename
is necessary.
Thanks!
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS/LAPACK: /opt/software/miniconda/envs/r36/lib/R/lib/libRblas.so
locale:
[1] C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] minfi_1.32.0 bumphunter_1.28.0
[3] locfit_1.5-9.1 iterators_1.0.12
[5] foreach_1.4.8 Biostrings_2.54.0
[7] XVector_0.26.0 SummarizedExperiment_1.16.0
[9] DelayedArray_0.12.0 BiocParallel_1.20.0
[11] matrixStats_0.55.0 Biobase_2.46.0
[13] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
[15] IRanges_2.20.0 S4Vectors_0.24.0
[17] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] nlme_3.1-139 bitops_1.0-6 bit64_0.9-7
[4] RColorBrewer_1.1-2 progress_1.2.2 httr_1.4.1
[7] tools_3.6.1 doRNG_1.8.2 nor1mix_1.3-0
[10] R6_2.4.1 HDF5Array_1.14.0 DBI_1.1.0
[13] tidyselect_1.0.0 prettyunits_1.1.1 base64_2.0
[16] bit_1.1-15.2 curl_4.3 compiler_3.6.1
[19] preprocessCore_1.48.0 xml2_1.2.2 rtracklayer_1.46.0
[22] readr_1.3.1 genefilter_1.68.0 quadprog_1.5-8
[25] askpass_1.1 rappdirs_0.3.1 stringr_1.4.0
[28] digest_0.6.25 Rsamtools_2.2.0 illuminaio_0.28.0
[31] siggenes_1.60.0 GEOquery_2.54.0 pkgconfig_2.0.3
[34] scrime_1.3.5 dbplyr_1.4.2 limma_3.42.0
[37] rlang_0.4.5 RSQLite_2.1.5 DelayedMatrixStats_1.8.0
[40] mclust_5.4.5 dplyr_0.8.4 RCurl_1.98-1.1
[43] magrittr_1.5 GenomeInfoDbData_1.2.2 Matrix_1.2-17
[46] Rcpp_1.0.3 Rhdf5lib_1.8.0 lifecycle_0.1.0
[49] stringi_1.4.3 MASS_7.3-51.3 zlibbioc_1.32.0
[52] rhdf5_2.30.0 plyr_1.8.6 BiocFileCache_1.10.0
[55] grid_3.6.1 blob_1.2.1 crayon_1.3.4
[58] lattice_0.20-38 splines_3.6.1 multtest_2.42.0
[61] GenomicFeatures_1.38.0 annotate_1.64.0 hms_0.5.3
[64] beanplot_1.2 pillar_1.4.3 rngtools_1.5
[67] codetools_0.2-16 biomaRt_2.42.0 XML_3.98-1.19
[70] glue_1.3.1 data.table_1.12.8 vctrs_0.2.3
[73] tidyr_1.0.2 openssl_1.4.1 purrr_0.3.3
[76] reshape_0.8.8 assertthat_0.2.1 xtable_1.8-4
[79] survival_2.44-1.1 tibble_2.1.3 GenomicAlignments_1.22.0
[82] AnnotationDbi_1.48.0 memoise_1.1.0