Inter-chip methylation normalization
0
0
Entering edit mode
blawney • 0
@blawney-12090
Last seen 4.0 years ago

My question related to normalization for EPIC methylation arrays, as performed by the minfi Bioc package.

I am helping with a longitudinal study where they are following changes in methylation over a time period of 3 years. I was given two sets of IDAT files for 192 total samples; 96 were performed last year, and 96 this year. I was not part of the design and there are no control samples that are common to each chip for the purposes of batch effect estimation. A collaborator decided to plot the distribution of Beta values for alleged "housekeeping" genes (which they believe should not show huge changes), and there is a very clear difference, as shown below:

enter image description here

Putting aside that there is only so much we can do since the chip and biological state are completely confounded, is there a sensible method for inter-chip normalization? I loaded both sample sheets together and used preprocessNoob, as shown below,

> library(minfi)
> targets = read.metharray.sheet('/workspace')
[read.metharray.sheet] Found the following CSV files:
[1] "/workspace/Samplesheet_051319.csv"
[2] "/workspace/SampleSheet_012120.csv" 
>
> RGSet <- read.metharray.exp(targets=targets, force=TRUE)
> normed_ms <- preprocessNoob(RGSet)

(full SessionInfo below the post)

My impression from reading Fortin (Bioinformatics. 2017 Feb 15; 33(4): 558–560.) is that noob (is the ssNoob they reference the same as minfi's preprocessNoob?) is "best" for my situation. In that paper, they state: "Single sample normalization is of great potential benefit to users, particularly for analyzing large datasets which arrive in batches, because data can be processed separately and independently of the previously processed data."

Am I on the right track? My collaborators clearly understand the confounding issue and indicate that there may, in fact, be biological reasons for this difference. However, they are still unsettled by it.

An additional question:

The minfi vignette says that the only column used by read.metharray.exp is Basename ("The only important column in sheet data.frame used in the targets argument for the read.metharray.exp() function is a column names Basename"). Does this mean it effectively ignores the array/slide/plate hierarchy? Or does it use that somehow? I know the full data frame can contain that information, but it is not clear if it is used since only Basename is necessary.

Thanks!

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS/LAPACK: /opt/software/miniconda/envs/r36/lib/R/lib/libRblas.so

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] minfi_1.32.0                bumphunter_1.28.0          
 [3] locfit_1.5-9.1              iterators_1.0.12           
 [5] foreach_1.4.8               Biostrings_2.54.0          
 [7] XVector_0.26.0              SummarizedExperiment_1.16.0
 [9] DelayedArray_0.12.0         BiocParallel_1.20.0        
[11] matrixStats_0.55.0          Biobase_2.46.0             
[13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[15] IRanges_2.20.0              S4Vectors_0.24.0           
[17] BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] nlme_3.1-139             bitops_1.0-6             bit64_0.9-7             
 [4] RColorBrewer_1.1-2       progress_1.2.2           httr_1.4.1              
 [7] tools_3.6.1              doRNG_1.8.2              nor1mix_1.3-0           
[10] R6_2.4.1                 HDF5Array_1.14.0         DBI_1.1.0               
[13] tidyselect_1.0.0         prettyunits_1.1.1        base64_2.0              
[16] bit_1.1-15.2             curl_4.3                 compiler_3.6.1          
[19] preprocessCore_1.48.0    xml2_1.2.2               rtracklayer_1.46.0      
[22] readr_1.3.1              genefilter_1.68.0        quadprog_1.5-8          
[25] askpass_1.1              rappdirs_0.3.1           stringr_1.4.0           
[28] digest_0.6.25            Rsamtools_2.2.0          illuminaio_0.28.0       
[31] siggenes_1.60.0          GEOquery_2.54.0          pkgconfig_2.0.3         
[34] scrime_1.3.5             dbplyr_1.4.2             limma_3.42.0            
[37] rlang_0.4.5              RSQLite_2.1.5            DelayedMatrixStats_1.8.0
[40] mclust_5.4.5             dplyr_0.8.4              RCurl_1.98-1.1          
[43] magrittr_1.5             GenomeInfoDbData_1.2.2   Matrix_1.2-17           
[46] Rcpp_1.0.3               Rhdf5lib_1.8.0           lifecycle_0.1.0         
[49] stringi_1.4.3            MASS_7.3-51.3            zlibbioc_1.32.0         
[52] rhdf5_2.30.0             plyr_1.8.6               BiocFileCache_1.10.0    
[55] grid_3.6.1               blob_1.2.1               crayon_1.3.4            
[58] lattice_0.20-38          splines_3.6.1            multtest_2.42.0         
[61] GenomicFeatures_1.38.0   annotate_1.64.0          hms_0.5.3               
[64] beanplot_1.2             pillar_1.4.3             rngtools_1.5            
[67] codetools_0.2-16         biomaRt_2.42.0           XML_3.98-1.19           
[70] glue_1.3.1               data.table_1.12.8        vctrs_0.2.3             
[73] tidyr_1.0.2              openssl_1.4.1            purrr_0.3.3             
[76] reshape_0.8.8            assertthat_0.2.1         xtable_1.8-4            
[79] survival_2.44-1.1        tibble_2.1.3             GenomicAlignments_1.22.0
[82] AnnotationDbi_1.48.0     memoise_1.1.0
minfi • 527 views
ADD COMMENT

Login before adding your answer.

Traffic: 865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6