Hi all,
I'm working on EPIC arrays methylation data by minfi version 1.16.0.
After reading the sheet, I import the EPIC data:
RGset <- read.450k.exp(targets = targets) RGset@annotation <- c(array = "IlluminaHumanMethylationEPIC", annotation = "ilm10b2.hg19")
Now I've a strange case when I remove all failed positions with non-significant p-values ( >0.05):
detP <- detectionP(RGset) RGset2 <- RGset[rowSums(detP < 0.05) == ncol(RGset), ] dim(RGset) Features Samples 1052641 18 dim(RGset2) Features Samples 39866 18
As you can see the Features number is reduced drastically. The simple commands to remove the failed positions are the same that I've always used on 450k arrays and worked every time good, for example on 18 Samples in 450k arrays the Features number of RGset is 622399, obviously, and the Features number of RGset2 is 620199 so a reasonable difference.
What's going on using the same commands on EPIC arrays?
Thanks in advance, Regards
Giovanni
sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS locale: [1] LC_CTYPE=it_IT.UTF-8 LC_NUMERIC=C LC_TIME=it_IT.UTF-8 [4] LC_COLLATE=it_IT.UTF-8 LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=it_IT.UTF-8 [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods [9] base other attached packages: [1] IlluminaHumanMethylationEPICmanifest_0.3.0 [2] minfi_1.16.0 [3] bumphunter_1.10.0 [4] locfit_1.5-9.1 [5] iterators_1.0.8 [6] foreach_1.4.3 [7] Biostrings_2.38.3 [8] XVector_0.10.0 [9] SummarizedExperiment_1.0.2 [10] GenomicRanges_1.22.3 [11] GenomeInfoDb_1.6.1 [12] IRanges_2.4.6 [13] S4Vectors_0.8.7 [14] lattice_0.20-33 [15] Biobase_2.30.0 [16] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] mclust_5.1 rgl_0.95.1441 base64_1.1 [4] Rcpp_0.12.3 corpcor_1.6.8 Rsamtools_1.22.0 [7] digest_0.6.9 plyr_1.8.3 futile.options_1.0.0 [10] ellipse_0.3-8 RSQLite_1.0.0 ggplot2_2.0.0 [13] zlibbioc_1.16.0 GenomicFeatures_1.22.8 annotate_1.48.0 [16] preprocessCore_1.32.0 splines_3.2.2 BiocParallel_1.4.3 [19] stringr_1.0.0 igraph_1.0.1 RCurl_1.95-4.7 [22] biomaRt_2.26.1 munsell_0.4.2 rtracklayer_1.30.1 [25] multtest_2.26.0 pkgmaker_0.22 GEOquery_2.36.0 [28] quadprog_1.5-5 codetools_0.2-14 matrixStats_0.50.1 [31] XML_3.98-1.3 reshape_0.8.5 GenomicAlignments_1.6.3 [34] MASS_7.3-45 bitops_1.0-6 grid_3.2.2 [37] nlme_3.1-122 xtable_1.8-0 gtable_0.1.2 [40] registry_0.3 DBI_0.3.1 magrittr_1.5 [43] scales_0.3.0 stringi_1.0-1 genefilter_1.52.0 [46] doRNG_1.6 limma_3.26.5 futile.logger_1.4.1 [49] nor1mix_1.2-1 lambda.r_1.1.7 RColorBrewer_1.1-2 [52] mixOmics_5.2.0 siggenes_1.44.0 tools_3.2.2 [55] illuminaio_0.12.0 rngtools_1.2.4 survival_2.38-3 [58] AnnotationDbi_1.32.3 colorspace_1.2-6 beanplot_1.2
Hi James,
Ok this is a stringent criterion but It worked fine on 450k arrays analyses I run, also on more than samples such as 50 or 60 with a reasonable post-filtering difference. I'll consider what you suggest about filtering probes with a smallish percentage of detection p-values (50%-20%). Only I wish there was not something bias the construction of RGSet reading raw data from IDAT files in EPIC array by this minfi version; in particular in this analysis the 18 samples are distributed on 8 arrays.
Sure full support for EPIC arrays requires greater minfi version and this would prevent 'possible' bias in reading EPIC raw data, since the probes number has almost doubled compared to 450k.
Many Thanks,
Giovanni