I am using Minfi to process data from Illumina EPIC methylation beadChips and I have found a discrepancy between the number of CpGs that GenomeStudio reports with a detection P value <0.01 and the number that I get reading in the .idat files using Minfi in R.
For example, in a recent batch I had one sample with only 801,850 CpGs with detection P values <0.01 (92.5%) according to GenomeStudio, but when I read in the data from the idat files and used the detectionP() function in Minfi, I got a count of 842,051 CpGs with detection P <0.01 (97.1%).
Is there an explanation for the discrepancy? This example has a pretty extreme difference, but the number of good CpGs I get when I read from the idat files directly is consistently higher than what comes out of GenomeStudio.
Example code for getting the count of good detection P values from Minfi and sessionInfo is below.
RGSet <- read.metharray.exp(targets = targets) detP <- detectionP(RGSet) dim(detP) # [1] 866836 137 sum(detP[,"200861170017_R06C01"]<0.01) # [1] 842051
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Fedora 23 (Server Edition) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils datasets [9] methods base other attached packages: [1] Gviz_1.17.4 minfi_1.20.2 bumphunter_1.14.0 [4] locfit_1.5-9.1 iterators_1.0.8 foreach_1.4.3 [7] Biostrings_2.42.1 XVector_0.13.7 SummarizedExperiment_1.4.0 [10] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 IRanges_2.8.2 [13] S4Vectors_0.12.2 Biobase_2.34.0 BiocGenerics_0.19.2 loaded via a namespace (and not attached): [1] nlme_3.1-131 bitops_1.0-6 [3] matrixStats_0.52.1 RColorBrewer_1.1-2 [5] httr_1.2.1 tools_3.3.0 [7] backports_1.0.5 doRNG_1.6 [9] nor1mix_1.2-2 R6_2.2.0 [11] rpart_4.1-10 Hmisc_4.0-2 [13] DBI_0.6-1 lazyeval_0.2.0 [15] colorspace_1.3-2 nnet_7.3-12 [17] gridExtra_2.2.1 base64_2.0 [19] preprocessCore_1.36.0 htmlTable_1.9 [21] pkgmaker_0.22 rtracklayer_1.34.2 [23] scales_0.4.1 checkmate_1.8.2 [25] genefilter_1.56.0 quadprog_1.5-5 [27] stringr_1.2.0 digest_0.6.12 [29] Rsamtools_1.26.1 foreign_0.8-67 [31] illuminaio_0.16.0 siggenes_1.48.0 [33] GEOquery_2.40.0 base64enc_0.1-3 [35] dichromat_2.0-0 htmltools_0.3.5 [37] BSgenome_1.41.2 ensembldb_1.5.9 [39] limma_3.30.13 htmlwidgets_0.8 [41] RSQLite_1.1-2 BiocInstaller_1.24.0 [43] shiny_1.0.1 mclust_5.2.3 [45] BiocParallel_1.8.1 acepack_1.4.1 [47] VariantAnnotation_1.19.7 RCurl_1.95-4.8 [49] magrittr_1.5 Formula_1.2-1 [51] Matrix_1.2-8 Rcpp_0.12.10 [53] munsell_0.4.3 stringi_1.1.3 [55] MASS_7.3-45 zlibbioc_1.20.0 [57] plyr_1.8.4 AnnotationHub_2.5.4 [59] lattice_0.20-35 splines_3.3.0 [61] multtest_2.30.0 GenomicFeatures_1.26.4 [63] annotate_1.52.1 knitr_1.15.1 [65] beanplot_1.2 rngtools_1.2.4 [67] codetools_0.2-15 biomaRt_2.30.0 [69] XML_3.98-1.6 biovizBase_1.22.0 [71] latticeExtra_0.6-28 data.table_1.10.4 [73] httpuv_1.3.3 gtable_0.2.0 [75] openssl_0.9.6 reshape_0.8.6 [77] assertthat_0.1 ggplot2_2.2.1 [79] mime_0.5 xtable_1.8-2 [81] survival_2.41-3 tibble_1.2 [83] GenomicAlignments_1.10.1 AnnotationDbi_1.36.2 [85] registry_0.3 memoise_1.0.0 [87] cluster_2.0.6 interactiveDisplayBase_1.12.0 > packageVersion("minfi") [1] '1.20.2'
Thank you for the fast and helpful response, James. I also contacted Illumina tech support and they basically said the same thing -- that they are using a different algorithm to generate detection p values.
Thanks, Kasper! Here is the documentation that Illumina tech support pointed me to:
"The information we have on the algorithms used in GenomeStudio Methylation are found in the user guide, linked here: https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/genomestudio/genomestudio-2011-1/genomestudio-methylation-v1-8-user-guide-11319130-b.pdf
The actual calculation for detection P value is likely inherited from the formula used for Gene Expression, which is described in the gene expression user guide on page 106: https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/genomestudio/genomestudio-2011-1/genomestudio-gx-module-v1-0-user-guide-11319121-a.pdf
Besides these, we do not have more detailed information about the algorithms used, but I hope this helps."
... It seems that not even tech support can say for sure what GenomeStudio does! I imagine this documentation is exactly what you looked at before, so there may not be anything to change.
Great! I just sent you an email. Let me know if you need anything else.
-Brooke