Disparity between MethylLumiSet (methylumi) and RGChannelSet (minfi) - treatment of missing values
1
0
Entering edit mode
@pilling-luke-6027
Last seen 9.6 years ago
Hello, I am loading 450k methylation data into R and have noticed a difference in the way that "methylumi" and "minfi" load the IDAT files. Each has a dedicated function for this (methylumIDAT() for methylumi, and read.450k.exp() for minfi), each of which creates an object which contains the relevant information (MethyLumiSet objects in the first instance, RGChannelSet objects in the 2nd). I am finding that the number of "missing" values reported in an array is much higher in the MethyLumiSet object than the RGChannelSet object, when I compare the 2 methods on the same IDAT files (this is before any exclusions, transformations, etc - raw data after import is different). To give an example; I loaded a single array (12 samples) using both methods. I extracted the betas, and then used the "na.omit()" command to remove any probes (rows) that have missing values. I find that 988 probes are excluded from the MethyLumiSet object, but not from the RGChannelSet object (47 probes are omitted from both). I then wanted to look at the raw data for a probe that is excluded from MethyLumiSet but not RGChannelSet; ###### First I picked the first probe in this list (excluded from MethyLumiSet but not RGChannelSet); > probes.excluded.methylumiset[! probes.excluded.methylumiset %in% probes.excluded.rgchannelset][1] [1] "cg00035864" ###### Then extracted the betas for this probe from both data objects; > methylumiset.betas[row.names(methylumiset.betas) == "cg00035864"] [1] NA NA NA NA 0.2101861 0.2584211 NA NA NA NA 0.1878819 NA > rgchannelset.betas [row.names(rgchannelset.betas) == "cg00035864"] [1] 0.5665025 0.3771760 0.5247209 0.6393211 0.2101861 0.2584211 0.5242291 0.4642263 0.4795640 0.5143571 0.1878819 0.4678663 I cannot find any reference on the web or in the documentation for these functions as to why they would treat the raw IDAT data import differently - any input/insight would be much appreciated. Many thanks, Luke ------------------ Mr. Luke C. Pilling Associate Research Fellow & PhD Researcher, Epidemiology and Public Health Group, Medical School, University of Exeter, U.K. [[alternative HTML version deleted]]
probe methylumi probe methylumi • 2.3k views
ADD COMMENT
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 9 months ago
United States
This is extremely weird. As far as I know, both packages are using the same original code for reading in IDAT files, and as far as I can recall, we (minfi / illuminaio) have not had a bugfix which would address this. But you are looking at Beta's so my immediate thought is that you are doing something with divide by zero and/or na.omit. I would be shocked if there is a difference in the Red/Green channel measurements. Could we get (1) the offending IDAT file posted somewhere (2) full code (3) sessionInfo() Best, Kasper On Wed, Jul 3, 2013 at 9:32 AM, Pilling, Luke <l.pilling@exeter.ac.uk>wrote: > Hello, > > I am loading 450k methylation data into R and have noticed a difference in > the way that "methylumi" and "minfi" load the IDAT files. > > Each has a dedicated function for this (methylumIDAT() for methylumi, and > read.450k.exp() for minfi), each of which creates an object which contains > the relevant information (MethyLumiSet objects in the first instance, > RGChannelSet objects in the 2nd). > > I am finding that the number of "missing" values reported in an array is > much higher in the MethyLumiSet object than the RGChannelSet object, when I > compare the 2 methods on the same IDAT files (this is before any > exclusions, transformations, etc - raw data after import is different). > > To give an example; > > I loaded a single array (12 samples) using both methods. I extracted the > betas, and then used the "na.omit()" command to remove any probes (rows) > that have missing values. I find that 988 probes are excluded from the > MethyLumiSet object, but not from the RGChannelSet object (47 probes are > omitted from both). > > I then wanted to look at the raw data for a probe that is excluded from > MethyLumiSet but not RGChannelSet; > > > ###### First I picked the first probe in this list (excluded from > MethyLumiSet but not RGChannelSet); > > > probes.excluded.methylumiset[! probes.excluded.methylumiset %in% > probes.excluded.rgchannelset][1] > [1] "cg00035864" > > ###### Then extracted the betas for this probe from both data objects; > > > methylumiset.betas[row.names(methylumiset.betas) == "cg00035864"] > [1] NA NA NA NA 0.2101861 0.2584211 NA > NA NA NA 0.1878819 NA > > rgchannelset.betas [row.names(rgchannelset.betas) == "cg00035864"] > [1] 0.5665025 0.3771760 0.5247209 0.6393211 0.2101861 0.2584211 0.5242291 > 0.4642263 0.4795640 0.5143571 0.1878819 0.4678663 > > > I cannot find any reference on the web or in the documentation for these > functions as to why they would treat the raw IDAT data import differently - > any input/insight would be much appreciated. > > Many thanks, > Luke > > ------------------ > Mr. Luke C. Pilling > Associate Research Fellow & PhD Researcher, > Epidemiology and Public Health Group, Medical School, University of > Exeter, U.K. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thanks for the reply. Following my e-mail to this mailing list, I also e-mailed one of the package authors (for Methylumi package - Tim Triche), who told me that which the Beta calculation was the same (hence we don't see any difference in the values that are present), they use an altered version of the detection P-value calculation. As such, a different number of Betas are actually present. So you are correct: same code for actually reading the file (and generating Betas), but slightly different parameters for the P-value calculation (although I don't believe this has been published - I didn't see it in the manuals!). Missing values in the RGChannelSet (loaded by read.450k.exp() ) are presented as NaN, rather than NA, but are treated the same by na.omit(). I'm afraid I cannot provide the data, as it is not "my" data - it is human cohort data that I am exploring normalization methods for. The commands were (for the most part) the default options; ###### First load using minfi (output included) ######################## test.rg <- read.450k.exp(base="data/meth/450k/", targets=targets[targets$Sample_Slide=="5790742015", ], extended=T) test.rg # RGChannelSetExtended (storageMode: lockedEnvironment) # assayData: 622399 features, 12 samples # element names: Green, GreenSD, NBeads, Red, RedSD # phenoData # sampleNames: 5790742015_R03C02 5790742015_R04C02 ... 5790742015_R02C02 (12 total) # varLabels: Sample_code98 Sample_Name ... filenames (10 total) # varMetadata: labelDescription # Annotation # array: IlluminaHumanMethylation450k # annotation: ilmn.v1.2 test.rg.pf <- pfilter(test.rg, pnthresh=.01, perc=5, pthresh=5) # 0 samples having 5 % of sites with a detection p-value greater than 0.01 were removed # Samples removed: # 6053 sites were removed as beadcount <3 in 5 % of samples # 1674 sites having 5 % of samples with a detection p-value greater than 0.01 were removed ######## Load using Methylumi test.mld <- methylumIDAT(Barcodes[Sample_Slide=="5790742015"], idatPath="data/meth/450k/5790742015/") test.mld # Object Information: # MethyLumiSet (storageMode: lockedEnvironment) # assayData: 485577 features, 12 samples # element names: betas, methylated, methylated.OOB, pvals, unmethylated, unmethylated.OOB # protocolData: none # phenoData # sampleNames: 5790742015_R03C02 5790742015_R04C02 ... 5790742015_R02C02 (12 total) # varLabels: barcode # varMetadata: labelDescription # featureData # featureNames: cg00000029 cg00000108 ... rs9839873 (485577 total) # fvarLabels: Probe_ID DESIGN COLOR_CHANNEL # fvarMetadata: labelDescription # experimentData: use 'experimentData(object)' # Annotation: IlluminaHumanMethylation450k # Major Operation History: # submitted finished command # 1 2013-07-03 11:32:10 2013-07-03 11:33:44 methylumIDAT(barcodes = Barcodes[Sample_Slide == "5790742015"], # 2 2013-07-03 11:33:53 2013-07-03 11:33:56 Subset of 485577 features. test.mld.pf <- pfilter(test.mld, pnthresh=.01, perc=5, pthresh=5) # 0 samples having 5 % of sites with a detection p-value greater than 0.01 were removed # Samples removed: # 1035 sites were removed as beadcount <3 in 5 % of samples # 1249 sites having 5 % of samples with a detection p-value greater than 0.01 were removed ######## SessionInfo R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] wateRmelon_1.0.3 [2] ROC_1.36.0 [3] IlluminaHumanMethylation450k.db_2.0.7 [4] org.Hs.eg.db_2.9.0 [5] RSQLite_0.11.4 [6] DBI_0.2-7 [7] AnnotationDbi_1.22.6 [8] lumi_2.12.0 [9] matrixStats_0.8.1 [10] limma_3.16.5 [11] methylumi_2.6.1 [12] ggplot2_0.9.3.1 [13] reshape2_1.2.2 [14] scales_0.2.3 [15] IlluminaHumanMethylation450kmanifest_0.4.0 [16] minfi_1.6.0 [17] Biostrings_2.28.0 [18] GenomicRanges_1.12.4 [19] IRanges_1.18.1 [20] reshape_0.8.4 [21] plyr_1.8 [22] lattice_0.20-15 [23] Biobase_2.20.0 [24] BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] affy_1.38.1 affyio_1.28.0 annotate_1.38.0 [4] beanplot_1.1 BiocInstaller_1.10.2 colorspace_1.2-2 [7] dichromat_2.0-0 digest_0.6.3 grid_3.0.1 [10] gtable_0.1.2 illuminaio_0.2.0 KernSmooth_2.23-10 [13] labeling_0.1 MASS_7.3-26 Matrix_1.0-12 [16] mclust_4.1 mgcv_1.7-22 multtest_2.16.0 [19] munsell_0.4 nleqslv_2.0 nlme_3.1-109 [22] nor1mix_1.1-4 preprocessCore_1.22.0 proto_0.3-10 [25] R.methodsS3_1.4.2 RColorBrewer_1.0-5 siggenes_1.34.0 [28] splines_3.0.1 stats4_3.0.1 stringr_0.6.2 [31] survival_2.37-4 tools_3.0.1 XML_3.96-1.1 [34] xtable_1.7-1 zlibbioc_1.6. ######################################## Best, Luke From: Kasper Daniel Hansen [mailto:kasperdanielhansen@gmail.com] Sent: Thursday, July 04, 2013 1:23 PM To: Pilling, Luke Cc: bioconductor@r-project.org Subject: Re: [BioC] Disparity between MethylLumiSet (methylumi) and RGChannelSet (minfi) - treatment of missing values This is extremely weird. As far as I know, both packages are using the same original code for reading in IDAT files, and as far as I can recall, we (minfi / illuminaio) have not had a bugfix which would address this. But you are looking at Beta's so my immediate thought is that you are doing something with divide by zero and/or na.omit. I would be shocked if there is a difference in the Red/Green channel measurements. Could we get (1) the offending IDAT file posted somewhere (2) full code (3) sessionInfo() Best, Kasper On Wed, Jul 3, 2013 at 9:32 AM, Pilling, Luke <l.pilling@exeter.ac.uk<mailto:l.pilling@exeter.ac.uk>> wrote: Hello, I am loading 450k methylation data into R and have noticed a difference in the way that "methylumi" and "minfi" load the IDAT files. Each has a dedicated function for this (methylumIDAT() for methylumi, and read.450k.exp() for minfi), each of which creates an object which contains the relevant information (MethyLumiSet objects in the first instance, RGChannelSet objects in the 2nd). I am finding that the number of "missing" values reported in an array is much higher in the MethyLumiSet object than the RGChannelSet object, when I compare the 2 methods on the same IDAT files (this is before any exclusions, transformations, etc - raw data after import is different). To give an example; I loaded a single array (12 samples) using both methods. I extracted the betas, and then used the "na.omit()" command to remove any probes (rows) that have missing values. I find that 988 probes are excluded from the MethyLumiSet object, but not from the RGChannelSet object (47 probes are omitted from both). I then wanted to look at the raw data for a probe that is excluded from MethyLumiSet but not RGChannelSet; ###### First I picked the first probe in this list (excluded from MethyLumiSet but not RGChannelSet); > probes.excluded.methylumiset[! probes.excluded.methylumiset %in% probes.excluded.rgchannelset][1] [1] "cg00035864" ###### Then extracted the betas for this probe from both data objects; > methylumiset.betas[row.names(methylumiset.betas) == "cg00035864"] [1] NA NA NA NA 0.2101861 0.2584211 NA NA NA NA 0.1878819 NA > rgchannelset.betas [row.names(rgchannelset.betas) == "cg00035864"] [1] 0.5665025 0.3771760 0.5247209 0.6393211 0.2101861 0.2584211 0.5242291 0.4642263 0.4795640 0.5143571 0.1878819 0.4678663 I cannot find any reference on the web or in the documentation for these functions as to why they would treat the raw IDAT data import differently - any input/insight would be much appreciated. Many thanks, Luke ------------------ Mr. Luke C. Pilling Associate Research Fellow & PhD Researcher, Epidemiology and Public Health Group, Medical School, University of Exeter, U.K. [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 829 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6