Question

NaN values in beta/M values matrices of RGChannelSet are imputed during normalization?

0

Entering edit mode

omorante • 0

@ee812cc0

Last seen 4.0 years ago

After loading different datasets of iDATS to minfi, I have noticed that there are several CpGs in the RGChannelSet beta and m values (obtained with the getBeta and getM functions)

However, after normalization with, for example, the Quantile or noob methods, these NAs are converted to numbers, and they are different depending on the normalization method used.

I would like to know if that is an expected behaviour, or if we should exclude manually these CpGs in the final analysis or impute the values .

minfiData methylationArrayAnalysis MethylationArrayData minfi • 1.9k views

ADD COMMENT • link updated 4.1 years ago by Kasper Daniel Hansen ★ 6.5k • written 4.1 years ago by omorante • 0

0

Entering edit mode

I am going to add an example to reproduce what I am saying, with the minfiData package:

rawb = as.data.frame(minfi::getBeta(minfiData::RGsetEx))
idx = row.names(rawb[rowSums(is.na(rawb))>0,]) #names of rows with NAs
head(rawb[idx,])

gset = minfi::preprocessQuantile(minfiData::RGsetEx)
normb = minfi::getBeta(gset)
head(normb[idx,])

head(rawb[idx,]): 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02 cg00541303 0.62546370 0.67913903 0.2830123 0.39541413 NaN 0.20470972 cg12211867 0.28244096 0.33997644 0.7489393 0.65331968 0.4373757 NaN cg21422909 0.06863271 NaN 0.1095389 0.08551037 0.1019352 0.08451947 cg01119172 0.82459109 0.72714932 0.7325581 0.80993602 NaN 0.72307692 cg24494008 0.10291859 0.08823529 0.1718062 NaN 0.1111831 0.23105263 cg01157718 NaN 0.38190955 0.8803734 0.72485992 0.8875035 0.67151163

head(normb[idx,]) 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02 cg00541303 0.60855203 0.63757631 0.3890528 0.40985515 0.6952252 0.27570889 cg12211867 0.34563935 0.39431849 0.7164490 0.59228549 0.4622814 0.47585709 cg21422909 0.07522882 0.51918442 0.1397407 0.08466117 0.0844101 0.09255525 cg01119172 0.76768695 0.68974381 0.7780127 0.81666941 0.6952252 0.82167865 cg24494008 0.12813159 0.09321829 0.1666078 0.51918442 0.1081707 0.19789654 cg01157718 0.67244123 0.38078366 0.8881847 0.73250020 0.8763427 0.73704905

I do not understand why these methylation values seem to appear from nowhere. Is there an imputation process?

Session info:

R Under development (unstable) (2020-11-08 r79408) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS

Matrix products: default BLAS: /home/omorante/R-devel/lib/libRblas.so LAPACK: /home/omorante/R-devel/lib/libRlapack.so

locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 LC_COLLATE=es_ES.UTF-8 LC_MONETARY=en_AU.UTF-8
[6] LC_MESSAGES=es_ES.UTF-8 LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0 IlluminaHumanMethylation450kmanifest_0.4.0
[3] minfi_1.37.0 bumphunter_1.33.0
[5] locfit_1.5-9.4 iterators_1.0.13
[7] foreach_1.5.1 Biostrings_2.59.0
[9] XVector_0.31.0 SummarizedExperiment_1.21.0
[11] Biobase_2.51.0 MatrixGenerics_1.3.0
[13] matrixStats_0.57.0 GenomicRanges_1.43.0
[15] GenomeInfoDb_1.27.0 IRanges_2.25.2
[17] S4Vectors_0.29.3 BiocGenerics_0.37.0

loaded via a namespace (and not attached): [1] nlme_3.1-150 bitops_1.0-6 bit64_4.0.5 RColorBrewer_1.1-2 progress_1.2.2
[6] httr_1.4.2 tools_4.1.0 doRNG_1.8.2 nor1mix_1.3-0 R6_2.5.0
[11] HDF5Array_1.19.0 DBI_1.1.0 rhdf5filters_1.3.0 tidyselect_1.1.0 prettyunits_1.1.1
[16] base64_2.0 minfiData_0.37.0 bit_4.0.4 curl_4.3 compiler_4.1.0
[21] preprocessCore_1.53.0 xml2_1.3.2 DelayedArray_0.17.0 rtracklayer_1.51.0 readr_1.4.0
[26] quadprog_1.5-8 genefilter_1.73.0 askpass_1.1 rappdirs_0.3.1 Rsamtools_2.7.0
[31] stringr_1.4.0 digest_0.6.27 illuminaio_0.33.0 siggenes_1.65.0 GEOquery_2.59.0
[36] pkgconfig_2.0.3 scrime_1.3.5 sparseMatrixStats_1.3.0 dbplyr_2.0.0 limma_3.47.0
[41] rlang_0.4.8 rstudioapi_0.11 RSQLite_2.2.1 DelayedMatrixStats_1.13.0 generics_0.1.0
[46] mclust_5.4.6 BiocParallel_1.25.1 dplyr_1.0.2 RCurl_1.98-1.2 magrittr_1.5
[51] GenomeInfoDbData_1.2.4 Matrix_1.2-18 Rcpp_1.0.5 Rhdf5lib_1.13.0 lifecycle_0.2.0
[56] stringi_1.5.3 MASS_7.3-53 zlibbioc_1.37.0 rhdf5_2.35.0 plyr_1.8.6
[61] BiocFileCache_1.15.0 grid_4.1.0 blob_1.2.1 crayon_1.3.4 lattice_0.20-41
[66] splines_4.1.0 multtest_2.47.0 GenomicFeatures_1.43.1 annotate_1.69.0 hms_0.5.3
[71] beanplot_1.2 pillar_1.4.6 rngtools_1.5 codetools_0.2-18 biomaRt_2.47.0
[76] XML_3.99-0.5 glue_1.4.2 data.table_1.13.2 vctrs_0.3.4 tidyr_1.1.2
[81] openssl_1.4.3 purrr_0.3.4 reshape_0.8.8 assertthat_0.2.1 xtable_1.8-4
[86] survival_3.2-7 tibble_3.0.4 GenomicAlignments_1.27.1 AnnotationDbi_1.53.0 memoise_1.1.0
[91] ellipsis_0.3.1

ADD REPLY • link 4.1 years ago omorante • 0

score 0 · Answer 1 · 2020-11-12

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 14 hours ago

United States

The short answer is that it's expected. The long answer is more complicated.

You should note that those aren't really NA values, but instead are NaN values. Which you should expect if you divide by zero. In addition, when you run getBeta on an RGChannelSet, the first thing that happens is preprocessRaw is used to convert to a MethylSet, because you can't get the betas without some sort of pre-processing. And preprocessRaw doesn't do anything but compute the raw methylated and unmethylated values. As an example:

> z <- preprocessRaw(RGsetEx)
> rbind(getMeth(z)["cg00541303",], getUnmeth(z)["cg00541303",])
      5723646052_R02C02 5723646052_R04C01 5723646052_R05C02 5723646053_R04C02
 [1,]              4721              5427              1616              2535
 [2,]              2827              2564              4094              3876
      5723646053_R05C02 5723646053_R06C02
 [1,]                 0               878
 [2,]                 0              3411

## The fifth sample will be NaN because the beta will be 0/0

## Those same values after quantile normalization

 > rbind(M["cg00541303",], U["cg00541303",])
      5723646052_R02C02 5723646052_R04C01 5723646052_R05C02 5723646053_R04C02
 [1,]          5004.833          4700.000          2208.167          2405.167
 [2,]          3219.333          2671.667          3467.583          3463.167
      5723646053_R05C02 5723646053_R06C02
 [1,]          89.41791          1525.750
 [2,]          39.19927          4008.167

## which will no longer generate an NaN beta value

Doing any sort of normalization will tend to remove any NaN beta values, because you won't have any zero methylation or unmethylation values. But you can still get rid of CpGs that appear to be undetected in more than one sample:

> detectionP(RGsetEx)["cg00541303",]
 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02 5723646053_R04C02
      0.000000e+00      0.000000e+00     4.387266e-305     1.330263e-288
 5723646053_R05C02 5723646053_R06C02
      1.000000e+00      5.385087e-67

ADD COMMENT • link 4.1 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you very much for your explanation.

I think that I have understood the behaviour, but I am still not sure why the normalization methods are normalizing a NaN beta (or a 0 value in the methylated signal and a 0 value in the unmethylated signal) to a value.

Moreover, different normalization methods produce really different results in these values that originally are "NaN" in the RGSet.

I was using a filter of average detectionP of 0.01. Should I remove every CpG with at least one sample with a high detectionP? And, if we would like to impute the NaN values, should we do with the normalized M-values/Beta-values using the information of NaN values in the RGSet?

ADD REPLY • link 4.1 years ago omorante • 0

0

Entering edit mode

This happens for probes where Red, Green is zero. The default Illumina definition of beta value handles this problem by adding 100 beta = M / (M + U + 100) so you can think of that as imputation. You can use this definition of beta in minfi (if you want), but the default is to not add 100.

Most normalization methods change R,G or M,U and if you change the numbers, it is not surprising that you no longer have zeroes.

ADD REPLY • link 4.1 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Thanks! However, I suppose that if M and U are 0, then that position has failed and we can not know the actual beta value, can we?

ADD REPLY • link 4.1 years ago omorante • 0

0

Entering edit mode

The probe could have failed. Alternatively, genetic differences may have caused this specific probe to not work for this specific individual; for example, the CpG may not exist at all.

ADD REPLY • link 4.1 years ago Kasper Daniel Hansen ★ 6.5k