MAI error - vector size limit
1
0
Entering edit mode
@hans-ulrich-klein-1945
Last seen 7 weeks ago
United States

Hi all,

I am receiving this error message when running MAI to impute missing values:

gp.mai <- MAI(gp.raw, MCAR_algorithm="BPCA", MNAR_algorithm="Single", assay_ix=1)
Estimating pattern of missingness
Imposing missingness
Generating features
Training
Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : long vectors (argument 28) are not supported in .C  gp.raw is a SummarizedExperiment object with 9100x358 measurements. About 6% are missing values. Memory usage is substantial, and according to that message, the problem is that MAI exceeds the maximum vector length. Did anybody else run into this problem? I wonder whether there is an easy workaround, for example, using a subset of the data at the training step, but MAI() does not offer many options. Best, Hans  sessionInfo() R version 4.1.3 (2022-03-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 9 (stretch) Matrix products: default BLAS: /mnt/mfs/cluster/bin/R-4.1.3/lib/libRblas.so LAPACK: /mnt/mfs/cluster/bin/R-4.1.3/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] caret_6.0-92 lattice_0.20-45 [3] SummarizedExperiment_1.24.0 GenomicRanges_1.46.1 [5] GenomeInfoDb_1.30.1 IRanges_2.28.0 [7] S4Vectors_0.32.4 MatrixGenerics_1.6.0 [9] matrixStats_0.62.0 preprocessCore_1.56.0 [11] MAI_1.0.0 imputeLCMD_2.1 [13] impute_1.68.0 pcaMethods_1.86.0 [15] Biobase_2.54.0 BiocGenerics_0.40.0 [17] norm_1.0-10.0 tmvtnorm_1.5 [19] gmm_1.6-6 sandwich_3.0-2 [21] Matrix_1.4-0 mvtnorm_1.1-3 [23] ggplot2_3.3.6 loaded via a namespace (and not attached): [1] googledrive_2.0.0 colorspace_2.0-3 ellipsis_0.3.2 [4] class_7.3-20 XVector_0.34.0 fs_1.5.2 [7] proxy_0.4-27 listenv_0.8.0 prodlim_2019.11.13 [10] fansi_1.0.3 lubridate_1.8.0 xml2_1.3.3 [13] codetools_0.2-18 splines_4.1.3 doParallel_1.0.17 [16] itertools_0.1-3 jsonlite_1.8.0 pROC_1.18.0 [19] broom_1.0.0 dbplyr_2.2.1 missForest_1.5 [22] readr_2.1.2 compiler_4.1.3 httr_1.4.3 [25] backports_1.4.1 assertthat_0.2.1 gargle_1.2.0 [28] cli_3.3.0 tools_4.1.3 gtable_0.3.0 [31] glue_1.6.2 GenomeInfoDbData_1.2.7 reshape2_1.4.4 [34] dplyr_1.0.9 doRNG_1.8.2 Rcpp_1.0.9 [37] cellranger_1.1.0 vctrs_0.4.1 nlme_3.1-155 [40] iterators_1.0.14 timeDate_4021.104 gower_1.0.0 [43] stringr_1.4.0 globals_0.15.1 rvest_1.0.2 [46] lifecycle_1.0.1 rngtools_1.5.2 googlesheets4_1.0.0 [49] future_1.27.0 MASS_7.3-55 zlibbioc_1.40.0 [52] zoo_1.8-10 scales_1.2.0 ipred_0.9-13 [55] hms_1.1.1 parallel_4.1.3 tidyverse_1.3.2 [58] rpart_4.1.16 stringi_1.7.8 randomForest_4.7-1.1 [61] foreach_1.5.2 e1071_1.7-11 hardhat_1.2.0 [64] lava_1.6.10 rlang_1.0.4 pkgconfig_2.0.3 [67] bitops_1.0-7 purrr_0.3.4 recipes_1.0.1 [70] tidyselect_1.1.2 parallelly_1.32.1 plyr_1.8.7 [73] magrittr_2.0.3 R6_2.5.1 generics_0.1.3 [76] DelayedArray_0.20.0 DBI_1.1.3 pillar_1.8.0 [79] haven_2.5.0 withr_2.5.0 survival_3.2-13 [82] RCurl_1.98-1.8 nnet_7.3-17 tibble_3.1.8 [85] future.apply_1.9.0 modelr_0.1.8 utf8_1.2.2 [88] tzdb_0.3.0 grid_4.1.3 readxl_1.4.0 [91] data.table_1.14.2 forcats_0.5.1 ModelMetrics_1.2.2.2 [94] reprex_2.0.1 digest_0.6.29 tidyr_1.2.0 [97] munsell_0.5.0  randomForest MAI • 158 views ADD COMMENT 0 Entering edit mode Here is a reproducible example. I believe any larger dataset will crash: library(MAI) values <- rnorm(8000*300) values[sample(1:(8000*300), size=20000)] <- NA dataMat <- matrix(values, nrow=8000, ncol=300) imputed <- MAI(dataMat, MCAR_algorithm="BPCA", MNAR_algorithm="Single")  Consumes a larger amount of memory during training and then crashes with: Estimating pattern of missingness Imposing missingness Generating features Training Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
long vectors (argument 28) are not supported in .C

0
Entering edit mode
@67e42748
Last seen 7 weeks ago
United States

Thank you for the reproducible example. This is an error with R not allowing the random forest algorithm to exceed the set memory size. I was able to to get the example you provided me to work by decreasing the number of trees trained in the RF. I added a parameter forest_list_args so that you can pass any random forest parameter you want to the model. I pushed the changes to https://github.com/KechrisLab/MAI. I was unable to push to Bioconductor I got an error of ! [remote rejected] main -> main (hook declined) error: failed to push some refs to 'git.bioconductor.org:packages/MAI.git'. I will need to a couple of days to figure out what is happening there. In the mean time please install the package through GitHub.

Let me know if anything else comes up.

Best luck, Jonathan.