Hi all,
I am receiving this error message when running MAI to impute missing values:
gp.mai <- MAI(gp.raw, MCAR_algorithm="BPCA", MNAR_algorithm="Single", assay_ix=1)
Estimating pattern of missingness
Imposing missingness
Generating features
Training
Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
long vectors (argument 28) are not supported in .C
gp.raw is a SummarizedExperiment object with 9100x358 measurements. About 6% are missing values. Memory usage is substantial, and according to that message, the problem is that MAI exceeds the maximum vector length. Did anybody else run into this problem? I wonder whether there is an easy workaround, for example, using a subset of the data at the training step, but MAI() does not offer many options.
Best, Hans
sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /mnt/mfs/cluster/bin/R-4.1.3/lib/libRblas.so
LAPACK: /mnt/mfs/cluster/bin/R-4.1.3/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] caret_6.0-92 lattice_0.20-45
[3] SummarizedExperiment_1.24.0 GenomicRanges_1.46.1
[5] GenomeInfoDb_1.30.1 IRanges_2.28.0
[7] S4Vectors_0.32.4 MatrixGenerics_1.6.0
[9] matrixStats_0.62.0 preprocessCore_1.56.0
[11] MAI_1.0.0 imputeLCMD_2.1
[13] impute_1.68.0 pcaMethods_1.86.0
[15] Biobase_2.54.0 BiocGenerics_0.40.0
[17] norm_1.0-10.0 tmvtnorm_1.5
[19] gmm_1.6-6 sandwich_3.0-2
[21] Matrix_1.4-0 mvtnorm_1.1-3
[23] ggplot2_3.3.6
loaded via a namespace (and not attached):
[1] googledrive_2.0.0 colorspace_2.0-3 ellipsis_0.3.2
[4] class_7.3-20 XVector_0.34.0 fs_1.5.2
[7] proxy_0.4-27 listenv_0.8.0 prodlim_2019.11.13
[10] fansi_1.0.3 lubridate_1.8.0 xml2_1.3.3
[13] codetools_0.2-18 splines_4.1.3 doParallel_1.0.17
[16] itertools_0.1-3 jsonlite_1.8.0 pROC_1.18.0
[19] broom_1.0.0 dbplyr_2.2.1 missForest_1.5
[22] readr_2.1.2 compiler_4.1.3 httr_1.4.3
[25] backports_1.4.1 assertthat_0.2.1 gargle_1.2.0
[28] cli_3.3.0 tools_4.1.3 gtable_0.3.0
[31] glue_1.6.2 GenomeInfoDbData_1.2.7 reshape2_1.4.4
[34] dplyr_1.0.9 doRNG_1.8.2 Rcpp_1.0.9
[37] cellranger_1.1.0 vctrs_0.4.1 nlme_3.1-155
[40] iterators_1.0.14 timeDate_4021.104 gower_1.0.0
[43] stringr_1.4.0 globals_0.15.1 rvest_1.0.2
[46] lifecycle_1.0.1 rngtools_1.5.2 googlesheets4_1.0.0
[49] future_1.27.0 MASS_7.3-55 zlibbioc_1.40.0
[52] zoo_1.8-10 scales_1.2.0 ipred_0.9-13
[55] hms_1.1.1 parallel_4.1.3 tidyverse_1.3.2
[58] rpart_4.1.16 stringi_1.7.8 randomForest_4.7-1.1
[61] foreach_1.5.2 e1071_1.7-11 hardhat_1.2.0
[64] lava_1.6.10 rlang_1.0.4 pkgconfig_2.0.3
[67] bitops_1.0-7 purrr_0.3.4 recipes_1.0.1
[70] tidyselect_1.1.2 parallelly_1.32.1 plyr_1.8.7
[73] magrittr_2.0.3 R6_2.5.1 generics_0.1.3
[76] DelayedArray_0.20.0 DBI_1.1.3 pillar_1.8.0
[79] haven_2.5.0 withr_2.5.0 survival_3.2-13
[82] RCurl_1.98-1.8 nnet_7.3-17 tibble_3.1.8
[85] future.apply_1.9.0 modelr_0.1.8 utf8_1.2.2
[88] tzdb_0.3.0 grid_4.1.3 readxl_1.4.0
[91] data.table_1.14.2 forcats_0.5.1 ModelMetrics_1.2.2.2
[94] reprex_2.0.1 digest_0.6.29 tidyr_1.2.0
[97] munsell_0.5.0
Here is a reproducible example. I believe any larger dataset will crash:
Consumes a larger amount of memory during training and then crashes with: