Question

MAI error - vector size limit

0

Entering edit mode

Hans-Ulrich Klein ▴ 330

@hans-ulrich-klein-1945

Last seen 11 months ago

United States

Hi all,

I am receiving this error message when running MAI to impute missing values:

gp.mai <- MAI(gp.raw, MCAR_algorithm="BPCA", MNAR_algorithm="Single", assay_ix=1)
Estimating pattern of missingness
Imposing missingness
Generating features
Training
Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
  long vectors (argument 28) are not supported in .C

gp.raw is a SummarizedExperiment object with 9100x358 measurements. About 6% are missing values. Memory usage is substantial, and according to that message, the problem is that MAI exceeds the maximum vector length. Did anybody else run into this problem? I wonder whether there is an easy workaround, for example, using a subset of the data at the training step, but MAI() does not offer many options.

Best, Hans


sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS:   /mnt/mfs/cluster/bin/R-4.1.3/lib/libRblas.so
LAPACK: /mnt/mfs/cluster/bin/R-4.1.3/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] caret_6.0-92                lattice_0.20-45            
 [3] SummarizedExperiment_1.24.0 GenomicRanges_1.46.1       
 [5] GenomeInfoDb_1.30.1         IRanges_2.28.0             
 [7] S4Vectors_0.32.4            MatrixGenerics_1.6.0       
 [9] matrixStats_0.62.0          preprocessCore_1.56.0      
[11] MAI_1.0.0                   imputeLCMD_2.1             
[13] impute_1.68.0               pcaMethods_1.86.0          
[15] Biobase_2.54.0              BiocGenerics_0.40.0        
[17] norm_1.0-10.0               tmvtnorm_1.5               
[19] gmm_1.6-6                   sandwich_3.0-2             
[21] Matrix_1.4-0                mvtnorm_1.1-3              
[23] ggplot2_3.3.6              

loaded via a namespace (and not attached):
 [1] googledrive_2.0.0      colorspace_2.0-3       ellipsis_0.3.2        
 [4] class_7.3-20           XVector_0.34.0         fs_1.5.2              
 [7] proxy_0.4-27           listenv_0.8.0          prodlim_2019.11.13    
[10] fansi_1.0.3            lubridate_1.8.0        xml2_1.3.3            
[13] codetools_0.2-18       splines_4.1.3          doParallel_1.0.17     
[16] itertools_0.1-3        jsonlite_1.8.0         pROC_1.18.0           
[19] broom_1.0.0            dbplyr_2.2.1           missForest_1.5        
[22] readr_2.1.2            compiler_4.1.3         httr_1.4.3            
[25] backports_1.4.1        assertthat_0.2.1       gargle_1.2.0          
[28] cli_3.3.0              tools_4.1.3            gtable_0.3.0          
[31] glue_1.6.2             GenomeInfoDbData_1.2.7 reshape2_1.4.4        
[34] dplyr_1.0.9            doRNG_1.8.2            Rcpp_1.0.9            
[37] cellranger_1.1.0       vctrs_0.4.1            nlme_3.1-155          
[40] iterators_1.0.14       timeDate_4021.104      gower_1.0.0           
[43] stringr_1.4.0          globals_0.15.1         rvest_1.0.2           
[46] lifecycle_1.0.1        rngtools_1.5.2         googlesheets4_1.0.0   
[49] future_1.27.0          MASS_7.3-55            zlibbioc_1.40.0       
[52] zoo_1.8-10             scales_1.2.0           ipred_0.9-13          
[55] hms_1.1.1              parallel_4.1.3         tidyverse_1.3.2       
[58] rpart_4.1.16           stringi_1.7.8          randomForest_4.7-1.1  
[61] foreach_1.5.2          e1071_1.7-11           hardhat_1.2.0         
[64] lava_1.6.10            rlang_1.0.4            pkgconfig_2.0.3       
[67] bitops_1.0-7           purrr_0.3.4            recipes_1.0.1         
[70] tidyselect_1.1.2       parallelly_1.32.1      plyr_1.8.7            
[73] magrittr_2.0.3         R6_2.5.1               generics_0.1.3        
[76] DelayedArray_0.20.0    DBI_1.1.3              pillar_1.8.0          
[79] haven_2.5.0            withr_2.5.0            survival_3.2-13       
[82] RCurl_1.98-1.8         nnet_7.3-17            tibble_3.1.8          
[85] future.apply_1.9.0     modelr_0.1.8           utf8_1.2.2            
[88] tzdb_0.3.0             grid_4.1.3             readxl_1.4.0          
[91] data.table_1.14.2      forcats_0.5.1          ModelMetrics_1.2.2.2  
[94] reprex_2.0.1           digest_0.6.29          tidyr_1.2.0           
[97] munsell_0.5.0

randomForest MAI • 1.3k views

ADD COMMENT • link updated 3.3 years ago by jonathan.dekermanjian • 0 • written 3.3 years ago by Hans-Ulrich Klein ▴ 330

0

Entering edit mode

Here is a reproducible example. I believe any larger dataset will crash:

library(MAI) 
values <- rnorm(8000*300)
values[sample(1:(8000*300), size=20000)] <- NA
dataMat <- matrix(values, nrow=8000, ncol=300)
imputed <- MAI(dataMat, MCAR_algorithm="BPCA", MNAR_algorithm="Single")

Consumes a larger amount of memory during training and then crashes with:

Estimating pattern of missingness
Imposing missingness
Generating features
Training
Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
  long vectors (argument 28) are not supported in .C

ADD REPLY • link 3.3 years ago Hans-Ulrich Klein ▴ 330

score 0 · Answer 1 · 2022-08-09

Thank you for the reproducible example. This is an error with R not allowing the random forest algorithm to exceed the set memory size. I was able to to get the example you provided me to work by decreasing the number of trees trained in the RF. I added a parameter forest_list_args so that you can pass any random forest parameter you want to the model. I pushed the changes to https://github.com/KechrisLab/MAI. I was unable to push to Bioconductor I got an error of ! [remote rejected] main -> main (hook declined) error: failed to push some refs to 'git.bioconductor.org:packages/MAI.git'. I will need to a couple of days to figure out what is happening there. In the mean time please install the package through GitHub.

Let me know if anything else comes up.

Best luck, Jonathan.