Dear all,
I am using the Boruta package and want to input non-default parameter values for ntree and mtry. From what I read in the vignette, I understood that it is possible to pass these parameters to ranger (random forest implementation internally called by Boruta) :
You can pass arguments to the importance provider by providing it to the Boruta call; for instance, ranger, the default importance provider, makes use of all available CPU threads, won’t always be the optimal choice. Setting num.threads in the Boruta call will cause it to relay this argument to the ranger function, and hence limit the training process parallelism.
Reminder on default parameter values from the reference manual :
Random Forest methods has two main parameters, number of attributes tried at each split and the number of trees in the forest; first one is called mtry in both implementations, but the second ntree in randomForest and num.trees in ranger. To this end, to maintain compatibility, getImpRf* functions still accept ntree parameter relaying it into num.trees. Still, both parameters take the same defaults in both implementations (square root of the number of all attributes and 500 respectively)
My issue is that, while it's working with ntree, I dot not get it to work yet with mtry, see examples below (where I check the first line from Boruta attStats command to inspect and compare results between different calls):
ntree example
# Call with default
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0))
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria **4.4267420** 4.2501325 -0.4203045 8.6944647 0.6868687 Confirmed <br>
# Call specifing ntree with the default value (we expect the same output) set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0, num.trees = 500))
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria **4.4267420** 4.2501325 -0.4203045 8.6944647 0.6868687 Confirmed <br>
# Call with a non-default value for ntree (we expect different output)
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0, num.trees = 1000))
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 6.2996109 6.1540756 1.670633 12.0373538 0.6868687 Confirmed <br>
=> It's looking good with ntree
mtry example
# Call with default
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0))
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria **4.4267420** 4.2501325 -0.4203045 8.6944647 0.6868687 Confirmed <br>
# Call specifing mtry with the default value (we expect the same output)
# *Ps* : I wasn't quite sure if Boruta was considering a regression or classification task
# nor if it was using ncol or ncol - 1, thus I tested all cases
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0, mtry = sqrt(ncol(data))))
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0, mtry = sqrt(ncol(data) - 1)))
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0, mtry = ncol(data) / 3))
set.seed(54); attStats(Boruta(formule, data = data, doTrace = 0, mtry = (ncol(data) - 1) / 3))
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.3579070 4.2577025 1.341730 7.5233281 0.65656566 Confirmed <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.1890915 4.1481305 -0.0775629 8.5946043 0.69696970 Confirmed <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.3579070 4.2577025 1.341730 7.5233281 0.65656566 Confirmed <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.1890915 4.1481305 -0.0775629 8.5946043 0.69696970 Confirmed <br>
# Given that we didn't find the same output, let's try all possible mtry values to find the "default one"
for(i in 0:ncol(data)) {
print(paste0("mtry set at : ", i))
set.seed(54); print(attStats(Boruta(formule, data = data, doTrace = 0, mtry = i)))
}
> [1] "mtry set at : 0" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria **4.4267420** 4.2501325 -0.4203045 8.6944647 0.6868687 Confirmed <br>
> [1] "mtry set at : 1" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 3.75337508 3.65007557 0.6998206 6.5739342 0.64646465 Tentative <br>
> [1] "mtry set at : 2" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.1890915 4.1481305 -0.0775629 8.5946043 0.69696970 Confirmed <br>
> [1] "mtry set at : 3" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.3579070 4.2577025 1.341730 7.5233281 0.65656566 Confirmed <br>
> [1] "mtry set at : 4" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.6081939 4.5531384 1.081664 9.6172220 0.6767677 Confirmed [1] <br>
> "mtry set at : 5" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.7811405 4.6350621 0.9283142 9.84832090 0.71717172 Confirmed <br>
> [1] "mtry set at : 6" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.6548880 4.3611922 1.2079338 9.0864045 0.67676768 Confirmed <br>
> [1] "mtry set at : 7" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.7861728 4.7295293 1.3558568 9.7099331 0.6767677 Confirmed <br>
> [1] "mtry set at : 8" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.6571985 4.47381402 0.4249924 11.7667187 0.6666667 Confirmed <br>
> [1] "mtry set at : 9" <br>
> meanImp medianImp minImp maxImp normHits decision <br>
> Actinobacteria 4.6967805 4.3565461 0.1421514 10.1955589 0.6464646 Tentative <br>
With mtry = 0 with get the same values. But mtry = 0 just forces ranger to use the default value for mtry, see example below :
(ranger::ranger(response ~ ., data, mtry = 0))$mtry
2
(ranger::ranger(response ~ ., data, mtry = 1))$mtry
> 1
(ranger::ranger(response ~ ., data, mtry = 2))$mtry
> 2
What I am missing ? How can i set a custom value for mtry if I don't understand how Boruta handle this parameter ?
I did not take into account that it wasn't a bioconductor package, sorry for that.
The question was forwarded to the author (Miron B. Kursa) of the package, who kindly replied (it did solve my issue).
This question may thus be closed or deleted if it doesn't belong here. Otherwise I will post the answer at some point