Like a bagged model, a random forest model includes multiple trees. Unlike a bagged model, a random forest model does not use all available predictors to build each tree. Instead, each tree is built using a few predictors randomly selected from the full set of predictors. Using a different subset of predictors each time means that strong predictors don’t always appear in every tree, and therefore don’t cause all trees to be similar. As a result, individual trees in a random forest are more diverse than trees in a bagged model (the trees are “de-correlated”), leading to less overall variance in the model and more reliable predictions.
Parameters that can be tuned include the number of trees in the “forest”, the number of predictors per tree, and the minimum number of cases in a node before the node is allowed to split further.
Like in a bagged tree model, a bootstrapped sample is drawn from the original data for building each tree in the random forest.
There are two options for weights in the R package ranger to tackle imbalanced data:
Applying case weights (149/604 = 0.246688 for ‘No’ and 1.0 for ‘Yes’), which will sample ‘Yes’ cases with “higher probability” during the bootstrap sampling procedure, although the documentation doesn’t specify how much higher. This is similar to the downsampling of ‘No’ cases in the bagged model previously.
Applying class weights (could be any number that we think is appropriate, but for now, 0.246688 for ‘No’ and 1.0 for ‘Yes’), which adds a heavier cost for wrongly classifying ‘Yes’ cases during splitting in a tree, and also gives ‘Yes’ cases more influence in terminal nodes when the majority vote is taken to decide what that node is.
Like in a bagged tree model, one parameter to decide on for a random forest is how many trees to include in the model. The default number in R packages is 500, but it just has to be a high enough number that the error rate plateaus. Most gains in accuracy happen within the first 100 trees (ref).
Another parameter that needs tuning is the number of predictors to randomly select for building each tree in the forest. The default in R is the square root of the total number of predictors in the dataset. In this case, the default would be sqrt(28), or roughly 5 predictors.
The last parameter to tune is the minimum number of cases in a node before the node is allowed to split further; the default in R is 20 cases.
Technical notes: One more parameter that can be changed is the type of splitting rule. In this case, I kept to “gini”, the same as in the single classification tree and the bagged model previously. Note that case weights were used during bootstrapping for each tree in the random forest model, i.e., oversampling ‘Yes’ cases. Bootstrap was also the chosen method in caret’s trainControl( ) when doing a grid search of different tuning parameters, with seeds set to make all models reproducible. The accuracy shown in the plot below refers to that of out-of-bag samples.
The plot below shows the classification accuracy (and SD in shaded regions) of random forest models with different numbers of trees, different numbers of predictors for trees, and different minimum cases before a node is allowed to split. There are three broad trends:
Accuracy is worse only if the number of trees is very small, 5 or 10. Models with 50 or more trees are actually similar to models with hundreds of trees.
A smaller number of randomly selected predictors per tree works better than a larger number of predictors. There’s a steady decline in accuracy that’s especially noticeable after 4 predictors.
The minimum number of cases in a node doesn’t interact much with other parameters.
Assuming we want to stick with a minimum number of 20 cases in a node to be consistent with previous analyses, 4 or fewer predictors per tree, and at least 50 trees but fewer than 400 because of computational time – a random forest model with 200 trees and 4 predictors per tree is most accurate at classifying cases that were “out-of-bag” (cases that the trees weren’t trained on when they were being built, due to the bootstrap sampling procedure).
## mtry num.trees oob_accuracy
## 1 2 50 0.7422819
## 2 3 50 0.7261745
## 3 4 50 0.7583893
## 4 2 100 0.7583001
## 5 3 100 0.7277556
## 6 4 100 0.7636122
## 7 2 200 0.7543161
## 8 3 200 0.7436919
## 9 4 200 0.7742364
## 10 2 300 0.7583001
## 11 3 300 0.7490040
## 12 4 300 0.7702523
Just out of curiosity, I also ran the same random forest model (200 trees, 4 predictors each, min 20 cases in a node) with class weights (0.246688 for “No”, 1 for “Yes”) instead of case weights (0.246688 for “No”, 1 for “Yes”). The out-of-bag accuracy was slightly worse at 0.7304, compared to 0.7742 with case weights. Adjusting the cost rather arbitrarily to 1 for “No” and 2 for “Yes”, the out-of-bag accuracy was 0.7942. Not having a good rationale for what the real costs should be for misclassification, and to be consistent with previous analyses, I decided to stick with case weights.
A final random forest model was constructed using ranger. The model had 200 unpruned trees with a minimum node size of 20, with each tree built from 4 randomly selected predictors and a bootstrap sample of n=753 sampled with replacement from the original n=753. Case weights were applied, leading to an oversampling of ‘Yes’ cases. No cross-validation procedure was used.
The out-of-bag error was 0.2351; i.e., the model accuracy was 76.49% when tested on cases not in the bootstrap samples used to construct individual trees.
Keeping in mind that accuracy is inflated when using a model to evaluate the same cases it was trained on, the model accuracy was 98.14% when tested on all cases, with 100% sensitivity, 97.68% specificity, and 98.84% AUC.
In the plot below, importance was scaled relative to the most important variable (Age).
Randomly selecting subsets of predictors to form de-correlated trees improved model performance, compared to using all predictors for every tree as in the bagged model.
The top three variables were the same between the bagged model and random forest model, although Age overtook HHIE as the top variable in the random forest model.
| Accuracy % | Sensitivity % | Specificity % | Area Under Curve | |
|---|---|---|---|---|
| Logistic reg (lasso) | 63.47 | 59.73 | 64.4 | 62.07 |
| Classification tree | 67.46 | 78.52 | 64.74 | 71.63 |
| Bagging (250 trees) | 77.42 | 32.21 | 88.58 | 60.4 |
| RF (200 trees, m=4) | 98.14 | 100 | 97.68 | 98.84 |
| Boosting | * | * | * | * |