About bagged tree models

A single classification tree can have high variance; that is, different subsets of data can lead to very different trees. Bagging – short for “bootstrap aggregating” – can decrease that variance by combining the results of multiple classification trees. In other words, the average of multiple “guesses” tends to be closer to the real answer than a single “guess”. The bagging procedure is as follows:
1. Take a random sample of the data with replacement (the “bootstrap”)
2. Create a full unpruned tree using that sample, using all predictors
3. Record the predicted outcomes from the tree; that is, the predicted class of each observation
4. Repeat for the number of trees specified by the user, e.g. a bagged model could have 10 or 100 trees
5. For a binary outcome, take the majority vote across the trees’ predictions; that is, if 7/10 trees say that observation 200 should be Class A, then that is the prediction for observation 200 (the “aggregating”)

Number of trees to use

To find out how many repetitions of the bagging procedure are appropriate (i.e. how many individual classification trees should be created and aggregated into one model), models with different numbers of trees can be built, and the error measured. Too many trees won’t lead to overfitting but will take unnecessary computing time, so it’s better to use the smallest number of trees that leads to low error.

The plot below shows that classification accuracy for this data increases as the number of trees in a bagged model increases, up to a point (the black line is mean accuracy across different subsets of data, and the grey lines are +/- 1 SD). Accuracy plateaus at about 250 trees, so 250 trees is a reasonable number to use.

Technical notes
Technical notes for plot: For each model with a different number of trees (B), a cross-validation procedure was used to evaluate the model’s performance on slightly different random subsets of the data (5-fold, repeated 3 times using different folds). For a fair comparison, the same folds were used for all models. The cross-validation procedure occurred before the bootstrapping procedure in each model. During the bootstrapping procedure for each tree within a model, downsampling was used on the majority class (“No”) so that each tree was built using an equal number of “Yes” and “No” cases. The black line in the plot below refers to the mean classification accuracy of the model on samples that were not selected during the bootstrapping process when the model was being built (hence, “out of bag”). The grey lines are +/- 1 SD across 15 cross-validation samples.

Final model

The final model consisted of 250 individual trees, each allowed to grow to its full depth. For each tree, the same parameters were used as in the single classification tree: a minsplit of 20 observations for splitting a node, and split = “gini”. Each tree was built from a bootstrapped sample of n=753, randomly sampled with replacement from the original dataset of n=753. The bootstrapping process means that each tree was built on roughly 2/3 of the data. Downsampling was applied to the majority class (“No”) so that each tree was built using an equal number of “Yes” and “No” cases. No cross-validation procedure was used in this final model, as the dataset was already pretty small and individual trees were built using different subsets of the data anyway.

Model performance

The table below shows the final model’s confusion matrix of correctly and incorrectly predicted cases. The model’s classification accuracy was 77.42% overall, with 32.21% sensitivity and 88.58% specificity.

##               Actual No Actual Yes
## Predicted No        535        101
## Predicted Yes        69         48

Variable importance

Variable importance was calculated similarly as in the single classification tree, except it’s now summed across 250 trees in the model. In the plot below, importance was scaled relative to the most important variable (HHIE).

Takeaways

  1. A bagged model made of 250 trees appeared to be more accurate than a single tree, but that was because it skewed more towards “No” cases (which made up the vast majority of cases), despite the use of downsampling.

  2. HHIE, Age, and PTA(BE) were the most important variables in a bagged model, similar to a single classification tree.

Accuracy % Sensitivity % Specificity % Area Under Curve
Logistic x=4 63.47 59.73 64.4 62.07
Class tree cp=0.013 67.46 78.52 64.74 71.63
Bagging (250 trees) 77.42 32.21 88.58 60.4
Random forest * * * *
Boosting * * * *