About boosting

In both bagged trees and random forest models, multiple trees are built and their results are aggregated. Similar to bagged trees and random forests, a boosted tree model aggregates results from many trees. Unlike bagged trees and random forests, where trees are “grown” independently of one another, boosted models create trees in a sequence, so that subsequent trees work on the results of previous trees and try to correct previously misclassified cases.

General procedure:

Build a single classification tree from all the data (or a bootstrap sample, depending on the option chosen). This first tree is usually fairly simple, and not a very good classifier, just a little better than 50% correct.
There are a couple different ways for building the next tree. In this case, the Adaboost procedure weights misclassified cases from the first tree more heavily than correct cases, to focus on those cases for the next round.
Build another simple tree again, making use of case weights from the previous tree to try to improve performance on trickier cases. At the end, weight wrongly classified cases more heavily again.
Repeat this process until a set number of trees (iterations) is reached, by testing a few different values and seeing where the error rate plateaus.
As a final step, aggregate the classification results of all trees, with different weights assigned to each tree depending on how well it performed overall. In other words, the final outcome is a summation of weighted outcomes from all trees.

Parameters to tune

Parameters that can be tuned include: the total number of trees (iterations) in the model, the complexity of individual trees (number of splits, or maximum depth), and the shrinkage parameter or weight applied to all trees to adjust the “learning rate”. The first and last parameters interact to some extent, as a shrinkage parameter that is too small means that a large number of trees is needed to fit the data appropriately, which will take more time to run. A shrinkage parameter that is too big will quickly overfit the data, so that the model doesn’t generalize to other data.

Two parameters were tested to see which combination led to the best model: the total number of trees (mstop = 5, 10, 25, 50, 100, 200), the complexity of each tree (maxdepth = 1, 2, 3), and the learning rate (nu = 0.05, 0.1, 0.25, 0.5).

The plot below shows that fewer trees worked better than more trees; model accuracy did not improve beyond 10 trees or so. More complex trees with more depth also worked better than a simple stump. When the learning rate was set higher, beyond 0.25 or so, the number of trees no longer mattered.

Technical notes

Technical notes for plot: The accuracy is the average out-of-bag accuracy for 15 bootstrap samples, from a 5-fold 3-repeat cross-validation procedure. Shaded areas are +/- 1 SD. Case weights were incorporated.

Technical notes for choice of package: ada( ) from package ‘ada’ and boosting( ) from package ‘adabag’ are both based on rpart( ) and would be more comparable to previous analyses that were also based on rpart( ). Unfortunately neither method allows case weights, so that nearly all cases are classified as “No”. Alternative methods are mboost( ) and blackboost( ) from package ‘mboost’, which do allow case weights. However, mboost( ) uses “stumps”, or a single split on one predictor per tree, and maxdepth can’t be adjusted. blackboost( ) allows tuning of maxdepth but individual trees are constructed using statistical tests to determine splits instead of using the Gini index as in previous rpart-based methods. So blackboost( ) results aren’t directly comparable to those previous rpart-based results, but being able to incorporate case weights and tune maxdepth outweighed other considerations. I chose family=AdaExp() to use the original Adaboost algorithm (Freund & Schapire, 1996).

Final model

A final boosted model was constructed using caret::train(method = “blackboost”, family = AdaExp()). The final parameters were mstop = 10, maxdepth = 3, and learning rate nu = 0.1. Case weights were applied. No resampling was done (e.g. bootstrap). Overall accuracy was 67.73%, with 49.66% sensitivity and 72.19% specificity.

##          actual
## predicted  No Yes
##       No  347  56
##       Yes 257  93

Variable importance

As before, variable importance was scaled relative to that of Age.

Takeaways

It was surprising that the boosted tree model performed poorly; I had expected it would at least be better than a single classification tree. Note that there were under-the-hood differences in how the algorithm split cases compared to previous tree methods.
The most important three variables are the same as in the classification tree and the random forest models: Age first, followed by HHIE and then PTA.

	Accuracy %	Sensitivity %	Specificity %	AUC
Logistic reg (lasso, x=4)	63.47	59.73	64.40	62.07
Classification tree (CART)	67.46	78.52	64.74	71.63
Bagging (250 trees)	77.42	32.21	88.58	60.40
RF (200 trees, m=4) ^	98.14	100.00	97.68	98.84
Boosting (10 trees, depth=3) ^^	67.73	49.66	72.19	60.92

Footnotes:

^ Accuracy looks amazing here, but it’s inflated by using the model to evaluate the same data it was trained on. As shown on the previous page, the accuracy for the random forest model was 76.49% when tested on out-of-bag samples the trees had not been built on. In earlier methods (LR, tree, bagging), accuracy wasn’t great even when evaluating the same data the models were trained on, because even the best models didn’t work that well.

^^ It’s not really fair to compare the boosting results with the other tree methods’ results, as it used a different method to split cases.

Boosting

About boosting

Parameters to tune

Final model

Variable importance

Takeaways