A classification tree uses predictors to sequentially sort observations into more homogeneous categories of outcomes, with the goal of knowing what sequence of labels or cut-off values ultimately leads to an outcome of “Yes” or “No”. This different from logistic regression, which uses a weighted combination of predictors to calculate the probability of an outcome being “Yes” or “No”.
Advantages:
A classification tree has a built-in variable selection procedure, and
it can handle different types of predictors easily (categorical and
numerical). Compared to logistic regression, the information from a
classification tree is easier to interpret. A toy example: 20 out of a
sample of 100 people bought hearing aids and 80 did not, and 19 of those
20 were correctly identified using the cut-off values of PTA > 40
followed by HHIE > 20. This information is more practically useful
than knowing, say, that every 1-point increase in PTA led to a 3%
increase in the odds of someone buying a hearing aid. Unlike logistic
regression, trees can work around missing data by using “surrogate
variables” in place of missing variables, whereas logistic regression
requires all variables to be complete.
Disadvantages:
A single classification tree is not as “robust” as some other models.
That is, a small change in the data can lead to a big change in the
model, because earlier branches in the sequence will have flow-on
effects. Simplicity can also be its downfall. First, using labels and
cut-offs to classify cases isn’t very flexible for fitting complex data
and can make trees less accurate than some other types of models.
Second, allowing fewer branches to grow makes a tree easier to
interpret, but it is likely to classify fewer cases correctly.
The ratio of “Yes” to “No” cases was 149:604. “No” cases were assigned lighter weights (149/604 = 0.2466887) than “Yes” cases (1.0), so that the model would not be biased towards “No” cases. It’s as if there were equal numbers of each class (149 each) at the top of the tree.
Technical note: A loss matrix can be specified for the tree model, which adds a penalty for either false positives or missed true cases. Using a loss matrix instead of, or in addition to, case weights doesn’t make a difference for sensitivity or specificity. Only case weights seem to matter in this analysis.
A “full” classification tree is one that is allowed to grow without any restriction. In other words, the tree continues to sort observations at each node without any minimum criterion for improvement. In our case, sorting only stops when there are less than 20 cases in a node(minsplit = 20 is the default in rpart, but a different value can be specified). However, a full tree is hard to interpret, and likely would not generalize to other data.
Some parameters can be specified to restrict the growth of a tree: specifying a minimum number of cases in a node before further sorting is allowed, or specifying a maximum depth (e.g., no more than three splits as the longest path through the tree; note that minsplit and maxdepth can interact).
Alternatively, a method of “pruning” can be used to test how well the tree classifies cases at different complexities, and then settle on the simplest tree that is reasonably accurate. A low complexity parameter (CP) value allows a tree to split whenever there’s a small improvement, while a high CP value requires a large improvement before allowing a further split. So low CP’s lead to more complex trees, while high CP’s lead to simpler trees. The best tree has the highest CP with the best accuracy, i.e., simple but accurate. The plot below shows the accuracy of different trees for our data, built using a typical range of CP values. (There is an “average” and “SE” because there’s a 5-fold cross-validation procedure.)
Judging by accuracy, CP = 0.02 seems to be the optimal level of complexity, leading to a relatively simple tree whose accuracy is within range of more complex trees.
However, examining other measures, using a CP value of 0.013 instead of 0.02 gives a 19-point increase in sensitivity (59% to 78%), for no change in overall accuracy (67%).
## cpvalue nsplits Accuracy Sensitivity Specificity AUC
## 14 0.100 1 0.6507 0.4698 0.6954 0.5826
## 13 0.050 3 0.6720 0.5906 0.6921 0.6413
## 12 0.040 3 0.6720 0.5906 0.6921 0.6413
## 11 0.030 3 0.6720 0.5906 0.6921 0.6413
## 10 0.025 3 0.6720 0.5906 0.6921 0.6413
## 9 0.020 3 0.6720 0.5906 0.6921 0.6413
## 8 0.015 5 0.7025 0.5906 0.7301 0.6604
## 7 0.013 13 0.6746 0.7852 0.6474 0.7163
## 6 0.012 15 0.6494 0.8591 0.5977 0.7284
## 5 0.011 22 0.7158 0.8591 0.6805 0.7698
## 4 0.010 24 0.7331 0.8591 0.7020 0.7805
## 3 0.005 34 0.7663 0.9060 0.7318 0.8189
## 2 0.001 43 0.7888 0.8993 0.7616 0.8305
## 1 0.000 43 0.7888 0.8993 0.7616 0.8305
Max depth = 4. In the original analysis where the growth of the classification tree was restricted to a maximum depth of 4, Age Stigma formed two branches at depth 3. After dropping Q4 (the first of five items in the Age Stigma scale), Age Stigma was no longer part of the decision tree, but other branches and the overall accuracy stayed similar.
| Accuracy % | Sensitivity % | Specificity % | AUC | |
|---|---|---|---|---|
| With Q4 | 71.45 | 63.09 | 73.51 | 68.30 |
| Without Q4 | 69.99 | 63.09 | 71.69 | 67.39 |
As stated earlier, using a CP value of 0.013 instead of 0.02 gives a 19-point increase in sensitivity, for no change in overall accuracy. Both tree models and their metrics are shown below for comparison.
| Accuracy % | Sensitivity % | Specificity % | AUC | |
|---|---|---|---|---|
| CP = 0.02 | 67.20 | 59.06 | 69.21 | 64.13 |
| CP = 0.013 | 67.46 | 78.52 | 64.74 | 71.63 |
CP=0.02
with actual and weighted counts
CP=0.013
with actual and weighted counts
In this case, variable importance was calculated from how much each variable increased the proportion of correctly classified cases, relative to the other variables. In the plots below, variable importance was scaled so that the total summed to 100%. (As an example, of all the cases that were switched from an incorrect category to the correct category by the CP=0.013 tree model, Age accounted for 27.5% for them.)
Technical note: maxcompete = FALSE and maxsurrogate = 0, to look only at the variables involved in the primary splits.
## Variable_importance
## Age 27.5
## HHIE_total 17.5
## PTA4_better_ear 10.7
## Help_problems 8.6
## Sub_Age_avg 7.1
## Health 6.5
## Ability 6.1
## QoL 5.6
## Edu 5.3
## Soc_Somewhat_negative.f 5.0
To check how both tree models changed (or not) with different subsets of the data, 5 different trees were constructed using the same parameters, dropping out a different, randomly-selected 20% portion of the data each time. The proportion of ‘Yes’ and ‘No’ cases was kept constant in all subsets of the data. The same cases were dropped out for both trees.
Changes in model metrics (%) across data subsets, CP = 0.013
| metric | subset_1 | subset_2 | subset_3 | subset_4 | subset_5 |
|---|---|---|---|---|---|
| Accuracy | 69.32 | 70.27 | 73.3 | 69.60 | 75.08 |
| Sensitivity | 73.95 | 63.03 | 72.5 | 67.23 | 82.35 |
| Specificity | 68.18 | 72.05 | 73.5 | 70.19 | 73.29 |
| AUC | 71.07 | 67.54 | 73.0 | 68.71 | 77.82 |
Changes in variable importance (%) across data subsets, CP = 0.013
| variable | subset_1 | subset_2 | subset_3 | subset_4 | subset_5 |
|---|---|---|---|---|---|
| Ability | * | * | 9.15 | * | * |
| Accomp.f | * | * | * | * | * |
| Age | 25.18 | 40.85 | 21.86 | 29.31 | 13.59 |
| Age_stigma_avg | 20.41 | * | * | 12.02 | 6.59 |
| Concern | * | * | 4.48 | * | * |
| Edu | * | * | * | * | 4.15 |
| HA_stigma_avg | * | * | * | * | 8.22 |
| Health | * | * | * | * | 4.61 |
| Help_neighbours | * | * | * | * | * |
| Help_problems | 11.92 | 11.64 | * | 9.96 | * |
| HHIE_total | 21.5 | 26.56 | 13.63 | 22.81 | 9.63 |
| Lonely | * | * | * | * | 5.19 |
| Married.f | * | * | * | * | 8.42 |
| PTA4_better_ear | 3.48 | 20.96 | 22.59 | 25.91 | 12.87 |
| QoL | 7.94 | * | * | * | 3.73 |
| Sex.f | * | * | * | * | 4.58 |
| Soc_Discuss_HL.f | * | * | 6.12 | * | * |
| Soc_Hearing_test.f | * | * | * | * | * |
| Soc_Know_HL.f | * | * | * | * | * |
| Soc_Obtain_HA.f | * | * | * | * | * |
| Soc_Regular_use.f | * | * | * | * | * |
| Soc_Sometimes_use.f | * | * | 7.09 | * | 3.86 |
| Soc_Somewhat_negative.f | * | * | * | * | * |
| Soc_Somewhat_positive.f | * | * | * | * | * |
| Soc_Suspect_HL.f | 9.58 | * | 9.45 | * | * |
| Soc_Very_negative.f | * | * | * | * | * |
| Soc_Very_positive.f | * | * | * | * | * |
| Sub_Age_avg | * | * | 5.62 | * | 14.56 |
Age, HHIE and PTA(BE) were consistently important for classifying whether participants decided to purchase hearing aids. Even so, model performance and variable importance varied widely depending on which data were used, indicating that the models weren’t very stable. In other words, a small change in the data sometimes led to a fairly different model.
A standard method of tree pruning based only on accuracy suggested that a 3-split tree was an adequate compromise between complexity and accuracy. However, looking at sensitivity, a 13-split tree was much better at identifying cases of hearing aid purchase, without much increase in false positives.
A measure of variable importance showed that Age, HHIE and PTA(BE) were the most important variables for correctly classifying hearing aid purchase in both simpler and more complex tree models.
Checking on the stability of the tree models by dropping out different subsets of data, the effects of Age, HHIE and PTA(BE) varied quite a bit, indicating that these tree models weren’t very stable.
| Accuracy % | Sensitivity % | Specificity % | Area Under Curve | |
|---|---|---|---|---|
| Logistic x=4 | 63.47 | 59.73 | 64.4 | 62.07 |
| Class tree cp=0.013 | 67.46 | 78.52 | 64.74 | 71.63 |
| Bagging | * | * | * | * |
| Random forest | * | * | * | * |
| Boosting | * | * | * | * |