About classification trees

A classification tree uses predictors to sequentially sort observations into more homogeneous categories of outcomes, with the goal of knowing what sequence of labels or cut-off values ultimately leads to an outcome of “Yes” or “No”. This different from logistic regression, which uses a weighted combination of predictors to calculate the probability of an outcome being “Yes” or “No”.

Advantages:
A classification tree has a built-in variable selection procedure, and it can handle different types of predictors easily (categorical and numerical). Compared to logistic regression, the information from a classification tree is easier to interpret. A toy example: 20 out of a sample of 100 people bought hearing aids and 80 did not, and 19 of those 20 were correctly identified using the cut-off values of PTA > 40 followed by HHIE > 20. This information is more practically useful than knowing, say, that every 1-point increase in PTA led to a 3% increase in the odds of someone buying a hearing aid. Unlike logistic regression, trees can work around missing data by using “surrogate variables” in place of missing variables, whereas logistic regression requires all variables to be complete.

Disadvantages:
A single classification tree is not as “robust” as some other models. That is, a small change in the data can lead to a big change in the model, because earlier branches in the sequence will have flow-on effects. Simplicity can also be its downfall. First, using labels and cut-offs to classify cases isn’t very flexible for fitting complex data and can make trees less accurate than some other types of models. Second, allowing fewer branches to grow makes a tree easier to interpret, but it is likely to classify fewer cases correctly.

Using case weights

The ratio of “Yes” to “No” cases was 149:604. “No” cases were assigned lighter weights (149/604 = 0.2466887) than “Yes” cases (1.0), so that the model would not be biased towards “No” cases. It’s as if there were equal numbers of each class (149 each) at the top of the tree.

Technical note: A loss matrix can be specified for the tree model, which adds a penalty for either false positives or missed true cases. Using a loss matrix instead of, or in addition to, case weights doesn’t make a difference for sensitivity or specificity. Only case weights seem to matter in this analysis.

Pruning a classification tree

A “full” classification tree is one that is allowed to grow without any restriction. In other words, the tree continues to sort observations at each node without any minimum criterion for improvement. In our case, sorting only stops when there are less than 20 cases in a node(minsplit = 20 is the default in rpart, but a different value can be specified). However, a full tree is hard to interpret, and likely would not generalize to other data.

Some parameters can be specified to restrict the growth of a tree: specifying a minimum number of cases in a node before further sorting is allowed, or specifying a maximum depth (e.g., no more than three splits as the longest path through the tree; note that minsplit and maxdepth can interact).

Alternatively, a method of “pruning” can be used to test how well the tree classifies cases at different complexities, and then settle on the simplest tree that is reasonably accurate. A low complexity parameter (CP) value allows a tree to split whenever there’s a small improvement, while a high CP value requires a large improvement before allowing a further split. So low CP’s lead to more complex trees, while high CP’s lead to simpler trees. The best tree has the highest CP with the best accuracy, i.e., simple but accurate. The plot below shows the accuracy of different trees for our data, built using a typical range of CP values. (There is an “average” and “SE” because there’s a 5-fold cross-validation procedure.)

Judging by accuracy, CP = 0.02 seems to be the optimal level of complexity, leading to a relatively simple tree whose accuracy is within range of more complex trees.

However, examining other measures, using a CP value of 0.013 instead of 0.02 gives a 19-point increase in sensitivity (59% to 78%), for no change in overall accuracy (67%).

##    cpvalue nsplits Accuracy Sensitivity Specificity    AUC
## 14   0.100       1   0.6507      0.4698      0.6954 0.5826
## 13   0.050       3   0.6720      0.5906      0.6921 0.6413
## 12   0.040       3   0.6720      0.5906      0.6921 0.6413
## 11   0.030       3   0.6720      0.5906      0.6921 0.6413
## 10   0.025       3   0.6720      0.5906      0.6921 0.6413
## 9    0.020       3   0.6720      0.5906      0.6921 0.6413
## 8    0.015       5   0.7025      0.5906      0.7301 0.6604
## 7    0.013      13   0.6746      0.7852      0.6474 0.7163
## 6    0.012      15   0.6494      0.8591      0.5977 0.7284
## 5    0.011      22   0.7158      0.8591      0.6805 0.7698
## 4    0.010      24   0.7331      0.8591      0.7020 0.7805
## 3    0.005      34   0.7663      0.9060      0.7318 0.8189
## 2    0.001      43   0.7888      0.8993      0.7616 0.8305
## 1    0.000      43   0.7888      0.8993      0.7616 0.8305

Old analysis: Max depth

Max depth = 4. In the original analysis where the growth of the classification tree was restricted to a maximum depth of 4, Age Stigma formed two branches at depth 3. After dropping Q4 (the first of five items in the Age Stigma scale), Age Stigma was no longer part of the decision tree, but other branches and the overall accuracy stayed similar.

	Accuracy %	Sensitivity %	Specificity %	AUC
With Q4	71.45	63.09	73.51	68.30
Without Q4	69.99	63.09	71.69	67.39

New analysis: Pruning

CP=0.02 vs CP=0.013

As stated earlier, using a CP value of 0.013 instead of 0.02 gives a 19-point increase in sensitivity, for no change in overall accuracy. Both tree models and their metrics are shown below for comparison.

	Accuracy %	Sensitivity %	Specificity %	AUC
CP = 0.02	67.20	59.06	69.21	64.13
CP = 0.013	67.46	78.52	64.74	71.63

CP=0.02 with actual and weighted counts
CP=0.013 with actual and weighted counts

Variable importance

In this case, variable importance was calculated from how much each variable increased the proportion of correctly classified cases, relative to the other variables. In the plots below, variable importance was scaled so that the total summed to 100%. (As an example, of all the cases that were switched from an incorrect category to the correct category by the CP=0.013 tree model, Age accounted for 27.5% for them.)

Technical note: maxcompete = FALSE and maxsurrogate = 0, to look only at the variables involved in the primary splits.

##                         Variable_importance
## Age                                    27.5
## HHIE_total                             17.5
## PTA4_better_ear                        10.7
## Help_problems                           8.6
## Sub_Age_avg                             7.1
## Health                                  6.5
## Ability                                 6.1
## QoL                                     5.6
## Edu                                     5.3
## Soc_Somewhat_negative.f                 5.0

Model stability

To check how both tree models changed (or not) with different subsets of the data, 5 different trees were constructed using the same parameters, dropping out a different, randomly-selected 20% portion of the data each time. The proportion of ‘Yes’ and ‘No’ cases was kept constant in all subsets of the data. The same cases were dropped out for both trees.

Changes in model metrics (%) across data subsets, CP = 0.013

metric	subset_1	subset_2	subset_3	subset_4	subset_5
Accuracy	69.32	70.27	73.3	69.60	75.08
Sensitivity	73.95	63.03	72.5	67.23	82.35
Specificity	68.18	72.05	73.5	70.19	73.29
AUC	71.07	67.54	73.0	68.71	77.82

Changes in variable importance (%) across data subsets, CP = 0.013

variable	subset_1	subset_2	subset_3	subset_4	subset_5
Ability	*	*	9.15	*	*
Accomp.f	*	*	*	*	*
Age	25.18	40.85	21.86	29.31	13.59
Age_stigma_avg	20.41	*	*	12.02	6.59
Concern	*	*	4.48	*	*
Edu	*	*	*	*	4.15
HA_stigma_avg	*	*	*	*	8.22
Health	*	*	*	*	4.61
Help_neighbours	*	*	*	*	*
Help_problems	11.92	11.64	*	9.96	*
HHIE_total	21.5	26.56	13.63	22.81	9.63
Lonely	*	*	*	*	5.19
Married.f	*	*	*	*	8.42
PTA4_better_ear	3.48	20.96	22.59	25.91	12.87
QoL	7.94	*	*	*	3.73
Sex.f	*	*	*	*	4.58
Soc_Discuss_HL.f	*	*	6.12	*	*
Soc_Hearing_test.f	*	*	*	*	*
Soc_Know_HL.f	*	*	*	*	*
Soc_Obtain_HA.f	*	*	*	*	*
Soc_Regular_use.f	*	*	*	*	*
Soc_Sometimes_use.f	*	*	7.09	*	3.86
Soc_Somewhat_negative.f	*	*	*	*	*
Soc_Somewhat_positive.f	*	*	*	*	*
Soc_Suspect_HL.f	9.58	*	9.45	*	*
Soc_Very_negative.f	*	*	*	*	*
Soc_Very_positive.f	*	*	*	*	*
Sub_Age_avg	*	*	5.62	*	14.56

Age, HHIE and PTA(BE) were consistently important for classifying whether participants decided to purchase hearing aids. Even so, model performance and variable importance varied widely depending on which data were used, indicating that the models weren’t very stable. In other words, a small change in the data sometimes led to a fairly different model.

Takeaways

A standard method of tree pruning based only on accuracy suggested that a 3-split tree was an adequate compromise between complexity and accuracy. However, looking at sensitivity, a 13-split tree was much better at identifying cases of hearing aid purchase, without much increase in false positives.
A measure of variable importance showed that Age, HHIE and PTA(BE) were the most important variables for correctly classifying hearing aid purchase in both simpler and more complex tree models.
Checking on the stability of the tree models by dropping out different subsets of data, the effects of Age, HHIE and PTA(BE) varied quite a bit, indicating that these tree models weren’t very stable.

	Accuracy %	Sensitivity %	Specificity %	Area Under Curve
Logistic x=4	63.47	59.73	64.4	62.07
Class tree cp=0.013	67.46	78.52	64.74	71.63
Bagging	*	*	*	*
Random forest	*	*	*	*
Boosting	*	*	*	*

Classification tree (CART)