A logistic regression model calculates the probability that an observation belongs to a particular category, given the set of predictors associated with that observation. The output from the model (the predicted probabilities) are limited to values between 0 and 1 using a logistic function.
Technical notes: After some mathematical rearrangement, the model becomes a typical linear equation where the left side of the equation is log-odds, log(p/(1-p)), and the right side consists of an intercept and coefficients associated with each predictor. Using R’s predict() function with the results of a glm(family = binomial) model produces probability values that have already been calculated from the log-odds values produced by the model.
The true outcomes in this study are ‘0’ or ‘1’ (did not or did purchase), but a logistic regression model produces probabilities (in the form of log odds) as the predicted outcomes. A decision threshold has to be selected to classify cases into ‘0’ or ‘1’ based on their individual probabilities from the model. The decision threshold affects sensitivity and specificity. As an extreme example, if all our cases with probability < 0.02 are classed as ‘0’ and all cases with probability ≥ 0.02 are classed as ‘1’, there would be no ‘0’ cases and all ‘1’ cases, which means a sensitivity of 100% but also a 100% false-positive rate. If cases in the sample were divided half-and-half into ‘0’s and ’1’s, then a decision threshold could be 0.5, but it also depends on other things like how important it is not to miss a ’1’ case, or the consequences of falsely flagging a ‘0’ case. We have no particular requirements about either missed cases or false positives. Pronk et al. (2017) said that in the absence of a clinically well-accepted threshold, prevalence in the sample could be used instead, which would be 0.1978 in our sample (the actual proportion of those who purchased).
The plot below illustrates how choosing different thresholds affects sensitivity and specificity for our sample, for a model that includes all 28 predictors.
Summary of full model with 28 predictors:
##
## Call:
## glm(formula = formula28, family = "binomial", data = pta)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4876 -0.6816 -0.5249 -0.3478 2.5764
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.021742 1.562749 -3.213 0.001312 **
## Age 0.044407 0.012704 3.495 0.000473 ***
## PTA4_better_ear 0.002953 0.013749 0.215 0.829944
## HHIE_total 0.054704 0.013229 4.135 3.55e-05 ***
## Ability -0.011181 0.072990 -0.153 0.878250
## Sex -0.280899 0.209306 -1.342 0.179579
## Edu 0.076460 0.094624 0.808 0.419068
## Married 0.132926 0.255678 0.520 0.603137
## Health 0.331364 0.154736 2.141 0.032235 *
## QoL -0.300496 0.170110 -1.766 0.077315 .
## Help_neighbours -0.128858 0.118300 -1.089 0.276046
## Help_problems 0.004244 0.134815 0.031 0.974888
## Concern -0.118125 0.082335 -1.435 0.151376
## Lonely -0.221659 0.206423 -1.074 0.282910
## Sub_Age_avg -0.020162 0.119229 -0.169 0.865715
## Age_stigma_avg 0.095348 0.100539 0.948 0.342943
## HA_stigma_avg -0.164076 0.092675 -1.770 0.076655 .
## Accomp 0.138590 0.270560 0.512 0.608488
## Soc_Suspect_HL 0.918827 0.370866 2.478 0.013230 *
## Soc_Know_HL -0.400545 0.320576 -1.249 0.211500
## Soc_Discuss_HL -0.335275 0.217980 -1.538 0.124025
## Soc_Hearing_test 0.102782 0.245653 0.418 0.675651
## Soc_Obtain_HA 0.024849 0.362753 0.069 0.945387
## Soc_Sometimes_use -0.173981 0.241780 -0.720 0.471781
## Soc_Regular_use -0.117689 0.361091 -0.326 0.744481
## Soc_Very_positive 0.450030 0.311597 1.444 0.148663
## Soc_Somewhat_positive -0.012779 0.254318 -0.050 0.959926
## Soc_Somewhat_negative -0.098050 0.258940 -0.379 0.704941
## Soc_Very_negative 0.291415 0.295772 0.985 0.324491
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 749.15 on 752 degrees of freedom
## Residual deviance: 685.86 on 724 degrees of freedom
## AIC: 743.86
##
## Number of Fisher Scoring iterations: 5
Akaike Information Criterion (AIC): AIC is a measure of how much information is “lost” by the model compared to the “true” state of things, because a model can’t capture all the variance with a limited set of predictors. A model with a lower AIC is usually considered a better model than one with higher AIC. Adding more predictors into a model will always explain more variance even if the added predictors aren’t meaningful, so AIC also includes a penalty for simply increasing the number of predictors in a model.
Backwards procedure: The current model with 28 predictors has an AIC of 743.86 (above output). To start the iterative process, run 28 new models, each time dropping one of the predictors, and look at each new model’s AIC. Sometimes dropping a predictor will give an AIC that’s lower, i.e. better, than the current model’s. The predictor that causes the biggest AIC drop will be removed. Repeat the process by running 27 new models with the remaining predictors, dropping one each time, until a point when dropping any predictor makes the AIC worse than keeping all predictors. This process leaves 6 predictors, and it has a better AIC of 713.84 (below output).
Summary of model with 6 predictors:
##
## Call:
## glm(formula = Purchased_HA ~ Age + HHIE_total + Health + HA_stigma_avg +
## Soc_Suspect_HL + Soc_Discuss_HL, family = "binomial", data = pta)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6376 -0.6704 -0.5562 -0.3917 2.3116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.21123 1.07378 -5.784 7.28e-09 ***
## Age 0.04500 0.01107 4.066 4.78e-05 ***
## HHIE_total 0.05210 0.01040 5.010 5.44e-07 ***
## Health 0.18003 0.10578 1.702 0.0888 .
## HA_stigma_avg -0.17037 0.08940 -1.906 0.0567 .
## Soc_Suspect_HL 0.82504 0.34304 2.405 0.0162 *
## Soc_Discuss_HL -0.32214 0.19761 -1.630 0.1031
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 749.15 on 752 degrees of freedom
## Residual deviance: 699.84 on 746 degrees of freedom
## AIC: 713.84
##
## Number of Fisher Scoring iterations: 4
Odds ratios: The coefficients for Age and HHIE have relatively narrow confidence intervals, indicating robust findings (every increase in 1 year of Age and every increase in 1 point on HHIE is associated with 4.6% and 5.3% greater odds of purchasing hearing aids, respectively). In contrast, the other four variables have wide confidence intervals. For some variables, the lower limit is < 1.00 and the upper limit is > 1.00, which means the effect could actually be in either direction.
## OR 2.5 % 97.5 %
## (Intercept) 0.002 0.000 0.016
## Age 1.046 1.024 1.069
## HHIE_total 1.053 1.032 1.075
## Health 1.197 0.976 1.478
## HA_stigma_avg 0.843 0.707 1.004
## Soc_Suspect_HL 2.282 1.210 4.702
## Soc_Discuss_HL 0.725 0.491 1.066
In terms of sensitivity and specificity, the simpler model with 6 predictors from a backwards step procedure isn’t much worse than the model with all 28 predictors.
Why stepwise variable selection is bad: With procedures such as backwards elimination, the F-statistic no longer follows an F distribution, p-values are no longer the probability of observing this test statistic when the null hypothesis is true, and it’s difficult to correct for multiple comparisons. Model coefficients tend to be biased away from 0 and confidence intervals tend to be narrower than they should be, so findings are “inflated”.
Penalized regression: For ordinary least squares regression (as in the models above), the algorithm finds coefficients that result in the smallest “distances” between the datapoints and the regression line, minimizing the sum of squared residuals. Two variations are ridge regression and lasso regression: these each add a penalty term so that smaller coefficients are chosen over larger coefficients during the process of minimizing residuals. Smaller coefficients make the model more “stable”, or less variable with different data.
Ridge regression works better for highly correlated variables and when the number of observations is smaller than the number of predictors; neither applies to our data. Lasso regression shrinks some coefficients to 0, basically acting as a form of variable selection, but ridge regression does not do this. There is a tuning parameter attached to the penalty, lambda, which controls how heavy the penalty on large coefficients should be. If lambda is 0, there is no penalty and it’s just an ordinary least squares regression. When lambda is dialed up in lasso regression, more coefficients will shrink to nothing and be dropped from the model.
To figure out which lambda value to use for the coefficients penalty, a standard method is to test a range of values and measure model performance at each value. The figure below shows that Area Under Curve is highest with lambda = 0.02750. (There is a mean and SD for AUC because of a 5-fold cross-validation procedure.)
At lambda = 0.02750, where AUC is highest, there are 4 coefficients left: Age, HHIE, HA Stigma, Soc_Suspect_HL. Together they explain 3.71% of the variance.
Summary of model with 4 predictors:
##
## Call:
## glm(formula = Purchased_HA ~ Age + HHIE_total + HA_stigma_avg +
## Soc_Suspect_HL, family = "binomial", data = pta)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4831 -0.6814 -0.5593 -0.4042 2.3437
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.511323 0.937553 -5.878 4.14e-09 ***
## Age 0.044948 0.010868 4.136 3.53e-05 ***
## HHIE_total 0.045900 0.009955 4.611 4.02e-06 ***
## HA_stigma_avg -0.163669 0.088638 -1.846 0.0648 .
## Soc_Suspect_HL 0.742241 0.339078 2.189 0.0286 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 749.15 on 752 degrees of freedom
## Residual deviance: 705.28 on 748 degrees of freedom
## AIC: 715.28
##
## Number of Fisher Scoring iterations: 4
## OR 2.5 % 97.5 %
## (Intercept) 0.004040757 0.0006210107 0.0246466
## Age 1.045973327 1.0241242231 1.0687634
## HHIE_total 1.046969985 1.0268406708 1.0677703
## HA_stigma_avg 0.849022960 0.7128779428 1.0095551
## Soc_Suspect_HL 2.100638661 1.1241124640 4.2999273
Comparing models with 28, 6, and 4 predictors, overall accuracy is similar, with sensitivity changing the most as fewer predictors are used:
| Predictors | Accuracy % | Sensitivity % | Specificity % | AUC | |
|---|---|---|---|---|---|
| Full model | 28 | 65.47 | 65.10 | 65.56 | 65.33 |
| Backwards step | 6 | 65.86 | 61.07 | 67.05 | 64.06 |
| Lasso regression | 4 | 63.47 | 59.73 | 64.40 | 62.07 |
(In response to comments)
The following sample sizes were needed to detect the following effects with alpha two-tailed = 0.95 and power = 0.80:
| Odds ratio | Required sample size | |
|---|---|---|
| Age | 1.046 | 413 |
| HHIE | 1.047 | 310 |
| HA_Stigma | 0.849 | 1010 |
| Soc_Suspect_HL | 2.101 | 769 |
Given that our sample size was n = 753, our study had an adequate sample size for Age, HHIE and Soc_Suspect_HL, but not for HA Stigma. (Note: This sort of post-hoc power analysis feels a bit circular; I’m not sure it adds any useful information.)
Predictors were not too highly correlated and there were no high-influence datapoints, which helped increase confidence in the model coefficients (odds ratios).
Dropping most of the 28 predictors did not lead to a worse model; having 4 variables works about as well as having 6 or 28. Those few variables that do explain the outcome don’t lead to a very good model, with metrics in the low 60’s.
Using a less controversial method of lasso regression than stepwise variable selection: Age, HHIE, and “know at least 1 person with suspected HL” affected the decision to purchase hearing aids. HA Stigma survived the variable selection procedure as well, but confidence intervals showed that it had an uncertain effect.