Checking for multicollinearity

When predictors are highly correlated with each other (collinear), it’s not clear which predictor actually affects the outcome. A model with collinear predictors can also be unstable, with wide standard errors for estimates. Collinear predictors can be flagged from their large coefficients in a correlation matrix. However, it’s also possible for three or more predictors to be multicollinear without showing large coefficients. To check on this, the Variance Inflation Factor (VIF) can be calculated for each predictor. For each predictor (say, x1 to x5), x2 to x5 are regressed on x1. x1’s VIF is 1/(1 - R-squared), where R-squared is the variance of x1 explained by x2 to x5. So if x2 to x5 together explained 80% of the variance of x1, then x1’s VIF = 1/(1 - 0.8) = 5. Any predictor in the list with VIF ≥ 5 is multicollinear with other predictors and should be dropped.

##                       Variance_inflation_factor
## Age                                    1.391028
## PTA4_better_ear                        1.303042
## HHIE_total                             1.805321
## Ability                                1.608187
## Sex                                    1.183470
## Edu                                    1.240213
## Married                                1.268702
## Health                                 2.265314
## QoL                                    2.352055
## Help_neighbours                        1.223540
## Help_problems                          1.253997
## Concern                                1.161110
## Lonely                                 1.272948
## Sub_Age_avg                            1.418641
## Age_stigma_avg                         1.080622
## HA_stigma_avg                          1.069033
## Accomp                                 1.170265
## Soc_Suspect_HL                         1.187087
## Soc_Know_HL                            1.422682
## Soc_Discuss_HL                         1.289768
## Soc_Hearing_test                       1.493677
## Soc_Obtain_HA                          2.868023
## Soc_Sometimes_use                      1.587398
## Soc_Regular_use                        3.112293
## Soc_Very_positive                      2.458085
## Soc_Somewhat_positive                  1.757352
## Soc_Somewhat_negative                  1.677875
## Soc_Very_negative                      1.460546

Checking for high-influence datapoints

Leverage: How “far away” each datapoint is from the others, considering all predictors (like doing a scatterplot but with more dimensions than 2D). A high-leverage datapoint has more potential to change the regression, but doesn’t necessarily do so.

Influence: How much an observation changes the regression, when it’s dropped versus when it’s included. A high-influence datapoint definitely changes the regression, and the model should be run with and without it to see what happens. None of the values below are larger than 0.5 (one criterion for “large”).

Dropping the six observations highlighted in red did not change the coefficients of the full logistic regression model, so no observations needed to be excluded.

Takeaways

Predictors were not multicollinear and there were no high-influence datapoints, which helps increase confidence in the coefficients (and therefore odds ratios) of the model.

Multicollinearity and high-influence datapoints

Checking for multicollinearity

Checking for high-influence datapoints

Takeaways