Is multicollinearity a problem with logistic regression


What is Multicollinearityhow do you recognize it and why is it a problem? You can find the answers to exactly these questions in this post!

You want that Multicollinearity understand even faster? In our Video we have summarized everything you need to know about this topic in a way that is easy to understand.

Multicollinearity explained in simple terms

Multicollinearity (engl. Multicollinearity) is present if multiple predictors in a Regression analysis strong with each other correlate. So one does not consider the correlation of the multicollinearity Predictors with the Criteria , but the correlations between the various predictors. If this correlation is high, then there is multicollinearity.

Multicollinearity and Regression

Multicollinearity is a problem because strong intercorrelations of the predictors make the estimate of the Regression coefficients unsafe. This reduces the meaningfulness of your results. As a result, you must always do a regression analysis checkwhether there is multicollinearity. For that you look at that Tolerance value a predictor or his VIF statistics (short for "variance influence factor"). Tolerance values ​​should be as large as possible and VIF values ​​as small as possible. However, the two statistics say the same thing. So it is enough if you interpret one of the two values.

It is important that there must be no multicollinearity Requirement for the regression analysisthat you always have to check. There are also other assumptions that must be fulfilled so that you can interpret the regression analysis meaningfully. This includes the linear relationship of the variables that Independence of the residuals and Homoscedasticity

What is multicollinearity?

Multicollinearity describes that a predictor strongly interacts with other predictors in a regression analysis correlated. In other words, this predictor brings little new information into the regression. Instead, a large part of the predictor's information is already contained in the other predictors. For the calculation of the multicollinearity you therefore consider how well a predictor compares itself to the other predictors of the regression predict leaves.

Imagine, for example, that you want that "Life satisfaction of a person" criterion predict. That's why you look at them Predictors of "income", "number of friends" and "parties attended per month" . Now you want to examine the multicollinearity. To do this, you leave out the criterion “life satisfaction” for the time being and only consider the three predictors.

Then you examine how well you do any of the predictors can predict with the help of the other predictors. That means, for a moment, one of the predictors becomes even to the criterion. In relation to our example, you can calculate roughly how well you can calculate the number of parties attended using your income and the number of friends predict can. Depending on how well you do the predictor under consideration could predict with the remaining predictors, the similar or dissimilar are the predictors.

In our example it could well be that the "number of parties attended" strongly correlates with the "number of friends" correlated. After all, a person with more friends tends to have more opportunities to be invited to parties. If you find a high correlation between these predictors, one speaks of Multicollinearity

Why is multicollinearity a problem?

There is a high multicollinearity a problem for your regression analysis. This is because predictors that are strongly correlated with each other sometimes use the equal proportions of variance of the criterion enlighten. In order not to weight these enlightened proportions twice, the proportion of enlightened variance is used when calculating the Regression coefficients divided between the predictors involved. However, it is not clear which predictor should be weighted how much. As a result, the estimate of the Regression coefficients increasingly unsafe, the more variance of the criterion the predictors “share”. That leads to your forecast the criterion values ​​for multicollinearity less reliable becomes.

Can you be a predictor Perfect predict by the other predictors, one speaks of "Perfect multicollinearity". In this case, the predictor does not introduce any new information that is not already available from another predictor. As a result, perfect multicollinearity leads to the Regression weights can no longer be estimated at all and you can no longer perform the regression analysis.

You see, multicollinearity is a Regression analysis problem. Therefore, before you do a regression, you always have to check whether there is multicollinearity in your variables.

How do I recognize multicollinearity?

There are different statisticswhere you read off can whether with a predictor Multicollinearity is present. On the one hand there are the so-called Tolerance values. You indicate which one Proportion of the variance of a predictor not cleared up if you can get this predictor using all other predictors to predict. here is that Coefficient of determination the prediction of the predictor with all other predictors. If you subtract the coefficient of determination from 1, you get the portion of the variance that could not be explained by the predictors. Ever closer the tolerance value at 1 lies the more more independent is the predictor of the other predictors.

On the other hand, there is the VIF value ("variance influence factor"). You calculate it by taking the reciprocal of the tolerance value.

The Tolerance value and the VIF value differ in their form of representation, but say the same thing. Hence, it is sufficient if you interpret either the tolerance value or the VIF value. The easiest way to get both the tolerance value and the VIF statistics is from a statistics program.

How big or small the tolerance value and the VIF statistic should each be do not give a blanket answer. This always depends on how many predictors you are looking at in your regression analysis. The general rule: Tolerance values should if possible near 1 and VIF values if possible small be.
As Rule of thumb can you remember that tolerance values not smaller as and VIF values not bigger as should be.

More regression analysis?

Would you like more about the Regression analysis and your properties Experienced? We have a whole range of other posts on this topic for you, take a look!