Logistic Regression Analysis and Interpretation
Logistic regression analysis and interpretation is a complex task that involves different methods and approaches. This document depicts some of the common methods for binomial logistic regression analysis and is intended to simplify the logistic regression analysis process.
Note – This document does not cover all aspects of logistic regression analysis and does not guarantee a complete and secure logistic regression model. Models should be validated with statisticians or data mining professionals.
Glossary and Terms
- Explained variable (dependent variable)– The variable in question or the variable being explored by the model. This variable is binary and can only have two values 0 or 1.
- Explanatory variables (independent variable) – The variables used to model the explained variable (dependent variable).
- Null model – The empty model or a model that includes the explained variable.
- Fitted model – The actual model that contains the explained variable and the explanatory variables
Logistic Regression – General Modeling Guidelines
- Data set should contain at least 30 rows of data
- The logistic regression model should be comprised of no more than 1 variable per 30-50 data rows. As example, a logistic regression model based on a data set containing 300 data rows should have no more than 6-10 variables (300/30 – 300/50).
- A logistic regression model should have preselected variables used as the model core and defined by a professional in the field of application being modeled. The preselected variables are ones that considered as affecting the decision being modeled prior to the modeling process. The general approach is that a logistic regression model has to be based on the field of application and can not be defined solely on statistical tests.
- Logistic regression models should have a minimal set of variables – This rule can not be quantified yet variables that add little to model performance should not be included.
- The desired parameter values in the process of analysis are not absolute and relate to the field being modeled (example: models involving human behavior might have larger p-values and less accuracy compare to models involving physical phenomena).
- Classification tables (the tables that show how model classified the rate of hits/misses) should be assessed with caution. In order to achieve good and long lasting results, statistical testing should be the main tool of analysis and classification tables should be treated as an independent test conducted after model quality assessments are completed. Classification tables play an important role once a model is deployed and used since only reality shows the true quality of the model (after deployment). At this stage classification tables computed over the model results are the main tool for logistic regression model performance analysis.
- Biased data – many data sets include variables that imply future results, such variables cause model biasing and should be avoided. Notable cases of data corruption and biasing:
◦ Variables from the future – Variables that include future data or that were used to derive the target variable.
◦ Null values – Most applications for logistic regression modeling ignore null data and exclude it from the model. Null values can enter implicitly into a model in calculated variables especially when using dummy variables. In many cases null values imply that the target variable is 0, this may happen in commercial data sets when a field such as gender or address field was filled only after a customer joined the company.
◦ Large coefficient values – In many cases, variables with a large coefficient values in a logistic regression model imply either that the variable is extremely effective (should probably be removed from the model and used as an external rule or filter) or that the variable contains biased data and is highly correlated to the explained variable. In either case, high coefficient variables in a logistic regression model should be treated carefully.
Logistic Regression - Model Parameter Analysis
Different statistical packages and applications have different outputs for logistic regression models. Many of the logistic regression model parameters are relative parameters that change dramatically over different models. Comparing between test parameter values in different models should be done using models built on the same data set for the same explained (independent) variable in question. The following suggestion for logistic regression model parameter analysis includes some of the common tests.
- Null Deviance – Without diving to the way this parameter value is calculated, null deviance is the performance of the “empty” model.
- Model Deviance – Describes the performance of the current model. The higher the model deviance compare to the null deviance is, the better the model is. Use can use the improvement parameter (below) instead of this variable to compare between model performances.
- Improvement – Determines how much the model improves the classification of the variable in question (the explained or dependent variable). This parameter should be as high as possible and can only compared between logistic regression models for the same explained variable. Improvement is the difference between Null Deviance and Model Deviance.
- p-value – A parameter with extreme importance as gives a good indication to the significance of the model. When a model has a high p-value (p-value > 0.2) there is a very good chance that the model is not significant and should not be used.
- Cox and Snell R Square – This R^2 test as well as other logistic regression R^2 tests tries to measure the strength of association of the model. The values of this test are between 0 and 1. Today Nagelkerke R^2 is more common and considered a better indication to strength of association.
- Nagelkerke R Square – A modification of the Cox and Snell R^2
- Area Under ROC Curve (AOC) – Gives a good indication to model performance (values are between 0.5 and 1). This variable should be as high as possible with some restrictions. Typical values indicate the following:
◦ 0.5 – No distinguish ability (the model has no meaning).
◦ 0.51 – 0.7 – Low distinguish ability (not a very good model yet the model can be used).
◦ 0.71 – 0.9 – Very good distinguish ability.
◦ 0.91 – 1 – Excellent distinguish ability; In some fields, logistic regression models can have an excellent distinguish ability, however this might indicate that the model is “too good to be true”. Double and triple check your model making sure that no variables from the future are present and that the model has no other odd parameter values.
- Homsmer-Lemeshow Probability – The Homsmer – Lemeshow probability test is based on a chi-square test which is done over the Homsmer – Lemeshow table (below). This important parameter tests the assumption that the model distinguishes the explained variable better. The actual Null hypothesis is that the model is insignificant and the test tries to break this hypothesis. Values for this test should be higher than 0.5 – 0.6.
- Homsmer-Lemeshow table – A model classification table which describes both expected model classifications and actual model classifications. The Homsmer-Lemeshow table divides the data into 10 groups (deciles, one per row) each representing the expected and observed frequency of both 1 and 0 values. The expected frequency of data assigned to each deciles should match the actual frequency outcome and each deciles should contain data.
HL table output
- Classification tables – In binomial logistic regression, the classification table is a 2 x 2 table that contains the observed and predicted model results (shown in the figure below). The classification table is computed by taking a data set, usually either training data - the data the model was built on or test data – a data set that was not used to compute model coefficients and is used for model quality evaluation. The model then is used to classify each data record using the computed probability given by the model (a value between 0 and 1) and the cut value which is the minimal value of probability that should be classified as 1. The default "cut value" value is 0.5, determines that a data record that has a value larger than 0.5 should be classified as 1.
The classification table has 4 data cells:
1. Observed 0 Predicted 0 – The number of cases that were both predicted and observed as 0. The model classification was correct for these records.
2. Observed 0 Predicted 1 – The number of cases that were predicted as 1 yet observed as 0. The records in this cell are referred to as false negatives The model classification was incorrect for these records.
3. Observed 1 Predicted 1 – The number of cases that were both predicted and observed as 1. The model classification was correct for these records.
4. Observed 1 Predicted 0 – The number of cases that were predicted as 0 yet observed as 1. The records in this cell are referred to as false positives The model classification was incorrect for these records.
Different fields of applications require different rates of false positives and false negatives since in some applications false positives can not be tolerated while in other applications, false negatives can not be tolerated.
Cases plot (histogram of the predicted data)
This plot shows how many data records were assigned to each probability interval. An example of a cases plot is shown below where 30 data records were assigned a 0.2-0.3 probability to have a value of 1 yet these values are classified by the model as 0. When a model contains high coefficient variables, the cases plot tends to have most values on either end of the chart (data is either classified as having a very low probability of 1 or a very high probability of 1).
The hits ratio shows each the model performance over the deciles
During logistic regression model analysis, it is important to examine the hits/miss ratio. In many models (e.g. models that involve classification of behavior), the hits ratio chart is expected to have lower values in the middle that rise towards each end (as shown in the image below). This behavior is observed since the logistic regression model is expected to miss more cases classified with 0.3 – 0.7 probability than cases that are classified with 0-0.2 or 0.8-1 probability.