Credit Scoring Series Part Four: Variable Selection
The main philosophy of credit intelligence is “doing more with less,” and credit risk models are the means to embody this philosophy. Using an automated process and focusing on key information, credit decisions can be made in seconds, which can reduce operational costs by speeding up the decision-making process. Ultimately, fewer questions and rapid credit decisions increase customer satisfaction; for lenders, this means expanding their customer base, welcoming “safer” customers, and increasing profits.
But how can we reach these goals? We can find answers in the next step of the credit risk modeling process – the variable selection process.
The mining view created as the result of the data preparation process is a multi-dimensional customer signature that’s used to discover potentially predictive relationships and test the strength of those relationships. A thorough analysis of a customer’s signature is an important step when creating a set of testable hypotheses based on the characteristics found in the customer’s signature. Often referred to as business insights, this analysis helps organizations interpret customer behavior trends, further informing the modeling process.
The purpose of the business insights analysis is to:
- Validate the derived customer’s data is aligned with business understanding. For example, insight analysis should support the business statement that “customers with higher debt-to-income ratio are more likely to default”;
- Provide model result analysis benchmarks; and
- Shape the modeling methodology
Business insights analysis utilizes similar techniques to exploratory data analysis by combining univariate and multivariate statistics and different data visualization techniques. Typical techniques are correlation, cross-tabulation, distribution, time-series analysis, and supervised and unsupervised segmentation analysis. Segmentation is especially important, since it determines when multiple scorecards are needed.
Variable selection, based on the results of the business insights analysis, starts by partitioning the mining view into at least two different partitions: the training and testing partitions. The training partition is used to develop the model, and the testing partition is used for assessing and validating model performance.
Figure 1. Simplified scorecard model building process
Variable Selection
Variable selection is a collection of candidate model variables tested for significance during model training. Candidate model variables are also known as independent variables, predictors, attributes, model factors, covariates, regressors, features, and/or characteristics.
Variable selection is a parsimonious process that aims to identify the minimum amount of predictors that will yield maximum predictive accuracy. This approach is different than that of data preparation, which aims to add as many variables as possible to the mining view. These opposing requirements are achieved using optimization – finding the minimal selection bias under the given constraints.
The main goal is to find the variables that the credit scorecard model could use to both predict the likelihood a customer takes on bad debt and rank them by that likelihood. This usually means selecting statistically significant variables in the predictive model and having a balanced set of predictors (usually 8–15) that create a balanced, comprehensive view of a customer. In addition to customer-specific risk characteristics, we should also consider including systematic risk factors to account for economic fluctuation and volatility.
But this is easier said than done – there are many limitations involved in variable selection. First, the model will usually contain some highly predictive variables that legal, ethical, or regulatory frameworks prohibit. Second, some variables might be unavailable or of poor quality during the modeling or production stages. Additionally, there might be important variables that haven’t been recognized because of a biased population sample or because their model effect would be counterintuitive as a result of multicollinearity. And finally, the business/organization will always have the last word, and it might insist on including only business-sound variables, among other things.
All these constraints are potential sources of bias, which makes it difficult for data scientists to minimize selection bias. Typical preventive measures during variable selection include:
- Collaboration with other industry experts to identify important variables;
- Assessing potential issues regarding data source, reliability, or mismeasurement;
- Data cleansing; and
- Using control variables to account for banned variables or specific events (like economic drift, for example)
It’s also important to recognize that variable selection is an iterative process that occurs throughout the model building process, which may look like this:
- Prior to model fitting, teams may reducing the number of variables in the mining view to a manageable set of candidate variables;
- Then, during the model training process, teams may reduce variables further due to statistical insignificance, multicollinearity, low contributions, or penalization to avoid overfitting;
- From there, the process carries on during model evaluation and validation; and
- Finally, the process concludes upon business approval, where model clarity and usability is key
Variable selection finishes after a team reaches the “sweet spot,” the point where model accuracy is maximized.
Figure 2. Iterative nature of variable selection process
Variable Selection Methodologies
A plethora of variable selection methods are available. With advances in machine learning, this number has been constantly increasing. Variable selection techniques depend on whether we use variable reduction or variable elimination (filtering); whether the selection process is carried out inside or outside predictive models; whether we use supervised or unsupervised learning; or if the underlying methods are based on specific embedded techniques like cross-validation.
Variable selection method | Description |
---|---|
Supervised variable selection outside predictive models (Figure 3) | Information value Chi-square statistics Gini index |
Unsupervised variable selection/extraction outside predictive models | Correlation analysis Cluster analysis Principal component analysis Neural networks |
Supervised variable selection inside predictive models | Recursive feature selection: forward, backward and stepwise Regularization techniques (for example, AIC/BIC, lasso, ridge) Ensemble techniques (for example, random forest and gradient boosting) Cross validation |
Table 1. Typical credit risk modeling variable selection techniques
Figure 3. Variable selection via bivariate analysis
In credit risk modeling, two of the most used variable selection methods are information value for filtering prior to model training, and stepwise selection for variable selection during the training of a logistic regression model. Although both receive some criticism from practitioners, it’s important to recognize that no ideal methodology exists because each variable selection method has its benefits and drawbacks. Which one to use and how best to combine them isn’t an easy task to solve and requires solid domain knowledge, good data understanding, and extensive modeling experience.
Conclusion
Credit scoring is a dynamic, flexible, and powerful tool for lenders, but there are plenty of ins and outs that are worth covering in detail. To learn more about credit scoring and credit risk mitigation techniques, read the next installment of our credit scoring series, Part Five: Credit Scorecard Development.
Read prior Credit Scoring Series installments: