Credit Scoring Series Part Three: Data Preparation and Exploratory Data Analysis
In computer science, there’s an axiom that rings especially true when it comes to data projects: “Garbage in, garbage out.” In every case – and especially in credit scorecard development – data preparation is paramount. And unfortunately, data preparation is the most challenging, time-consuming phase of the Cross Industry Standard Process for Data Mining (CRISP-DM) cycle. In most projects, teams dedicate 70% of their time to data preparation; in some cases, that figure can reach more than 90%. That’s because data preparation involves data collection, combining multiple data sources, aggregations, and transformations, data cleansing, “slicing and dicing,” and looking at the data’s breadth and depth so organizations can clearly understand how to turn data quantity into data quality. Ensuring quality data preparation helps teams move to the next phase of scorecard development – model building.
This series’ previous article discussed the importance of model design and identified its main components: unit of analysis, population frame, sample size, criterion variable, modeling windows, data sources, and data collection methods. It’s vital to consider each of these components for successful data preparation. The final product of this stage is a mining view that encompasses the right level of analysis, modeling population, and independent and dependent variables.
|Mining view component||Application scorecard case study example|
|Unit of analysis||Customer level|
|Population frame||Loan applicants with previous bad debt history|
|Sample size||Through the door applicants during 2015 and 2016|
|Data sources||Credit bureau data, applicant data, age debt history|
|Independent variables||Mixture of nominal, ordinal and interval data, such as aggregated values, flags, ratios, time and date values|
|Dependent variable||Default status (1 or 0)|
|Operational definitions||Default: 90 days past due|
|Observation window||Historical credit bureau customer information over the period of three years|
|Performance window||One year|
Table 1. Model design components
As the old saying goes, the more the merrier. As part of data understanding, any external and internal data sources should provide both quantity and quality. The data utilized must be relevant, accurate, timely, consistent, and complete; it must also be voluminous and diverse enough to provide useful results. For application scorecards where internal data is limited, external data can fill in the gaps. In contrast, behavior scorecards utilize more of the internal data – which is why behavior scorecards usually have more predictive power than application scorecards. The table below outlines the common data sources that would be required for customer verification, fraud detection, or credit granting.
|Source||Category||Supplied by authority|
|External||Address, postcode||Credit Bureaus|
|Electoral roll data|
|Court and insolvency|
|Generic bureau scores|
Table 2. Data source diversity
The data preparation process starts with data collection, commonly referred to as the Extract-Transform-Load (ETL) process. Data integration combines different sources using data merging and interlinking; typically, it requires teams to manipulate relational tables using several integrity rules such as entity, referential, and domain integrity. Using one-to-one, one-to-many, or many-to-many relationships, the data is then aggregated to the desired level of analysis so it produces a unique customer signature.
Figure 1. Data preparation process
Data exploration and data cleansing are mutually iterative steps. Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions to correlations, cross-tabulation, and characteristic analysis.
Figure 2. EDA (univariate view)
Figure 3. EDA (characteristic analysis)
Following exploratory data analysis (EDA), the data is treated to increase its quality. Data cleansing requires good business and data understanding so organizations and teams can correctly interpret the data. This is an iterative process designed to remove, replace, modify, and rectify irregularities. Two major issues with unclean data are missing values and outliers – both these things can dampen model accuracy, making it imperative that teams and organizations have a solid data cleansing process in place.
Before a decision is made on how to treat missing values, we need to understand the reason for missing data and understand the distribution of missing data so we can categorize it as:
- Missing completely at random (MCAR);
- Missing at random (MAR); or
- Missing not at random (MNAR)
Missing data treatment often assumes MCAR and MAR, but NMAR data is more difficult to deal with. The list below provides common treatments ordered by complexity.
|Missing Data Treatment||Description|
|Leave missing data||Small percentage of missing values could be tolerated|
Missing values have special meaning and would be treated as a separate category
|Delete missing data||Listwise (complete) or|
Pros: simple and fast
Cons: reduce statistical power, problematic on small datasets
|Single imputation||Mean, mode, median; add missing_flag for adjustment;|
Pros: simple, fast and uses the complete dataset
Cons: reduced variability, ignores relationship between variables; not effective where that data contains a large amount of missing values (typically more than 5% of the data)
Cons: reduced variance
Pros: imputes categorical and numeric data
Cons: performance issue on large datasets
Maximum likelihood estimation
Pros: unbiased, used the complete dataset
Pros: accurate, cutting-edge machine learning technique
Cons: difficult to code without a special function
Table 3. Missing data treatments
Outliers are another “beast” in our data, since they can violate the statistical assumptions that underpin models. Once identified, it’s important to understand why outliers are there before applying any treatment, because while in most cases they must be dealt with, they can occasionally be useful. For example, outliers could be a valuable source of information in fraud detection; in that case, it’d be a bad idea to replace them with a mean or median value.
Outliers should be analyzed using univariate and multivariate analysis. For detection, we can use visual methods like histograms, box plots, or scatter plots; we can also use statistical methods like mean and standard deviation, clustering, small decision tree leaf nodes, Mahalanobis distance, Cook’s D, and/or Grubbs’ test. But determining what is and isn’t an outlier isn’t as simple as identifying missing values. The decision should be based upon a specified criterion. For example: any value outside ±3 standard deviations, or ±1.5IQR, or 5th to 95th percentile range would be labeled as an outlier.
Outliers can be treated in a similar way as missing values, but other transformations can also be utilized, including binning, weights assignment, conversion to missing values, logarithm transformations, and/or Winsorization.
As discussed above, data cleansing may require teams to implement different statistical and machine learning techniques. Even though these transformations could create better scorecard models, teams must consider the practicality of implementing these, since complex data manipulations can be costly, difficult to implement, and they can slow down model processing performance.
Once the data is clean, we’re ready for a more creative part – data transformation. Data transformation (or feature engineering) is the creation of additional (hypothesized) model variables that are tested for significance. The most common transformations include binning and optimal binning, standardization, scaling, one hot encoding, interaction terms, mathematical transformations (from non-linear into linear relationships and from skewed data into normally distributed data) and data reduction using clustering and factor analysis.
Apart from some general recommendations on how to tackle this task, it’s the data scientist’s responsibility to suggest the best approach to transform the customer data signature into the powerful mining view. This is probably the most creative – and most challenging – aspect of the data scientist’s role since it requires both a solid grasp of business understanding and considerable statistical and analytical ability. Often, the key of a good model isn’t the power of a specific modeling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena being examined.
Credit scoring is a dynamic, flexible, and powerful tool for lenders, but there are plenty of ins and outs that are worth covering in detail. To learn more about credit scoring and credit risk mitigation techniques, read the next installment of our credit scoring series, Part Four: Variable Selection.
And click here to learn more about Altair’s credit scoring, credit risk, and financial services solutions.
Read prior Credit Scoring Series installments: