Data Preparation 101 – The Objective of Data Preparation
Data preparation is a fundamental aspect of the modeling process. In fact, it is the most important part of the process since it occupies up to 80% of the total time of the project. The objective of data preparation is to prepare what is known as the modeling view or mining view.
The modeling view is the table that contain all the variables needed for the development and validation of the predictive model. It contains the target variable (dependent variable) and the candidate predictors (independent variables). The modeling view also contains other fields not needed specifically for modeling, but are used for other related purposes such as ID variables and fields used in reporting or model deployment.
The input to the data preparation step is the raw data coming from operational systems, external data, survey data, and data generated within the organization from other analytical platforms such as BI and OLAP tools.
Data preparation for modeling is not a one-step process, it is an iterative process that interacts with the data exploration, and modeling and validation. As a result, there is no standardized process for data preparation. However, there are some common categories of operations that are used. These are:
- Sourcing and combining the data: these operations include data import, merging, appending, sampling, and combining data from different sources.
- Data cleansing and transformations: these include filtering variables, treatment of missing values, summarization and aggregation, generation of new variables, binning and reducing number of categories, ranks, and reshaping variable distributions (such as Box-Cox transformations), and weight of evidence transformations.
- Variable reduction: calculation of predictive power and variable selection, variable reduction using principal component analysis, factor analysis, and variable clustering.