Altair_Blog_hero_1920x225

Featured Articles

The Art and Science of Working with Missing Values

“Missing” or “null” are the most common values in any data generated by a business operational system. They are less common in engineering applications where data is generated by sensors and machine settings of known values. But where do these values come from? 

In business operational systems, missing values are generated because most of the data elements supposed to be filled when dealing with customers aren’t captured. This is especially common in bank forms, where many fields are left blank. These fields generate missing values.  Anther common source of missing values is errors that occur when transforming data from one storage system to another. For example, when importing data from a relational database into, say, a Python Pandas DataFrame, not all data formats supported by the database engine are available in, say, Panda’s data manipulation library. Therefore, these formats’ improper treatment leads to blank data fields in new DataFrames, even when those fields contained values in the original database. 

Missing values also occur because of calculation errors, such as division by zero, numerical overflow, or trying to calculate the logarithm of a nonpositive value. 

But not all missing values are a result of errors or values that are unknown or unfilled by the data generating system. They may have an important meaning that’s relevant to the business process in consideration. For example, when a financial institution sends a new credit card to a customer, the customer is expected to activate the card before they can use it. If the customer doesn’t activate the card, the field “Activation Date” will remain empty, i.e., having a value of “Missing” (or “null”). This value indicates that the institution issued the card, but the consumer didn’t activate it.  

Coding missing values is somehow arbitrary. Some systems fill fields with special values that indicate missing, such as “9999.” In most database engines, missing values are denoted by “null”. Some systems may have more than one value to denote missing, such as “null” and “NaN,” which stands for “not a number” and may be reserved when the missing value is the result of computational error. Some systems represent missing values with different symbols depending on value type. For example, in the SAS language, a missing string is representing by a null string “”, while a null numeric value is represented by a period “.”. This variability of coding missing values in different platforms presents an additional challenge to the process of treating missing values. 

When faced with missing values, we have three possibilities:

  1. We can ignore the record with missing values. This approach may not be feasible when we have many missing values in different variables. It’s best when the frequency of missing values is low, and they’re limited to a small number of variables in the dataset. 
  2. We may want to “replace” the missing value with a valid value for this variable. For example: if a customer’s age is missing (as well as the date of birth !) we may assume the average value of the entire population. 
  3. The last method is known as “data imputation,” where we develop a model that predicts the value that should have been in place of the missing value by using the other non-missing values in other variables as predictors. The technique known as “multiple imputation” goes beyond that and predicts a set of values to replace the missing value. These values satisfy a specific distribution that is expected to be present in the data. 

Today, missing values are simply a part of working with data. To learn more about how to treat missing values and to explore how Altair’s tools implement the strategies and solutions above, visit the webpage for Altair's data analytics capabilities and Altair Knowledge Studio.

Click here to learn more about Altair’s machine learning capabilities, solutions, and expertise.