1 of 15

Chapter 9: Missing Values Processing

Introduction

Missing values are encountered in virtually all real-world data collection processes. Missing values can result from non-responses in surveys, poor record-keeping, server outages, attrition in longitudinal surveys, faulty sensors of a measuring device, etc. Despite the intuitive nature of this problem and the fact that almost all quantitative studies are affected by it, applied researchers have given it remarkably little attention in practice. Burton and Altman (2004) state this predicament very forcefully in the context of cancer research: “We are concerned that very few authors have considered the impact of missing covariate data; it seems that missing data is generally either not recognized as an issue or considered a nuisance that it is best hidden.”

Given the abundance of “big data” in the field of analytics, missing values processing may not be a particularly fashionable topic. After all, who cares about a few missing data points if there are many more terabytes of observations waiting to be processed? One could be tempted to analyze complete data only and remove all incomplete observations. Regardless of how many more complete observations might be available, this naive approach would almost certainly lead to misleading interpretations or create a false sense of confidence in one’s findings.

Koller and Friedman (2009) provide an example of a hypothetical medical trial that evaluates the efficacy of a drug. In this trial, patients can drop out, in which case their results are not recorded. If patients withdraw at random, there is no problem ignoring the corresponding observations. On the other hand, if patients prematurely quit the trial because the drug does not seem to help them, discarding these observations introduces a strong bias in the efficacy evaluation. As this example illustrates, it is important to understand the mechanism that produces the missingness, i.e. the conditions under which some values are not observed.

As missing values processing — beyond the naive ad hoc approaches — can be a demanding task, both methodologically and computationally. Traditionally, the process of specifying an imputation model has been a scientific modeling effort on its own, and few non-statisticians dared to venture into this specialized field (van Buuren, 2007).

With Bayesian networks and BayesiaLab, handling missing values properly now becomes feasible for researchers who might otherwise not attempt to deal with missing values beyond the ad hoc approaches. Responding to Burton and Altman’s serious concern, we believe that the presented methods can help missing values processing become an integral part of more research projects in the future.

We have already mentioned missing values processing several times in earlier chapters, as it is one of the steps in the Data Import Wizard. However, we have delayed a formal discussion of the topic until now because the recommended missing values processing methods are tightly integrated with BayesiaLab’s learning algorithms. Indeed, all of BayesiaLab’s core functions for learning and inference are prerequisites for successfully applying missing values processing. With all the building blocks in place, we can now explore this subject in detail.

Types of Missingness

Four principal types of missing values are typically encountered in research:

Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR) or Not Missing at Random (NMAR). Both of these equivalent expressions, MNAR and NMAR, appear equally frequently in the literature. We use MNAR throughout this chapter.
Filtered Values

We can explain each of these conditions with the following causal Bayesian network. It illustrates:

the data-generating process (DGP);
the mechanism that causes the missingness;
the observable variables that contain the missing values.

Reference Network Model

Generating a Test Dataset

We use this reference network to simulate all missingness conditions and generate a test dataset from it for subsequent evaluation:

Learn about Generating the Test Dataset

With such a test dataset, the problems associated with missingness become very obvious. Given that we have specifically encoded all types of missingness mechanisms in the reference network, the resulting test dataset is a kind of worst-case scenario, which is ideal for testing purposes.

However, before we can apply any missing values processing methods, we need to bring the test dataset into BayesiaLab. While the Data Import Wizard is explained in detail in the BayesiaLab User Guide (see Open Data Source), we quickly summarize the steps in the following sub-topics:

Importing the Test Dataset into BayesiaLab

Missing Values Processing in BayesiaLab

Once imported into BayesiaLab, we attempt to recover the reference network's original distributions from the test dataset.

In a typical data analysis workflow in BayesiaLab, a researcher encounters Missing Values Processing in Step 3 of the Data Import Wizard, i.e., when importing a dataset. So, we evaluate each available Missing Values Processing method in the context of a prototypical workflow.

Filter (Listwise/Casewise Deletion)
Replace By (Mean/Modal Imputation)
Infer — Static Imputation
Infer — Dynamic Imputation
Infer — Structural EM
Infer — Entropy-Based Imputations
Approximate Dynamic Imputation

Each of the above methods yields an imputed dataset. Now we can examine how well the imputed datasets match the distributions from the reference model. We explore the advantages and disadvantages of each method on this basis.

Ultimately, this assessment of approaches is meant as a guide for choosing a Missing Values Processing method as a function of what we know about the data-generating process and the missingness mechanism in particular.

Before we even present results, we need to warn you that some of the to-be-evaluated methods, such as Filter (Listwise/Casewise Deletion) and Replace By (Mean/Modal Imputation), are not recommended for default use. We still include them for two reasons: first, they are almost universally used in statistical analysis, and second, under certain circumstances, they can be safe to use. Regardless of their suitability for research, they can help us understand the challenges of missing values processing.

Missing Completely at Random (MCAR)

Missing Completely at Random (MCAR) means that the missingness mechanism is entirely independent of all other variables. In our causal Bayesian network, we encode this independent mechanism with a boolean variable named MCAR_X1.

Furthermore, we assume that there is a variable X1 that represents the original data-generating process. This variable, however, is hidden, so we cannot observe it directly. Rather, we can only observe X1 via the variable X1_obs, which is a “clone” of X1 but with one additional state, “?”, which indicates that the value of X1 is not observed.

The Bayesian network shown below is a subnetwork of the complete network presented above. The behavior of the three variables we just described is encoded in this subnetwork.

In addition to this qualitative structure, we need to describe the quantitative part, i.e., the parameters of this subnetwork, including the missingness mechanism and the observable variable:

X1 is a continuous variable with values between 0 and 1. We have arbitrarily defined a Normal distribution for modeling the DGP.
MCAR_X1 is a boolean variable without any parent nodes. This means that MCAR_X1 is independent of all variables, whether hidden or not. Its probability of being true is 10%.
X1_obs has two parents: the data-generating variable X1 and the missingness mechanism MCAR_X1 . The following deterministic rule defines the conditional probability distribution of X1_obs: IF MCAR_X1 THEN X1_obs =? ELSE X1_obs = X1

Now that our causal Bayesian network is fully specified, we can evaluate the impact of the missingness mechanism on the observable variable X1obs. Given that we have created a complete model of this small domain, we automatically have perfect knowledge of the distribution of X1. Thus, we can directly compare X1 and X1_obs via the Monitors.

We see that X1 (left) and X1_obs (center) have the same mean and the same standard deviation. This suggests that the remaining observations in X1_obs (center) are not different from the non-missing cases in X1 (left). The only difference is that X1_obs (center) has one additional state (“?”) for missing values, representing 10% of the observations. Thus, deleting the missing observations of an MCAR variable should not bias the estimation of its distribution. In BayesiaLab, we can simulate this assumption by setting negative evidence on “?” (green arrow labeled “Delete”). As we can see, the distribution of X1_obs (right) is now exactly the same as the one of X1 (left).

Under real-world conditions, however, we typically do not know whether the missing values in our dataset were generated completely at random (MCAR). This would be a strong assumption to make, and it is generally not testable. As a result, we can rarely rely on this fairly benign condition of missingness and, thus, should never be too confident in deleting missing observations.

Missing at Random (MAR)

Secondly, data can be Missing at Random (MAR). Here, the missingness of data depends on observed variables. A brief narrative shall provide some intuition for the MAR condition: in a national survey of small business owners about the business climate, there is a question about the local cost of energy. Chances are that the owner of a business that uses little electricity, e.g., a yoga studio, may not know of the current cost of 1 kWh of electric energy and could not answer that question, thus producing a missing value in the questionnaire. On the other hand, the owner of an energy-intensive business, e.g., an electroplating shop, would presumably be keenly aware of the electricity price and able to respond accordingly. In this story, the probability of non-response is presumably inversely proportional to the energy consumption of the business.

In the subnetwork shown below, X3_obs is the observed variable that causes the missingness, e.g., the energy consumption in our story. X2_obs would be the stated price of energy if known. X2 would represent the actual price of energy in our narrative. Indeed, from the researcher’s point of view, the actual cost of energy in each local market and for each electricity customer is hidden.

To simulate this network, we need to define its parameters, i.e., the quantitative part of the network structure:

X2 is a continuous variable with values between 0 and 1. Here, too, we have arbitrarily defined a Normal distribution for modeling the DGP.
MAR_X2 is a boolean variable with one parent, which specifies that the missingness probability depends directly on the fully observed variable X3_obs. The exact values are not important here, as we only need to know that the probabilities of missingness are inversely proportional to the values of X3_obs:

X2_obs has two parents, i.e., the data-generating variable X2 and the missingness mechanism MAR_X2. The conditional probability distribution of X2_obs can be described by the following deterministic rule: IF MAR_X2 THEN X2_obs=? ELSE X2_obs=X2

Given the fully specified network, we can now simulate the impact of the missingness mechanism on the observable variable X2_obs.

As the above screenshot shows, the mean and standard deviation in the Monitor of X2_obs indicates that the distribution of the observed values of X2 differs significantly from the original distribution, leading to an overestimation of X2 in this example. We can simulate the deletion of incomplete observations by setting negative evidence on “?” in the Monitor of X2_obs (green arrow labeled “Delete”). The simulated distribution of X2_obs (right) clearly differs from the one of X2 (left).

Missing Not at Random (MNAR)

Missing Not at Random (MNAR) or Not Missing at Random (NMAR) are equivalent expressions; MNAR and NMAR appear equally frequently in the literature. We use MNAR throughout this chapter.

Missing Not at Random (MNAR) refers to situations in which the missingness of a variable depends on hidden causes (unobserved variables), such as the data-generating variable itself. This condition is depicted in the subnetwork below.

An example of the MNAR condition would be a hypothetical telephone survey about alcohol consumption. Heavy drinkers might decline to provide an answer out of fear of embarrassment. On the other hand, survey participants who drink very little or nothing at all might readily report their actual drinking habits. As a result, the missingness is a function of the very variable in which we are interested.

In order to proceed to simulation, we need to specify the parameters of the missingness mechanism and the observable variable:

X4 is a continuous variable with values between 0 and 1, and a Normal distribution models the DGP.
MNAR_X4 is a boolean variable with one parent, which specifies that the missingness probability depends directly on the hidden variable X4. However, the exact values are unimportant. We need to state that the probabilities of missingness are proportional to the values of X4:

X4_obs has two parents, i.e., the data-generating variable X4 and the missingness mechanism MNAR_X4.
The following deterministic rule defines the conditional probability distribution: IF MNAR_X4 THEN X4_obs=? ELSE X4_obs=X4

The impact of the mechanism of the missing value becomes apparent as we compare the Monitors of the network side by side.

As the above screenshot shows, the mean and standard deviation in the Monitor of X4_obs (center column) indicate that the distribution of the observed values of X4 differs significantly from the original distribution (left column), leading to an underestimation of X4 in this example. We can simulate the deletion of incomplete observations by setting negative evidence on “?” (green arrow labeled “Delete”). The simulated distribution of X4_obs (right column) indeed differs from the one of X4 (left column).

Filtered Values

There is a fourth type of missingness, which is less often mentioned in the literature. In BayesiaLab, we refer to missing data of this kind as Filtered Values (see also Filtered Values in Chapter 5). In fact, Filtered Values are technically not missing at all. Rather, Filtered Values are values that do not exist in the first place. Clearly, something nonexistent cannot become missing due to a missingness mechanism.

For instance, in a hotel guest survey, a question about one’s satisfaction with the hotel swimming pool cannot be answered if the hotel property does not have a swimming pool. This question is not applicable. The absence of a swimming pool rating for this hotel would not be a missing value. On the other hand, for a hotel with a swimming pool, the absence of an observation would be a missing value.

Conceptually, Filtered Values are quite similar to Missing at Random (MAR) values, as Filtered Values usually depend on other variables in the dataset, too, which may or may not be fully observed. However, Filtered Values should never be processed as missing values. In our example, it is certainly not reasonable to impute a value for the swimming pool rating if there is no swimming pool. Rather, a Filtered Value should be considered a special type of observation.

In BayesiaLab, an additional state, marked with a chequered icon, is added to this type of variable in order to denote Filtered Values (BayesiaLab’s learning algorithms implement a kind of local selection for excluding the observations with Filtered Values while estimating the probabilistic relationships). The following illustration shows an example of a network including Filtered Values.

Once again, we must describe the parameters of the subnetwork, including the Filtered Values mechanism and the observable variable:

Filter_X5 is a boolean variable with one parent, which specifies that it depends on the hidden variable X4. Here, X5 becomes a Filtered Value if X4 is greater than 0.7.
X5 is a continuous variable with values between 0 and 1. It has two parents, X4 and the Filtered Values mechanism: IF Filter_X5 THEN X5=Filtered Value ELSE X5=f(X4)
X5_obs is a pure clone of X5, i.e., X5 is fully observed: X5_obs=X5

For the sake of completeness, we present the Monitors of X5 (left) and X5_obs (right):

Missing Values Processing in BayesiaLab

Generating the Test Dataset

To begin this exercise, we use BayesiaLab to produce the test data that we will later use for evaluating the Missing Values Processing methods.

We can directly generate data according to the joint probability distribution encoded by the Reference Network: Main Menu > Data > Generate Data.

Next, we must specify whether to generate this data internally or externally. For now, we generate the data internally, which means that we associate data points with all nodes. This includes missing values and Filtered Values according to the reference network.

For the Number of Examples (i.e., cases or records), we set 10,000.

Now that this data exists inside BayesiaLab, we need to export it, so we can truly start “from scratch” with the test dataset. Also, regarding realism, we only want to make the observable variables available rather than all. We first select the nodes X1_obs through X5_obs and then select Main Menu > Data > Save Data from the main menu.

Next, we confirm that we only want to save the Selected Nodes, i.e., the observable variables.

Upon specifying a file name and saving the file, the export task is complete.

A quick look at the CSV file confirms that the newly generated data contain missing and Filtered Values, as indicated with question marks (?) and asterisks (*), respectively.

Now that we have produced a test dataset with all types of missingness, we forget our reference model to start “from scratch.” We approach this dataset as if this were the first time we see it, without any assumptions and without any background knowledge. This provides us with a suitable test case for BayesiaLab’s range of missing values processing methods.

Importing the Test Dataset into BayesiaLab

We show the first two steps of the Data Import Wizard only for reference, as their options have already been discussed in previous chapters.

Our test dataset consisting of 10,000 records was saved as a CSV file, so we start the import process via Main Menu > Data > Open Data Source > Text File.

Step 1: Data Structure Definition

Note the missing values in columns X1_obs, X2_obs, and X4_obs in the Data Panel. Column X5_obs features Filtered Values, which are marked with an asterisk (*).

Step 2: Definition of Variable Types

The next step of the Data Import Wizard requires no further input, but we can review the statistics provided in the Information Panel: we have 5,547 missing values (=11.09% of all cells in the Data panel) and 1,364 Filtered Values (=2.73%).

Step 3: Data Selection, Filtering, and Missing Values Processing

The next screen brings us to the core task of selecting the Missing Values Processing method. In the screenshot, the default option Structural EM is pre-selected, but we will explore all options systematically from the top. The default method can be specified under Main Menu > Window > Preferences > Data > Import & Associate > Missing & Filtered Values.

We explain and evaluate each Missing Values Processing method separately. Please select the topic below or open it in the navigation bar.

Filter (Listwise/Casewise Deletion)
Replace By (Mean/Modal Imputation)
Infer — Static Imputation
Infer — Dynamic Imputation
Infer — Structural EM
Infer — Entropy-Based Imputations
Approximate Dynamic Imputation

Filter (Listwise/Casewise Deletion)

The screenshot below shows Filter applied to X1_obs only. Given this selection, the Number of Rows, i.e., the number of cases or records in the dataset, drops from the original 10,000 to 8,950. Note that Filter can be applied variable by variable. Thus, it is possible to apply Filter to a subset of variables only and use other methods for the remaining variables.

In the Graph Panel, the absence of the question mark icon on X1obs signals that it no longer contains any missing values.

The Monitors now show the processed distributions. However, for a formal review of the processing effects, we must compare the distributions of the newly processed variables with their unprocessed counterparts.

In the overview below, we compare the original distributions (left column), followed by the distributions corresponding to the 10,000 generated samples (center column), and the distributions produced by the application of Missing Values Processing (right column). This is the format we will employ to evaluate all missing values processing methods.

Now we turn to test the application of Filter to all variables with missing values, i.e., X1_obs, X2_obs, and X4_obs.

Even before evaluating the resulting distributions, we see in the Information Panel that over half of the rows of data are being deleted due to applying Filter. It is easy to see that in a dataset with more variables, this could quickly reduce the number of remaining records—potentially down to zero. In a dataset in which not a single record is completely observed, Filter is obviously not applicable at all.

Replace By (Mean/Modal Imputation)

As opposed to deletion-type methods, such as Filter (Listwise/Casewise Deletion), we now consider the “opposite” approach, i.e., filling in the missing values with imputed values. Here, imputing means replacing the non-observed values with estimates in order to facilitate the analysis of the whole dataset.

In BayesiaLab, this function is available via the Replace By option. We can specify to impute any arbitrary value, e.g., based on expert knowledge or an automatically generated value. For a Continuous variable, BayesiaLab offers a default replacement of the missing values with the mean value of the variable. For a Discrete variable, the default is the modal value, i.e., the most frequently observed value of the variable. In our example, X1_obs has a mean value of 0.40878022. This is the value to be imputed for all missing values for X1_obs.

Note that Replace By can be applied variable by variable. Thus, it is possible to apply Replace By to a subset of variables only and use other methods for the remaining variables.

For the purposes of our example, we use Replace By for X1_obs, X2_obs, and X4_obs. As soon as this is specified, the number of the remaining missing values is updated in the Information Panel. By using the selected method, no missing values remain.

In the same way, we studied the performance of Filter, we now review the results of the Replace By method. Whereas this imputation method is optimal at the individual/observation level (it is the rational decision for minimizing the prediction error), it is not optimal at the population/dataset level. The right column in the following screenshot shows that imputing all missing values with the same value has a strong impact on the shape of the distributions. Even though the mean values of the processed variables (right column) remain unchanged compared to the observed values (center column), the standard deviation is much reduced.

Similar to our verdict on Filter (Listwise/Casewise Deletion), Replace By cannot be recommended either for general use. However, its application could be justified if expert knowledge were available for setting a specific replacement value or if the number of affected records were negligible compared to the overall size of the dataset.

Infer — Static Imputation

Static Imputation resembles the Replace By (Mean/Modal Imputation) method but differs in three important aspects:

The buttons under Infer are available whenever a variable with missing values is selected in the Data Panel.

While Replace By (Mean/Modal Imputation) is deterministic, Static Imputation performs random draws from the marginal distributions of the observed data and saves these randomly drawn values as “placeholder values.”
The imputation is only performed internally, and BayesiaLab still “remembers” exactly which observations are missing.
Whereas Replace By (Mean/Modal Imputation) can be applied to individual variables, any of the options under Infer apply to all variables with missing values, with the exception of those that have already been processed by Filter (Listwise/Casewise Deletion) or Replace By (Mean/Modal Imputation).

Although this probabilistic imputation method is not optimal at the observation/individual level (it is not the rational decision for minimizing the prediction error), it is optimal at the dataset/population level.

As illustrated below, drawing the imputed values from the current distribution keeps the distributions of variables pre and post-processing the same. As a result, Static Imputation returns distributions that match the ones produced by Filter (Listwise/Casewise Deletion) but without deleting any observations. As no records are discarded, Static Imputation does not introduce any additional biases. However, the distributions of X2 (MAR) and X4 (MNAR) remain strongly biased.

Infer — Dynamic Imputation

Dynamic Imputation is the first of a range of methods that take advantage of the structural learning algorithms available in BayesiaLab.

Like Infer — Static Imputation, Dynamic Imputation is probabilistic; imputed values are drawn from distributions. However, unlike Infer — Static Imputation, Dynamic Imputation does not only perform imputation once but rather whenever the current model is modified, i.e., after each arc addition, deletion, and reversal during structural learning. This way, Dynamic Imputation always uses the latest network structure for updating the distributions from which the imputed values are drawn.

Upon completion of the data import, the resulting unconnected network initially has exactly the same distributions as the ones we would have obtained with Static Imputation. In both cases, imputation is only based on marginal distributions. With Dynamic Imputation, however, the imputation quality gradually improves during learning as the structure becomes more representative of the data-generating process. For example, a correct estimation of the MAR variables is possible once the network contains the relationships that explain the missingness mechanisms.

Dynamic Imputation might also improve the estimation of MNAR variables if structural learning finds relationships with proxies of hidden variables that are part of the missingness mechanisms.

The question marks associated with X1_obs, X2_obs, and X4_obs confirm that the missingness is still present, even though the observations have been internally imputed.

On the basis of this unconnected network, we can perform structural learning. We select Main Menu > Learning > Unsupervised Structural Learning > Taboo.

While the network only takes a few moments to learn, we notice that it is somewhat slower compared to what we would have observed using a non-dynamic missing values processing method, e.g., Filter (Listwise/Casewise Deletion), Replace By (Mean/Modal Imputation), or Infer — Static Imputation. For our small example, the additional computation time requirement is immaterial. However, the computational cost increases with the number of variables in the network, the number of missing values, and, most importantly, the complexity of the network. As a result, Dynamic Imputation can slow down the learning process significantly.

The following screenshot reports the performance of the Dynamic Imputation. The distributions show a substantial improvement compared to all the other methods we have discussed so far. As expected, X2_obs is now correctly estimated, and it even improves the distribution estimation of the difficult-to-estimate MNAR variable X4_obs. More specifically, there is now much less of an underestimation of the mean value.

Infer — Structural EM

Structural Expectation Maximization (or Structural EM for short) is the next available option under Infer. This method is very similar to Dynamic Imputation, but instead of imputing values after each structural modification of the model, the set of observations is supplemented with one weighted observation per combination of the states of the jointly unobserved variables. Each weight equals the posterior joint probability of the corresponding state combination.

Upon completion of the data import process, we perform structural learning again, analogously to what we did in the context of Dynamic Imputation. As it turns out, the discovered structure is equivalent to the one previously learned. Hence, we can immediately proceed to evaluate the performance.

The distributions produced by Structural EM are quite similar to those obtained with Dynamic Imputation. At least, in theory, Structural EM should perform slightly better. However, the computational cost can be even higher than that of Dynamic Imputation because the computational cost of Structural EM also depends on the number of state combinations of the jointly unobserved variables.

Infer — Entropy-Based Imputations

Whereas the standard (non-entropy-based) approaches randomly choose the sequence in which missing values are imputed within a row of data, the entropy-based methods select the order based on the conditional uncertainty associated with the unobserved variable. More specifically, missing values are imputed first for those variables that meet the following conditions:

Variables that have a fully-observed/imputed Markov Blanket;
Variables that have the lowest conditional entropy, given the observations and imputed values.

The advantages of the entropy-based methods are (a) the speed improvement over their corresponding standard methods and (b) their improved ability to handle datasets with large proportions of missing values.

Approximate Dynamic Imputation

Usage

As a best-practice recommendation, we propose the following sequence of steps:

We run the Maximum Weight Spanning Tree algorithm to learn the first network structure.
Given the now-improved imputation quality, we start another structural learning algorithm, such as EQ, which may produce a more complex network.