With the abundance of “big data” in the field of analytics, and all the challenges today’s immense data volume is causing, it may not be particularly fashionable or pressing to discuss missing values. After all, who cares about missing data points when there are petabytes of more observations out there?
As the objective of any data gathering process is to gain knowledge about a domain, missing values are obviously undesirable. A missing datum does without a doubt reduce our knowledge about any individual observation, but implications for our understanding of the whole domain may not be so obvious, especially when there seems to be an endless supply of data.
Missing values are encountered in virtually all real-world data collection processes. Missing values could be the result of non-responses in surveys, poor record-keeping, server outages, attrition in longitudinal surveys or the faulty sensors of a measuring device, etc. What’s often overlooked is that not properly handling missing observations can lead to misleading interpretations or create a false sense of confidence in one’s findings, regardless of how many more complete observations might be available.
Despite the intuitive nature of this problem, and the fact that almost all quantitative studies are affected by it, applied researchers have given it remarkably little attention in practice. Burton and Altman (2004) state this predicament very forcefully in the context of cancer research: “We are concerned that very few authors have considered the impact of missing covariate data; it seems that missing data is generally either not recognized as an issue or considered a nuisance that it is best hidden.”
As missing values processing (beyond the naïve ad-hoc approaches) can be a demanding task, both methodologically and computationally, the principal objective of this paper is to propose a new and hopefully easier approach by employing Bayesian networks. It is not our intention to open the proverbial “new can of worms”, and thus distract researchers from their principal study focus, but rather we want to demonstrate that Bayesian networks can reliably, efficiently and intuitively integrate missing values processing into the main research task.