Data classification is one of the most common tasks in the field of statistical analysis and countless methods have been developed for this purpose over time. A common approach is to develop a model based on known historical data, i.e. where the class membership of a record is known, and to use this generalization to predict the class membership for a new set of observations.
Applications of data classifications permeate virtually all fields of study, including social sciences, engineering, biology, etc. In the medical field, classification problems often appear in the context of disease identification, i.e. making a diagnosis about a patient’s condition. The medical sciences have a long history of developing large body of knowledge, which links observable symptoms with known types of illnesses. It is the physician’s task to use the available medical knowledge to make inference based on the patient’s symptoms, i.e. to classify the medical condition in order to enable appropriate treatment.
Over the last two decades, so-called medical expert systems have emerged, which are meant to support physicians in their diagnostic work. Given the sheer amount of medical knowledge in existence today, it should not be surprising that significant benefits are expected from such machine-based support in terms of medical reasoning and inference.
In this context, several papers by Wolberg, Street, Heisey and Managasarian became much-cited examples. They proposed an automated method for the classification of Fine Needle Aspirates through imaging processing and machine learning with the objective of achieving a greater accuracy in distinguishing between malignant and benign cells for the diagnosis of breast cancer. At the time of their study, the practice of visual inspection of FNA yielded inconsistent diagnostic accuracy. The proposed new approach would increase this accuracy reliably to over 95%. This research was quickly translated into clinical practice and has since been applied with continued success.
As part of their studies in the late 1980s and 1990s, the research team generated what became known as the Wisconsin Breast Cancer Database, which contains measurements of hundreds of FNA samples and the associated diagnoses. This database has been extensively studied, even outside the medical field. Statisticians and computer scientists have proposed a wide range of techniques for this classification problem and have continuously raised the benchmark for predictive performance.
Our objective with this paper is to present Bayesian networks as a highly practical framework for working with this kind of classification problem. We intend to demonstrate how the BayesiaLab software can extremely quickly, and relatively simply, create Bayesian network models that achieve the performance of the best custom-developed models, while only requiring a fraction of the development time.
Furthermore, we wish to illustrate how Bayesian networks can help researchers and practitioners generate a deeper understanding of the underlying problem domain. Beyond merely producing predictions, we can use Bayesian networks to precisely quantify the importance of individual variables and employ BayesiaLab to help identify the most efficient path towards a diagnosis.
BayesiaLab’s speed of model building, its excellent classification performance, plus the ease of interpretation provide researchers with a powerful new tool. Bayesian networks and BayesiaLab have thus become a driver in accelerating research.
Download white paper (PDF, 8.6 MB)