Data Clustering is an unsupervised learning algorithm that creates a new variable, [Factor_i]. The states of this variable represent the induced clusters.
Data Clustering can be used for different purposes:
From a technical perspective, the segment should be:
From a functional perspective, the segments should be:
Data Clustering with Bayesian networks is typically based on a Naive structure in which the newly created variable [Factor_i] is the parent of the variables that are used for clustering (usually called the Manifest variables).
This variable being hidden, i.e. with 100% of missing values, the marginal probability distribution of [Factor_i] and the conditional probability distributions of the Manifest variables are initialized with random distributions. Thus, an Expectation-Maximization (EM) algorithm is used to fit these distributions with the data:
Data Clustering has been updated in versions 5.1 and 5.2.
This new feature has been added for improving the stability of the induced solution (3rd technical quality). It consists in creating a data set made of a subset of the Factors that have been created while searching for the best segmentation, and then using Data Clustering on these new variables. The final solution is thus a summary of the best solutions that have been found (4th purpose).
The five Manifest variables (bottom of the graph) are used in the data set for describing the observations.
The Factor variables [Factor_1], [Factor_2], and [Factor_3] have been induced with Data Clustering. They are then imputed to create a new data set.
In this example, three Factor variables are used for creating the final solution [Factor_4].
Let's use a data set that contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement. For the sake of simplicity, we are just describing the houses here with the 5 Manifest variables below, discretized into 2 bins.
The wizard below describes the setting used for segmenting these houses:
After 100 steps, the best solution consists in segmenting the houses into 4 groups. The mapping below (Analysis | Report | Target | Relationship with Target Node | Mapping) shows the created states/segments:
This radar chart (Analysis | Report | Target | Posterior Mean Analysis | Radar Chart) allows interpreting the generated segments. As we can see, they are easily distinguishable (2nd technical quality).
This solution with 4 segments thus satisfies the first two technical qualities listed above. However, what about the 3rd one, the stability? Below are the scores of the 10 best solutions that have been generated while learning:
Even though the best solution is made of 4 segments, this is the only solution with 4 clusters, all the other ones having pretty much the same score, but with 3 clusters. Thus, we can assume that a solution with 3 clusters would be more stable.
Using Meta-Clustering on the 10 best solutions (10%) indeed generates a final solution made of 3 clusters.
This mapping juxtaposes the mapping of the initial solution with 4 segments (lower opacity) and of the one corresponding to the meta-clustering solution.
The relationships between the final and initial segments are as follows:
As stated in the Context, Data Clustering with Bayesian network is typically done with Expectation-Maximization (EM) on a Naive structure. Thus, this is based on the hypothesis that the Manifests variables are conditionally independent of each others given [Factor_i]. Therefore, the Naive structure is well suited for finding observations that look the same (1st purpose), but not so good for finding observations that behave similarly (2nd purpose). The behavior should be represented with direct relationships between the Manifests.
Our new Multinet clustering is an EM2 algorithm based both on a Naive structure (Look) and on a set of Maximum Weight Spanning Trees (MWST) (Behavior). Once the distributions of the Naive randomly set, the algorithm works as follows:
Two parameters allow changing the Look/Behavior equilibrium. They can be considered as probabilities to run the Naive and MWST steps at each iteration.
Setting a weight of 0 for Behavior defines a Data Clustering quite similar to the usual one, but based on hard imputation instead of soft imputation.
Let's use the same data set that describes houses in Seattle.
The wizard below describes the setting we used for segmenting these houses:
After 100 steps, the best solution consists in segmenting the houses into 3 groups. The final network is a Naive Augmented Network, with a direct link between 2 Manifests variables (that are thus not independent given the segmentation, i.e. the Behavior part). Note that this dependency is valid for C3 only (conclusion drawn after some inference with the network).
The radar chart allows analyzing the Look of the segments.
This new option allows selecting the segmentation that maximizes the Mutual Information of the Manifest variables with the Target Node.
Let's use the entire data set that describes houses in Seattle, with this subset of Manifest variables:
After setting Price (K$) as a Target Node and selecting all the other variables, we use the following settings for Data Clustering:
This returns a solution with 2 segments, generating an Heterogeneity Index of 60%. This indicates thus that using [Factor_i] as a breakout variable would allow increasing by 60% the sum of the Mutual Informations of the Manifest variables with the Target Node.
The Multi-Quadrant below highlights the improvement of the Mutual Information. The points correspond to the Mutual Informations on the entire data set, and the vertical scales shows the variations of the Mutual Informations by splitting the data based on the values of [Factor_i].
New Feature: Random Weights