Learning | Clustering | Data Clustering
Data Clustering is an unsupervised learning algorithm that creates a new variable, [Factor_i]. The states of this variable represent the induced clusters.
Data Clustering can be used for different purposes:
- For finding observations that look the same;
- For finding observations that behave the same;
- For representing an unobserved dimension;
- For summarizing a set of variables;
- For compactly representing the joint probability distribution.
From a technical perspective, the segment should be:
- Have clear differences with the other segments;
From a functional perspective, the segments should be:
- Easy to understand;
- A fair representation of the data.
Data Clustering with Bayesian networks is typically based on a Naive structure in which the newly created variable [Factor_i] is the parent of the variables that are used for clustering (usually called the Manifest variables).
This variable being hidden, i.e. with 100% of missing values, the marginal probability distribution of [Factor_i] and the conditional probability distributions of the Manifest variables are initialized with random distributions. Thus, an Expectation-Maximization (EM) algorithm is used to fit these distributions with the data:
- Expectation: the network is used with its current distributions for computing the posterior probabilities of [Factor_i], for the entire set of observations described in the data set; These probabilities are used for soft imputing [Factor_i];
- Maximization: based on this imputation, the distributions of the network are updated via Maximum-Likelihood. Then, the algorithm goes back to Expectation, until no significant changes occur to the distributions.
New Feature: Meta-Clustering
This new feature has been added for improving the stability of the induced solution (3rd technical quality). It consists in creating a data set made of a subset of the Factors that have been created while searching for the best segmentation, and then using Data Clustering on these new variables. The final solution is thus a summary of the best solutions that have been found (4th purpose).
The five Manifest variables (bottom of the graph) are used in the data set for describing the observations.
The Factor variables [Factor_1], [Factor_2], and [Factor_3] have been induced with Data Clustering. They are then imputed to create a new data set.
In this example, three Factor variables are used for creating the final solution [Factor_4].
New Feature: Multinet
As stated in the Context, Data Clustering with Bayesian network is typically done with Expectation-Maximization (EM) on a Naive structure. Thus, this is based on the hypothesis that the Manifests variables are conditionally independent of each others given [Factor_i]. Therefore, the Naive structure is well suited for finding observations that look the same (1st purpose), but not so good for finding observations that behave similarly (2nd purpose). The behavior should be represented with direct relationships between the Manifests.
Our new Multinet clustering is an EM2 algorithm based both on a Naive structure (Look) and on a set of Maximum Weight Spanning Trees (MWST) (Behavior). Once the distributions of the Naive randomly set, the algorithm works as follows:
- Expectation_Naive: the Naive network is used with its current distributions for computing the posterior probabilities of [Factor_i], for the entire set of observations described in the data set; These probabilities are used for hard imputing [Factor_i], i.e. choosing the state with the highest posterior probability;
- Maximization_MWST: [Factor_i] is used as a breakout variable. A MWST is learned on each subset of data.
- Expectation_MWST: the joint probabilities of the observations are computed with each MWST and used for updating the imputation of [Factor_i].
- Maximization_Naive: based on this updated imputation, the distributions of the Naive network are updated via Maximum-Likelihood. Then, the algorithm goes back to Expectation_Naive, until no significant changes occur to the distributions.
Two parameters allow changing the Look/Behavior equilibrium. They can be considered as probabilities to run the Naive and MWST steps at each iteration.
Setting a weight of 0 for Behavior defines a Data Clustering quite similar to the usual one, but based on hard imputation instead of soft imputation.
New Feature: Heterogeneity
This new option allows selecting the segmentation that maximizes the Mutual Information of the Manifest variables with the Target Node.
New Feature: Random Weights