Learning | Clustering | Data Clustering
Data Clustering is a form of unsupervised learning that is utilized to segment the data. The output of the algorithm is a new variable, [Factor_i]. The states of this new variable correspond to the created segments.
Data Clustering can be used for different purposes:
- For finding observations that look the same;
- For finding observations that behave the same;
- For representing an unobserved dimension;
- For summarizing a set of variables;
- For compactly representing the joint probability distribution.
From a technical point of view, the segment should be:
- Have clear differences with the other segments;
From a functional point of view, the segments should be:
- Easy to understand;
- A fair representation of the data.
Data Clustering with Bayesian networks is classically based on a Naive structure in which the newly created variable [Factor_i] is the parent of the variables that are used for clustering (usually called the Manifest variables).
This variable being hidden, i.e. with 100% of missing values, the marginal probability distribution of [Factor_i] and the conditional probability distributions of the Manifest variables are initialized with random distributions. Thus, an Expectation-Maximization algorithm is used to fit these distributions with the data:
- Expectation: the network is used with its current distributions for computing the posterior probabilities of [Factor_i], for the entire set of observations described in the data set; These probabilities are used for soft imputing [Factor_i];
- Maximization: based on this imputation, the distributions of the network are updated via Maximum-Likelihood; Then, the algorithm goes back to Expectation until no significant changes occur in the distributions.
New Feature: Meta-Clustering
This new feature has been added for improving the stability of the induced solution (3rd technical quality). It consists in using Data Clustering on the data set made of a subset of the Factors that have been created while searching for the best segmentation. The final solution is thus a summary of the best solutions that have been found.
The five Manifest variables at the very bottom are used in the data set for describing the observations.
The Factor variables [Factor_1], [Factor_2], and [Factor_3] have been induced with Data Clustering, and then imputed to create new columns in the data set.
In this example, three Factor variables are thus used for creating the final solution [Factor_4].
New Feature: Multinet
Usual Data Clustering with Bayesian network is
New Feature: Heterogeneity
New Feature: Random Weights