Child pages
  • Data Clustering (7.0)

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »


The root page BlabC:BayesiaLab Home could not be found in space BayesiaLab.


Learning | Clustering | Data Clustering

Data Clustering is a form of unsupervised learning that is utilized to segment the data. The output of the algorithm is a new variable,  [Factor_i]. The states of this new variable correspond to the created segments. 

Data Clustering can be used for different purposes:

  1. For finding observations that look the same;
  2. For finding observations that behave the same;
  3. For representing an unobserved dimension;
  4. For summarizing a set of variables;
  5. For compactly representing the joint probability distribution.

From a technical point of view, the segment should be:

  1. Homogeneous/pure;
  2. Have clear differences with the other segments;
  3. Stable.

From a functional point of view, the segments should be:

  1. Easy to understand;
  2. Operational;
  3. A fair representation of the data.

Data Clustering with Bayesian networks is classically based on a Naive structure in which the newly created variable [Factor_i] is the parent of the variables that are used for clustering (usually called the Manifest variables). 

This variable being hidden, i.e. with 100% of missing values, the marginal probability distribution of [Factor_i] and the conditional probability distributions of the Manifest variables are initialized with random distributions. Thus, an Expectation-Maximization (EM) algorithm is used to fit these distributions with the data:

  1. Expectation: the network is used with its current distributions for computing the posterior probabilities of [Factor_i], for the entire set of observations described in the data set; These probabilities are used for soft imputing [Factor_i];
  2. Maximization: based on this imputation, the distributions of the network are updated via Maximum-Likelihood. Then, the algorithm goes back to Expectation, until no significant changes occur to the distributions.


Data Clustering has been updated in versions 5.1 and 5.2.

New Feature: Meta-Clustering

This new feature has been added for improving the stability of the induced solution (3rd technical quality). It consists in using Data Clustering on the data set made of a subset of the Factors that have been created while searching for the best segmentation. The final solution is thus a summary of the best solutions that have been found (4th purpose). 

The five Manifest variables at the very bottom are used in the data set for describing the observations.

The Factor variables  [Factor_1],  [Factor_2], and [Factor_3] have been induced with Data Clustering, and then imputed to create new columns in the data set.

In this example, three Factor variables are thus used for creating the final solution [Factor_4].


Let's use a data set that contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement. For the sake of simplicity, we are just describing the houses here with the 5 Manifest variables below, discretized into 2 bins.

  • grade:Overall grade given to the housing unit
  • sqrt_above: Square footage of house apart from basement     
  • sqft_living15: Living room area in 2015
  • sqft_lot: Square footage of the lot
  • lat: Latitude coordinate

The wizard below describes the setting we used for segmenting these houses:

After 100 steps, the best solution consists in segmenting the houses into 4 groups. The mapping below (Analysis | Report | Target | Relationship with Target Node | Mapping) shows the created segments

The size of each segment is proportional to its marginal probability (i.e. how many houses belong to this segment), the intensity of the blue is proportional to the purity of the associated cluster (1st technical quality), and the geometrical proximity is defined by the neighborhood.

This Radar Chart (Analysis | Report | Target | Posterior Mean Analysis | Radar Chart) allows interpreting the generated segments. As we can see, they are easily distinguishable (2nd technical quality).

This solution with 4 segments thus satisfies the first two technical qualities listed above. What about the 3rd one, the stability? Below are the scores of the 10 best solutions that have been generated while learning:

As we can see, even though the best solution is made of 4 segments, this is the only solution with 4 clusters, all the other ones having pretty much the same score but with 3 clusters. We can thus assume that a solution with 3 clusters would be more stable.

Using Meta-Clustering on the 10 best solutions (10%) indeed generates a final solution made of 3 clusters.

This mapping juxtaposes the mapping of the initial solution with 4 segments (lower opacity) and the one created via meta-clustering.

The relationships between the final and initial segments are as follows:

  • C1 groups C4 and C2 (the main difference between C4 and C2 was the Square footage of the lot),
  • C2 corresponds to C3
  • C3 corresponds to C1


New Feature: Multinet

As stated in the Context,  Data Clustering with Bayesian network is usually done with Expectation-Maximization (EM) on a Naive structure. This is thus based on the hypothesis that the Manifests variables are conditionally independent of each others given [Factor_i]. This structure is thus well suited for finding observations that look the same (1st purpose), but not so good for finding observations that behave the same (2nd purpose).

Our new Multinet clustering is an EM2 algorithm based both on a Naive structure (Look) and on a set of Maximum Weight Spanning Trees (MWST) (Behavior). Once the distributions of the Naive randomly set, the algorithm works as follows:

  1. Expectation_Naive: the Naive network is used with its current distributions for computing the posterior probabilities of [Factor_i], for the entire set of observations described in the data set; These probabilities are used for hard imputing [Factor_i], i.e. choosing the state with the highest posterior probability;
  2. Maximization_MWST[Factor_i] is used as a breakout variable. A MWST is learned on each subset of data. 
  3. Expectation_MWST: the joint probabilities of the observations are computed with each MWST and used for updating the imputation of [Factor_i].
  4. Maximization_Naive: based on this updated imputation, the distributions of the Naive network are updated via Maximum-Likelihood. Then, the algorithm goes back to Expectation_Naive, until no significant changes occur to the distributions.

Two parameters allow changing the Look/Behavior equilibrium. They can be considered as probabilities to run the Naive and MWST steps at each iteration.

Setting a weight of 0 for Behavior defines a Data Clustering quite similar to the usual one, but based on hard imputation instead of soft imputation.


Let's use the same data set that describes houses in Seattle.

The wizard below describes the setting we used for segmenting these houses:

After 100 steps, the best solution consists in segmenting the houses into 3 groups. The final network is a Naive Augmented Network, with a direct link between 2 Manifests variables (that are thus not independent given the segmentation, i.e. the Behavior part).



New Feature: Heterogeneity

New Feature: Random Weights