Child pages
  • Data Clustering (7.0)

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »


The root page BlabC:BayesiaLab Home could not be found in space BayesiaLab.


Learning | Clustering | Data Clustering

Data Clustering is a form of unsupervised learning that is utilized to segment the data. The output of the algorithm is a new variable,  [Factor_i]. The states of this new variable correspond to the created segments. 

There are various reasons to use Data Clustering:

  • For finding observations that look the same;
  • For finding observations that behave the same;
  • For representing an unobserved dimension;
  • For compactly representing the joint probability distribution.

From a technical point of view, the segment should be:

  1. Homogeneous/pure;
  2. Have clear differences with the other segments;
  3. Be stable.

From a functional point of view, the segments should be:

  1. Easy to understand;
  2. Operational;
  3. Be a fair representation of the data.


Data Clustering has been updated in versions 5.1 and 5.2.

New Feature: Meta-Clustering

This new feature has been added for improving the stability of the induced solution (3rd technical quality). It consists in using Data Clustering on the data set made of a subset of the Factors that have been created while learning. The final solution is thus a summary of the best solutions that have been found. 

The five variables at the very bottom are called Manifest variables. They are used in the data set for describing the observations.

The Factor variables  [Factor_1],  [Factor_2], and [Factor_3] have been induced with Data Clustering, and then imputed to create new columns in the data set.

In this example, three Factor variables are thus used for creating the final solution [Factor_4].


Let's use a data set that contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement. For the sake of simplicity, we are just describing the houses here with the 5 Manifest variables below, discretized into 2 bins.

  • grade:Overall grade given to the housing unit
  • sqrt_above: Square footage of house apart from basement     
  • sqft_living15: Living room area in 2015
  • sqft_lot: Square footage of the lot
  • lat: Latitude coordinate

The wizard below describes the setting we used for segmenting these houses:

After 100 steps, the best solution consists in segmenting the data into 4 groups.


However, below are the scores of the 10 best solutions that have been generated while learning:

As we can see, even though the best solution is made of 4 segments, this is the only solution with 4 clusters, all the other ones having pretty much the same score but with 3 clusters. We can thus assume that a solution with 3 clusters would be more stable.

Using Meta-Clustering on the 10 best solutions (10%) indeed generates a final solution made of 3 clusters.


New Feature: Multinet

New Feature: Heterogeneity

New Feature: Random Weights