Child pages
  • Data Clustering (7.0)

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.



Let's use a dataset that contains house sale prices for King County, which includes the city of Seattle, Washington. It describes homes sold between May 2014 and May 2015. More specifically, we have extracted 94 houses that are more than 100 years old, that have been renovated, and come with a basement. For simplicity, we are just describing the houses with the 5 Manifest variables below, discretized into 2 bins.

  • grade: Overall grade given to the housing unit
  • sqrt_above: Square footage of house, apart from basement     
  • sqft_living15: Living room area in 2015
  • sqft_lot: Square footage of the lot
  • lat: Latitude coordinate

The wizard below shows the settings used for segmenting these houses:

After 100 steps, segmenting the houses into 4 groups is the best solution. Below, the mapping (Analysis | Report | Target | Relationship with Target Node | Mapping) shows the created states/segments:

  • the size of each segment is proportional to its marginal probability (i.e. how many houses belong to each segment),
  • the intensity of the blue is proportional to the purity of the associated cluster (1st technical quality), and
  • the layout reflects the neighborhood.

This radar chart (Analysis | Report | Target | Posterior Mean Analysis | Radar Chart) allows interpreting the generated segments. As we can see, they are easily distinguishable (2nd technical quality).

Thus, the solution with 4 segments satisfies the first two technical qualities listed above. However, what about the 3rd one, the stability? Below are the scores of the 10 best solutions that have been generated while learning:

Even though the best solution is made of 4 segments, this is the only solution with 4 clusters, all the other ones have nearly the same score, but with 3 clusters. Thus, we can assume that a solution with 3 clusters would be more stable.

Using Meta-Clustering on the 10 best solutions (10%) indeed generates a final solution made of 3 clusters.

This mapping juxtaposes the mapping of the initial solution with 4 segments (lower opacity) and of the one corresponding to the meta-clustering solution.

The relationships between the final and initial segments are as follows:

  • C1 groups C4 and C2 (the main difference between C4 and C2 was the Square footage of the lot),
  • C2 corresponds to C3
  • C3 corresponds to C1



As stated in the Context,  Data Clustering with Bayesian network networks is typically done with Expectation-Maximization (EM) on a Naive structure. Thus, this it is based on the hypothesis that the Manifests variables Manifest variables are conditionally independent of each others given other given [Factor_i]. Therefore, the Naive structure is well suited for finding observations that look the same (1st purpose), but not so good for finding observations that behave similarly (2nd purpose). The behavior should be represented with by direct relationships between the Manifests.

Our new Multinet clustering is an EM2 algorithm based both on a Naive structure (Look) and on a set of Maximum Weight Spanning Trees (MWST) (Behavior). Once the distributions of the Naive are randomly set, the algorithm works as follows:

  1. Expectation_Naive: the Naive network is used with its current distributions for computing the posterior probabilities of [Factor_i], for the entire set of observations described in the data set; These probabilities are used for hard-imputing [Factor_i], i.e. choosing the state with the highest posterior probability;
  2. Maximization_MWST[Factor_i] is used as a breakout variable. A An MWST is learned on each subset of data. 
  3. Expectation_MWST: the joint probabilities of the observations are computed with each MWST and used for updating the imputation of [Factor_i].
  4. Maximization_Naive: based on this updated imputation, the distributions of the Naive network are updated via Maximum-Likelihood. Then, the algorithm goes back to Expectation_Naive, until no significant changes occur to the distributions.



Let's use the same data set that describes houses in Seattle.

The wizard below describes show the setting settings we used for segmenting these the houses:

After 100 steps, the best solution consists in segmenting the houses into three groups is the best solution. The final network is a Naive Augmented Network, with a direct link between two Manifests anifest variables (that are thus , which are, therefore, not independent given the segmentation, i.e. the Behavior part). Note that this dependency is valid for C3 only (conclusion drawn after some , which can be seen after performing inference with the network).

The radar chart allows analyzing the Look of the segments.



The assumption that the data is homogeneous, given all the Manifest Variables, can sometimes be unrealistic. There may be significant heterogeneity in the data across unobserved groups, and it can bias the machine-learned Bayesian networks. This phenomena phenomenon is known as Unobserved Heterogeneity, i.e. an unobserved variable in the data setdataset.

Data Clustering represents a solution for searching for such hidden unobserved groups (3rd purpose). However, whereas the default scoring function in Data Clustering is based on the entropy of the data, finding heterogenous groups requires modifying the scoring function.

We thus defined an Heterogeneity Index :




The Heterogeneity Weight allows setting a weight of the Heterogeneity Index in the score, which will thus , therefore, bias the selection of the solutions toward segmentations that maximize the Mutual Information of the Manifest variables with the Target Node


Let's use the entire data set that describes houses in Seattle, with this subset of Manifest variables:

  • Renovated: indicates if the house has been renovated
  • Age: Age of the house
  • sqft_living15: Living room area in 2015
  • long: Longitude coordinate
  • lat: Latitude coordinate
  • Price (K$): Price of the house.

After setting Price (K$) as a Target Node and selecting all the other variables, we use the following settings for Data Clustering:

This returns a solution with 2 segments, generating an a Heterogeneity Index of 60%. This indicates thus , therefore, that using [Factor_i] as a breakout variable would allow increasing by 60% increase the sum of the Mutual Informations of the Manifest variables with the Target Node by 60 %.

The Multi-Quadrant below highlights the improvement of the Mutual Information. The points correspond to the Mutual Informations on the entire data set, and the vertical scales shows show the variations of the Mutual Informations by splitting the data based on the values of [Factor_i].


By default, the weight associated with a variable is set to 1. Whereas a weight of 0 renders the variable purely illustrative, a weight of 2 is equivalent to duplicating the variable in the data setdataset. The option Mutual Information Weight, introduced in version 5.1, allows to weight weighting the variable by taking into account its Mutual Information with the Target node.

As of version 7.0, a new option, Random Weights, allows to randomly modify the weight values while values randomly while trying the find the best segmentation. The amplitude of the randomness is inversely proportional to the current number of trials, therefore starting thus with the maximum level of randomness and ending with almost no weight modification. This option can be useful for escaping from local minima.