Child pages
  • Hierarchical Clustering

Contents

Question

I have built a segmentation model in BayesiaLab, where the resultant segments are derived from a set of induced factors. I now have a new set of data that I would like to classify into the segments created from the first set of data. In order to identify the best variables with which to do that I ran an Augmented MB on the manifest variables in the original model. Not surprisingly, given that the segments were created from factors, the overall precision was not great (~70%). Running the same procedure, but this time focusing only on the induced factors yields a much higher precision (~96%).

So what I would like to be able to do is generate "factor scores" for each of the original induced factors for each respondent in the new dataset. Then, I would like to use the factors identified in the Augmented Markov Blanket procedure to classify new respondents into a given segment, and then save those classifications out to a data file.

Anyway, I was wondering if this is possible? And if so, how it is accomplished in BayesiaLab? 

Answer

The process you are describing is what we call the “BayesiaLab Hierarchical Clustering”. We will first describe the workflow for Data Clustering based on Factors, and then how to use the resulting network for classifying a new set of observations into the created segments.

Here are the steps for creating the Hierarchical network:

  1. Run one of the Unsupervised Structural Learning algorithms on the Manifest variables (the original variables of your dataset) you want to include into your segmentation. 
    The best choice is probably the Maximum Weight Spanning tree. Even if the generated tree is usually not the best representation of the joint probability distribution,

    1. This is the fastest algorithm,
    2. The results are stable,
    3. This is an intermediate step. The network will just be used for generating clusters of variables.
  2. Go to ValidationMode (F5).
  3. Run Variable clustering.
  4. Go back to Modeling Mode (F4)
  5. Run Multiple clustering for inducing one Factor per cluster of variables 

    1. Check "Add all Nodes to the Final Network" to get the Factor and the Manifest in the final network.
    2. Check “Connect Factors to their Manifest Variables” to get a set of Naïve structures, one per Factor.
    3. Check "Forbid new Relations with Manifest Variables" to focus only on the Factors.
  6. Select all the nodes (Ctrl + A) and run Data Clustering. The constraints on the Manifests will prevent the connections with the cluster node.
Example

We do not want age and gender to be part of the clustering. We are excluding them by pressing X while double clicking on them. This can also be done by using the node's contextual menu (Properties - Exclusion).

We run the Maximum Weight Spaning Tree.

We run the Symmetric Layout algorithm (View - Layout - Symmetric Layout)

We go to Validation mode (F5) and run the Variable Clustering (Learning - Clustering - Variable Clustering).

11 clusters are automatically created (we used a Maximum Cluster Size equal to 7, Option - Settings - Learning - Variable Clustering). The mapping below illustrates these 11 clusters. It is generated by clicking on the icon highlighted by the yellow circle The name on each cluster corresponds to the strongest node in the cluster (Best Node Name option).

We validate the result by clicking on the icon highlighted by the green circle

 

We go back to Modeling Mode (F4) and run the Multiple Clustering (Learning - Clustering - Multiple Clustering) with the following options:

We select all the nodes (Ctrl + A) and run Data Clustering (Learning - Clustering - Data Clustering) with the following settings:

The cluster node, named [Factor_11], is at the center of the graph below. We used here the Radial Layout (View - Layout - Radial Layout).

 

Here are the steps for using the Hierarchical network for classification of a new dataset. The new observations do not have Factors.

  1. Associate the new dataset to the Hierarchical network (Data - Associate Data Source). All the Factors will appear as white nodes for indicating that they are Hidden nodes, i.e. nodes without any corresponding data in the associated dataset.
  2. Impute the Factors by using the evidence we have on the Manifests

    1. Select of all the Factors
    2. Right click on one of these Factors and select Imputation - Choose the Values with the Maximum Probability 
  3. Impute (classify) the Cluster node by right clicking on this node and selecting Imputation - Choose the Values with the Maximum Probability

You can then save the entire data set or only save your cluster by selecting it (Data - Save Data).

Example

We associate the new dataset to the Hierarchical network we've just learned (Data - Associate Data Source - Text File).

The data association wizard informs us that some nodes, the 12 Factors, do not have any corresponding columns in the dataset. These 12 Factors appear as white node (hidden nodes) in the graph below.

 

We first select all the Factors except the one corresponding to the Cluster node ([Factor_11]). We use the contextual menu of one Factor node, Select - Classes - Factor, and unselect [Factor_11] with a Ctrl + left-click.

We impute the selected Factors by using the node' contextual menu of one of the Factors and Imputation.

All the Factors are now painted in blue again, except the Cluster node.

We impute the Cluster by using its contextual menu and Imputation - Choose the Values with the Maximum Probability. The node becomes blue again.

We can now save the classification of the new observations by using Data - Save Data

If you do not want to use the original Hierarchical network (i.e. the one with the Cluster node connected to all the Factors and the Factors connected to all their Manifests), you can impute the Cluster node after Step 6 of the Cluster Induction workflow, and then use a Supervised Learning algorithm to find the best variables. However, as the segments have been induced based on the Naïve structure, the initial structure should obviously give you the best results. Using one of the Markov Blanket based algorithms is then only useful, in that particular case, for trying to reduce the number of variables that necessary for the classification.