Child pages
  • Data Clustering

Contents

This menu gives access to algorithms that allow clustering data in an unsupervised way, in order to find partitions of homogeneous elements. These algorithms are based on a naive architecture in which node CLUSTERS, which is used to model the partitions, is the parent of all the other variables. Unlike supervised learning, the values of node CLUSTERS are never observed in a database. All these algorithms then rely on Expectation-Maximization methods for the estimation of these missing values.

Output

It is possible to create cluster with ordered numerical states. These values are the mean of the score of each connected node for each state of the cluster node. This score is weighted by the binary mutual information of each node in order to be more representative of the relationships. If two of these values are strictly identical, an epsilon is added to one of them to obtain two different values. The excluded nodes are not taken into account for the computation of the numerical values.

Clustering Settings

The assistant gives access to the different search methods:

  • Fixed Class Number: the algorithm tries to segment data according to a given number of classes (ranging from 2 to 127). However, it is possible to obtain less clusters than desired;
     
  • Automatic Selection of the Class Number: a random walk is used to find the optimal number of classes, starting with the specified number of clusters and increasing that number until obtaining empty or unstable clusters, or reaching the specified maximum number of clusters. The random walk is guided by the results obtained at each step;

Options

  • Sample Size: the sample size option makes it possible to search for the optimal number of classes on data subsets to improve the convergence speed (a sampling by step/trial). The partition obtaining the best score is then used as the initial partition for the search on the entire data set. It is possible to indicate either a percentage or the exact number of lines to use.
     
  • Steps Number: the number of steps for the random walk. Knowing that it is possible to stop the search by clicking on the red light of the status bar while preserving the best clustering, this number can be exaggeratedly great.
     
  • Maximum Drift: indicates the maximum difference between the clusters probabilities during learning and those obtained after missing value completion, i.e. between the theoretical distribution during learning and the effective distribution after imputation over the learning data set.
     
  • Minimum Cluster Purity in Percentage: defines the minimum allowed purity for a cluster to be kept.
     
  • Minimum Cluster Size in Percentage: defines the minimum allowed size for a cluster to be kept.

Edit Node Weights

A button displays a dialog box in order to edit weights associated to each variable.

 
Those weights, with default value 1, are associated with the variables and permit to guide the clustering. A weight greater than 1 will imply that the variable will be more taken into account during the clustering. A zero weight will make the variable purely illustrative.

Result

At the end of clustering, an algorithm allows finding automatically if one of the Clusters node's states is a filtered state or not. If so, this state is marked as filtered.

An automatic analysis of the obtained segmentation is then carried out and returns a textual report. This report is a Target Report Analysis, but contains some additional information. It is made of: 

  • A summary of the learning characteristics (method and parameters, best number of clusters, score and time to find the solution);
     
  • A sorted list of the obtained results (number of clusters and corresponding score);
     
  • A list of the clusters sorted according to the marginal probabilities of each cluster (cf. Target Report Analysis) ;
     
  • A list of the clusters sorted according to the number of examples really associated to each cluster when using a decision rule based on the maximum likelihood for each case described in the learning set; 
     
  • A list of the clusters sorted according to the purity of each cluster. The purity corresponds to the mean of the cluster probability computed from each associated example of the learning set. This list also comes with the neighborhood of each cluster. The neighborhood of a cluster is the set of clusters that have a non zero probability during the association of an example to that cluster;
     
  • A list of the nodes sorted according to the quantity of information brought to the knowledge of the Cluster node (cf. Target Report Analysis);
     
  • The probabilistic profile of each cluster (cf. Target Report Analysis).

The Mapping button of the report window allows displaying a graphical representation of the created clusters:

This graph displays three properties of the found clusters: 

  • the color represents the purity of the clusters: the more a cluster is blue, the more it is pure
     
  • the size represents the prior probability of the cluster
     
  • the distance between two clusters represents the mean neighborhood of the clusters.

The rotation buttons at the bottom right allow rotating the entire graph.

In order to ease the understanding of the obtained clusters, and if at least one variable used in the clustering has numerical values associated to its states, the states of the node Cluster will have long names automatically associated. This name will contain the mean value of all the clustered variables obtained when observing the state of the Cluster.