Child pages
  • Structural Coefficient Analysis (7.0)

Contents

Context

Tools | Multi-Run | Structural Coefficient Analysis

This tool helps choosing the best  Structural Coefficient by testing structural learning algorithms with a range of coefficients, impacting thus the structural complexity of the machine learned networks. 

Renamed Menu Item

This feature was previously under Tools | Cross-Validation.

New Feature: New Metrics

As of version 7.0, two additional metrics are available for evaluating the impact of the coefficients on the quality of the induced networks:

  • The Contingency Table Fit, for measuring the representation quality of the data (right part of the MDL score, )
  • The Degree of Freedom Reduction Efficiency (), a new metric for measuring how the complexity of the model impacts the representation quality of the data.

    It is defined as follow:



    where:

     is the degree of freedom of the fully connected network F,
     
    is the degree of freedom of the current network B,
    and  is the Contingency Table Fit of the current network B.

New Feature: Rediscretize Continuous Nodes

This new option allows running automatic discretization before executing the selected structural learning algorithm with the current Structural Coefficient.

The discretization is only run for the continuous variables that have an associated automatic discretization algorithm, i.e. for which the discretization thresholds have not been manually defined (or modified). Note the Target Node is never rediscretized in this context.

The main purpose of this option is to allow testing the impact of the Structural Coefficient on the discretization algorithms. It is thus geared toward supervised learning problems, where the variables are discretized with Tree based approaches. The Structural Coefficient is indeed also used in the MDL score that is utilized for the induction of the trees.

However, when the Seed is not fixed, this can also have an impact on the following discretization algorithms that are stochastic by nature:

Example

Let's use a data set that contains house sale prices for King County, which includes Seattle. It describes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement.

All the continuous variables have been discretized into three bins, with R2-GenOpt.

Given the small number of observation in this data set, we set five prior samples for the Smoothed Probability Estimation (5.0.4) in order to utilize a non-informative prior in the estimation of the parameters.

Below is the network learned with EQ, with the default Structural Coefficient:

Four nodes remain unconnected.

This means that, from the MDL score perspective, with the default Structural Coefficient, relationships with the other nodes are too weak, and therefore it is "too expensive" to represent these relationships. In other words, the additional bits required to represent the structure, if we were to add a link with one of these nodes, will not be compensated by the reduction of the number of bits to represent the data.

One way to try getting these nodes connected would be to decrease the number of bins used for discretization. This would automatically reduce the "price" for adding links with these nodes.

However, if we want to keep the same discretization, we can try to reduce the Structural Coefficient. Instead of manually selecting a value by trial and error, the Structural Coefficient Analysis tool can be used for automatically testing different coefficients.

With this setting, 25 networks are learned with EQ, starting with a Structural Coefficient = 1, then to 0.968, then 0.936 ..., the last network being learned with a Structural Coefficient = 0.2.

Prior to each trial, the nodes are discretized into three bins with R2-GenOpt. As the seed of our random number generator is fixed, unchecking Redescritize Continuous Nodes would return the exact same results.

The three selected metrics are computed for each of these 25 networks.

The normalized values of these metrics are available by clicking the Curve button:

All these curves suggest a coefficient between to 0.5 and 0.6.

Updated Feature: Structure Comparison

Comparing the structure of the learned networks usually helps deciding which coefficient to finally utilized. The networks are now stored from the largest Structural Coefficient to the smallest.

Example

Let's continue with our house example. The structures of the networks can be compared by clicking Structure Comparison.

The two highlighted arrows are used to go through the different structures.

The Synthesis Structure is not a Bayesian network. It is a graph that contains all the links that have been generated during the different trials.

A link can have 3 different colors:

  • Black: the link belong to the initial structure and has been found in at least one generated solution;
  • Red: the link belong to the initial structure but has never been found in the generated solutions;
  • Blue: the link does not belong to the initial structure but has been found in at least one generated solution;

Furthermore, the thickness of the link is proportional to its frequency in the generated structures.

When an arc is added between two links, this indicates a V-Structure. Without an arc, the link can have both orientations in its Equivalent Class.

This is the network that is in the Graph Panel when the analysis is run.

This structure has been found twice, and the maximum Structural Coefficient was 1. They correpond to the two highlighted points in the graph below:

This structure has been found twice, and the maximum Structural Coefficient was 0.933. They correpond to the two highlighted points in the graph below:

This structure has been found once with a Structural Coefficient = 0.6. It correponds to the highlighted point in the graph below:

This structure has been found once with a Structural Coefficient = 0.567. It correponds to the highlighted point in the graph below:

This structure has been found seven times, and the maximum Structural Coefficient was 0.533. They correpond to the highlighted points in the graph below:

Given these structures, a conservative choice would be select the solution with a path between every nodes and the largest coefficient, i.e. the solution with a Structural Coefficient set to 0.567.

Clicking the highligted icon allows to direcly open a new graph with the visualized structure.

When choosing a Structural Coefficient lower that 1, it is highly recommended to double check that the relationships that have been represented by decreasing the Structural Coefficient are significant. This can be done by running Analysis | Report | Relationship.

The p-value highlighted in green confirms that the relationships are significant with a threshold set to 1%. Note that, even though there is a link between view and waterfront, the p-value computed with the model and the one computed directly on the data (assuming a direct link between the two variables) are not exaclty the same. This is because the model has been learned with five prior samples for the Smoothed Probability Estimation, smoothing thus slightly the relationship.

Doing the same analysis on the network learned with a Structural Coefficient set to 0.533 returns a p-value of 2% for the weakest relationship estimated on the data.