# Contents

# Context

#### Tools | Multi-Run | Structural Coefficient Analysis

This tool helps choosing the best **Structural Coefficient** by testing structural learning algorithms with a range of coefficients, impacting thus the structural complexity of the machine learned networks.

# Renamed Menu Item

This feature was previously under **Tools | Cross-Validation.**

# New Feature: New Metrics

As of version 7.0, two additional metrics are available for evaluating the impact of the coefficients on the quality of the induced networks:

- The
**Contingency Table Fit**, for measuring the representation quality of the data (right part of the MDL score, ) - The
**Degree of Freedom Reduction Efficiency**(), a new metric for measuring how the complexity of the model impacts the representation quality of the data.

It is defined as follow:

where:

is the degree of freedom of the fully connected network*F,*is the degree of freedom of the current network

*B,**and*is the**Contingency Table Fit**of the current network*B.*

# New Feature: Rediscretize Continuous Nodes

This new option allows running automatic discretization before executing the selected structural learning algorithm with the current **Structural Coefficient**.

The discretization is only run for the continuous variables that have an associated automatic discretization algorithm, i.e. for which the discretization thresholds have not been manually defined (or modified). Note the **Target Node** is never rediscretized in this context.

The main purpose of this option is to allow testing the impact of the **Structural Coefficient** on the discretization algorithms. It is thus geared toward supervised learning problems, where the variables are discretized with **Tree** based approaches. The **Structural Coefficient** is indeed also used in the MDL score that is utilized for the induction of the trees.

However, when the **Seed** is not fixed, this can also have an impact on the following discretization algorithms that are stochastic by nature:

**Example**

Let's use a data set that contains house sale prices for King County, which includes Seattle. It describes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement.

All the continuous variables have been discretized into three bins, with **R ^{2}-GenOpt**.

Given the small number of observation in this data set, we set five prior samples for the **Smoothed Probability Estimation (5.0.4) **in order to utilize a non-informative prior in the estimation of the parameters.

Below is the network learned with **EQ,** with the default **Structural Coefficient**:

Four nodes remain unconnected.

This means that, from the MDL score perspective, with the default **Structural Coefficient,** relationships with the other nodes are too weak, and therefore it is "too expensive" to represent these relationships. In other words, the additional bits required to represent the structure, if we were to add a link with one of these nodes, will not be compensated by the reduction of the number of bits to represent the data.

One way to try getting these nodes connected would be to decrease the number of bins used for discretization. This would automatically reduce the "price" for adding links with these nodes.

However, if we want to keep the same discretization, we can try to reduce the **Structural Coefficient**. Instead of manually selecting a value by trial and error, the **Structural Coefficient Analysis** tool can be used for automatically testing different coefficients.

With this setting, 25 networks are learned with EQ, starting with a **Structural Coefficient **= 1,** **then to 0.968, then 0.936 ..., the last network being learned with a **Structural Coefficient **= 0.2.

Prior to each trial, the nodes are discretized into three bins with **R ^{2}-GenOpt**. As the seed of our random number generator is fixed, unchecking

**Redescritize Continuous Nodes**would return the exact same results.

The three selected metrics are computed for each of these 25 networks.

The normalized values of these metrics are available by clicking the **Curve** button:

All these curves suggest a coefficient between to 0.5 and 0.6.

# Updated Feature: Structure Comparison

Comparing the structure of the learned networks usually helps deciding which coefficient to finally utilized. The networks are now stored from the largest **Structural Coefficient** to the smallest.

**Example**

Let's continue with our house example. The structures of the networks can be compared by clicking **Structure Comparison.**

The two highlighted arrows are used to go through the different structures.

The **Synthesis Structure** is not a Bayesian network. It is a graph that contains all the links that have been generated during the different trials.

A link can have 3 different colors:

**Black**: the link belong to the initial structure and has been found in at least one generated solution;**Red**: the link belong to the initial structure but has never been found in the generated solutions;**Blue**: the link does not belong to the initial structure but has been found in at least one generated solution;

Furthermore, the thickness of the link is proportional to its frequency in the generated structures.

When an arc is added between two links, this indicates a **V-Structure**. Without an arc, the link can have both orientations in its **Equivalent Class**.

This is the network that is in the **Graph Panel** when the analysis is run.

This structure has been found twice, and the maximum **Structural Coefficient** was 1. They correpond to the two highlighted points in the graph below:

This structure has been found twice, and the maximum **Structural Coefficient** was 0.933. They correpond to the two highlighted points in the graph below:

This structure has been found once with a **Structural Coefficient** = 0.6. It correponds to the highlighted point in the graph below:

This structure has been found once with a **Structural Coefficient** = 0.567. It correponds to the highlighted point in the graph below:

This structure has been found seven times, and the maximum **Structural Coefficient** was 0.533. They correpond to the highlighted points in the graph below:

Given these structures, a conservative choice would be select the solution with a path between every nodes and the largest coefficient, i.e. the solution with a **Structural Coefficient** set to 0.567.

Clicking the highligted icon allows to direcly open a new graph with the visualized structure.

When choosing a **Structural Coefficient** lower that 1, it is highly recommended to double check that the relationships that have been represented by decreasing the **Structural Coefficient** are significant. This can be done by running **Analysis | Report | Relationship.**

The p-value highlighted in green confirms that the relationships are significant with a threshold set to 1%. Note that, even though there is a link between ** view** and

**, the p-value computed with the model and the one computed directly on the data (assuming a direct link between the two variables) are not exaclty the same. This is because the model has been learned with five prior samples for the**

*waterfront***Smoothed Probability Estimation**, smoothing thus slightly the relationship.

Doing the same analysis on the network learned with a **Structural Coefficient** set to 0.533 returns a p-value of 2% for the weakest relationship estimated on the data.