Contents

Context

Tools | Resampling

Prior to version 7.0, this menu item was named Cross-Validation. However, the associated tools now belong to a broader class of methods usually called Resampling.

Let's assume the current Bayesian network  has an associated data set  made of  observations. Resampling consists in generating  data sets  that are utilized for learning  networks 

BayesiaLab offers now three different ways to create the data sets :

  1. Jackknife and K-Fold is divided into  folds of  observations.  data sets are created by iteratively excluding one fold
  2. Bootstrap: each data set  is created by sampling  observations with replacement from the original data set 
  3. Data Perturbation:  each data set  is created by perturbing the observations of  by multiplying their current weight by a random perturbation. It is a smooth bootstrap, where weights can take any continuous values from 0 to 2.  

Resampling can be used for two types of analysis:

  1. Measuring the variability of estimations with JackknifeBootstrap or Data Perturbation;
  2. Estimating the quality of a learning configuration, i.e. learning algorithm and settings, with K-Fold.

The maximum number of data sets that can be generated with Jackknife and K-Fold is . This is configuration is known as Leave-One-Out Cross-Validation.

Bootstrap and Data Perturbation do not have such limitation and can be used for generating an arbitrary number of data sets.

The data sets created with Jackknife and K-Fold contain less observations than. The structural coefficient  is thus updated (see Arc Confidence) in order to take into account that . The number of prior samples, if any, is also updated by using the same equation.

Once generated, all the Continuous Variables that have not been discretized manually are re-discretized! The discretization is thus another source of instability.

If you want to exclude this source of instability and only measure the variability associated with the learning method, you can make a right click on the data set icon in the lower right corner of the Graph Panel and select Remove Associated Discretization Type.

History

Cross-Validation has been updated in versions 5.0.2,  5.0.4, 5.1, 5.3, 5.4 and 6.0.

Moved Feature

The Structural Coefficient Analysis is a tool that is not based on data sampling by rather on multiple runs on the same data set. It has thus been moved to the new Multi-Run menu.

New Feature

The Multi-Target Evaluation has been added in the set of tools that can be used for analyzing the  learned networks .

As of version 7.0, there are thus four available kinds of resampling analysis:

Example

Let's use the same data set for illustrating these four analyses. This data set that contains house sale prices for King County, which includes Seattle. It describes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement.

All the continuous variables have been discretized into three bins, with R2-GenOpt.

Given the small number of observations in this data set, we set five prior samples for the Smoothed Probability Estimation in order to utilize a non-informative prior in the estimation of the parameters.

Based on the Multi-Run Analysis for selecting the best Structural Coefficient, we set the Structural Coefficient to 0.567, and used EQ to learn the network below.

The color associated with the nodes correspond to the four clusters of variables. and Price (K$) is the Target Node.