##### Child pages
• Resampling (7.0)

# Context

#### Tools | Resampling

Prior to version 7.0, this menu item was named Cross-Validation. However, the associated tools now belong to a broader class of methods usually called Resampling.

Let's assume the current Bayesian network B has an associated data set D made of N observations. Resampling consists in generating K data sets Dk that are utilized for learning K networks Bk

BayesiaLab offers now three different ways to create the data sets Dk:

1. Jackknife and K-FoldD is divided into K folds of N/K observations. K data sets are created by iteratively excluding one fold.
2. Bootstrap: each data set Dk is created by sampling N observations with replacement from the original data set D
3. Data Perturbation:  each data set Dk is created by perturbing the observations of D by multiplying their current weight by a random perturbation. It is a smooth bootstrap, where weights can take any continuous values from 0 to 2.

Resampling can be used for two types of analysis:

1. Measuring the variability of estimations with JackknifeBootstrap or Data Perturbation;
2. Estimating the quality of a learning configuration, i.e. learning algorithm and settings, with K-Fold.

The maximum number of data sets that can be generated with Jackknife and K-Fold is N. This is configuration is known as Leave-One-Out Cross-Validation.

Bootstrap and Data Perturbation do not have such limitation and can be used for generating an arbitrary number of data sets.

Since the data sets created with Jackknife and K-Fold contain less observations than D, the structural coefficient αk is updated (see Arc Confidence) in order to take into account that Nk < N. The number of prior samples, if any, is also updated by using the same equation.

Once Dk generated, all the Continuous Variables that have not been discretized manually are re-discretized! The discretization is thus another source of instability.

If you want to exclude this source of instability and only measure the variability associated with the learning method, make a right click on the data set icon in the lower right corner of the Graph Panel and select Remove Associated Discretization Type.

# History

Cross-Validation has been updated in versions 5.0.2,  5.0.4, 5.1, 5.3, 5.4 and 6.0.

# Moved Feature

The Structural Coefficient Analysis is a tool that is not based on data sampling by rather on multiple runs on the same data set. It has thus been moved to the new Multi-Run menu.

# New Feature

The Multi-Target Evaluation has been added in the set of tools that can be used for analyzing the K learned networks Bk.

As of version 7.0, there are thus four available kinds of resampling analysis:

Example

Let's use the same data set for illustrating these four analyses. This data set that contains house sale prices for King County, which includes Seattle. It describes homes sold between May 2014 and May 2015. More precisely, we have extracted the 94 houses that are more than 100 years old, that have been renovated, and come with a basement.

All the continuous variables have been discretized into three bins, with R2-GenOpt.

Given the small number of observations in this data set, we set five prior samples for the Smoothed Probability Estimation in order to utilize a non-informative prior in the estimation of the parameters.

Based on the Multi-Run Analysis for selecting the best Structural Coefficient, we set the Structural Coefficient to 0.567, and used EQ to learn the network below.

The color associated with the nodes correspond to the four clusters of variables. and Price (K\$) is the Target Node.