Child pages
  • Discretization

Contents

Question

Do we always need to bin the continuous variables or is BayesiaLab able to directly handle these variables?

Answer

Continuous variables always have to be binned, i.e. discretized. BayesiaLab offers a broad set of tools for discretizing continuous variables. These tools are available within the Data Import Wizard, in the Node Editor and in the context of the Re-Discretization function.

  • Decision Tree (supervised discretization when you have a target variable)
  • Density Approximation (for fitting the continuous density function)
  • K-Means (another way to try fitting the continuous density function)
  • Equal Distances 
  • Normalized Equal Distances (to prevent the negative impact of outliers)
  • Equal Frequencies
  • Manual Discretization: to allow you to set bins directly on the distribution or density functions

Example

Let's take the example of the continuous variable C104 that is selected in the screen shot above.

Selecting the Manual Discretization shows you the distribution function.

Clicking on the Switch View button displays the density function.

The default threshold, indicated by the blue vertical line, is set to the mean value of the variable.

Manual Discretization

Thresholds can be added and removed by right-clicking on the graph, and can be modified by selecting them.

The automatic discretization algorithms can be either selected by using the Type button or in the Manual mode by clicking on Generate a Discretization

  • Type: the selected discretization algorithm will only be applied at the end of the importation process
  • Generate a Discretization: the discretization algorithm is run and the computed thresholds are displayed in the graph.
Decision Tree

We have here a discrete target variable Y. The decision tree algorithm will search the best thresholds to optimize the mutual information between Y and the binned C104. It will also find the optimal number of thresholds to use given the relationship between C104 and the number of lines in the dataset.

We ask for 5 intervals (and then 4 thresholds)

The Decision Tree returns only 2 bins.

The threshold is set to 461.

Density Approximation

This algorithm will analyze the density function of C104 to find the best thresholds.

We ask for 10 intervals (and then 9 thresholds)

This algorithm is also able to find the optimal number of bins to approximate the density function.

K-Means

This algorithm consists in using the data clustering algorithm K-Means on C104 data only.

We ask for 5 intervals.

Normalized Equal Distances

This algorithm consists in first using a smoothing algorithm to "clean the outliers" and then computing the equal distances.

We ask for 5 intervals.

Equal Distances

This algorithm consists in directly computing the equal distances based on the range of the variable.

We ask for 5 intervals.

Caution

This algorithm is sensible to outliers

Equal Frequencies

This algorithm consists in defining thresholds to try having the same number of points in each bin.

We ask for 5 intervals.

You can use these discretization tools during the data import process and also once the network is created. In the latter case, we speak of "rediscretization". This allows you to try different binning schemes (discretization methods + number of bins) and to select the one yielding the best model.