1 of 12

Automatic Discretization

Context

Automatic Discretization covers numerous discretization algorithms that are part of Step 4 — Discretization and Aggregation of the Data Import Wizard.
Except for Manual, all items in the Type menu represent Automatic Discretization algorithms.
Most of these algorithms can also be accessed via the Generate a Discretization function within the Manual Discretization screen.

Usage

Selecting a Discretization algorithm applies variable by variable, i.e., you can use a different algorithm for each Continuous variable.
To select a variable, click on the variable header or anywhere inside the column.
You can perform the selection and deselection of multiple variables with keystroke combinations commonly used in spreadsheet editing:
- Ctrl+Click: add a variable to the current selection.
- Shift+Click: add all variables between the currently selected and the clicked variable to the selection.
- Ctrl+A: select all variables in the Data panel. However, selecting all variables is not useful here in Step 4, as there are no actions that can apply to all variable types.
- Shift+End: select all variables from the currently selected variable to the rightmost variable in the table.
- Shift+Home: select all variables from the currently selected variable to the leftmost variable in the table.
Click the Select All Continuous button to select all Continuous variables.
- Note that this action will also select any variables which you have already discretized manually. As a result, you may override your previous choices.
- Note that Continuous variables already discretized manually are highlighted in soft blue.

If you do not specify an algorithm for a variable that was not manually discretized either, the default Discretization algorithm with its default settings will be used.
You can set the default Discretization algorithm under Main Menu > Window > Preferences > Discretization. [+] Show More
For the following algorithms, a Log Transformation is available as an option:
- R2-GenOpt*
- R2-GenOpt
- K-Means
- Density Approximation
- Normalized Equal Distance
- Equal Distance
- Applying the Log Transformation is useful if you have a high density of values at the bottom end of the variable domain. This "stretches" the scale for small values approaching zero.
- Note that the Log Transformation is only used temporarily for discretization purposes. Thus, the values of the thresholds and values of the intervals can all be interpreted based on the original scale.
For the following algorithms, the option Isolate Zeros is available:
- R2-GenOpt*
- R2-GenOpt
- K-Means
- Normalized Equal Distance
- Separating 0 into a separate interval can be useful for zero-inflated distributions so as to clearly separate small values from "absolutely nothing."
Click Finish to perform the Discretization.
A progress bar displays the status of the Discretization process:

If a Filtered Value is defined for a Continuous variable, a new artificial interval with an infinitesimally small width of 10-7 will be added after the intervals defined in this step. This newly-created state will serve as the Filtered State, and "*", i.e., the asterisk character, will be its State Name.
At its conclusion, BayesiaLab opens up a Graph Window with all imported variables now represented as nodes.
Simultaneously, a window pops up that offers you an optional Import Report, which is Step 5 of the Data Import Wizard.

Automatic Discretization Algorithms in Detail

Tree

Context

Tree is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

Tree is a bivariate discretization method. It machine-learns a decision tree that uses the to-be-discretized variable for representing the conditional probability distributions of the Target variable given the to-be-discretized variable. Once the Tree is learned, it is analyzed to extract the most useful thresholds.
It is the method of choice in the context of Supervised Learning, i.e., if you plan to machine-learn a model to predict the Target variable.
At the same time, we do not recommend using Tree in the context of Unsupervised Learning. The Tree algorithm creates bins that are biased toward the designated Target variable. Naturally, emphasizing one particular variable would run counter to the intent of Unsupervised Learning.
Note that if the to-be-discretized variable is independent of the selected Target variable, it will be impossible to build a tree, and BayesiaLab will prompt you to select a univariate discretization algorithm.
All manually discretized variables can be used as a Target variable for Tree discretization.

Using a Target variable for Discretization does not create a Target Node in the network.

Perturbed Tree

Context

Algorithm Details & Recommendations

- For each perturbed dataset, a univariate tree is learned to predict the Target variable with the to-be-discretized continuous variable.
- Extracting the most frequent thresholds produces the final discretization.

Supervised Multivariate

Context

Supervised Multivariate is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Supervised Multivariate discretization algorithm focuses on representing the multivariate probabilistic dependencies involving a Target variable.
It utilizes Random Forests to find the most useful thresholds for predicting the Target variable.
Its function can be summarized as follows:
- Data Perturbation generates a range of datasets.
- For each perturbed dataset, a multivariate tree is learned to predict the Target variable with a subset of variables. If a structure is already defined, it is used to bias the selection of the variables for each dataset.
- Extracting the most frequent thresholds produces the final discretization.
The Supervised Multivariate takes into account the Minimum Interval Weight and can improve the generalization capability of the model.
Being based on Random Forests, this algorithm is computationally expensive and stochastic by nature.
After the conclusion of the Data Import Wizard, the Supervised Multivariate discretization algorithm is also available from Main Menu > Learning > Discretization.
Not that the Supervised Multivariate discretization algorithm is not available via Node Context Menu > Node Editor > States > Curve > Generate a Discretization.

R2-GenOpt

Context

R2-GenOpt is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The R2-GenOpt algorithm utilizes a Genetic Algorithm to find a discretization that maximizes the R2 between the discretized variable and its corresponding (hidden) Continuous variable.
As such, it is the optimal approach for achieving the first objective of discretization, i.e., finding a precise representation of the values of a Continuous variable.
This algorithm takes into account the Minimum Interval Weight and can also create a specific bin for representing zeros if the Isolate Zeros option is set.
In Validation Mode, the R2 value between the Discretized variable and its corresponding Continuous variable can be retrieved in the Information Mode by hovering over the monitor.

Workflow Illustration

R2-GenOpt*

Context

R2-GenOpt* is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.
R2-GenOpt* is a modified version of R2-GenOpt and uses a specific MDL score to choose the number of bins.

Algorithm Details & Recommendations

With 100 observations, even though we selected 8 bins, only 3 were created for the variable 8- Wrist girth.

With 1,500 observations, even though we selected 10 bins, only 5 have been created for AGN, and 6 for ALL.

K-Means

Context

K-Means is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The K-Means algorithm is based on the classical K-Means data clustering algorithm but uses only one dimension, which is the to-be-discretized variable.
K-Means returns a discretization that directly depends on the Probability Density Function of the variable.
More specifically, it employs the Expectation-Maximization algorithm with the following steps:
1. Initialization: random creation of K centers
2. Expectation: each point is associated with the closest center
3. Maximization: each center position is computed as the barycenter of its associated points
Steps 2 and 3 are repeated until convergence is reached.
Based on the centers K, the discretization thresholds are defined as:

{T_i} = \frac{{{K_i} + {K_{i + 1}}}}{2}\

The following figure illustrates how the algorithm works with K=3.

For example, applying a three-bin K-Means Discretization to a normally distributed variable would create a central bin representing 50% of the data points and one bin of 25% each for the distribution's tails.
Without a Target variable, or if little else is known about the variation domain and distribution of the Continuous variables, K-Means is recommended as the default method.

Density Approximation

Context

Algorithm Details & Recommendations

The Density Approximation discretization detects changes in the sign of the derivative of the Probability Density Function (PDF) in order to identify local minima and maxima.
Between each local minimum and maximum, the algorithm creates a threshold.

Also, the algorithm automatically detects the optimal number of bins, although you can specify the maximum number of bins.
The minimum size permitted for bins is 1% of the data points.

Normalize Equal Distance

Context

Normalized Equal Distance is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Normalized Equal Distance algorithm pre-processes the data with a smoothing algorithm to remove outliers before computing equal partitions.
As a result, the algorithm is less sensitive to outliers than the Equal Distance algorithm.
The algorithm also takes into account the Minimum Interval Weight that defines the minimum prior probability of a bin.
You can adjust the default Minimum Interval Weight under Main > Menu > Window > Preferences > Discretization.

Equal Distance

Context

Equal Distance is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Equal Distance algorithm computes the equal distances based on the range of the variable.
This method is particularly useful for discretizing variables that share the same variation domain (e.g. satisfaction measures in surveys).
Additionally, this method is suitable for obtaining a discrete representation of the density function.
However, the Equal Distance algorithm is extremely sensitive to outliers and can generate intervals that do not contain any data points. Please see the Normalized Equal Distance algorithm, which addresses this particular issue.

Equal Frequency

Context

Equal Frequency is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

This Equal Frequency algorithm defines thresholds so that each interval contains the same number of observations.
This approach typically produces a uniform distribution.
As a result, the shape of the original density function is no longer apparent upon discretization.
This also leads to an artificial increase in the entropy of the system, directly affecting the complexity of machine-learned models.
However, this type of discretization can be useful — once a structure is learned — for further increasing the precision of the representation of continuous values.

Unsupervised Multivariate

Context

Unsupervised Multivariate is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.
This multivariate discretization method is based on analyzing the relationship between variables.

Algorithm Details & Recommendations

The Unsupervised Multivariate discretization algorithm focuses on representing multivariate probabilistic dependencies using Random Forests.
Its functionality can be described as follows:
- A new dataset is created as a clone of the original one.
- In this new dataset, each variable is independently shuffled to render all the variables independent while keeping the same statistics for each variable.
- The cloned dataset is concatenated with the original dataset. Then, a target variable is created to differentiate the clone from the original, indicating the independent set versus the original dependent set.
- Various datasets are generated from this concatenated dataset with Data Perturbation.
- For each perturbed dataset, a multivariate tree is learned to predict the target variable with a subset of variables. If a structure is already defined, it is used to bias the selection of the variables for each dataset.
- Extracting the most frequent thresholds produces the discretization.
- Being based on Random Forests, this algorithm is computationally expensive and stochastic by nature, specifically when the number of variables is important.
The Unsupervised Multivariate discretization algorithm is also available after the data import via Main Menu > Learning > Discretization.
However, it is not available in the Node Editor (Node Context Menu > Edit > Curve > Generate a Discretization).

Automatic Discretization

Context

Automatic Discretization covers numerous discretization algorithms that are part of Step 4 — Discretization and Aggregation of the Data Import Wizard.
Except for Manual, all items in the Type menu represent Automatic Discretization algorithms.
Most of these algorithms can also be accessed via the Generate a Discretization function within the Manual Discretization screen.

Usage

Selecting a Discretization algorithm applies variable by variable, i.e., you can use a different algorithm for each Continuous variable.
To select a variable, click on the variable header or anywhere inside the column.
You can perform the selection and deselection of multiple variables with keystroke combinations commonly used in spreadsheet editing:
- Ctrl+Click: add a variable to the current selection.
- Shift+Click: add all variables between the currently selected and the clicked variable to the selection.
- Ctrl+A: select all variables in the Data panel. However, selecting all variables is not useful here in Step 4, as there are no actions that can apply to all variable types.
- Shift+End: select all variables from the currently selected variable to the rightmost variable in the table.
- Shift+Home: select all variables from the currently selected variable to the leftmost variable in the table.
Click the Select All Continuous button to select all Continuous variables.
- Note that this action will also select any variables which you have already discretized manually. As a result, you may override your previous choices.
- Note that Continuous variables already discretized manually are highlighted in soft blue.

If you do not specify an algorithm for a variable that was not manually discretized either, the default Discretization algorithm with its default settings will be used.
You can set the default Discretization algorithm under Main Menu > Window > Preferences > Discretization. [+] Show More
For the following algorithms, a Log Transformation is available as an option:
- R2-GenOpt*
- R2-GenOpt
- K-Means
- Density Approximation
- Normalized Equal Distance
- Equal Distance
- Applying the Log Transformation is useful if you have a high density of values at the bottom end of the variable domain. This "stretches" the scale for small values approaching zero.
- Note that the Log Transformation is only used temporarily for discretization purposes. Thus, the values of the thresholds and values of the intervals can all be interpreted based on the original scale.
For the following algorithms, the option Isolate Zeros is available:
- R2-GenOpt*
- R2-GenOpt
- K-Means
- Normalized Equal Distance
- Separating 0 into a separate interval can be useful for zero-inflated distributions so as to clearly separate small values from "absolutely nothing."
Click Finish to perform the Discretization.
A progress bar displays the status of the Discretization process:

If a Filtered Value is defined for a Continuous variable, a new artificial interval with an infinitesimally small width of 10-7 will be added after the intervals defined in this step. This newly-created state will serve as the Filtered State, and "*", i.e., the asterisk character, will be its State Name.
At its conclusion, BayesiaLab opens up a Graph Window with all imported variables now represented as nodes.
Simultaneously, a window pops up that offers you an optional Import Report, which is Step 5 of the Data Import Wizard.