1 of 3

Minimum Description Length Score

Definition

The Minimum Description Length Score (MDL Score) is derived from Information Theory and has been used extensively in the Artificial Intelligence community.

It consists of the sum of two components that estimate:

the minimum number of bits required to represent a model, and
the minimum number of bits required to represent the data given the model.

However, in the specific context of Bayesian networks, we need to explain the exact meaning and the notation of these two components:

Calculating Complexity: DL(B)Calculating Fit: DL(D|B)"the minimum number of bits required to represent a model" is denoted $DL\left( {B} \right)$ (="Description Length of the Bayesian network $B$ ") and refers to the structural complexity of the Bayesian network model $B$ , which includes the network graph and all probability tables.
- For brevity, we often use the shorthand "complexity" or "structure" to refer to $DL\left( {B} \right)$ .
- Small values of $DL\left( {B} \right)$ suggest a simple model structure, and large values a complex model.
- The goal of this structural part is to apply Occam's Razor, or the law of parsimony, i.e., to choose the simplest hypothesis, all other things being equal.
"the minimum number of bits required to represent the data given the model" is denoted $DL\left( {D|B} \right)$ (="Description Length of the data $D$ given the Bayesian network $B$ ") and refers to the likelihood of the data $D$ with respect to the Bayesia network model $B$ .
- The data likelihood is inversely proportional to the probability of the observed dataset, as inferred by the Bayesian network model.
- Put simply, $DL\left( {D|B} \right)$ refers to the "fit" of the model to the data.
- Small values of $DL\left( {D|B} \right)$ suggest a well-fitting model; large values, conversely, imply a poor fit.

BayesiaLab attempts to minimize the MDL Score by evaluating candidate networks during structural learning.

Learn More About Calculating Complexity & Fit

Calculating Complexity: DL(B)

Description Length of the Bayesian Network — DL(B)

$DL(B)$ is the number of bits required to represent a Bayesian network. We can break down this value into the sum of two components:

$DL(G)$ , which stands for the number of bits required to represent the graph G of the Bayesian network,
$DL(P|G)$ represents the number of bits required to represent the set of probability tables P.

$DL(B) = DL(G) + DL(P|G)$

Calculating DL(G)

To calculate $DL(G)$ , we need to determine the number of nodes and the number of their parent nodes.

DL(G) = \sum\limits_i^n {\left( {{{\log }_2}(n) + {{\log }_2}\left( {\begin{array}{*{20}{c}} n\\ {\left\| {P{a_i}} \right\|} \end{array}} \right)} \right)}

where

n is the number of random variables (nodes): ${X_1},...,{X_n}$
$P{a_i}$ is the set of the random variables that are parents of ${X_i}$ in graph G
and $P{a_i}$ is the number of parents of the random variable ${X_i}$ .

Calculating DL(P|G)

Computing $DL(P|G)$ is straightforward as it is proportional to the number of cells in all probability tables.

$DL(P|G) = \sum\limits_i^n {\left( {\prod\limits_j^{\left\| {P{a_i}} \right\|} {{S_j} \times ({S_i} - 1) \times DL(p)} } \right)}$

where

${{S}_{i}}$ is the number of states of the random variable ${X_i}$
$p$ is the probability associated with the cell.

As the probability p cannot be known prior to learning the network, we use the following classical heuristic in BayesiaLab:

$DL(p) = \frac{{{{\log }_2}(N)}}{2}$

Calculating Fit: DL(D|B)

To calculate the description length of the data given the Bayesian network, we utilize the fact that the description length is inversely proportional to the probability of the observed data inferred by the model.

where

The chain rule allows rewriting this equation with:

Calculating Fit: DL(D|B)

$\begin{array}{l} DL(D|B) = \sum\limits_{j = 1}^N {DL({e_j}|B)} \\ DL(D|B) = \sum\limits_{j = 1}^N {{{\log }_2}\left( {\frac{1}{{{P_B}({e_j})}}} \right)} \\ DL(D|B) = - \sum\limits_{j = 1}^N {{{\log }_2}\left( {{P_B}({e_j})} \right)} \end{array}$

where

${e_j}$ is the n-dimensional observation described in row ${j}$ , and
$PB\left( {{e_j}} \right)$ is the joint probability of this observation returned by the Bayesian network $B$ .

The chain rule allows rewriting this equation with:

$\begin{array}{l} DL(D|B) = - \sum\limits_{j = 1}^N {{{\log }_2}\left( {\prod\limits_{i = 1}^n {{P_B}({x_{ij}}|{\pi _{ij}})} } \right)} \\ DL(D|B) = - \sum\limits_{j = 1}^N {\sum\limits_{i = 1}^n {{{\log }_2}\left( {{P_B}({x_{ij}}|{\pi _{ij}})} \right)} } \end{array}$

Minimum Description Length Score

Definition

The Minimum Description Length Score (MDL Score) is derived from Information Theory and has been used extensively in the Artificial Intelligence community.

It consists of the sum of two components that estimate:

the minimum number of bits required to represent a model, and
the minimum number of bits required to represent the data given the model.

However, in the specific context of Bayesian networks, we need to explain the exact meaning and the notation of these two components:

Calculating Complexity: DL(B)Calculating Fit: DL(D|B)"the minimum number of bits required to represent a model" is denoted $DL\left( {B} \right)$ (="Description Length of the Bayesian network $B$ ") and refers to the structural complexity of the Bayesian network model $B$ , which includes the network graph and all probability tables.
- For brevity, we often use the shorthand "complexity" or "structure" to refer to $DL\left( {B} \right)$ .
- Small values of $DL\left( {B} \right)$ suggest a simple model structure, and large values a complex model.
- The goal of this structural part is to apply Occam's Razor, or the law of parsimony, i.e., to choose the simplest hypothesis, all other things being equal.
"the minimum number of bits required to represent the data given the model" is denoted $DL\left( {D|B} \right)$ (="Description Length of the data $D$ given the Bayesian network $B$ ") and refers to the likelihood of the data $D$ with respect to the Bayesia network model $B$ .
- The data likelihood is inversely proportional to the probability of the observed dataset, as inferred by the Bayesian network model.
- Put simply, $DL\left( {D|B} \right)$ refers to the "fit" of the model to the data.
- Small values of $DL\left( {D|B} \right)$ suggest a well-fitting model; large values, conversely, imply a poor fit.

BayesiaLab attempts to minimize the MDL Score by evaluating candidate networks during structural learning.

Learn More About Calculating Complexity & Fit

pageCalculating Complexity: DL(B)

pageCalculating Fit: DL(D|B)

Calculating Complexity: DL(B)

Description Length of the Bayesian Network — DL(B)

$DL(B)$ is the number of bits required to represent a Bayesian network. We can break down this value into the sum of two components:

$DL(G)$ , which stands for the number of bits required to represent the graph G of the Bayesian network,
$DL(P|G)$ represents the number of bits required to represent the set of probability tables P.

$DL(B) = DL(G) + DL(P|G)$

Calculating DL(G)

To calculate $DL(G)$ , we need to determine the number of nodes and the number of their parent nodes.

DL(G) = \sum\limits_i^n {\left( {{{\log }_2}(n) + {{\log }_2}\left( {\begin{array}{*{20}{c}} n\\ {\left\| {P{a_i}} \right\|} \end{array}} \right)} \right)}

where

n is the number of random variables (nodes): ${X_1},...,{X_n}$
$P{a_i}$ is the set of the random variables that are parents of ${X_i}$ in graph G
and $P{a_i}$ is the number of parents of the random variable ${X_i}$ .

Calculating DL(P|G)

Computing $DL(P|G)$ is straightforward as it is proportional to the number of cells in all probability tables.

$DL(P|G) = \sum\limits_i^n {\left( {\prod\limits_j^{\left\| {P{a_i}} \right\|} {{S_j} \times ({S_i} - 1) \times DL(p)} } \right)}$

where

${{S}_{i}}$ is the number of states of the random variable ${X_i}$
$p$ is the probability associated with the cell.

As the probability p cannot be known prior to learning the network, we use the following classical heuristic in BayesiaLab:

$DL(p) = \frac{{{{\log }_2}(N)}}{2}$