Child pages
  • Comparing Training and Test Data in BayesiaLab 5.0.7

Contents

Question

BayesiaLab 5.0.7 provides great relationship analysis reports on individual models. I am trying to figure out how do you do classification and relationship analysis report in BayesiaLab between training data-set and test data-set.

Say, you have two datasets, (a-training.csv) which is used to build the belief network using the data-learning feature of BayesiaLab. Then, you want to do classification against the test dataset (a-test.csv) to find out

  1. How well the test features align with the trained model / belief network?
  2. Are there any anomalies (if yes, why and their corresponding sensitivity analysis) to analyze the difference between Mutual information, Pearson's correlation etc.

Answer

Supervised Learning

If you have a target node, you can use Analysis - Network Performance – Target menu to evaluate to performance of the model on both the learning and test sets (global precision, Confusion matrix, Lift, Gain and ROC curves).

 

Unsupervised Structural Learning

If you do not have any target, you can use Analysis - Network Performance – Overall menu to compare the Joint Probability Distributions returned by your BBN on the learning and test sets.

Outliers/Anomalies Detection

The first solution consists in:

  1. Associate your test data set to your network
  2. Use the Overall Performance and extract the tail of the distribution in a file
  3. Associate this new data set (the tail) to your network and use the interactive inference to analyze why the joint probability is low, line per line (right click on the monitor panel and Sort by less probable evidence menu). The first monitor is the one corresponding the variable that makes falling the joint the most.

The second solution consists in:

  1. Save your test dataset + the corresponding joint (Inference – Batch Joint Probability menu)
  2. Edit this file in Excel 
  3. Sort the data with respect to the joint 
  4. Select a threshold for the joint and define a Boolean variable indicating if the joint is lower to this selected threshold
  5. Import this data set and do supervised learning to characterize this new target variable. Using the Markov Blanket will allow you getting the variables explaining why your joint is low.

The first solution consists in diagnosing line per line why the joint is low.

The second consists in finding a model that allows characterizing why a set of lines have a low joint probability.