Step 2 — Definition of Variable Types

Context

In Step 2 — Definition of Variable Types of the five-step Data Import Wizard, you need to define variable types.
Step 2 contains four panels that relate to each other in their content and available actions.

Overview of Elements in Step 2

Type

With the radio buttons in the Type panel, you can define the type of each variable.
Before you start making your determinations, BayesiaLab has already made some guesses regarding the appropriate variable type, i.e., Discrete versus Continuous.
Furthermore, some variables have limited options regarding the variable type because of their distributions:
- If a variable has the same value for all observations, it falls into the Unused variable type. Such a not-distributed variable cannot be imported at all into BayesiaLab.
- Variables that contain any text values cannot be declared Continuous variables.
- Variables with Missing Values cannot be of the type Weight, Row Identifier, or Learn/Test.

Usage

To select a variable, click on the variable header or click anywhere inside the column in the Data panel.
You can perform the selection of multiple variables with keystroke combinations commonly used in spreadsheet editing:
- Ctrl+Click: add a variable to the current selection.
- Shift+Click: add all variables between the currently selected and the clicked variable to the selection.
- Ctrl+A: select all variables in the Data panel.
- Shift+End: select all variables from the currently selected variable to the rightmost variable in the table.
- Shift+Home: select all variables from the currently selected variable to the leftmost variable in the table.
The current selection is highlight by showing the selected columns in a darker shade of their current color.

Discrete

The Discrete type considers each unique value of the variable a distinct state.
Any variable that contains text will be considered Discrete by default.
The maximum number of unique values that can be accommodated can be specified under Main Menu > Window > Preference > Editing > Node > Maximum Number of States.

Continuous

The Continuous type applies to numerical variables, which must be discretized in Step 4 — Discretization and Aggregation.
If a variable contains integer values above a certain threshold, the variable will be considered Continuous.
You can specify this threshold under Main Menu > Windows > Preferences > Data > Import & Associate > Threshold for Assuming Integers as Continuous. The default threshold value is 5.

Learn more about Discrete and Continuous nodes in the Node Editor topic.

Weight

Weighting is often applied to surveys to make a survey sample representative of the demographics of the underlying population.

If your dataset contains such a Weight variable, select it by clicking on the corresponding column.
Then, select the Weight button in the Type panel.
Later, in Step 4 — Discretization and Aggregation, you can specify whether or not to normalize the Weight variable.

Learning/Test

For a dataset that has already been split into a Learning Set and a Test Set, you can use such an existing definition to import your data into BayesiaLab.

Both the Learning Set and the Test Set need to be in the same data table, rather than in separate files.
A binary indicator variable needs to identify each set with a unique code.
With a Learning/Test variable defined, in Step 4 — Discretization and Aggregation of the Data Import Wizard, you need to assign which of your codes corresponds to BayesiaLab's Learning and Test states.

Row Identifier

You can assign one or more variables to serve as Row Identifiers. The values of Row Identifiers are imported but not processed in any way. They serve as labels that are attached to each record.

There are numerous functions in BayesiaLab that allow you to look up what record in the dataset corresponds to what is currently on display on the screen.
For instance, Automatic Evidence-Setting displays the Row Identifier in the Status Bar.

Unused

By selecting the Unused button, you can skip the import of the selected variables. In previous versions of BayesiaLab, this option was also known as "Not Distributed."

Unused is automatically applied to variables containing only a single value across all observations, i.e., when the variable is "not distributed," hence the original name.
Unused variables will appear grayed out in the remaining steps of the Data Import Wizard.

Multiple Typing

The Multiple Typing panel allows you to quickly assign variable types across multiple variables.

Click Set All to Discrete to apply the Discrete type all variables, if possible.
Click Set All to Continuous to apply the Continuous type all variables, if possible.

By clicking either button, all previous type assignments are replaced.

You can automatically remove variables, i.e., set them to the Unused type, if they exceed a certain column percentage of Missing Values.

Click the Set Missing Values Threshold button.
From the pop-up window, set the percentage.

All variables that exceed the specified threshold are set to Unused.

Information

The Information panel provides a range of statistics relating to the current type assignment of variables:

Number of Rows refers to the number of records in the to-be-imported datasets. In the context of datasets, rows, records, cases, samples, and observations all have equivalent meanings.
Discrete shows the absolute count of variables currently assigned to the Discrete type. The percentage refers to the proportion of Discrete variables among all variables, including the type Unused.
Continuous shows the absolute count of variables currently assigned to the Continuous type. The percentage refers to the proportion of Continuous variables among all variables, including the type Unused.
Others displays the count of all the variable assigned to the types Row Identifier, Weight, or Learn/Test.
Unused shows the absolute count of variables currently assigned to the Unused type. The percentage refers to the proportion of Unused variables among all variables.
Missing Values displays the count of cells in the dataset that contain Missing Values. The percentage refers to the proportion of cells in the dataset that contain Missing Values, including all variables types, even Unused, Row Identifier, and Learning/Test.
Filtered Values displays the count of cells in the dataset that contain Filtered Values, as indicated by the asterisk (*). The percentage refers to the proportion of cells in the dataset that contain Filtered Values, including all variable types, even Unused, Row Identifier, and Learning/Test.

Data

The Data panel visualizes the current variable selection and type assignment with colors (see Usage above).
Horizontal and vertical scrolling allows you to view the entire dataset that will be imported.

Workflow Animation

PreviousStep 1 — Data Structure Definition: Text File NextStep 3 — Data Selection, Filtering, and Missing Value Processing

Last updated 9 months ago