Step 3 — Data Selection, Filtering, and Missing Value Processing

​Context

  • Step 3 of the five-step Data Import Wizard deals with Data Selection, Filtering, and Missing Values Processing.

Overview of Elements

Data

We start with the Data panel — although it is at the bottom of the window — as it can help inform decisions about Missing Values Processing.

This Data panel resembles the Data panel from Step 2 — Definition of Variable Types.

However, there are several important additional pieces of information available:

  • A Missing Values icon indicates the presence of at least one Missing Value in the corresponding variable.

  • A triangle icon indicates that variable-specific statistics are available. It appears on all variable headers with the exception of variables of the type Row Identifier and Unused.

  • Clicking on the triangle icon or the associated variable header brings up a table with variable statistics:

    • For Discrete variables, it shows the frequencies of all states, including Missing Values and Filtered Values:

  • The Filter checkboxes allow you to uncheck/deselect specific values.

  • The checked box means that the value is included, which is the default condition.

  • The unchecked box means that the value is excluded and that all rows​ that contain that value will be filtered, i.e., removed.

  • As you experiment with checking/unchecking, you can see how the Number of Rows in the Information panel changes.

In terms of a data query, the Filter checkbox would be the equivalent of a nominal value row filter.

Note that the number of Filtered Values does not refer to the number of excluded rows due to an unchecked Filter checkbox.

  • For Continuous variables, it shows the standard statistics, such as Minimum, Maximum, Mean, and Standard Deviation. Additionally, the table displays the frequencies of non-missing values, Missing Values, and Filtered Values:

Select Values

The Select Values panel relates to the Filter checkboxes plus any Required Minima and Maxima applied in the Data panel.

Three actions are available in this panel:

  • You can choose the logic for combining the Filters and Minima/Maxima assigned in the Data panel:

    • OR: a row will be removed if ANY of the selected Filters or specified Minima/Maxima across all variables apply to that row.

    • AND: a row will only be removed of ALL of the selected Filters and specified Minima/Maxima across all variables that apply to that row.

  • Click the Show Selections button to review what Filters and Minima/Maxima are currently in place.

  • Note the syntax for Discrete variables: The variable name is followed by "in" (i.e., is an element of) followed by the included values shown as an array in square brackets.

  • Further logical expressions are shown as conjunctions (AND) or disjunctions (OR) in separate lines.

  • Clicking the Delete Selections button removes all Filters and Minima/Maxima currently in place.

Missing Values Processing

In the Missing Value Processing panel you can specify which kind of processing to apply to variables with Missing Values, i.e., Filter, Replace, and Infer.

This panel is only active if you select one of the variables that feature a small question mark icon . This icon indicates that the corresponding variable contains at least one Missing Value.

Filter

The Filter function allows you to remove rows from the dataset that contain Missing Values. This is equivalent to what is commonly known as casewise deletion.

You can apply the Filter individually to any variable that contains Missing Values.

Usage

  • In the Data panel, click on the header or into the column of the variable with Missing Values.

  • Then, check the Filter checkbox in the Missing Values Processing panel.

  • Next, choose the logical condition to apply when you select multiple variables to be subject to the Filter.

    • OR: a row will be removed if ANY of the selected variables contain a Missing Value in that row.

    • AND: a row will only be removed of ALL of the selected variables containing a Missing Value in that row.

Before applying Filter, please consider the implications discussed in Chapter 9: Missing Values Processing.

Replace By

With the Replace By function, you can specify a value for replacing the Missing Values in the selected variable.

You have several options in this regard:

  • You can set a specific value:

    • For a Discrete variable, you can select among the values observed in the variable from a drop-down list.

  • Alternatively, you can choose the Modal value, i.e., the most frequently occurring value of the variable in the dataset.

  • For a Continuous variable, you can select to use the Mean value computed from the dataset.

  • As an alternative, you can specify any arbitrary value.

Infer

For practical analysis purposes, the Infer option is the most common method for Missing Values Processing.

To learn about Missing Values Processing beyond Filter and Replace By, please see Missing Values Processing in Chapter 9 of our e-book.

The Methods in Detail:

  • Infer — Static Imputation

  • Infer — Dynamic Imputation

  • Infer — Structural EM

  • Infer — Entropy-Based Imputations

Information

The Information panel is identical in its functionality to the Information panel in Step 2 — Definition of Variable Types. Please refer to that topic for details.




Last updated

Logo

Bayesia USA

info@bayesia.us

Bayesia S.A.S.

info@bayesia.com

Bayesia Singapore

info@bayesia.com.sg

Copyright © 2024 Bayesia S.A.S., Bayesia USA, LLC, and Bayesia Singapore Pte. Ltd. All Rights Reserved.