The Data Import Wizard is a comprehensive tool for preparing external data for use in BayesiaLab.
You can use BayesiaLab's Data Import Wizard to load data from two types of sources.
- text files consisting of data fields separated by delimiters, e.g. comma, semicolon, tab or pipe "|"
- databases that can be accessed on a database server
The data import process consists of five steps:
- Step 1: Preview of text file content (a) or, alternatively, the database content (b)
- Step 2: Selection and type assignment of variables
- Step 3: Definition of missing value processing and data filtering
- Step 4: Discretization of continuous variables, state aggregation of discrete variables, and data type selection
- Step 5: Import report
Step 1 (a): Text File Preview
The Data Import Wizard shows you a preview of the fields' content, as defined by the Separator. If necessary, e.g. with a foreign language data source, you can change the encoding. If a Title Line containing the names of the variables already exists, these names will be shown as column headers; otherwise the variables are automatically named.
In this first step of the Data Import Wizard, you can also specify the strings that identify a missing value, e.g. N/A, NULL, -9999, etc.
Similarly, you can define Filtered Values by adding them to the list of Filtered Values. If a matching string is found in the to-be-imported data, it will be considered a Filtered Value and represented with "*", the asterisk character.
In the Data panel,
- Columns correspond to variables
- Rows correspond to samples, records, or observations.
However, in bioinformatics, and especially in the field of microarray analysis, a different data layout is typical. Here, rows correspond to variables (e.g. genes), and columns correspond to observations. To handle this data format, you can simply select the Transpose option.
In this step of the Data Import Wizard, you can
- Specify the Separators, i.e. the characters that delineate the variables in the data.
- Specify the values that will be treated as Missing Values.
- Specify the values that will be treated as Filtered Values, for discrete or continuous variables.
- Indicate the presence of a Title Tine, so the values of the first row will be used variable headers.
- Indicate an End Line Character.
- Specify to remove Single and/or Double Quote String Delimiters.
- Specify how to handle Consecutive Separators.
Furthermore, you can import only a subset of the available data by selecting Define Sample:
There types of sampling are available:
- Random Sample (defined in percent): you specify a percentage, and BayesiaLab randomly selects the corresponding number of rows from the data file.
- Random Sample (defined by size): you specify the number of rows, and BayesiaLab randomly selects from the data file.
- Custom Sample: you specify the first and last row to be imported.
You can also set aside a portion of the data file to serve as a test sample, i.e. you would split the data file into a learning and a test set. To do so, simply set the percentage that will determine the how many rows will be randomly selected to be part of the test set.
Step 1 (b): Database File Preview
When connecting to a database (as opposed to reading in a text file as in Step 1 (a)), the Data Import Wizard manages all aspects of the data loading process.
In the following, we highlight 11 elements that make up the first step of the import process.
- JDBC Driver Specification
- Clicking the
button brings up the database driver selection, which includes
- the available drivers
- the syntax of the database URL,
- the installation directory
Database URL. The driver-specific URL syntax is displayed here.
All URL's are stored and will be available for future sessions.
- User: database user names
- Password: password corresponding to the user name
- Once the connection is established with the database, this panel lists all data tables available in the database. A Catalog and Schema drop-down menu allow filtering the available tables. No selection implies that all tables are shown.
- Selecting the database fields from the list automatically generates the corresponding SQL SELECT statement.
- Alternatively, you can directly enter an SQL query. The
button opens the SQL query editor window:
- Open: opens a text file containing a previously-saved SQL query.
- Save: saves the query to a file.
Validate: closes the SQL query editor, and pastes the query into the data selection panel.
- Cancel: closes the editor without any change.
- Limit defines the maximum number of rows to be displayed in the preview.
- Specifies the values that will be treated as Missing Values.
- Transpose switches the layout so that rows represent variables and columns represent observations (see Step 1 (a) re bioinformatics data layout)
The display of available tables in control 6 is limited to 500. If your database contains more tables, you will see an error message, and the drop-down menu will appear blank. However, this does not affect the database connection, and you can still submit SQL statements, incl. table names, via the Query field (control 8).
Step 2: Selection and Typing of the Variables
This second wizard consists of three elements:
- Format: to choose the variables to include in the future Bayesian network and to indicate their type:
- Not Distributed: specifies that this column will not be included in the Bayesian network.
- Discrete: each unique value of the variable will be considered as a distinct state
- Continuous: the values are considered numerical and will subsequently be discretized
- Weight: declares the selected columns as weighting variable, assigned a case weight to each row. Note that there can be only one weight variable.
- Data Type: defines which rows will be assigned to the learning and the test set. Note that there is only one data type variable. The column must contain only two different states.
- Row Identifier: define a record identifier for each row. Identifiers can be of any type (string or numbers), but cannot contain missing values. However, unicity is not required. Two rows can have the same identifier. These identifiers will be saved with the database and kept in any derived database (generated by some analysis or tools). This identifier can be used to select a row in the database that will be observed (during Interactive Inference, Interactive Updating or manual selection on the database). The current identifier is displayed in the status bar.
- Data View: to view the database and to select variables. Thanks to + it is also possible to select all the columns and to choose the same processing for all the variables.
- Information: provides information about the data that will be imported while processing the variables. Others groups the statistics for Row Identifier, Weight and Data Type columns.
Step 3: Missing Value Processing and Data Filtering
- Missing Value Processing specifies which kind of processing to apply, i.e. Filter, Replace, and Infer, for variables with missing values:
- Two Filters are available
- OR Filter: each row with at least one missing value for one of the variables belonging to the OR rule is discarded.
- AND Filter: each row with at least one missing value for all the variables belonging to the AND rule is discarded.
- Replace by specifies the value that will be used to replace the missing values in the selected column.
- You can set a specific value.
- In the case of continuous variables, you can specify to use the mean value computed from the available data,
- In the case of discrete variables, you can specify to use the modal value computed from the available data.
- Alternatively, you can choose one of the proposed values in the combo box. The proposed values represent the available states in the selected column. If you select multiple columns are selected, only common states are proposed.
- Static Imputation: the probability distributions of the missing values are estimated based on the available data, by considering that all the variables are independent (fully unconnected network). Whereas the previous "Replace by" option allows replacing the missing values with the most probable values, the missing values are replaced here by values that are randomly chosen ac- cording to the probability distributions. Even if this decision rule is not optimal at the line level (the optimal rule being the one used by the "Replace by" option), it is however the best rule at the population level. This imputation process only occurs at the end of the data loading. However, it can also be launched, with respect to the current Bayesian network, by using the Learning menu, with the Learning the Probabilities menu item.
- Dynamic Imputation: the conditional probability of the missing values is dynamically estimated based on the current network and the available data of the line. Each missing value is then re- placed by a value randomly selected with respect to the probability distribution. During learning, a new imputation is realized after each structure modification. This option then brings a rigorous solution to attack imputation tasks, as it is possible to save the database with all the data pro- cessing results included, i.e. without missing values in that case.
- Structural EM: the probability of each state is dynamically estimated during learning by using the current structure and to the available data. These probabilities are directly used for learning the structure and the parameters, i.e. there is no completion with a specific state.
- Two Filters are available
Dynamic Imputation and Structural EM represent the most rigorous way to process missing values. However, these methods are time costly as they need to carry out inference while learning the structure of the network. Note also that the choice of one of these methods is applied to all the variables for which an inference processing has been set. This choice can be changed once the data are loaded by using the menu. Note that the missing values replacement is dynam- ically taken into account in the information panel
- The Data View: a left click on the icon , gives access to statistical information about the distribution of the variable. If a variable has not replaced missing values, then the icon will indicate it in the header of the column.
When we want to check or uncheck several states at the same time, you have just to select the convenient states (by clicking on their names) and to check or uncheck one of the filters keeping the key pressed in order to do not lose the selection.
The Minimum and Maximum required zones allow creating filters.
It is also possible to useto define filters. Two types of filters are available: OR and AND filters. The filtering system consists in defining the lines that will be imported. If we want to describe the lines that we want to discard, we have to invert the corresponding logical expression. The example below describes how to filter the smokers that are less than 15 years old. We then have to keep the lines where the individual does not smoke OR is older than 15.
Choose the OR filter
Click on Smoker and uncheck True
Click on Age and specify 15 years old in the Required minimum field
Verify the created filter
Step 4: Discretization of the Continuous Variables, State Aggregation of the Discrete Variables and Data Type
If a weight column is specified, it will be used in the different discretization and aggregation algorithms.
If a data type column is specified, the current learning rows will be used in the different discretization and aggregation algorithms.
Discretization of Continuous Variables
This last stage occurs when there are continuous variables to take into account into the network.
A manual and four automatic discretization methods are proposed:
- Decision tree : supervised induction of the most informative discretization thresholds with respect to a target variable. The target variable must be discrete, if not it must be manually discretized before.
- KMeans : data clustering with the KMeans algorithm. The data are standardized. Each obtained cluster defines an interval.
- Equal distances : design of intervals that have the same width.
- Equal frequencies : design of intervals with the same weight, i.e. containing the same number of points.
- Manual : manual design of intervals using a graphical interface.
The selected discretization method applies for the selected column. If you want to use a unique method (except Manual ) for all the continuous variables, you can click on Select all continuous.
If you want to design your own intervals, the distribution function will then be displayed in the right window (lines are sorted according to the values of the continuous variable; the X-axis represents the number of individuals and the Y-axis represents the values of the continuous variable). All the manually discretized variables can be used as target variable for the decision tree discretization.
The user can switch the view of the data to a representation of the density curve generated by the Batch-Means method. In this view, the data's density curve is displayed. The continuous variable's values are represented along X-axis and the density of probability is represented along the Y-axis. The two red areas at each extremity indicate that the curve may not be accurate and can't be used to place here some discretization points.
This window is fully interactive and allows, in both view:
- Adding a threshold: Right Click
- Removing a threshold: Right Click on the threshold
- Selecting a threshold: Left Click on a threshold
- Moving a threshold: Left Click down and mouse move, the current Y-Coordinate appears in the Point box :
- Zooming: Ctrl + Left Click down + move + release to define the area that we want to enlarge. In the distribution function, the zoom will be done vertically and in the density curve, it will be done horizontally. It is possible to zoom successively as much as you need.
- Unzooming: Ctrl + Double Left Click
Besides this distribution function, the button: allows having access to the three automatic discretization methods through a new dialog. This part can be considered a wizard for the manual discretization as it is possible to launch these methods, to see the resulting discretization on the distribution function, and then to modify the result by moving, deleting and adding thresholds.
If the chosen discretization fails, a dialog box is displayed to warn the user. In this dialog it can change the chosen discretization.
It is also possible to transfer the defined discretization points to other variables. The button: allows getting the list of the continuous variables. Simply select from that list the variables to process.
The transfer applies only if the variation field of the selected variables is compatible with the variation field of the original variable.
If ais defined for a continuous variable, a filtered state will be created at the end of import as a new interval. This interval will be added following the intervals defined by the discretization. The name of the state associated with the interval is * by default.
State Aggregation of a Discrete Variable
It is possible to aggregate the states of a discrete variable into one new state. This processing has to be carried out for the discrete variables that have too many states. When a discrete variable is selected (left click on the corresponding column), the following window appears
- Creating an aggregate: The list of states appears in the Aggregation zone. To make a selection, click on the chosen state and maintain pressed for a partial selection or maintain pressed for a multiple selection. Once the selection done, click on , the new aggregate appears in the Aggregates list.
- Modifying an aggregate: The selection of an aggregate allows adding new states.
- Renaming an aggregate: The name edition is available thanks to a double-click on it
- Removing an aggregate: You can remove an aggregate by selecting it and pressing
Besides this a priori aggregation process, it is also possible to use the correlation of the variable with a target variable which can be discrete or continuous but manually discretized before. By checking and selecting a target variable, BayesiaLab displays the conditional probability table of the target variable knowing the current variable.
In this screen capture, the probability distribution of the target node (Cancer) is displayed with respect to each state of the current variable. For binary or "binarized" (a target state has been set) target variables, the column "Correlation" is used to highlight the difference between the conditional probability of the first target state and its marginal probability. Green bars (the conditional probability is higher than the marginal one) and red bars (lower) greatly improve the readability of the results. It is then possible to aggregate states that have the same relation with the target just by using the colors.
The exact value of the difference is given in a tooltip when you point to the corresponding bar.
Besides the visual help, it is also possible to have a special aggregation wizard for binary or "binarized" target variables by clicking on
The colored bar represents the variation field of the differences between the conditional and marginal probabilities of the first target state. This bar is interactive and allows defining aggregation thresholds:
- Adding a threshold: right click,
- Modifying a threshold: left click down and mouse move on the deltas-axis, or by selecting the threshold value from the list and by directly editing it thanks to a double left click.
- Removing a threshold: right click on the threshold.
Defining just one threshold equal to 0 allows grouping all the "green" and "red" states automatically.
The button: displays a new dialog that allows detecting automatically how to group the states. This algorithm uses a decision tree to find what the best threshold is in order to create the given number of aggregates.
Once the wanted final state number chosen, the found best thresholds are displayed in the table and you can click on the OK button to apply the new generated aggregates.
If a is defined for a discrete variable, a filtered state will be created with * as name by default. This state could not be aggregate with the other states.
State Aggregation of Several Discrete Variables
It is possible to automatically perform the aggregation of the states of several discrete variables by se- lecting several discrete columns in the table or by clicking on the button Select all discrete. It brings up a new panel:
It allows performing automatic aggregation like in the over the selected variables that the initial state number is greater or equals to the specified number. You have simply to specify the target node and the state and indicate how many states you want at the maximum. If the target node is one of the selected variables, it will remain unmodified.
Sometimes, the algorithm cannot find any grouping because the current variable and the specified target are independent. In this case, the unmodified variables will be displayed in a dialog at the end of the automatic aggregation process.
It is possible to stop the process by clicking on the close button of the progress bar. The already treated variable will be conserved and the other will stay as before.
If a column if a used as data type, you can configure select which state you want to use for learning and for test. To do this, use the corresponding combo boxes to change the type associated to each state.
After a successful database import, it is possible to display the HTML import report.
The first column displays the names of the imported variables. The second column displays the type associated with each variable. The third column shows the different states of each variable if necessary. Information contained into the last column may vary according to the type of the variable:
- Data Type: indicates, for each state, if this state will be used for learning or test and how many they are in the database.
- Weight: no necessary information.
- Discrete: indicates, for each state, the possible aggregated states that were aggregated. The color of the last cell indicates if the states of the node were really aggregated.
- Continuous: indicates, for each state, the intervals limits. After that, the asked discretization and the really obtained discretization are shown. The background of this cell is colored according to the obtained discretization to allow a quick identification.