Predictive modelling using evolutionary algorithms
Iconic is an open source evolutionary algorithm framework developed by a group of 10 students at the University of Newcastle (UoN) in collaboration with Dr. Markus Wagner from the University of Adelaide (UoA) and Prof. Pablo Moscato from the University of Newcastle. It was developed as part of their Final Year Project for the Bachelor of Engineering (Software)(Honours) program. The Iconic Software Ecosystem has 3 main components: The Iconic CLI - a command line interface that allows users to run the Iconic System in a bare-bones, lightweight manner, the Iconic Workbench - a graphical user interface that gives users an accessible and easy way to operate the system, and the Iconic API - the underlying logic that both interfaces use to perform calculations and analysis.
This User Manual (UM) provides the information necessary for users to effectively use the Iconic Workbench and the Iconic CLI. It is up-to-date as of Iconic v0.7.0 released 02/11/2018. This document provides an overview of all screens in the Iconic Workbench, a guide on how to get started, and a list of each feature provided by the workbench. It also provides information and examples on how to use the Iconic CLI.
The Iconic Workbench provides a standalone graphical user interface over the Iconic API. It allows users to easily import and modify datasets, preprocess data and start evolutionary searches to find an expression which represents the data. The Iconic Workbench aims to provide users an easy way to generate predictive models for their data.
The key features it provides are:
The Iconic CLI is a simple, light-weight command-line tool designed to give more advanced users the ability to expedite the generation of models without relying on a graphical user interface. The current version uses the Global Simple Evolutionary algorithm for Multiple Objectives (GSEMO) to minimise mean squared error and genome size. It takes a single input file, a population size and a number of generations.
Below is a high-level diagram of how both the Iconic Workbench and Iconic CLI integrate with the Iconic API to provide their functionality.
The term ‘user’ is used throughout this document to refer to a person who requires and/or has acquired access to the Iconic Workbench or Iconic CLI.
Below is a list of known bugs with the Iconic Workbench. For an optimal experience, please refrain from replicating the following scenarios:
Below is an overview of each of the screens in the Iconic Workbench, followed by a short guide on how to run your first search.
Java Runtime Environment version 8 or higher must be installed to use the Iconic Workbench. It can be downloaded here
To optimize utilisation of the Iconic Workbench:
Below is a simple start to finish guide to running a search.
Before exiting the system, it is recommended to stop all running searches.
The following sub-sections provide detailed, step-by-step instructions on how to use the various functions or features of the Iconic Workbench.
Projects provide logical groupings for multiple datasets and searches.
Alternatively, if no project exists, you will be prompted to create one when importing or creating a dataset
You may either import an existing dataset in CSV format, or create one from scratch and enter or paste in values.
*NOTE: If a project does not exist, you will be asked to create one when importing or creating a dataset
You may add multiple searches to a dataset. There are two types of searches to choose from : Gene Expression Programming or Cartesian Genetic Programming.
You may view an edit the dataset in a spreadsheet view. This supports copy and paste both to and from Microsoft Excel
You may save a dataset from the Iconic Workbench to your system
This will save the dataset in a CSV format
You may view a plot of the data for each individual feature in the dataset
This will plot the data from that feature below. Additionally, a label “(missing values)” will appear next to the feature in the table if the feature is missing values.
You may apply a number of transformations to each feature in the dataset. These can be enabled in a user specified order. This ordering will appear on the right hand side of the screen. Features with preprocessors applied will be labelled with “(modified)”. This will not affect the dataset in the “Input Data” screen, nor will modified values be exported when exporting the dataset.
You may smooth feature data using a sliding window approach. By default, the window is 2. This means that for every data point, we take the average or the two before, two after and the point itself. This becomes the new value for this data point. This is calculated in advanced before making any changes to the dataset. If there is no values immediately before or after the data point, the window will wrap around to the nearest point.
If the dataset is missing values, you MUST apply this preprocessor before any others can be applied. This is also true if the “Remove Outliers” (see below) preprocessor is applied and removes values from the dataset. There are 5 methods for removing outliers:
You may normalise the scale of the data points between two user defined values. For example, if your data range is 0 to 100, you can use this preprocessor to scale it between 0 and 1.
You may use this preprocessor to remove outlying data points from the feature. You may specify a threshold. A data point is considered an outlier if the distance between the data point and the mean value is greater than the threshold multiplied by IQR (interquartile range). If outliers are removed, you must handle missing values before continuing.
You may offset each data point in the feature by a specified positive or negative amount.
You may define the parameters to use when searching. Availability and display of some parameters are restricted to certain search types. For quick reference, you may hover over a label for an explanation. Additional information may be found here: Using the Command-Line
You MUST select the dataset to search on
Note: You will not be able to select a dataset if it is missing values. You must handle these missing values via the “Process Data” screen
This specifies the target function to search for. It allows you to define the input features and output (classifier) feature to use for searching.
Target Expression Syntax
These are the functional primitives to be included in the search. Each building block can be enabled or disabled individually, have their complexities set and will display a description when clicked.
Below is a list of all currently implemented building blocks in alphabetical order
ABS (a):
Returns the positive value of a.
ADD (a, b):
Returns a + b.
AND (a, b):
Returns 1 if both a and b are greater than 0, 0 otherwise.
ACOS (a):
Returns the inverse cosine function of a.
ASIN (a):
Returns the inverse sine function of a.
ATAN (a):
Returns the inverse single argument tangent function of a.
CEIL (a):
Returns the integer of a rounded up.
COS (a):
Returns the cosine of a.
DIV (a, b):
Returns the division of a / b.
EQUAL (a, b):
Returns 1 if a is equal to b, 0 otherwise.
EXP (a):
Returns e^a.
FLOOR (a):
Returns the integer of a rounded down.
GAUSS (a):
Returns exp(-x^2), providing a normal distribution.
GREATER (a, b):
Returns 1 if a > b, 0 otherwise.
GREATER_EQUAL (a, b):
Returns 1 if a >= b, 0 otherwise.
IF (a, b, c):
Returns returns b if a > 0, c otherwise.
LESS (a, b):
Returns 1 if a < b, 0 otherwise.
LESSEQUAL (a, b):
Returns 1 if a <= b, 0 otherwise.
LOGISTIC (a):
Returns (1 / 1 + exp(-a)).
This is a common sigmoid squashing function.
MAX (a, b):
Returns the maximum value of a and b.
MIN (a, b):
Returns the minimum value of a and b.
MOD (a, b):
Returns the remainder of a / b.
MUL (a, b):
Returns a * b.
LN (a):
Returns the natural logarithm (base e) of a.
NEG (a):
Returns - a.
NOT (a):
Returns 0 if a is greater than 0, 1 otherwise.
OR (a, b):
Returns 1 if either a or b are greater than 0, 0 otherwise.
POW (a, b):
Returns a^b.
ROOT (a, b):
Returns the b-th root of a if a is greater than 0, NaN otherwise.
SGN (a):
Returns -1 if a is negative, 1 if a is positive, 0 otherwise.
SIN (a):
Returns the sine of a.
SQRT (a):
Returns the square root of a.
STEP (a):
Returns 1 if x is positive, 0 otherwise.
SUB (a, b):
Returns a - b.
TAN (a):
Returns the tangent of a.
TANH (a):
Returns the hyperbolic tangent of a.
This is a common squashing function returning a value between -1 and 1.
ATAN2 (a, b):
Returns the two argument inverse tangent function.
XOR (a, b):
Returns 1 if (a <= 0 and b > 0) or (a > 0 and b <= 0), 0 otherwise.
Below are the parameters which apply to both Gene Expression Programming and Cartesian Genetic Programming searches
This is the algorithm to use for determining result error. Currently, only Mean Squared Error is available
This is the number of “children” to be generated each generation
This is the number of generations to run the search for. A value of 0 will run the search indefinitely
This is the algorithm to use for mutation of chromosomes. Currently, only Single Active Gene Mutation is available. Mutation is a small random change in a chromosome
This is the chance of mutation as a percentage. The higher the mutation rate, the more chance a mutation will occur
Below are the parameters which only apply to Gene Expression Programming searches
This is the length of the Header to use.
The total length of the chromosome can be at minimum 1, and at max Header length + Tail length (where Tail length is Header length + 1). The Header part of the chromosome can pick building blocks, features of the dataset or constants. The Tail can only pick features or constants. The Tail was used to ensure that there is no leaf nodes expecting to have children.
In this diagram, the green F1 (meaning feature 1 in the dataset) doesn’t use its children because it doesn’t need to. But if it was a building block like a “+” then it would end up needing it.
This is the algorithm to use for crossover of chromosomes. Currently, only Simple Expression Crossover is available. Crossover is the act of replacing part of the child’s genes with those of a parents within the population
This is the chance of crossover as a percentage. The higher the crossover rate, the more chance a crossover will occur
Below are the parameters which only apply to Cartesian Genetic Programming searches
The number of outputs that the CGP chromosome can have, effectively splitting a solution into multiple parts
The number of columns in the CGP chromosomes dimensions
The number of rows in the CGP chromosomes dimensions. It is recommended to set this value to 1 as default.
The number of levels back that any node in the CGP chromosome can reach to connect to another node.
To start a search:
To stop a search:
You may pause a search and resume from where it left off.
The Iconic Workbench gives live information to the user about the search progress. This includes: Progress over time, Time elapsed, number of generations, generations per second, time since last improvement, average improvement time and number of CPU cores.
You may view live information of the search results via the “Results” screen. This displays a table of results as well as a solution fit plot when a result is selected.
Currently, Iconic only supports copy & paste of solutions from the results table
The Iconic Workbench supports different colour schemes. These are: Default, Dark, Bootstrap 2, Bootstrap 3
$ java -jar iconic-cli.jar -i <file> --population <number> --generations <number> --outputs <number> --primitives <symbol,...> [--graph] [--csv]
iconic-cli
is a simple command-line tool designed to expedite the generation of models without relying
on a graphical user interface. The current version 0.7.0
only uses
Global Simple Evolutionary Algorithm for Multiple Objectives (GSEMO)
with cartesian (graph-based) chromosomes on
two pre-defined objectives that minimise the:
iconic-cli
takes a single input file, a population size and a number of generations.
While running it prints the current progress as a percentage of generations elapsed versus total generations, the current least error and smallest size, and the total amount of time elapsed.
The output of each run will be placed in a new folder named after the input file, arranged in subfolders according to the date the run was initiated. Unless additional flags are included only a README file will be output containing each of the parameters used and their values.
$ java -jar iconic-cli.jar ...
$ ...
$ ls .
After running:
$ ls .
Directory: C:\Path\to\iconic-cli
Mode LastWriteTime Length Name
---- ------------- ------ ----
d----- 31/10/2018 11:46 AM inputFile # Output directory
-a---- 22/10/2018 2:21 PM inputFile.csv
-a---- 16/10/2018 9:29 AM iconic-cli.jar
(-a | --algorithm) <GENE_EXPRESSION_PROGRAMMING | CARTESIAN_GENETIC_PROGRAMMING>
The type of algorithm determines which algorithm should be used to perform the search.
As of version 0.7.0
this parameter is ignored as only GSEMO
with cartesian chromosomes
is used.
(-i | --input) <string>
The input file is a comma delimited list of values where each line is a new sample.
The current version 0.7.0
doesn’t support input files with column headers.
0, 1, 0.25
1, 1, 0.5
0, 0, 1
--outputs <integer>
The number of outputs is used to specify how many outputs a chromosome can have. If the chromosome doesn’t support multiple outputs this parameter will be ignored.
In the current version 0.7.0
chromosomes with multiple outputs have each output summed
together to produce a single output.
(-g | --generations) <integer>
The number of generations is used to specify how many generations to let the population evolve.
Unlike the Iconic Workbench
the number of generations must be greater than zero.
(-p | --population) <integer>
The population size is used to specify the size of the initial starting population.
This is less meaningful with GSEMO
as the population grows dynamically with only Pareto-optimal solutions
being kept for the next generation. Setting an initial population size greater than one can still be
used to increase the genetic diversity of the initial population.
--primitives <symbol>,...
The primitive set used by the algorithm must be specified as a list of comma-delimited symbols. If no primitives are specified all available primitives will be used by default.
--primitives ADD,MUL,DIV,SUB,LOG,SIN
A full list of available primitives can be seen by using --listPrimitives
.
(-cP | --crossoverProbability) <percentage in range [0.0, 1.0]>
The probability of crossover being used on an offspring during each instance of the
evolutionary cycle.
A crossover probability of 1.0
will cause crossover to always occur in each cycle,
whereas a probability of 0.0
will prevent crossover from ever occurring.
The crossover probability will only take effect if the algorithm uses a crossover operator.
In version 0.7.0
no crossover is included by default with no way to change it.
(-mP | --mutationProbability) <percentage in range [0.0, 1.0]>
The probability of mutation being used on an offspring during each instance of the
evolutionary cycle.
A mutation probability of 1.0
will cause mutation to always occur in each cycle,
whereas a probability of 0.0
will prevent mutation from ever occurring.
The mutation probability will only take effect if the algorithm uses a mutator.
In version 0.7.0
a mutator is included by default with no way to change it.
(-r | --repeat) <integer>
The number of repetitions (trials) to repeat the experiment for. Each trial will use the same parameters as specified.
If the --graph
or --csv
flags are enabled then the
results from each trial will be included within the same output file(s).
--graph
If this flag is included iconic-cli
will export the results to several charts in the PNG format.
These charts will be placed in the same output folder as the default output files.
The charts generated include a plot of every generation’s non-dominated set, the last generation’s non-dominated set, and a solution-fit plot of the overall Pareto-optimal set.
Non-dominated solutions from every generation
Non-dominated solutions from the last generation
Overall Pareto-optimal set’s solution fit
--csv
If this flag is included iconic-cli
will export the results to several CSV files.
These CSV files will be placed in the same output folder as the default output files.
The CSV files generated include a list of chromosomes from every generation’s non-dominated set, and the chromosomes from the last generation’s non-dominated set. Chromosomes are formatted as a 3-tuple of (mean squared error, size, model).
These parameters are exclusive to cartesian chromosomes. If any other type of chromosome is used any option specified here will be ignored.
Version 0.7.0
doesn’t support the use of other chromosomal types so these parameters will
never be ignored.
(--columns) <integer>
The number of columns that the chromosome should use. A cartesian chromosome stores its genotype as a graph, this parameter specifies the columnar dimensions of that graph.
(--rows) <integer>
The number of rows that the chromosome should use. A cartesian chromosome stores its genotype as a graph, this parameter specifies the row dimensions of that graph.
In general there is no reason to use a number of rows other than one as there’s always a functionally equivalent one-dimensional graph. If in doubt set this parameter to one.
(--levelsBack) <integer>
The maximum number of levels back that any node within the chromosome can connect to. A cartesian chromosome stores its genotype as a graph, this parameter specifies how far back in terms of columns that any node in the graph can reach. If a column is in range the node may connect to any other node in that column.
A maximum number of levels back that’s equal to or greater than the number of columns in the chromosome means that any node in the graph can connect to any other node preceding it. Reducing the maximum number of levels back will force the chromosome to produce larger models.
The Iconic team has stopped official implementation of the Iconic project on 02/11/2018. The Iconic team is not obligated in any way to maintain the system or provide support for end users. Contact information is provided below as a courtesy. Feel free to contact
|Contact - Org|Email|Role|Responsibility| |:—:|—|—|—| |Jayden Urch - Iconic|jayden.urch@uon.edu.au|Project Manager|Project management| |Tim Pitts - Iconic|timothy.pitts@uon.edu.au|Client Communications|External resourcing and communications| |Jasbir Shah - Iconic|jasbir.shah@uon.edu.au|Developer|Search algorithm and CLI expert| |Lachlan Meyer - Iconic|lachlan.meyer@uon.edu.au|Developer|Repository Owner| |Dr Pablo Moscato - UoN|pablo.moscato@newcastle.edu.au|Course Manager|Subject matter expert| |Dr Markus Wagner - UoA|markus.wagner@adelaide.edu.au|Primary Client|Project direction and features|