项目作者: worikgh

项目描述 :
Genetic programming
高级语言: Rust
项目地址: git://github.com/worikgh/wGP.git
创建时间: 2018-04-19T03:04:29Z
项目社区:https://github.com/worikgh/wGP

开源协议:GNU General Public License v3.0

下载


Genetic Programming - Again

Learning Classifier Rules

A system using genetic programming to automatically generate
classifiers.

Each classifier is a genetic programming tree expressed recursively:

  1. struct Node {
  2. o:Operator,
  3. l:Option<Node>,
  4. r:Option<Node>,
  5. d:Option<Node>,
  6. }

Each classifier is evolved using the genetic programming algorithm
with training data. Each classifier is associated with a Score object:

  1. #[derive(PartialEq, Debug, Clone)]
  2. pub struct Score {
  3. pub quality:f64,
  4. pub class:String,
  5. }

Score::quality is defined as the 1/(1+S), S is the sum of the
classification error over the training data.

Classification error is currently calculated using Hinge Loss.
Given T in 1.0, -1.0 the true classification of the case and Y the
estimate of the classifier, Hinge Loss is:

  1. 1.0-T*Y < 0 ? 0.0 : 1.0-T*Y

A classifier is assigned to a class by evaluating a Score for each
class and choosing the one with the highest score.

FIXME: There is no measure of differentiation. How good a classifier
is at telling one class from another.

When it classifies a case the ideal classifier will output 1 if the
case is of the class and -1 if the case is not. Classifiers are
implemented as programme trees and prepared using genetic programming.

To classify a new case a collection, Forest, of trees is used. Each
classifier examines the case and produces a result in [-1.0, 1.0].

For each class, calculate the mean score for classifiers in the Forest
specialised for that class, as:

  1. sum(result * Score::quality)/#classifiers

The class with the highest score is selected.

The value returned from classify in population.rs is

  1. Option<(String, String)>

The first String in the 2-tuple is the winning class. The second
string is in the format: <class> <score>,... listing all classes and
the score for the class in descending score order.

Operators implemented

  • Terminals All terminals are implemented as floating point values.
    On terminal nodes Node::l, Node::r and Node::d are None.
    • Inputs. From the domain of the function.
    • Constants. C
  • Arity one functions. Apply to Node::l. Node::r and Node::d are None.
    • Invert. Returns one over the value of l.
    • Negate. Returns negative one times ‘l’
  • Arity two functions. Apply to Node::l and Node::r. Node::d is None

    • Multiply. Returns l times r.
    • Gt. If the value of l is greater than the value of ‘r return 1.
      Otherwise return -1

    • Lt. If the value of l is less than the value of ‘r return 1.
      Otherwise return -1

    • Addition. Returns l plus r

  • Arity three functions. Apply to Node::l and Node::r, and
    Node::d

    • If. Calculate the value or ‘d’. If that is less than or equal to
      zero return the value of l, otherwise the value of r

Classes and Quality

Configuration File

Format is:

<key> <value>

num_generations

  1. The number of generations to simulate
  2. Example: num_generations 20000

max_population

  1. The maximum size of the population
  2. Example: max_population 10000

crossover_percent

  1. Each generation new individuals are created by combining two
  2. individuals. The number of new individuals created is at most
  3. population x crossover_percent / 100. New individuals are only
  4. added if they are not already in the population, duplicates are
  5. not allowed in the population.
  6. Example: crossover_percent 50

training_percent

  1. The data supplied for the simulation is divided into training and
  2. testing. This sets the percentage of data used to train the
  3. model. During model development the training data is used. When
  4. a new model that performs best in the training data is discovered
  5. it is run against the testing data and the results recorded
  6. Example: training_percent 80

data_file

  1. The file name of the training and testing data. Comma separated
  2. line per record. The first line is the field names (these
  3. constitute the inputs to the generated functions). The last
  4. column is the objective value.
  5. Example: data_file Abalone.in

generations_file

  1. The name of the file out to which a line is written every
  2. generation
  3. Example: generations_file AbaloneGenerations.txt

birthsanddeaths_file

  1. Every individual has a line in this file when it is created and
  2. when it is destroyed.
  3. Example: birthsanddeaths_file AbaloneBirthsAndDeaths.txt

copy_prob

data_file

filter

mutate_prob

rescore

save_file

training_percent

work_dir