This page describes the Orange trees. It first describes the basic components and procedures: it starts with the structure that represents the tree, then it defines how the tree is used for classification, then how it is built and pruned. The order might seem strange, but the things are rather complex and this order is perhaps a bit easier to follow. After you have some idea about what the principal components do, we described the concrete classes that you can use as components for a tree learner. The last part of the page contains examples.
Classification trees are represented as a tree-like hierarchy of
classes. TreeNode
stores information about the learning examples belonging to the node, a branch selector, a list of branches (if the node is not a leaf) with their descriptions and strengths, and a classifier.
TreeLearners
's storeDistributions
flag to false
.TreeLearner
's storeContingencies
flag to true
. Note that even when the flag is not set, the contingencies get computed and stored to TreeNode
, but are removed shortly afterwards. The details are given in the description of the TreeLearner
object.ExampleTable
s contain reference to examples in the root's ExampleTable
. Examples are only stored if a corresponding flag (storeExamples
) has been set while building the tree; to conserve the space, storing is disabled by default.DefaultClassifier
) that can be used to classify examples coming to the node. If the node is a leaf, this is used to decide the final class (or class distribution) of an example. If it's an internal node, it is stored if TreeNode
's flag storeNodeClassifier
is set. Since the nodeClassifier
is needed by some TreeDescenders
and for pruning (see far below), this is the default behaviour; space consumption of the default DefaultClassifier
is rather small. You should never disable this if you intend to prune the tree later.If the node is a leaf, the remaining fields are None
. If it's an internal node, there are several additional fields.
TreeNode
s. An element can be None
; in this case the node is emptyTreeSplitConstructor
. It can contain different kinds of descriptions, but basically, expect things like 'red' or '>12.3'.branchSelector
is of type Classifier
, since its job is similar to that of a classifier: it gets an example and returns discrete Value
in range [0, len(branches)-1
]. When an example cannot be classified to any branch, the selector can return a Value
containing a special value (sVal
) which should be a discrete distribution (DiscDistribution
). This should represent a branchSelector
's opinion of how to divide the example between the branches. Whether the proposition will be used or not depends upon the chosen TreeExampleSplitter
(when learning) or TreeDescender
(when classifying).The three lists (branches
, branchDescriptions
and branchSizes
) are of the same length; all of them are defined if the node is internal and none if it is a leaf.
TreeNode
has a method treesize()
that returns the number of nodes in the subtrees (including the node, excluding null-nodes).
A
is an object that classifies examples according to a tree stored in a field tree
.
Classification would be straightforward if there were no unknown values or, in general, examples that cannot be placed into a single branch. The response in such cases is determined by a component descender
.
is an abstract object which is given an example and whose basic job is to descend as far down the tree as possible, according to the values of example's attributes. The TreeDescender
calls the node's branchSelector
to get the branch index. If it's a simple index, the corresponding branch is followed. If not, it's up to descender to decide what to do, and that's where descenders differ. A descender
can choose a single branch (for instance, the one that is the most recommended by the branchSelector
) or it can let the branches vote.
In general there are three possible outcomes of a descent.
TreeNode
.branchSelector
returned a distribution and the TreeDescender
decided to stop the descend at this (internal) node. Again, descender returns the current TreeNode
and nothing else.branchSelector
returned a distribution and the TreeNode
wants to split the example (i.e., to decide the class by voting). It returns a TreeNode
and the vote-weights for the branches. The weights can correspond to the distribution returned by
branchSelector
, to the number of learning examples that were assigned to each branch, or to something else.TreeClassifier
uses the descender to descend from the root. If it returns only a TreeNode
and no distribution, the descend should stop; it does not matter whether it's a leaf (the first case above) or an internal node (the second case). The node's nodeClassifier
is used to decide the class. If the descender returns a TreeNode
and a distribution, the TreeClassifier
recursively calls itself for each of the subtrees and the predictions are weighted as requested by the descender.
When voting, subtrees do not predict the class but probabilities of classes. The predictions are multiplied by weights, summed and the most probable class is returned.
The rest of this section is only for those interested in the C++ code.
If you'd like to understand how the classification works in C++, start reading at TTreeClassifier::vote
. It gets a TreeNode
, an Example
and a distribution of vote weights. For each node, it calls the TTreeClassifier::classDistribution
and then multiplies and sums the distribution. vote
returns a normalized distribution of predictions.
A new overload of TTreeClassifier::classDistribution
gets an additional parameter, a TreeNode
. This is done for the sake of recursion. The normal version of classDistribution
simply calls the overloaded with a tree root as an additional parameter. classDistribution
uses descender
. If descender reaches a leaf, it calls nodeClassifier
, otherwise it calls vote
.
Thus, the TreeClassifier
's vote
and classDistribution
are written in a form of double recursion. The recursive calls do not happen at each node of the tree but only at nodes where a vote is needed (that is, at nodes where the descender halts).
For predicting a class, operator()
, calls the descender. If it reaches a leaf, the class is predicted by the leaf's nodeClassifier
. Otherwise, it calls vote
. From now on, vote
and classDistribution
interweave down the tree and return a distribution of predictions. operator()
then simply chooses the most probable class.
The main learning object is
. It is basically a skeleton into which the user must plug the components for particular functions. For easier use, defaults are provided.
Components that govern the structure of the tree are split
(of type TreeSplitConstructor
), stop
(of type TreeStopCriteria
) and exampleSplitter
(of type TreeExampleSplitter
).
The job of
is to find a suitable criteria for dividing the learning (and later testing) examples coming to the node. The data it gets is a set of examples (and, optionally, an ID of weight meta-attribute), a domain contingency computed from examples, apriori class probabilities, a list of candidate attributes it should consider and a node classifier (if it was constructed, that is, if storeNodeClassifier
is left true
).
The TreeSplitConstructor
should use the domain contingency when possible. The reasons are two-fold; one is that it's faster and the other is that the contingency matrices are not necessarily constructed by simply counting the examples. Why and how is explained later. There are, however, cases, when domain contingency does not suffice, for examples, when ReliefF is used as a measure of quality of attributes. In this case, there's no other way but to use the examples and ignore the precomputed contingencies.
The split constructor should consider only the attributes in the candidate list (the list is a vector of booleans, one for each attribute).
TreeSplitConstructor
returns most of the data we talked about when describing the TreeNode
. It returns a classifier to be used as TreeNode
's branchSelector
, a list of branch descriptions and a list with the number of examples that go into each branch. Just what we need for the TreeNode
. It can return an empty list for the number of examples in branches; in this case, the TreeLearner
will find the number itself after splitting the example set into subsets. However, if a split constructors can provide the numbers at no extra computational cost, it should do so.
In addition, it returns a quality of the split; a number without any fixed meaning except that higher numbers mean better splits.
If the constructed splitting criterion uses an attribute in such a way that the attribute is 'completely spent' and should not be considered as a split criterion in any of the subtrees (the typical case of this are discrete attributes that are used as-they-are, that is, without any binarization or subsetting), then it should report the index of this attribute. Some splits do not spend any attribute; this is indicated by returning a negative index.
A TreeSplitConstructor
can veto the further tree induction by returning no classifier. This can happen for many reasons. A general one is related to number of examples in the branches. TreeSplitConstructor
has a field minSubset
, which sets the minimal number of examples in a branch; null nodes, however, are allowed. If there is no split where this condition is met, TreeSplitConstructor
stops the induction.
is a much simpler component that, given a set of examples, weight ID and contingency matrices, decides whether to continue the induction or not. The basic criterion checks whether there are any examples and whether they belong to at least two different classes (if the class is discrete). Derived components check things like the number of examples and the proportion of majority classes.
is analogous to the TreeDescender
described about a while ago. Just like the TreeDescender
decides the branch for an example during classification, the TreeExampleSplitter
sorts the learning examples into branches.
TreeExampleSplitter
is given a TreeNode
(from which it can use different stuff, but most of splitters only use the branchSelector
), a set of examples to be divided, and the weight ID. The result is a list of subsets of examples and, optionally, a list of new weight ID's.
Subsets are usually stored as ExamplePointerTable
's. Most of TreeExampleSplitters
simply call the node's branchSelector
and assign examples to corresponding branches. When the value is unknown they choose a particular branch or simply skip the example.
Some enhanced splitters can split examples. An example (actually, a pointer to it) is copied to more than one subset. To facilitate real splitting, weights are needed. Each branch is assigned a weight ID (each would usually have its own ID) and all examples that are in that branch (either completely or partially) should have this meta attribute. If an example hasn't been split, it has only one additional attribute - with weight ID corresponding to the subset to which it went. Example that is split between, say, three subsets, has three new meta attributes, one for each subset. ID's of weight meta attributes are returned by the TreeExampleSplitter
to be used at induction of the corresponding subtrees.
Note that weights are used only when needed. When no splitting occured - because the splitter is not able to do it or because there was no need for splitting - no weight ID's are returned.
TreeLearner has a number of components.
TreeSplitConstructor
. Default value, provided by TreeLearner
, is SplitConstructor_Combined
with separate constructors for discrete and continuous attributes. Discrete attributes are used as are, while continuous attributes are binarized. Gain ratio is used to select attributes. A minimum of two examples in a leaf is required for discrete and five examples in a leaf for continuous attributes.TreeStopCriteria
. The default stopping criterion stops induction when all examples in a node belong to the same class.TreeExampleSplitter
. The default splitter is TreeExampleSplitter_UnknownsAsSelector
that splits the learning examples according to distributions given by the selector.nodeLearner
is MajorityLearner
.TreeClassifier
will use. Default descender is TreeDescender_UnknownMergeAsSelector
which votes using the branchSelector
's distribution for vote weights.TreeNodes
, and whether the nodeClassifier
should be build for internal nodes. By default, distributions and node classifiers are stored, while contingencies and examples are not. You won't save any memory by not storing distributions but storing contingencies, since distributions actually points to the same distribution that is stored in contingency.classes
.The TreeLearner
first sets the defaults for missing components. Although stored in the actual TreeLearner
's fields, they are removed when the induction is finished.
Then it ensures that examples are stored in a table. This is needed because the algorithm juggles with pointers to examples. If examples are in a file or are fed through a filter, they are copied to a table. Even if they are already in a table, they are copied if storeExamples
is set. This is to assure that pointers remain pointing to examples even if the user later changes the example table. If they are in the table and the storeExamples
flag is clear, we just use them as they are. This will obviously crash in a multi-threaded system if one changes the table during the tree induction. Well... don't do it.
Apriori class probabilities are computed. At this point we check the sum of example weights; if it's zero, there are no examples and we cannot proceed. A list of candidate attributes is set; in the beginning, all attributes are candidates for the split criterion.
Now comes the recursive part of the TreeLearner
. Its arguments are a set of examples, a weight meta-attribute ID (a tricky thing, it can be always the same as the original or can change to accomodate splitting of examples among branches), apriori class distribution and a list of candidates (represented as a vector of Boolean values).
Contingency matrix
is computed next. This happens even if the flag storeContingencies
is false
. If the contingencyComputer
is given we use it, otherwise we construct just an ordinary contingency matrix.
A stop
is called to see whether it's worth to continue. If not, a nodeClassifier
is built and the TreeNode
is returned. Otherwise, a nodeClassifier
is only built if forceNodeClassifier
flag is set.
To get a TreeNode
's nodeClassifier
, the nodeLearner
's smartLearn
function is called with the given examples, weight ID and the just computed matrix. If the learner can use the matrix (and the default, MajorityLearner
, can), it won't touch the examples. Thus, a choice of contingencyComputer
will, in many cases, affect the nodeClassifier
. The nodeLearner
can return no classifier; if so and if the classifier would be needed for classification, the TreeClassifier
's function returns DK or an empty distribution. If you're writing your own tree classifier - pay attention.
If the induction is to continue, a split
component is called. If it fails to return a branch selector, induction stops and the TreeNode
is returned.
TreeLearner
than uses ExampleSplitter
to divide the examples as described above.
The contingency gets removed at this point if it is not to be stored. Thus, the split
, stop
and exampleSplitter
can use the contingency matrices if they will.
The TreeLearner
then recursively calls itself for each of the non-empty subsets. If the splitter returnes a list of weights, a corresponding weight is used for each branch. Besides, the attribute spent by the splitter (if any) is removed from the list of candidates for the subtree.
A subset of examples is stored in its corresponding tree node, if so requested. If not, the new weight attributes are removed (if any were created).
Tree pruners derived from
can be given either a TreeNode
(presumably, but not necessarily a root) or a TreeClassifier
. The result is a new, pruned TreeNode
or a new TreeClassifier
with a pruned tree. The original tree remains intact.
Note however that pruners construct only a shallow copy of a tree. The pruned tree's TreeNode
s contain references to the same contingency matrices, node classifiers, branch selectors, ... as the original tree. Thus, you may modify a pruned tree structure (manually cut it, add new nodes, replace components) but modifying, for instance, some node's nodeClassifier
(a nodeClassifier
itself, not a reference to it!) would modify the node's nodeClassifier
in the corresponding node of the original tree.
Talking about node classifiers - pruners cannot construct a nodeClassifier
nor merge nodeClassifiers
of the pruned subtrees into classifiers for new leaves. Thus, if you want to build a prunable tree, internal nodes must have their nodeClassifiers
defined. Fortunately, all you need to do is nothing; if you leave the TreeLearner
's flags as they are by default, the nodeClassifiers
are created.
Several classes described above are already functional and can (and mostly will) be used as they are. Those classes are TreeNode
, TreeLearner
and TreeClassifier
. This section describe the other classes.
Classes TreeSplitConstructor
, TreeStopCriteria
, TreeExampleSplitter
, TreeDescender
, Learner
and Classifier
are among the Orange classes that can be subtyped in Python and have the call operator overloadedd in such a way that it is callbacked from C++ code. You can thus program your own components for TreeLearners
and TreeClassifiers
. The detailed information on how this is done and what can go wrong, is given in a separate page, dedicated to callbacks to Python.
Split construction is almost as exciting as waiting for a delayed flight. Boring, that is. Split constructors are programmed as spaghetti code that juggles with contingency matrices, with separate cases for discrete and continuous classes... Most split constructors work either for discrete or for continuous attributes. The suggested practice is to use a TreeSplitConstructor_Combined
that can handle both by simply delegating attributes to specialized split constructors.
Note: split constructors that cannot handle attributes of particular type (discrete, continuous) do not report an error or a warning but simply skip the attribute. It is your responsibility to use a correct split constructor for your dataset. (May we again suggest using TreeSplitConstructor_Combined
?)
The same components can be used either for inducing classification and regression trees. The only component that needs to be chosen accordingly is the 'measure' attribute for the TreeSplitConstructor_Measure
class (and derived classes).
The TreeSplitConstructor
's function has been described in details in description of the learning process.
Attributes
Methods
ExampleGenerator
, such as ExampleTable
, or a list of examples). WeightID
is optional; the default of 0 means that all examples have a weight of 1.0. Apriori-distribution should be of type orange.Distribution
and candidates should be a Python list of objects which are interpreted as booleans.
The function returns a tuple (branchSelector
, branchDescriptions
, subsetSizes
, quality
, spentAttribute
). SpentAttribute
is -1 if no attribute is completely spent by the split criterion. If no split is constructed, the selector
, branchDescriptions
and subsetSizes
are None
, while quality
is 0.0 and spentAttribute
is -1.An abstract base class for split constructors that employ a MeasureAttribute
to assess a quality of a split. At present, all split constructors except for TreeSplitConstructor_Combined
are derived from this class.
Attributes
MeasureAttribute
used for evaluation of a split. Note that you must select the subclass MeasureAttribute
capable of handling your class type - you cannot use MeasureAttribute_gainRatio
for building regression trees or MeasureAttribute_MSE
for classification trees.measure
component. Default is 0.0.TreeSplitConstructor_Attribute
attempts to use a discrete attribute as a split; each value of the attribute corresponds to a branch in the tree. Attributes are evaluated with the measure
and the one with the highest score is used for a split. If there is more than one attribute with the highest score, one of them is selected by random.
The constructed branchSelector
is an instance of ClassifierFromVarFD
that returns a value of the selected attribute. If the attribute is EnumVariable
, branchDescription
's are the attribute's values. The attribute is marked as spent, so that it cannot reappear in the node's subtrees.
TreeSplitConstructor_ExhaustiveBinary
works on discrete attributes. For each attribute, it determines which binarization of the attribute gives the split with the highest score. If more than one split has the highest score, one of them is selected by random. After trying all the attributes, it returns one of those with the highest score.
The constructed branchSelector
is again an instance ClassifierFromVarFD
that returns a value of the selected attribute. This time, however, its transformer
contains an instance of MapIntValue
that maps the values of the attribute into a binary attribute. Branch descriptions are of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to more than one value of the attribute. Branches that correspond to a single value of the attribute are described with this value. If the attribute was originally binary, it is spent and cannot be used in the node's subtrees. Otherwise, it can reappear in the subtrees.
This is currently the only constructor for splits with continuous attributes. It divides the range of attributes values with a threshold that maximizes the split's quality. As always, if there is more than one split with the highest score, a random threshold is selected. The attribute that yields the highest binary split is returned.
The constructed branchSelector
is again an instance of ClassifierFromVarFD
with an attached transformer
. This time, transformer
is of type ThresholdDiscretizer
. The branch descriptions are "<threshold" and ">=threshold". The attribute is not spent.
This constructor delegates the task of finding the optimal split to separate split constructors for discrete and for continuous attributes. Each split constructor is called, given only attributes of appropriate types as candidates. Both construct a candidate for a split; the better of them is selected.
(Note that there is a problem when more candidates have the same score. Let there be are nine discrete attributes with the highest score; the split constructor for discrete attributes will select one of them. Now, let us suppose that there is a single continuous attribute with the same score. TreeSplitConstructor_Combined
would randomly select between the proposed discrete attribute and the continuous attribute, not aware of the fact that the discrete has already competed with eight other discrete attributes. So, the probability for selecting (each) discrete attribute would be 1/18 instead of 1/10. Although not really correct, we doubt that this would affect the tree's performance; many other machine learning systems simply choose the first attribute with the highest score anyway.)
The branchSelector
, branchDescriptions
and whether the attribute is spent is decided by the winning split constructor.
Attributes
TreeSplitConstructor_Attribute
or TreeSplitConstructor_ExhaustiveBinary
TreeSplitConstructor_Threshold
or a split constructor you programmed in Python.TreeStopCriteria
determines when to stop the induction of subtrees, as described in detail in description of the learning process.
As opposed to TreeSplitConstructor
and similar basic classes, TreeStopCriteria
is not an abstract but a fully functional class that provides the basic stopping criteria. That is, the tree induction stops when there is at most one example left; in this case, it is not the weighted but the actual number of examples that counts. Besides that, the induction stops when all examples are in the same class (for discrete problems) or have the same value of the outcome (for regression problems).
Methods
true
) or continue (false
) the induction. If contingencies are given, they are used for checking whether the examples are in the same class (but not for counting the examples). Derived classes should use the contingencies whenever possible. If contingencies are not given, TreeStopCriteria
will work without them. Derived classes should also use them if they are available, but otherwise compute them only when they really need them.
TreeStopCriteria_common
contains additional criteria for pre-pruning: it checks the proportion of majority class and the number of weighted examples.
Attributes
minExamples
examples are not split any further. Example count is weighed.TreeExampleSplitter
is the third crucial component of TreeLearner
. Its function is described in description of the learning process.
An abstract base class for objects that split sets of examples into subsets. The derived classes differ in treatment of examples which cannot be unambiguously placed into a single branch (usually due to unknown value of the crucial attribute).
Methods
node
(particularly the branchSelector
) to split the given set of examples into subsets. Function returns a tuple with a list of example generators and a list of weights. The list of weights is either an ordinary python list of integers or a None when no splitting of examples occurs and thus no weights are needed.
Derived classes
This is a classifier's counterpart for TreeExampleSplitter
. It decides the destiny of examples that need to be classified and cannot be unambiguously put in a branch. The detailed function of this class is given in description of classification with trees.
An abstract base object for tree descenders.
Methods
None
, in the latter in contains a node and weights of votes for subtrees (a list of floats).
TreeDescender
's that never split examples always descend to a leaf, but they differ in the treatment of examples with unknown values (or, in general, examples for which a branch cannot be determined at some node(s) the tree). TreeDescender
's that do split examples differ in returned vote weights.Derived classes
nodeClassifier
will be used to make a decision. It is your responsibility to see that even the internal nodes have their nodeClassifiers
(i.e., don't disable creating node classifier or manually remove them after the induction, that's all)TreeExampleSplitter_UnknownsToBranch
.Classes derived from TreePruner prune the trees as described in the section pruning - make sure you read it to understand what the pruners will do to your trees.
This is an abstract base class which defines nothing useful, only a pure virtual call operator.
Methods
In Orange, a tree can have a non-trivial subtrees (i.e. subtrees with more than one leaf) in which all the leaves have the same majority class. (The reason why this is allowed is that those leaves can still have different distributions of classes and thus predict different probabilities.) However, this can be undesired when we're only interested in the class prediction or a simple tree interpretation. The TreePruner_SameMajority
prunes the tree so that there is no subtree in which all the nodes would have the same majority class.
This pruner will only prune the nodes in which the node classifier is of class DefaultClassifier
(or from a derived class).
Note that the leaves with more than one majority class require some special handling. The pruning goes backwards, from leaves to the root. When siblings are compared, the algorithm checks whether they have (at least one) common majority class. If so, they can be pruned.
Prunes a tree by comparing m-estimates of static and dynamic error as defined in (Bratko, 2002).
Attributes
This page does not provide examples for programming your own components, such as, for instance, a TreeSplitConstructor
. Those examples can be found on a page dedicated to callbacks to Python.
To have something to work on, we'll take the data from lenses dataset and build a tree using the default components.
part of treestructure.py (uses lenses.tab)
How big is our tree?
part of treestructure.py (uses lenses.tab)
If node is None
, we have a null-node; null nodes don't count, so we return 0. Otherwise, the size is 1 (this node) plus the sizes of all subtrees. The node is an internal node if it has a node.branchSelector
; it there's no selector, it's a leaf. Don't attempt to skip the if
statement: leaves don't have an empty list of branches, they don't have a list of branches at all.
Don't forget that this was only an excercise - TreeNode
has a built-in method treesize()
that does exactly the same.
Let us now write a simple script that prints out a tree. The recursive part of the function will get a node and its level.
part of treestructure.py (uses lenses.tab)
Don't waste time on studying formatting tricks (\n's etc.), this is just for nicer output. What matters is everything but the print statements. As first, we check whether the node is a null-node (a node to which no learning examples were classified). If this is so, we just print out "<null node>" and return.
After handling null nodes, remaining nodes are internal nodes and leaves. For internal nodes, we print a node description consisting of the attribute's name and distribution of classes. TreeNode
's branch description is, for all currently defined splits, an instance of a class derived from Classifier
(in fact, it is a ClassifierFromVarFD
, but a Classifier
would suffice), and its classVar
points to the attribute we seek. So we print its name. We will also assume that storing class distributions has not been disabled and print them as well. A more able function for printing trees (as one defined in orngTree
) has an alternative means to get the distribution, when this fails. Then we iterate through branches; for each we print a branch description and iteratively call the printTree0
with a level increased by 1 (to increase the indent).
Finally, if the node is a leaf, we print out the distribution of learning examples in the node and the class to which the examples in the node would be classified. We again assume that the nodeClassifier
is the default one - a DefaultClassifier
. A better print function should be aware of possible alternatives.
Now, we just need to write a simple function to call our printTree0. We could write something like...
... but we won't. Let us learn how to handle arguments of different types. Let's write a function that will accept either a TreeClassifier
or a TreeNode
(just like TreePruners
, remember?)
part of treestructure.py (uses lenses.tab)
It's fairly straightforward: if x
is of type derived from orange.TreeClassifier
, we print x.tree
; if it's TreeNode
we just call printTree0
with x
. If it's of some other type, we don't know how to handle it and thus raise an exception. (Note that we could also use if type(x) == orange.TreeClassifier:
, but this would only work if x
would be of type orange.TreeClassifier
and not of any derived types. The latter, however, do not exist yet...)
part of treestructure.py (uses lenses.tab)
For a final exercise, let us write a simple pruning procedure. It will be written entirely in Python, unrelated to any TreePruner
s. Our procedure will limit the tree depth - the maximal depth (here defined as the number of internal nodes on any path down the tree) shall be given as an argument. For example, to get a two-level tree, we would call "cutTree(root, 2)
". The function will be recursive, with the second argument (level) decreasing at each call; when zero, the current node will be made a leaf.
part of treestructure.py (uses lenses.tab)
There's nothing to prune at null-nodes or leaves, so we act only when node
and node.branchSelector
are defined. If level is not zero, we call the function for each branch. Otherwise, we clear the selector, branches and branch descriptions.
part of treestructure.py (uses lenses.tab)
You've already seen a simple example of using a TreeLearner
. You can just call it and let it fill the empty slots with the default components. This section will teach you three things: what are the missing components (and how to set the same components yourself), how to use alternative components to get a different tree and, finally, how to write a skeleton for tree induction in Python.
Let us construct a TreeLearner
to play with.
treelearner.py (uses lenses.tab)
There are three crucial components in learning: the split and stop criteria, and the exampleSplitter
(there are some others, which become important during classification; we'll talk about them later). They are not defined; if you use the learner, the slots are filled temporarily but later cleared again.
The stop is trivial. The default is set by
Well, this is actually done in C++ and it uses a global component that is constructed once for all, but apart from that we did effectively the same thing.
We can now examine the default stopping parameters.
Not very restrictive. This keeps splitting the examples until there's nothing left to split or all the examples are in the same class. Let us set the minimal subset that we allow to be split to five examples and see what comes out.
part of treelearner.py (uses lenses.tab)
OK, that's better. If we want an even smaller tree, we can also limit the maximal proportion of majority class.
part of treelearner.py (uses lenses.tab)
Well, this might have been an overkill...