Module orngTree implements a class TreeLearner
for
building both decision and regression
trees. orngTree.TreeLearner
is essentially a wrapper
around orange.TreeLearner
,
provided for easier use of the latter.
The module also contains functions for counting the number of nodes and leaves in the tree.
The module includes functions for printing out the tree, which are rather versatile and can print out practically anything you'd like to know, from the number of examples, proportion of examples of majority class in nodes and similar, to more complex statistics like the proportion of examples in a particular class divided by the proportion of examples of this class in a parent node. And even more, you can define your own callback functions to be used for printing.
is a class that assembles the generic
classification tree learner (from Orange's objects for induction of
decision trees). It sets a number of parameters used in induction that
can also be set after the creation of the object, most often through
the object's attributes. If upon initialization
TreeLearner
is given a set of examples, then an instance
of TreeClassifier
object is returned instead.
TreeSplitConstructor
(see documentation on
TreeLearner). Useful when prototyping new tree induction
algorithms. When this parameter is defined, other parameters that
affect the procedures for growing of the tree are ignored. These
include binarization
, measure
,
worstAcceptable
and minSubset
(default:
None).Used in pre-pruning, sets the lowest required attribute score. If the score of the best attribute is below this margin, the tree at that node is not grown further (default: 0).
So, to allow splitting only when gainRatio (the default measure) is greater than 0.6, one should run the learner like this:
minExamples
examples are not split any further, that is, all leaves in the tree
will contain at least that many of examples (default: 0).This is an example of the tree on iris data set, built with the above arguments - the numbers show the majority class proportion at each node. You can find more details in the script tree2.py, which induces and prints this tree.
TreeStopCriteria
. Useful when prototyping new
tree induction algorithms. See a documentation on TreeStopCriteria for more
info on this function. When used, parameters
maxMajority
and minExamples
will not be
considered (default: None).contingency.classes.
(default: True except for
storeExamples, which defaults to False).For a bit more complex example, here's how to write your own stop function. The example itself is more funny than useful. It constructs and prints two trees. For the first one we define the defStop
function, which is used by default, and combine it with a random function so that the stop criteria will also be met in additional 20% of the cases when defStop
is false. The second tree is build such that it considers only the random function as the stopping criteria. Note that in the second case lambda function still has three parameters, since this is a necessary number of parameters for the stop function (for more, see section on Orange Trees in Orange Reference).
The output is not shown here since the resulting trees are rather big.
countNodes(tree)
returns the number of nodes of tree.
Arguments
countLeaves(tree)
returns the number of leaves in the tree.
Arguments
Function dumpTree
dumps a tree to a string, and printTree
prints out the tree (printTxt
is alias for printTree
, and it's there for compatibility). Functions have same arguments.
Before we go on: you can read all about the function and use it to its full extent, or you can just call it, giving it the tree as the sole argument and it will print out the usual textual representation of the tree. If you're satisfied with that, you can stop here.
Arguments
"."
, the same string is used as for leaves.True
(default), the branches with a single node are printed before the branches with larger subtrees. If you set it to False
(which I don't know why you would), the branches are printed in order of appearance.The magic is in the format string. It is a string which is printed out at every leaf or internal node with the certain format specifiers replaced by data from the tree node. Specifiers are generally of form
%[^]<precision><quantity><divisor>.
^ at the start tells that the number should be multiplied by 100. It's useful for printing proportions like percentages, but it makes no sense to multiply, say, the number of examples at the node (although the function will allow it).
<precision> is in the same format as in Python (or C) string formatting. For instance, %N
denotes the number of examples in the node, hence %6.2N
would mean output to two decimal digits and six places altogether. If left out, a default format 5.3
is used, unless you multiply the numbers by 100, in which case the default is .0
(no decimals, the number is rounded to the nearest integer).
<divisor> tells what to divide the quantity in that node with. bP
means division by the same quantity in the parent node; for instance, %NbP
will tell give the number of examples in the node divided by the number of examples in parent node. You can add use precision formatting, e.g. %6.2NbP
. bA
is division by the same quantity over the entire data set, so %NbA
will tell you the proportion of examples (out of the entire training data set) that fell into that node. If division is impossible since the parent node does not exist or some data is missing, a dot is printed out instead of the quantity.
<quantity> is the only required element. It defines what to print. For instance, %N
would print out the number of examples in the node. Possible quantities are
%I(95)
of (more complicated) %5.3I(95)bP
.%5.3C="Iris-virginica"bP
- this will tell the number of examples of Iris-virginica by the number of examples this class in the parent node. If you are interested in examples that are not Iris-virginica, say %5.3CbP!="Iris-virginica"
For regression trees, you can use operators =, !=, <, <=, >, and >=, as in %C<22
- add the precision and divisor if you will. You can also check the number of examples in a certain interval: %C[20, 22]
will give you the number of examples between 20 and 22 (inclusive) and %C(20, 22)
will give the number of such examples excluding the boundaries. You can of course mix the parentheses, e.g. %C(20, 22]
. If you would like the examples outside the interval, add a !
, like %C!(20, 22]
.Now for some examples. We shall build a small tree from the iris data set - we shall limit the depth to three levels.
part of orngTree1.py
The easiest way to call the function is to pass the tree as the only argument. Calling orngTree.printTree(tree)
will print
Let's now print out the predicted class at each node, the number of examples in the majority class with the total number of examples in the node,
orngTree.printTree(tree, leafStr="%V (%M out of %N)")
.
Would you like to know how the number of examples declines as compared to the entire data set and to the parent node? We find it with this: orng.printTree("%V (%^MbA%, %^MbP%)")
Let us first read the format string. %M
is the number of examples in the majority class. We want it divided by the number of all examples from this class on the entire data set, hence %MbA
. To have it multipied by 100, we say %^MbA
. The percent sign after that is just printed out literally, just as the comma and parentheses (see the output). The string for showing the proportion of this class in the parent is the same except that we have bP
instead of bA
.
And now for the output: all examples of setosa for into the first node. For versicolor, we have 98% in one node; the rest is certainly not in the neighbouring node (petal length>=5.350) since all versicolors from the node petal width<1.750 went to petal length<5.350 (we know this from the 100%
in that line). Virginica is the majority class in the three nodes that together contain 94% of this class (4+4+86). The rest must had gone to the same node as versicolor.
If you find this guesswork annoying - so do I. Let us print out the number of versicolors in each node, together with the proportion of versicolors among the examples in this particular node and among all versicolors. So,
'%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
gives the following output:
Finally, we may want to print out the distributions, using a simple string %D
.
What is the order of numbers here? If you check data.domain.classVar.values
, you'll learn that the order is setosa, versicolor, virginica; so in the node at peta length<5.350 we have 49 versicolors and 3 virginicae. To print out the proportions, we can use, for instance %.2d
- this gives us proportions within node, rounded on two decimals.
We haven't tried printing out some information for internal nodes. To start with the most trivial case, we shall print the prediction at each node
nodeStr
should be the same as leafStr
(not very useful here, since leafStr
is trivial anyway).
Note that the output is somewhat different now: there appeared another node called root and the tree looks one level deeper. This is needed to print out the data for that node to.
Now for something more complicated: let us observe how the number of virginicas decreases down the tree:
Let's first interpret the format string: CbA="Iris-virginica"
is the number of examples from class virginica, divided by the total number of examples in this class. Add ^.1
and the result will be multiplied and printed with one decimal. The trailing %
is printed out. In parentheses we print the same thing except that we divide by the examples in the parent node. Note the use of single quotes, so we can use the double quotes inside the string, when we specify the class.
See what's in the parentheses in the root node? If printTree
cannot compute something (in this case it's because the root has no parent), it prints out a dot. You can also replace =
by !=
and it will count all classes except virginica.
For one final example with classification trees, we shall print the distributions in that nodes, the distribution compared to the parent and the proportions compared to the parent (the latter things are not the same - think why). In the leaves we shall also add the predicted class. So now we'll have to call the function like this.
Here's the result:
To explore the possibilities when printing regression trees, we are gonna induce a tree from the housing data set. Called with the tree as the only argument, printTree
prints the tree like this:
Let us add the standard error in both internal nodes and leaves, and the 90% confidence intervals in the leaves. So we want to call it like this:
What's the difference between %V
, the predicted value and %A
the average? Doesn't a regression tree always predict the leaf average anyway? Not necessarily, the tree predict whatever the nodeClassifier
in a leaf returns. But you're mostly right. The difference is in the number of decimals: %V
uses the FloatVariable
's function for printing out the value, which results the printed number to have the same number of decimals as in the original file from which the data was read.
Regression trees cannot print the distributions in the same way as classification trees. They instead offer a set of operators for observing the number of examples within a certain range. For instance, let us check the number of examples with values below 22, and compare this number with values in the parent nodes.
The last line, for instance, says the the number of examples with the class below 22 is among those with tax above 534 is 30 times higher than the number of such examples in its parent node.
For another exercise, let's count the same for all examples outside interval [20, 22] (given like this, the interval includes the bounds). And let us print out the proportions as percents.
OK, let's observe the format string for one last time. %c![20, 22]
would be the proportion of examples (within the node) whose values are below 20 or above 22. By %cbP![20, 22]
we derive this by the same statistics computed on the parent. Add a ^
and you have the percentages.
dumpTree
's argument userFormats
can be used to print out some other information in the leaves or nodes. If provided, userFormat
should contain a list of tuples with a regular expression and a callback function to be called when that expression is found in the format string. Expressions from userFormats
are checked before the built-in expressions discussed above, so you can override the built-ins if you want to.
The regular expression should describe a string like those we used above, for instance the string %.2DbP
. When a leaf or internal node is printed out, the format string (leafStr
or nodeStr
) is checked for these regular expressions and when the match is found, the corresponding callback function is called.
The callback function will get five arguments: the format string (leafStr
or nodeStr
), the match object, the node which is being printed, its parent (can be None
) and the tree classifier. The function should return the format string in which the part described by the match object (that is, the part that is matched by the regular expression) is replaced by whatever information your callback function is supposed to give.
The function can use several utility function provided in the module.
s
which is covered by mo
by the string sub
.insertStr(s, mo, "."). You should use this when the function cannot compute the desired quantity; it is called, for instance, when it needs to divide by something in the parent, but the parent doesn't exist.
s
matched by mo
by the number n
, formatted as specified by the user, that is, it multiplies it by 100, if needed, and prints with the right number of places and decimals. It does so by checking the mo
for a group named m100
(representing the ^
in the format string) and a group named fs
represented the part giving the number of decimals (e.g. 5.3
).by
equals bp
, it returns parent
, else it returns tree.tree
. This is used to find what to divide the quantity with, when division is required.There are also a few pieces of regular expression that you may want to reuse. The two you are likely to use are
^
) and the format for the number of decimals (e.g. 5.3
). The corresponding groups are named m100
and fs
.bP
or bA
or nothing; the result is in groups by
.For a trivial example, "%V" is implemented like this. There is the following tuple in the list of built-in formats: (re.compile("%V"), replaceV)
. replaceV
is a function defined by:
It therefore takes the value predicted at the node (node.nodeClassifier.defaultValue
), converts it to a string and passes it to insertStr
to do the replacement.
A more complex regular expression is the one for the proportion of majority class, defined as "%"+fs+"M"+by
. It uses the two partial expressions defined above.
Let's say with like to print the classification margin for each node, that is, the difference between the proportion of the largest and the second largest class in the node.
part of orngTree2.py
We first defined getMargin
which gets the distribution and computes the margin. The callback replaces, replaceB
, computes the margin for the node. If we need to divided the quantity by something (that is, if the by
group is present), we call orngTree.byWhom
to get the node with whose margin this node's margin is to be divided. If this node (usually the parent) does not exist of if its margin is zero, we call insertDot
to insert a dot, otherwise we call insertNum
which will insert the number, obeying the format specified by the user.
myFormat
is a list containing the regular expression and the callback function.
We can now print out the iris tree, for instance using the following call.
And this is what we get.
Function printDot
prints the tree to a file in a format used by GraphViz.
Uses the same parameters as printTxt
defined above, and
in addition two parameters which define the shape used for internal
nodes and laves of the tree:
Arguments
Check Polygon-based Nodes for various outlines supported by GraphViz.
Suppose you saved the tree in a file tree5.dot
. You can then print it out as a gif if you execute the following command line
E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories, Murray Hill NJ, U.S.A., October 1993.
Graphviz - open source graph drawing software. A home page of AT&T's dot and similar software packages.