C4.5 Classifier and Learner

C4.5 is a standard benchmark in machine learning. For this reason, it is incorporated in Orange, although Orange has its own implementation of decision trees.

uses the original Quinlan's code for learning so the tree you get is exactly like the one that would be build by standalone C4.5. Upon return, however, the original tree is copied to Orange components that contain exactly the same information plus what is needed to make them visible from Python. To be sure that the algorithm behaves just as the original, we use a dedicated class orange.C45TreeNode instead of reusing the components used by Orange's tree inducer (ie, orange.TreeNode). This, however, could be done and probably will be done in the future; we shall still retain orange.C45TreeNode but offer transformation to orange.TreeNode so that routines that work on Orange trees will also be usable for C45 trees.

C45Learner and C45Classifier behave like any other Orange learner and classifier. Unlike most of Orange learning algorithms, C4.5 does not accepts weighted examples.

Building the C4.5 plug-in

We haven't been able to obtain the legal rights to distribute C4.5 and therefore couldn't statically link it into Orange. Instead, it's incorporated as a plug-in which you'll need to build yourself. The procedure is trivial, but you'll need a C compiler. On Windows, the scripts we provide work with MS Visual C and the files CL.EXE and LINK.EXE must be on the PATH. On Linux you're equipped with gcc. Mac OS X comes without gcc, but you can download it for free from Apple.

Orange must be installed prior to building C4.5. (This is because the build script will copy the created file next to Orange, which it obviously can't if Orange isn't there yet.)

  1. Download the C4.5 (Release 8) sources from the Rule Quest's site and extract them into some temporary directory. The files will be modified in the further process, so don't use your copy of Quinlan's sources that you need for another purpose.
  2. Download buildC45.zip and unzip its contents into the directory R8/Src of the Quinlan's stuff (it's the directory that contains, for instance, the file average.c).
  3. Run buildC45.py, which will build the plug-in and put it next to orange.pyd (or orange.so on Linux/Mac).
  4. Run python, import orange and create orange.C45Learner(). If this fails, something went wrong; see the diagnostic messages from buildC45.py and read the below paragraph.
  5. Finally, you can remove the Quinlan's stuff, along with everything created by buildC45.py.

If the process fails, here's what buildC45.py really does: it creates .h files that wrap Quinlan's .i files and ensure that they are not included twice. It modifies C4.5 sources to include .h's instead of .i's. This step can hardly fail. Then follows the platform dependent step which compiles ensemble.c (which includes all the Quinlan's .c files it needs) into c45.dll or c45.so and puts it next to Orange. If this fails, but you do have a C compiler and linker, and you know how to use them, you can compile the ensemble.c into a dynamic library yourself. See the compile and link steps in buildC45.py, if it helps. Anyway, after doing this check that the built C4.5 gives the same results as the original.

C45Learner

C45Learner's attributes have double names - those that you know from C4.5 command lines and the corresponding names of C4.5's internal variables. All defaults are set as in C4.5; if you change nothing, you are running C4.5.

Attributes

gainRatio (g)
Determines whether to use information gain (false, default) or gain ratio for selection of attributes (true)
batch (b)
Turn on batch mode (no windows, no iterations); this option is not documented in C4.5 manuals. It conflicts with "window", "increment" and "trials".
subset (s)
Enables subsetting (default: false, no subsetting)
probThresh (p)
Probabilistic threshold for continuous attributes (default: false)
minObjs (m)
Minimal number of objects (examples) in leaves (default: 2)
window (w)
Initial windows size (default: maximum of 20% and twice the square root of the number of data objects)
increment (i)
The maximum number of objects that can be added to the window at each iteration (default: 20% of the initial window size)
cf (c)
Prunning confidence level (default: 25%)
trials (t)
Set the number of trials in iterative (i.e. non-batch) mode (default: 10)
prune
Return pruned tree (not an original C4.5 option) (default: true)

C45Learner also offers another way for setting the arguments: it provides a function commandline which is given a string and parses it the same way as C4.5 would parse its command line.

C45Classifier

C45Classifier contains a faithful reimplementation of Quinlan's function from C4.5. The only difference (and the only reason it's been rewritten) is that it uses a tree composed of orange.C45TreeNodes instead of C4.5's original tree structure.

Attributes

tree
C4.5 tree stored as a tree of C45TreeNodes.

C45TreeNode

This class is a reimplementation of the corresponding struct from Quinlan's C4.5 code.

Attributes

nodeType
Type of the node: C45TreeNode.Leaf (0), C45TreeNode.Branch (1), C45TreeNode.Cut (2), C45TreeNode.Subset (3). "Leaves" are leaves, "branches" split examples based on values of a discrete attribute, "cuts" cut them according to a threshold value of a continuous attributes and "subsets" use discrete attributes but with subsetting so that several values can go into the same branch.
leaf
Value returned by that leaf. The field is defined for internal nodes as well.
items
Number of (learning) examples in the node.
classDist
Class distribution for the node (of type DiscDistribution).
tested
The attribute used in the node's test. If node is a leaf, tested is None, if node is of type Branch or Cut tested is a discrete attribute, and if node is of type cut then tested is a continuous attribute.
cut
A threshold for continuous attributes, if node is of type Cut. Undefined otherwise.
mapping
Mapping for nodes of type Subset. Element mapping[i] gives the index for an example whose value of tested is i. Here, i denotes an index of value, not a Value.
branch
A list of branches stemming from this node.

Examples

The simplest way to use C45Learner is to call it. This script constructs the same learner as you would get by calling the usual C4.5.

part of c45.py (uses lenses.tab)

import orange data = orange.ExampleTable("lenses") tree = orange.C45Learner(data) for i in data[:5]: print tree(i), i.getclass()

Arguments can be set by the usual mechanism (the below to lines do the same, except that one uses command-line symbols and the other internal variable names)

tree = orange.C45Learner(data, m=100) tree = orange.C45Learner(data, minObjs=100)

The way that could be prefered by veteran C4.5 user might be through method commandline.

lrn = orange.C45Learner() lrn.commandline("-m 1 -s") tree = lrn(data)

There's nothing special about using C45Classifier - it's just like any other. To demonstrate what the structure of C45TreeNode's looks like, will show a script that prints it out in the same format as C4.5 does. (You can find the script in module orngC45).

def printTree0(node, classvar, lev):
    var = node.tested

    if node.nodeType == 0:
        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),

    elif node.nodeType == 1:
        for i, val in enumerate(var.values):
            print ("\n"+"|    "*lev + "%s = %s:") % (var.name, val),
            printTree0(node.branch[i], classvar, lev+1)

    elif node.nodeType == 2:
        print ("\n"+"|    "*lev + "%s <= %.1f:") % (var.name, node.cut),
        printTree0(node.branch[0], classvar, lev+1)
        print ("\n"+"|    "*lev + "%s > %.1f:") % (var.name, node.cut),
        printTree0(node.branch[1], classvar, lev+1)

    elif node.nodeType == 3:
        for i, branch in enumerate(node.branch):
            inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
            inset = [var.values[j[0]] for j in inset]
            if len(inset)==1:
                print ("\n"+"|    "*lev + "%s = %s:") % (var.name, inset[0]),
            else:
                print ("\n"+"|    "*lev + "%s in {%s}:") % (var.name, reduce(lambda x,y:x+", "+y, inset)),
            printTree0(branch, classvar, lev+1)


def printTree(tree):
    printTree0(tree.tree, tree.classVar, 0)
    print

Leaves are the simplest. We just print out the value contained in node.leaf. Since this is not a qualified value (ie., C45TreeNode does not know to which attribute it belongs), we need to convert it to a string through classVar, which is passed as an extra argument to the recursive part of printTree.

For discrete splits without subsetting, we print out all attribute values and recursively call the function for all branches. Continuous splits are equally easy to handle.

For discrete splits with subsetting, we iterate through branches, retrieve the corresponding values that go into each branch to inset, turn the values into strings and print them out, separately treating the case when only a single value goes into the branch.