Prev: My first Orange classifier, Next: Testing and Evaluating, Up: Classification
Orange supports a number of classification techniques, for instance classification trees, variants of naive Bayes, k-nearest neighbors, classification through association rules, function decomposition, logistic regression, and support vectors machines. We have already seen some of naive Bayes, and we will look here for a few more. Bear in mind that probably the best (and sometimes the only) way to access different methods is through their associated modules, so you should look there for more detailed documentation.
Let us look briefly at a different learning method.
Classification tree learner (yes, this is the same decision tree)
is another native Orange learner, but because it is a rather
complex object that is for its versatility composed of a number of
other objects (for attribute estimation, stopping criterion, etc.),
a wrapper (module) called orngTree
was build around it to simplify
the use of classification trees and to assemble the learner with
some usual (default) components. Here is a script with it:
tree.py (uses voting.tab)
Note that this script is almost the same as the one for naive
Bayes (classifier2.py), except
that we have imported another module (orngTree
) and used learner
orngTree.TreeLearner
to build a classifier called tree
.
For those of you that are at home with machine learning: the default parameters for tree learner assume that a single example is enough to have a leaf for it, gain ratio is used for measuring the quality of attributes that are considered for internal nodes of the tree, and after the tree is constructed the subtrees no pruning takes place (see orngTree documentation for details). The resulting tree with default parameters would be rather big, so we have additionally requested that leaves that share common predecessor (node) are pruned if they classify to the same class, and requested that tree is post-pruned using m-error estimate pruning method with parameter m set to 2.0.
The output of the script that uses classification tree learner is:
Notice that all of the instances are classified correctly. The last line of the script prints out the tree that was used for classification:
output of running the tree.py script
Notice that the printout states the decision at internal nodes and, for leaves, the class label to which a tree would make a classification. These later are associated probability, which is estimated from the learning set of examples.
If you are more of a "visual" type, you may like the following presentation of
the tree better. This was achieved by printing out a tree in so-called dot file
(the line of the script required for this is
orngTree.printDot(tree, fileName='tree.dot', internalNodeShape="ellipse", leafShape="box")
),
which was then compiled to PNG using
AT&T's Graphviz program called dot (see orngTree documentation
for more):
Let us here check on two other classifiers. The first one, called majority classifier, will seem rather useless, as it always classifies to the majority class of the learning set. It predicts class probabilities that are equal class distributions from learning set. While being useless as such, it may often be good to compare this simplest classifier to any other classifier you test – if your other classifier is not significantly better than majority classifier, than this may a reason to sit back and think.
The second classifier we are introducing here is based on k-nearest neighbors algorithm, an instance-based method that finds k examples from learning set that are most similar to the instance that has to be classified. From the set it obtains in this way, it estimates class probabilities and uses the most frequent class for prediction.
The following script takes naive Bayes, classification tree (what we have already learned), majority and k-nearest neighbors classifier (new ones) and prints prediction for first 10 instances of voting data set.
handful.py (uses voting.tab)
The code is somehow long, due to our effort to print the results
nicely. The first part of the code sets-up our four classifiers,
and gives them names. Classifiers are then put into the list
denoted with variable classifiers
(this is nice since, if we would
need to add another classifier, we would just define it and put it
in the list, and for the rest of the code we would not worry about
it any more). The script then prints the header with the names of
the classifiers, and finally uses the classifiers to compute the
probabilities of classes. Note for a special function apply
that we
have not met yet: it simply calls a function that is given as its
first argument, and passes it the arguments that are given in the
list. In our case, apply
invokes our classifiers with a data
instance and request to compute probabilities. The output of our
script is:
Notice that the prediction of majority class classifier does not depend on the instance it classifies (of course!). Other than that, it would be inappropriate to say anything conclusive on the quality of the classifiers – for this, we will need to resort to statistical methods on comparison of classification models, about which you can read in our next lesson.
Prev: My first Orange classifier, Next: Testing and Evaluating, Up: Classification