Subtyping Orange classes in Python

This page describes how to subtype Orange's classes in Python and make the overloaded methods callable from C++ components.

Since Orange has only an interface to Python but is otherwise independent from it, subtyping might sometimes not function as you would expect. If you subtype an Orange class in Python and overload some methods, the C++ code (say if you use your instance as a component for some Orange object) will call the C++ method and not the one you provided in Python.

Exceptions to that are: Filter, Learner, Classifier, TreeSplitConstructor, TreeStopCriteria, TreeExampleSplitter, TreeDescender and MeasureAttribute, TransformValue, ExamplesDistance, ExamplesDistance_Constructor. If you subtype one of these classes and overload its call operator, your operator will get called from the C++ code. If you subclass any other class or overload any other method, C++ code won't know about it.

If your subclass won't get called from C++ but only from Python code, you can subclass anything you want and it will function as it should.

If you are satisfied with that, you can skip to the examples on this page to learn how to subclass what you need. If you wonder why does it have to be like this, read on.


General Problem with Subtyping

Orange was first conceived as a C++ library of machine learning components. It was only after several years of development that Python was first used as a glue language. But even after being interfaced to Python, Orange still maintains its independency. It would be, in principle, possible to export Orange's components as, for example, COM objects that wouldn't require Python to run1. Orange components are not aware that they are being called from Python. Even more, they are not aware that they're being exposed to Python.

This becomes important when subtyping the components. Let's say you derived a Python class MyDomain from Domain. We would like to redefine the call operator which is used to convert an example from another domain. Our operator uses the original Domain to convert the example but afterwards sets the class to unknown.

cb-mydomain.py (uses lenses.tab)

class MyDomain(orange.Domain): def __call__(self, example): ex = orange.Domain.__call__(self, example) ex.setclass("?") return ex md = MyDomain(data.domain)

Subtyping built-in classes in Python is technically complex. When you call MyDomain, the Domain's constructor is called. It will construct a C++ instance of Domain, but when returning it to Python, it would mark it as an instance of MyDomain2. So, the md's memory representation is the same as that of ordinary instances of orange.Domain, but its type, as known to Python, is MyDomain. So when Python interpreter calls md, it will treat it as MyDomain and correctly call the method we defined above.

>>> print md(data[0]) ['psby', 'hyper', 'y', 'normal', '?']

Not so with C++. The C++ code knows nothing about Python wrappers, types, overloaded methods. For C++, md is an ordinary instance of Domain. So, what happens when C code tries to call it? To check this, we will convert an example in another way. We'll call the Example's constructor. There are different arguments that can be given; if you provide a domain and an existing example, a new example will be constructed by converting the existing into the specified domain. The domain will be called internally to perform the actual conversion.

>>> print orange.Example(md, data[0]) ['psby', 'hyper', 'y', 'normal', 'no']

The class is still 'no', not unknown.3

The obvious solution to the problem would be to make Orange's components "Python aware". This would be extremely slow: at each call, even at calls of operator[], for example, C++ code would have to check whether the corresponding operator has been overloaded in Python.

The solution we preferred is efficient, yet limited. The following fragment is a fully functional filter, derived from Orange.Filter.

class FilterYoung(orange.Filter): def __call__(self, ex): return ex["age"]=="young"

You can, for instance, use it as an argument for ExampleTable.filter method:

>>> fy = FilterYoung() >>> for e in data.filter(fy): >>> print e ['young', 'myope', 'no', 'reduced', 'none'] ['young', 'myope', 'no', 'normal', 'soft'] ['young', 'myope', 'yes', 'reduced', 'none'] ['young', 'myope', 'yes', 'normal', 'hard'] ['young', 'hypermetrope', 'no', 'reduced', 'none'] ['young', 'hypermetrope', 'no', 'normal', 'soft'] ['young', 'hypermetrope', 'yes', 'reduced', 'none'] ['young', 'hypermetrope', 'yes', 'normal', 'hard']

orange.Filter is an abstract object. You cannot construct it, e.g. by calling Filter(). But when the Filter's constructor is called to construct an instance of FilterYoung, such as fy, it constructs an instance of a special callback class, Filter_Python (although Orange doesn't admit it - when returning it to Python it says it's a FilterYoung, and that its base class is Filter4).

>>> type(FilterYoung()) <class '__main__.FilterYoung'> >>> type(FilterYoung()).__base__ <type 'Filter'>

Filter_Python's call operator (written in C++ and thus seen and respected in C++ code) callbacks the overloaded __call__ written in Python.

If it works for orange.Filter - why does it fail on orange.Domain? It's not been programmed. Only call operators of a few classes can do this: Filter, Learner, Classifier, TreeSplitConstructor, TreeStopCriteria, TreeExampleSplitter, TreeDescender and MeasureAttribute. Why only those? A rough estimate is that making all the methods of the existing 300 orange classes overloadable would inflate the Orange's source code to its triple size! On the other hand, the chosen classes are those that are most likely to be overloaded. For other, you can either find another solution or ask us to make them overloadable as well. Adding this functionality to a single method of a single class is a small undertaking.

You might be tempted to overload a call operator of a class that is derived from Filter. For instance

class MyFilter(orange.Filter_index): ...

This will fail just as for Domain. The class Filter_Python is derived from Filter and can only inherit its functionality. In the previous example, Filter_Python was a hidden class between FilterYoung and orange.Filter. In this example, you would need a class between MyFilter and orange.Filter_index, and this role obviously cannot be played by Filter_Python.

How to do it then? The simplest way is to wrap a Filter_index, like this

class MyFilter(orange.Filter): def __init__(self): self.subfilter = orange.Filter_index() ... def __call__(self): ... here you can call a self.subfilter or do some of your stuff ...

MyFilter is now derived from orange.Filter but still has it's own copy of Filter_index. Not pretty, but it works.

There's another nice way to construct your own filters (and, in general, other overloadable components). You don't need to derive a new class but seemingly construct an instance of an abstract class, giving a callback function as an argument.

filt = orange.Filter(lambda ex:ex["age"]=="young") for e in data.filter(filt): print e

The Filter's constructor is called directly, as to construct an instance of Filter, which it would usually refuse (since the class is abstract). But when given a function as an argument (here we used lambda function, but you can, of course, use ordinary functions as well), it constructs a Filter_Python and stores the given function to its dictionary. When Filter_Python is called, it calls the function passed as an argument to its constructor.

There's another twist. Sometimes you don't need to wrap the function into a class at all. You can, for example, construct a tree's nodeLearner on the fly. nodeLearner should be derived from Classifier, but you can assign it a callback function.

treeLearner = orange.TreeLearner() treeLearner.nodeLearner = lambda gen, weightID: orange.MajorityLearner(gen, weightID)

This function replaces a nodeLearner with a learner that calls orange.MajorityLearner. The example is artificial; the MajorityLearner can be given directly, with treeLearner.nodeLearner = orange.MajorityLearner; besides that, this is default anyway. Anyway, how can we assign a Python function to a nodeLearner which can only hold a pure C++ object Classifier, not Python functions. And how can we then even expect TreeLearner (written in C++) to call it? Checking what the above snippet actually stored to treeLearner.nodeLearner is revealing.

>>> treeLearner.nodeLearner <Learner instance at 0x019B1930>

When Orange assigns values to a component's attribute like treeLearner.nodeLearner it tries to convert the given arguments to a correct class, a Learner in this case. If the user actually gives an instance of (something derived from) Learner, that's great. Otherwise, Orange will call the Learner's constructor with what was given as an argument. If constructor can use this to construct an object, it's OK. The above assignment is thus equivalent to

treeLearner = orange.TreeLearner() treeLearner.nodeLearner = orange.Learner(lambda gen, weightID: orange.MajorityLearner(gen, weightID))

which, as we know from the last example with Filter, works as intended.

You might have used this feature before without knowing it. Have you ever constructed an EnumVariable and assigned some values to its field values? Field values is stored as orange.StringList, pure C++ vector<string>, not a Python list of strings which you have provided. A same thing as with nodeLearner happens here: since you tried to assign Python list instead of StringList, StringList's constructor was called with Python list as an argument.

A final advice for using derived classes is: don't wonder too much about how it works. Just be happy that it does. Use it, try things that you think might work, but be sure to check (a simple print here and there will suffice) that your call operators are actually called. That's all you need to care about.

Calling Inherited Methods

All classes for which you can (really) overload the call operator are abstract. The only exception is TreeStopCriteria, so this is the only class for which calling the inherited call operator makes sense. Examples section shows how to do it.

For all other classes: calling the inherited method is an error. Similarly, forgetting to define the call operator but trying to use it leads to a call of the inherited operator; an error again.

Examples

The below examples suppose that you have loaded a 'lenses' data in a variable 'data'.

Examples are somewhat simplified. For instance, many classes below will silently assume that the attribute they deal with is discrete. This is to make the code clearer.

Filter

We've already show how to derive filters. Filter is a simple object that decides whether a given example is "acceptable" or not. The below class accepts example for which the value of "age" is "young".

class FilterYoung(orange.Filter): def __call__(self, ex): return ex["age"]=="young"

Filter can be used, for instance, for selecting examples from an example table.

>>> fy = FilterYoung() >>> for e in data.filter(fy): ... print e ['young', 'myope', 'no', 'reduced', 'none'] ['young', 'myope', 'no', 'normal', 'soft'] ['young', 'myope', 'yes', 'reduced', 'none'] ['young', 'myope', 'yes', 'normal', 'hard'] ['young', 'hypermetrope', 'no', 'reduced', 'none'] ['young', 'hypermetrope', 'no', 'normal', 'soft'] ['young', 'hypermetrope', 'yes', 'reduced', 'none'] ['young', 'hypermetrope', 'yes', 'normal', 'hard']

Note two things. You don't need to write your own filters to select examples based on values. You'd get the same effect by

>>> for e in data.select(age="young"): ... print e

Second, you don't need to derive a class from a filter when a function would suffice. You can write either

>>> def f(ex): ... return ex["age"]=="young" >>> for e in data.filter(orange.Filter(f)): ... print e

or, for cases as simple as this, squeeze the whole function into a lambda function

>>> for e in data.filter(orange.Filter(lambda ex:ex["age"]=="young")): ... print e

Classifier

A "classifier" in Orange has a rather non-standard meaning. A classifier is an object with a call operator that gets an example and returns a value, a distribution of values or both - the return type is regulated by an optional second argument. Beside the standard use of classifiers - "class predictors" - this also covers predictors in regression, objects used in constructive induction (which use some of example's attributes to compute a value of a new attribute), and others.

For this tutorial, we will define a classifier that can be used for simple constructive induction. Its constructor will accept two attributes and construct a new attribute as Cartesian product of the two. Its name and names of its values will be constructed from pairs of names for original attributes. The call operator will return a value of the new attribute that corresponds to the values that the two attribute have on the example.

cb-classifier.py (uses lenses.tab)

class CartesianClassifier(orange.Classifier): def __init__(self, var1, var2): self.var1, self.var2 = var1, var2 self.noValues2 = len(var2.values) self.classVar = orange.EnumVariable("%sx%s" % (var1.name, var2.name)) self.classVar.values = ["%s-%s" % (v1, v2) \ for v1 in var1.values for v2 in var2.values] def __call__(self, ex, what = orange.Classifier.GetValue): val = ex[self.var1] * self.noValues2 + ex[self.var2] if what == orange.Classifier.GetValue: return orange.Value(self.classVar, val) probs = orange.DiscDistribution(self.classVar) probs[val] = 1.0 if what == orange.Classifier.GetProbabilities: return probs else: return (orange.Value(self.classVar, val), probs)

No surprises in constructor, except for a trick for construction of classVar.values.

In the call operator, the first line uses an implicit conversion of values to integers. When ex[self.var1], which is of type orange.Value, is multiplied by noValues2, which is an integer, the former is converted to an integer. The same happens at addition.

val is an index of the value to be returned. What follows is the usual procedure for constructing a correct return type for a classifier - you will often do something very similar in your classifiers.

cb-classifier.py (uses lenses.tab)

>>> tt = CartesianClassifier(data.domain[2], data.domain[3]) >>> for i in range(5): ... print "%s --> %s" % (data[i], tt(data[i])) ... ['young', 'myope', 'no', 'reduced', 'none'] ---> young-myope ['young', 'myope', 'no', 'normal', 'soft'] ---> young-myope ['young', 'myope', 'yes', 'reduced', 'none'] ---> young-myope ['young', 'myope', 'yes', 'normal', 'hard'] ---> young-myope ['young', 'hypermetrope', 'no', 'reduced', 'none'] ---> young-hypermetrope ['young', 'hypermetrope', 'no', 'normal', 'soft'] ---> young-hypermetrope

Learner

ClassifierByLookupTable is a classifier whose predictions are based on the value of a single attribute. It contains a simple table named lookupTable for conversion from attribute value to class prediction. The last element of the table is the value that is returned when the attribute value is unknown or out of range. Similarly, distributions is a list of distributions, used when ClassifierByLookupTable is used to predict a distribution.

Let us write a learner which chooses an attribute using a specified measure of quality and constructs a ClassifierByLookupTable that would use this single attribute for making predictions.

cb-learner.py (uses lenses.tab)

class OneAttributeLearner(orange.Learner): def __init__(self, measure): self.measure = measure def __call__(self, gen, weightID=0): selectBest = orngMisc.BestOnTheFly() for attr in gen.domain.attributes: selectBest.candidate(self.measure(attr, gen, None, weightID)) bestAttr = gen.domain.attributes[selectBest.winnerIndex()] classifier = orange.ClassifierByLookupTable(gen.domain.classVar, bestAttr) contingency = orange.ContingencyAttrClass(bestAttr, gen, weightID) for i in range(len(contingency)): classifier.lookupTable[i] = contingency[i].modus() classifier.distributions[i] = contingency[i] classifier.lookupTable[-1] = contingency.innerDistribution.modus() classifier.distributions[-1] = contingency.innerDistribution for d in classifier.distributions: d.normalize() return classifier

Constructor stores the measure to be used for choosing the attribute. Call operator assesses the qualities of attributes and feeds them to orngMisc.BestOnTheFly. This is a simple class with method candidates to which we feed some objects, and winnerIndex that tells the index of the greatest of the "candidates" (there's also a method winner that returns a winner itself, but we cannot use it here). The benefit of using BestOnTheFly is that it is fair; in case there are more than one winners, it will return a random winner and not the first or the last (however, if you call winnerIndex repetitively without adding any (winning) candidates, it will repeatedly return the same winner).

The chosen attribute is stored in bestAttr. A ClassifierByLookupTable is constructed next.

We then need to fill the lookupTable and distributions. For this, we construct a contingency matrix of type ContingencyAttrClass that has the given attribute as the outer and the class as the inner attribute. Thus, contingency[i] gives the distribution of classes for the i-th value of the attribute. We then iterate through the contingency to find the most probable class for each value of the attribute (obtained as modus of the distribution). When predicting probabilities of classes, our classifier will return normalized distributions.

When the value of the attribute is unknown or out of range, it will return the most probable class and the apriori class distribution; this can be find as inner distribution of the contingency.

cb-learner.py (uses lenses.tab)

>>> oal = OneAttributeLearner(orange.MeasureAttribute_gainRatio()) >>> c = oal(data) >>> c.variable EnumVariable 'tear_rate' >>> c.variable.values <reduced, normal> >>> print c.lookupTable <none, soft, none> >>> print c.distributions <<1.000, 0.000, 0.000>, <0.250, 0.417, 0.333>, <0.625, 0.208, 0.167>>

When trained on 'lenses' data, our learner chose the attribute 'tear_rate'. When its value is 'reduced', the predicted class is 'none' and the distribution shows that the classifier is pretty sure about it (100%). When the value is 'normal', the predicted class is 'soft' but with much less certainty (42%). When the value will be unknown or out of range (for example, is the user adds some values), the classifier will predict class 'no' with 62.5% certainty.

ExamplesDistance and ExamplesDistance_Constructor

ExamplesDistance_Constructor receives four arguments: an example generator and weights meta id, domain distributions (of type DomainDistributions and basic attribute statistics (an instance of DomainBasicAttrStat). The latter two can be None; you should write your code so that it computes them itself from the examples if they are needed. Function should return an instance of ExamplesDistance.

ExamplesDistance gets two examples and should return a number representing the distance between them.

MeasureAttribute

MeasureAttribute is slightly more complex since it can be given different sets of parameters. The class defines the way it will be called by setting the "needs" field (see documentation on attribute evaluation for more details). (Note: this has been changed from the mess we had in the past. Any existing code should still work or will need to be simplified if it does not.))

__call__(attributeNumber, domainContingency, aprioriProbabilities)
These arguments are sent if needs is set to orange.MeasureAttribute.DomainContingency. The data from which the attribute is to be evaluated is given by contingency of all attributes in the dataset. The attributeNumber tells which of those attribute the function needs to evaluate. Finally, there are apriori class probabilities, if the methods can make use of them; the third argument can sometimes be None.
__call__(contingencyMatrix, classDistribution, aprioriProbabilities)
In this form, which is used if needs equals orange.MeasureAttribute.Contingency_Class, you are given a class distribution and the contingency matrix for the attribute that is to be evaluated. In context of decision tree induction, this is a class distribution in a node and class distribution in branches if this attribute is chosen. The third argument again gives the apriori class distribution, and can sometimes be None if apriori distribution is unknown.
__call__(attributes, examples, aprioriProbabilities, weightID)
This form is used if needs is orange.MeasureAttribute.Generator. The attribute can be given as an instance of int or of Variable - you might want to check the argument type before using it.

In all cases, the method must return a real number representing the quality attribute; higher numbers mean better attributes. If in your measure of quality higher values mean worse attributes, you can either negate or inverse the number.

As an example, we will write a measure that is based on cardinality of attributes. It will also have a flag by which the user will decide whether he prefers the attributes with higher or with lower cardinalities.

cb-measureattribute.py (uses lenses.tab)

class MeasureAttribute_Cardinality(orange.MeasureAttribute): def __init__(self, moreIsBetter = 1): self.moreIsBetter = moreIsBetter def __call__(self, a1, a2, a3): if type(a1) == int: attrNo, domainContingency, apriorClass = a1, a2, a3 q = len(domainContingency[attrNo]) else: contingency, classDistribution, apriorClass = a1, a2, a3 q = len(contingency) if self.moreIsBetter: return q else: return -q

Alternatively, we can write the measure in form of a function, but without the flag. To make it shorter, will skip fancy renaming of parameters.

cb-measureattribute.py (uses lenses.tab)

def measure_cardinality(a1, a2, a3): if type(a1) == int: return len(a2[a1]) else: return len(a1)

To test the class and the function we shall induce a decision tree using the specified measure.

cb-measureattribute.py (uses lenses.tab)

treeLearner = orange.TreeLearner() treeLearner.split = orange.TreeSplitConstructor_Attribute() treeLearner.split.measure = MeasureAttribute_Cardinality(1)) tree = treeLearner(data) orngTree.printModel(tree)

There are three two-valued and one three-valued attribute. If we set the moreIsBetter to 1, as above, the attribute of the root of the tree would be the three-valued age while the attributes for the rest of the tree are chosen at random. If we set it to 0, the attribute age is used only when the values of all remaining attributes have been checked.

To use the function measure_cardinality we don't need to wrap it into anything. If we simply set

treeLearner.split = orange.TreeSplitConstructor_Attribute() treeLearner.split.measure = measure_cardinality)

the function is automatically wrapped.

TransformValue

TransformValue is a simple class whose call operator gets a Value and returns another (or the same) Value. An example of its use is given in the page about classifiers from attribute.

TreeSplitConstructor

The usual tree split constructor chooses an attribute on which the split is based and construct a ClassifierFromVarFD to return the chosen attribute's value. They are capable of much more. To demonstrate this, we will write a split constructor that constructs a split based on values of two attributes, joined in a Cartesian product. We will utilize a CartesianClassifier that we've already written above.

cb-splitconstructor.py (uses lenses.tab)

class SplitConstructor_CartesianMeasure(orange.TreeSplitConstructor): def __init__(self, measure): self.measure = measure def __call__(self, gen, weightID, contingencies, apriori, candidates): attributes = data.domain.attributes selectBest = orngMisc.BestOnTheFly(orngMisc.compare2_firstBigger) for var1, var2 in orange.SubsetsGenerator_constSize(2, attributes): if candidates[attributes.index(var1)] and candidates[attributes.index(var2)]: cc = CartesianClassifier(var1, var2) cc.classVar.getValueFrom = cc meas = self.measure(cc.classVar, gen) selectBest.candidate((meas, cc)) if not selectBest.best: return None bestMeas, bestSelector = selectBest.winner() return (bestSelector, bestSelector.classVar.values, None, bestMeas)

We again use the class BestOnTheFly from orngMisc module. This time we need to add a compare function that will compare the first element of the tuple, orngMisc.compare2_firstBigger, since we will feed it with tuples (a quality of the split, selector). The best selector and its quality are retrieved by the method winner.

Class orange.SubsetGenerator_constSize is used to generate pairs of attributes. For each pair, we check that both attributes are among the candidates.

Now comes the tricky business. We construct a CartesianClassifier to compute a Cartesian product of the two attributes. CartesianClassifier's constructor prepares a new attribute, which is stored in its classVar. The quality of the split needs to be determined as the quality of this attribute, as measured by self.measure - "meas = self.measure(cc.classVar, gen)" does the job. But the problem is that given examples (gen) do not have the attribute cc.classVar.

Not all, but many Orange's methods act like this: when asked to do something with the attribute that does not exist in the given domain, they try to compute its value from the attributes that are available. More precisely, the attribute needs to have a pointer to a classifier that is able to compute its value. In our case, we set the cc.classVar's field getValueFrom to cc.

When self.measure notices that the attribute cc.classVar does not exist in domain gen.domain, it will use cc.classVar.getValueFrom to compute its values on the fly.

If you don't understand the last few paragraphs, a short resume: using some magic, we construct a classifier that can be used as a split criterion (node's branchSelector), assess its quality and show it to selectBest. More about this is written in documentation on attribute descriptors.

When the loop ends, we return None if no splits were find (possibly because there were not enough candidates). Otherwise, we retrieve the winning quality and selector, and return an appropriate tuple consisting of

  • the branchSelector; a classifier that returns a value computed from the two used attributes;
  • descriptions of branches; values of the constructed attribute fit this purpose well;
  • numbers of examples in branches; we don't have this available so we'll let the TreeLearner find it itself
  • quality of the split; a measure will do;
  • the index of spent attribute; we've spent two, but can return only a single number, so we act as if we've spent none - we simply omit the index.

The code can be tested with the following script.

cb-splitconstructor.py (uses lenses.tab)

treeLearner = orange.TreeLearner() treeLearner.split = SplitConstructor_CartesianMeasure(orange.MeasureAttribute_gainRatio()) tree = treeLearner(data) orngTree.printTxt(tree)

TreeStopCriteria

TreeStopCriteria is a simple class. Its arguments are examples, id of meta-attribute with weights (or 0 if examples are not weighted) and a DomainContingency. The induction stops if TreeStopCriteria returns 1 (or anything representing "true" in Python). The class is peculiar for being the only non-abstract class whose call operator can be (really) overloaded. Thus, it is possible to call the inherited call operator. Even more, you should do so.

For a brief example, let us write a stop criterion that will call the common stop criteria, but besides that stop the induction randomly in 20% of cases.

part of cb-stopcriteria.py (uses lenses.tab)

from random import randint defStop = orange.TreeStopCriteria() treeLearner = orange.TreeLearner() treeLearner.stop = lambda e, w, c: defStop(e, w, c) or randint(1, 5)==1

We've defined a default stop criterion defStop to avoid constructing it at each call of our function. The whole stopping criterion is hidden in the lambda function which stops when the default says so or when the random number between 1 and 5 equals 1.

To demonstrate a call of inherited call operator, let us do the same thing by deriving a new class.

part of cb-stopcriteria.py (uses lenses.tab)

class StoppingCriterion_random(orange.TreeStopCriteria): def __call__(self, gen, weightID, contingency): return orange.TreeStopCriteria.__call__(self, gen, weightID, contingency) \ or randint(1, 5)==1 treeLearner.stop = StoppingCriterion_random()

TreeExampleSplitter

Example splitter's task is to split a list of examples (usually an ExamplePointerTable or ExampleTable) into subsets and return them as ExampleGeneratorList. The arguments it gets are a TreeNode (it will need at least branchSelector, some splitters also use branchSizes), a list of examples and an id of meta-attribute with example weights (or 0, if they are not weighted).

If some examples are split among the branches so that only part of an example belongs to a branch, the splitter should construct new weight meta attributes and fill it with example weights. A list of weight ID's should be returned in a tuple with the ExampleGenerator list. The exact mechanics of this is given on the page describing the tree induction.

TreeDescender

Descenders are about the trickiest components of Orange trees. They get two arguments, a starting node (not necessarily a tree root) and an example, and return a tuple with the finishing node and, optionally, a discrete distribution.

If there's no distribution, the TreeClassifier (who usually calls the descender) will use the returned node to classify the example. Thus the node's nodeClassifier will probably need to be defined (unless you've patched a TreeClassifier or written your own version of it).

If the distribution is returned, branches below the returned node will vote on the classifiers class and the distribution represents weights of votes for individual branches. Voting will require additional calls of the descender, but that's something that a TreeClassifier needs to worry about.

The descender's real job is to decide what should happen when the descend halts because a branch for an example cannot be determined. It can either return the node (so it will be used to classify the example without looking any further), silently decide for some branch, or request a vote.

A general descender look like that:

class Descender_RandomBranch(orange.TreeDescender): def __call__(self, node, example): while node.branchSelector: branch = node.branchSelector(example) if branch.isSpecial() or int(branch)>len(node.branches): < do something > nextNode = node.branches[int(branch)] if not nextNode: break node = nextNode return node

Descenders descend until they reach a node with no branchSelector - a leaf. They call each node's branchSelector to find the branch to follow. If the value is defined, they check whether the node below is a null-node. If this is so, they act as if the current node is a leaf.

Descenders differ in what they do when the branch index is unknown or out of range.

In this section, we will suppose that the tree has already been induced (using, say, default settings for TreeLearner) and stored in a TreeClassifier tree.

>>> tree = orange.TreeLearner(data) >>> orngTree.printTxt(tree) tear_rate=reduced: none (100.00%) tear_rate=normal | astigmatic=no | | age=young: soft (100.00%) | | age=pre-presbyopic: soft (100.00%) | | age=presbyopic: none (50.00%) | astigmatic=yes | | prescription=myope: hard (100.00%) | | prescription=hypermetrope: none (66.67%)
Continuing the descent

For the first exercise, will implement a descender that decides for a random branchs when descent stops. The decision will be random, it will ignore any probabilities that might be computed based on branchSizes or values of other example's attributes.

part of cb-descender.py (uses lenses.tab)

class Descender_Report(orange.TreeDescender): def __call__(self, node, example): print "Descent: root ", while node.branchSelector: branch = node.branchSelector(example) if branch.isSpecial() or int(branch)>=len(node.branches): break nextNode = node.branches[int(branch)] if not nextNode: break print ">> %s = %s" % (node.branchSelector.classVar.name, node.branchDescriptions[int(branch)]), node = nextNode return node

Everything goes according to the above template. When the branchSelector does not return a (valid) branch, we select a random branch (and print it out for debugging purposes).

To see how it works, we'll take the third example from the table and remove the value of the attribute needed at the root of the tree.

part of cb-descender.py (uses lenses.tab)

>>> ex = orange.Example(data.domain, list(data[3])) >>> ex[tree.tree.branchSelector.classVar] = "?" >>> print ex ['young', 'myope', 'yes', '?', 'hard']

We'll now tell the classifier to use our descender, and classify the example - we'll call the classifier for five times.

part of cb-descender.py (uses lenses.tab)

>>> tree.descender = Descender_RandomBranch() >>> for i in range(3): ... print tree(ex) Descender decides for 1 hard Descender decides for 1 hard Descender decides for 0 none

When the descender decides for the second branch (branch 1), the astigmatism and age is checked and the example is classified to "hard". When the descender takes the first branch (0), the classifier returns "none".

Voting

Our next descender will request a vote. It will, however, disregard any known probabilities and assign random weights to the branches.

part of cb-descender.py (uses lenses.tab)

class Descender_RandomVote(orange.TreeDescender): def __call__(self, node, example): while node.branchSelector: branch = node.branchSelector(example) if branch.isSpecial() or int(branch)>=len(node.branches): votes = orange.DiscDistribution([randint(0, 100) for i in node.branches]) votes.normalize() print "Weights:", votes return node, votes nextNode = node.branches[int(branch)] if not nextNode: break node = nextNode return node

In the first interesting line we construct a discrete distribution with random integers between 0 and 100, one for each branch. We normalize it and return the current node and the weights of votes. It's as simple as that.

We'll check the descender on the same example as above.

>>> tree.descender = Descender_RandomVote() >>> print tree(ex, orange.GetProbabilities) Decisions by random voting Weights: <0.338, 0.662> <0.338, 0.000, 0.662>

The first output line gives the weights of the branches - 0.338 for the first one and 0.662 for the second, which is reflected on the final answer.

A Reporting Descender

As the last example, here's a handy descender that prints out the descriptions of branches on the way. When branchSelector does not return (a valid) branch, it simply returns the current node, as if it was a leaf (you can change this if you want to).

part of cb-descender.py (uses lenses.tab)

class Descender_Report(orange.TreeDescender): def __call__(self, node, example): print "Descent: root ", while node.branchSelector: branch = node.branchSelector(example) if branch.isSpecial() or int(branch)>=len(node.branches): break nextNode = node.branches[int(branch)] if not nextNode: break print ">> %s = %s" % ( node.branchSelector.classVar.name, node.branchDescriptions[int(branch)]), node = nextNode print return node

We'll test it on the first example from the table (without removing any values).

>>> tree.descender = Descender_Report() >>> print "Classifying example", data[0] Classifying example ['young', 'myope', 'no', 'reduced', 'none'] >>> print "----> %s" % tree(data[1]) Descent: root >> tear_rate = normal >> astigmatic = no >> age = young ----> soft
1 Why "in principle"? The main reason is that the Python-to-Orange interface is so big that no one, at least not the principle authors of Orange, are ready to program and maintain another such interface. The other reason is that we've committed a small sin regarding independency; at certain point we stopped developing our own garbage collection system and now use the Python's instead. Getting independency from Python would mean rewriting the garbage collection which is something we'd prefer not to have to do.

2 For those that are familiar with C++ terms, but not with Python's API: each Python object has a pointer to a type description - the object that you get when calling Python's built-in function type(). The type description gives the name of the type, the memory size of its instances, several flags and pointers to functions that the objects provides. This is somewhat similar to a C++ list of virtual methods, except that here the methods are defined in advance and have fixed position. For example, the tp_cmp pointer points to the function that should be called when this object is to be compared to another; if NULL, object does not support comparison.

3md has two tables of virtual methods (vfptr), one for Python and one for C++. When Python calls it, it uses the __call__ you defined, when C++ calls it, it calls the function defined in C++.

4 There's a case when the intermediate class is revealed, a TreeStopCriteria_Python; this is needed because the class TreeStopCriteria is not abstract, but we won't discuss the details here.