This page describes how to subtype Orange's classes in Python and make the overloaded methods callable from C++ components.
Since Orange has only an interface to Python but is otherwise independent from it, subtyping might sometimes not function as you would expect. If you subtype an Orange class in Python and overload some methods, the C++ code (say if you use your instance as a component for some Orange object) will call the C++ method and not the one you provided in Python.
Exceptions to that are: Filter
, Learner
, Classifier
, TreeSplitConstructor
, TreeStopCriteria
, TreeExampleSplitter
, TreeDescender
and MeasureAttribute
, TransformValue
, ExamplesDistance
, ExamplesDistance_Constructor
. If you subtype one of these classes and overload its call operator, your operator will get called from the C++ code. If you subclass any other class or overload any other method, C++ code won't know about it.
If your subclass won't get called from C++ but only from Python code, you can subclass anything you want and it will function as it should.
If you are satisfied with that, you can skip to the examples on this page to learn how to subclass what you need. If you wonder why does it have to be like this, read on.
Orange was first conceived as a C++ library of machine learning components. It was only after several years of development that Python was first used as a glue language. But even after being interfaced to Python, Orange still maintains its independency. It would be, in principle, possible to export Orange's components as, for example, COM objects that wouldn't require Python to run1. Orange components are not aware that they are being called from Python. Even more, they are not aware that they're being exposed to Python.
This becomes important when subtyping the components. Let's say you derived a Python class MyDomain
from Domain
. We would like to redefine the call operator which is used to convert an example from another domain. Our operator uses the original Domain
to convert the example but afterwards sets the class to unknown.
cb-mydomain.py (uses lenses.tab)
Subtyping built-in classes in Python is technically complex. When you call MyDomain
, the Domain
's constructor is called. It will construct a C++ instance of Domain
, but when returning it to Python, it would mark it as an instance of MyDomain
2. So, the md
's memory representation is the same as that of ordinary instances of orange.Domain
, but its type, as known to Python, is MyDomain
. So when Python interpreter calls md
, it will treat it as MyDomain
and correctly call the method we defined above.
Not so with C++. The C++ code knows nothing about Python wrappers, types, overloaded methods. For C++, md
is an ordinary instance of Domain
. So, what happens when C code tries to call it? To check this, we will convert an example in another way. We'll call the Example
's constructor. There are different arguments that can be given; if you provide a domain and an existing example, a new example will be constructed by converting the existing into the specified domain. The domain will be called internally to perform the actual conversion.
The class is still 'no', not unknown.3
The obvious solution to the problem would be to make Orange's components "Python aware". This would be extremely slow: at each call, even at calls of operator[]
, for example, C++ code would have to check whether the corresponding operator has been overloaded in Python.
The solution we preferred is efficient, yet limited. The following fragment is a fully functional filter, derived from Orange.Filter
.
You can, for instance, use it as an argument for ExampleTable.filter
method:
orange.Filter
is an abstract object. You cannot construct it, e.g. by calling Filter()
. But when the Filter
's constructor is called to construct an instance of FilterYoung
, such as fy
, it constructs an instance of a special callback class, Filter_Python
(although Orange doesn't admit it - when returning it to Python it says it's a FilterYoung
, and that its base class is Filter
4).
Filter_Python
's call operator (written in C++ and thus seen and respected in C++ code) callbacks the overloaded __call__
written in Python.
If it works for orange.Filter
- why does it fail on orange.Domain
? It's not been programmed. Only call operators of a few classes can do this: Filter
, Learner
, Classifier
, TreeSplitConstructor
, TreeStopCriteria
, TreeExampleSplitter
, TreeDescender
and MeasureAttribute
. Why only those? A rough estimate is that making all the methods of the existing 300 orange classes overloadable would inflate the Orange's source code to its triple size! On the other hand, the chosen classes are those that are most likely to be overloaded. For other, you can either find another solution or ask us to make them overloadable as well. Adding this functionality to a single method of a single class is a small undertaking.
You might be tempted to overload a call operator of a class that is derived from Filter
. For instance
This will fail just as for Domain
. The class Filter_Python
is derived from Filter
and can only inherit its functionality. In the previous example, Filter_Python
was a hidden class between FilterYoung
and orange.Filter
. In this example, you would need a class between MyFilter
and orange.Filter_index
, and this role obviously cannot be played by Filter_Python
.
How to do it then? The simplest way is to wrap a Filter_index
, like this
MyFilter
is now derived from orange.Filter
but still has it's own copy of Filter_index
. Not pretty, but it works.
There's another nice way to construct your own filters (and, in general, other overloadable components). You don't need to derive a new class but seemingly construct an instance of an abstract class, giving a callback function as an argument.
The Filter
's constructor is called directly, as to construct an instance of Filter
, which it would usually refuse (since the class is abstract). But when given a function as an argument (here we used lambda function, but you can, of course, use ordinary functions as well), it constructs a Filter_Python
and stores the given function to its dictionary. When Filter_Python
is called, it calls the function passed as an argument to its constructor.
There's another twist. Sometimes you don't need to wrap the function into a class at all. You can, for example, construct a tree's nodeLearner
on the fly. nodeLearner
should be derived from Classifier
, but you can assign it a callback function.
This function replaces a nodeLearner
with a learner that calls orange.MajorityLearner
. The example is artificial; the MajorityLearner
can be given directly, with treeLearner.nodeLearner = orange.MajorityLearner
; besides that, this is default anyway. Anyway, how can we assign a Python function to a nodeLearner
which can only hold a pure C++ object Classifier
, not Python functions. And how can we then even expect TreeLearner
(written in C++) to call it? Checking what the above snippet actually stored to treeLearner.nodeLearner
is revealing.
When Orange assigns values to a component's attribute like treeLearner.nodeLearner
it tries to convert the given arguments to a correct class, a Learner
in this case. If the user actually gives an instance of (something derived from) Learner
, that's great. Otherwise, Orange will call the Learner
's constructor with what was given as an argument. If constructor can use this to construct an object, it's OK. The above assignment is thus equivalent to
which, as we know from the last example with Filter
, works as intended.
You might have used this feature before without knowing it. Have you ever constructed an EnumVariable
and assigned some values to its field values
? Field values
is stored as orange.StringList
, pure C++ vector<string>
, not a Python list of strings which you have provided. A same thing as with nodeLearner
happens here: since you tried to assign Python list instead of StringList
, StringList
's constructor was called with Python list as an argument.
A final advice for using derived classes is: don't wonder too much about how it works. Just be happy that it does. Use it, try things that you think might work, but be sure to check (a simple print
here and there will suffice) that your call operators are actually called. That's all you need to care about.
All classes for which you can (really) overload the call operator are abstract. The only exception is TreeStopCriteria
, so this is the only class for which calling the inherited call operator makes sense. Examples section shows how to do it.
For all other classes: calling the inherited method is an error. Similarly, forgetting to define the call operator but trying to use it leads to a call of the inherited operator; an error again.
The below examples suppose that you have loaded a 'lenses' data in a variable 'data'.
Examples are somewhat simplified. For instance, many classes below will silently assume that the attribute they deal with is discrete. This is to make the code clearer.
We've already show how to derive filters. Filter is a simple object that decides whether a given example is "acceptable" or not. The below class accepts example for which the value of "age" is "young".
Filter
can be used, for instance, for selecting examples from an example table.
Note two things. You don't need to write your own filters to select examples based on values. You'd get the same effect by
Second, you don't need to derive a class from a filter when a function would suffice. You can write either
or, for cases as simple as this, squeeze the whole function into a lambda function
A "classifier" in Orange has a rather non-standard meaning. A classifier is an object with a call operator that gets an example and returns a value, a distribution of values or both - the return type is regulated by an optional second argument. Beside the standard use of classifiers - "class predictors" - this also covers predictors in regression, objects used in constructive induction (which use some of example's attributes to compute a value of a new attribute), and others.
For this tutorial, we will define a classifier that can be used for simple constructive induction. Its constructor will accept two attributes and construct a new attribute as Cartesian product of the two. Its name and names of its values will be constructed from pairs of names for original attributes. The call operator will return a value of the new attribute that corresponds to the values that the two attribute have on the example.
cb-classifier.py (uses lenses.tab)
No surprises in constructor, except for a trick for construction of classVar.values
.
In the call operator, the first line uses an implicit conversion of values to integers. When ex[self.var1]
, which is of type orange.Value
, is multiplied by noValues2
, which is an integer, the former is converted to an integer. The same happens at addition.
val
is an index of the value to be returned. What follows is the usual procedure for constructing a correct return type for a classifier - you will often do something very similar in your classifiers.
cb-classifier.py (uses lenses.tab)
ClassifierByLookupTable
is a classifier whose predictions are based on the value of a single attribute. It contains a simple table named lookupTable
for conversion from attribute value to class prediction. The last element of the table is the value that is returned when the attribute value is unknown or out of range. Similarly, distributions
is a list of distributions, used when ClassifierByLookupTable
is used to predict a distribution.
Let us write a learner which chooses an attribute using a specified measure of quality and constructs a ClassifierByLookupTable
that would use this single attribute for making predictions.
cb-learner.py (uses lenses.tab)
Constructor stores the measure to be used for choosing the attribute. Call operator assesses the qualities of attributes and feeds them to orngMisc.BestOnTheFly
. This is a simple class with method candidates
to which we feed some objects, and winnerIndex
that tells the index of the greatest of the "candidates" (there's also a method winner
that returns a winner itself, but we cannot use it here). The benefit of using BestOnTheFly
is that it is fair; in case there are more than one winners, it will return a random winner and not the first or the last (however, if you call winnerIndex
repetitively without adding any (winning) candidates, it will repeatedly return the same winner).
The chosen attribute is stored in bestAttr
. A ClassifierByLookupTable
is constructed next.
We then need to fill the lookupTable
and distributions
. For this, we construct a contingency matrix of type ContingencyAttrClass
that has the given attribute as the outer and the class as the inner attribute. Thus, contingency[i]
gives the distribution of classes for the i-th value of the attribute. We then iterate through the contingency to find the most probable class for each value of the attribute (obtained as modus of the distribution). When predicting probabilities of classes, our classifier will return normalized distributions.
When the value of the attribute is unknown or out of range, it will return the most probable class and the apriori class distribution; this can be find as inner distribution of the contingency.
cb-learner.py (uses lenses.tab)
When trained on 'lenses' data, our learner chose the attribute 'tear_rate'. When its value is 'reduced', the predicted class is 'none' and the distribution shows that the classifier is pretty sure about it (100%). When the value is 'normal', the predicted class is 'soft' but with much less certainty (42%). When the value will be unknown or out of range (for example, is the user adds some values), the classifier will predict class 'no' with 62.5% certainty.
ExamplesDistance_Constructor
receives four arguments: an example generator and weights meta id, domain distributions (of type DomainDistributions
and basic attribute statistics (an instance of DomainBasicAttrStat
). The latter two can be None
; you should write your code so that it computes them itself from the examples if they are needed. Function should return an instance of ExamplesDistance
.
ExamplesDistance
gets two examples and should return a number representing the distance between them.
MeasureAttribute
is slightly more complex since it can be given different sets of parameters. The class defines the way it will be called by setting the "needs" field (see documentation on attribute evaluation for more details). (Note: this has been changed from the mess we had in the past. Any existing code should still work or will need to be simplified if it does not.))
needs
is set to orange.MeasureAttribute.DomainContingency
. The data from which the attribute is to be evaluated is given by contingency of all attributes in the dataset. The attributeNumber
tells which of those attribute the function needs to evaluate. Finally, there are apriori class probabilities, if the methods can make use of them; the third argument can sometimes be None
.needs
equals orange.MeasureAttribute.Contingency_Class
, you are given a class distribution and the contingency matrix for the attribute that is to be evaluated. In context of decision tree induction, this is a class distribution in a node and class distribution in branches if this attribute is chosen. The third argument again gives the apriori class distribution, and can sometimes be None
if apriori distribution is unknown.needs
is orange.MeasureAttribute.Generator
. The attribute can be given as an instance of int
or of Variable
- you might want to check the argument type before using it.In all cases, the method must return a real number representing the quality attribute; higher numbers mean better attributes. If in your measure of quality higher values mean worse attributes, you can either negate or inverse the number.
As an example, we will write a measure that is based on cardinality of attributes. It will also have a flag by which the user will decide whether he prefers the attributes with higher or with lower cardinalities.
cb-measureattribute.py (uses lenses.tab)
Alternatively, we can write the measure in form of a function, but without the flag. To make it shorter, will skip fancy renaming of parameters.
cb-measureattribute.py (uses lenses.tab)
To test the class and the function we shall induce a decision tree using the specified measure.
cb-measureattribute.py (uses lenses.tab)
There are three two-valued and one three-valued attribute. If we set the moreIsBetter
to 1, as above, the attribute of the root of the tree would be the three-valued age
while the attributes for the rest of the tree are chosen at random. If we set it to 0, the attribute age
is used only when the values of all remaining attributes have been checked.
To use the function measure_cardinality
we don't need to wrap it into anything. If we simply set
the function is automatically wrapped.
TransformValue
is a simple class whose call operator gets a Value
and returns another (or the same) Value
. An example of its use is given in the page about classifiers from attribute.
The usual tree split constructor chooses an attribute on which the split is based and construct a ClassifierFromVarFD
to return the chosen attribute's value. They are capable of much more. To demonstrate this, we will write a split constructor that constructs a split based on values of two attributes, joined in a Cartesian product. We will utilize a CartesianClassifier
that we've already written above.
cb-splitconstructor.py (uses lenses.tab)
We again use the class BestOnTheFly
from orngMisc
module. This time we need to add a compare function that will compare the first element of the tuple, orngMisc.compare2_firstBigger
, since we will feed it with tuples (a quality of the split, selector). The best selector and its quality are retrieved by the method winner
.
Class orange.SubsetGenerator_constSize
is used to generate pairs of attributes. For each pair, we check that both attributes are among the candidates.
Now comes the tricky business. We construct a CartesianClassifier
to compute a Cartesian product of the two attributes. CartesianClassifier
's constructor prepares a new attribute, which is stored in its classVar
. The quality of the split needs to be determined as the quality of this attribute, as measured by self.measure
- "meas = self.measure(cc.classVar, gen)
" does the job. But the problem is that given examples (gen
) do not have the attribute cc.classVar
.
Not all, but many Orange's methods act like this: when asked to do something with the attribute that does not exist in the given domain, they try to compute its value from the attributes that are available. More precisely, the attribute needs to have a pointer to a classifier that is able to compute its value. In our case, we set the cc.classVar
's field getValueFrom
to cc
.
When self.measure
notices that the attribute cc.classVar
does not exist in domain gen.domain
, it will use cc.classVar.getValueFrom
to compute its values on the fly.
If you don't understand the last few paragraphs, a short resume: using some magic, we construct a classifier that can be used as a split criterion (node's branchSelector
), assess its quality and show it to selectBest
. More about this is written in documentation on attribute descriptors.
When the loop ends, we return None
if no splits were find (possibly because there were not enough candidates). Otherwise, we retrieve the winning quality and selector, and return an appropriate tuple consisting of
branchSelector
; a classifier that returns a value computed from the two used attributes;
TreeLearner
find it itself
The code can be tested with the following script.
cb-splitconstructor.py (uses lenses.tab)
TreeStopCriteria
is a simple class. Its arguments are examples, id of meta-attribute with weights (or 0 if examples are not weighted) and a DomainContingency
. The induction stops if TreeStopCriteria
returns 1 (or anything representing "true" in Python). The class is peculiar for being the only non-abstract class whose call operator can be (really) overloaded. Thus, it is possible to call the inherited call operator. Even more, you should do so.
For a brief example, let us write a stop criterion that will call the common stop criteria, but besides that stop the induction randomly in 20% of cases.
part of cb-stopcriteria.py (uses lenses.tab)
We've defined a default stop criterion defStop
to avoid constructing it at each call of our function. The whole stopping criterion is hidden in the lambda function which stops when the default says so or when the random number between 1 and 5 equals 1.
To demonstrate a call of inherited call operator, let us do the same thing by deriving a new class.
part of cb-stopcriteria.py (uses lenses.tab)
Example splitter's task is to split a list of examples (usually an ExamplePointerTable
or ExampleTable
) into subsets and return them as ExampleGeneratorList
. The arguments it gets are a TreeNode
(it will need at least branchSelector
, some splitters also use branchSizes
), a list of examples and an id of meta-attribute with example weights (or 0, if they are not weighted).
If some examples are split among the branches so that only part of an example belongs to a branch, the splitter should construct new weight meta attributes and fill it with example weights. A list of weight ID's should be returned in a tuple with the ExampleGenerator
list. The exact mechanics of this is given on the page describing the tree induction.
Descenders are about the trickiest components of Orange trees. They get two arguments, a starting node (not necessarily a tree root) and an example, and return a tuple with the finishing node and, optionally, a discrete distribution.
If there's no distribution, the TreeClassifier
(who usually calls the descender) will use the returned node to classify the example. Thus the node's nodeClassifier
will probably need to be defined (unless you've patched a TreeClassifier
or written your own version of it).
If the distribution is returned, branches below the returned node will vote on the classifiers class and the distribution represents weights of votes for individual branches. Voting will require additional calls of the descender, but that's something that a TreeClassifier
needs to worry about.
The descender's real job is to decide what should happen when the descend halts because a branch for an example cannot be determined. It can either return the node (so it will be used to classify the example without looking any further), silently decide for some branch, or request a vote.
A general descender look like that:
Descenders descend until they reach a node with no branchSelector
- a leaf. They call each node's branchSelector
to find the branch to follow. If the value is defined, they check whether the node below is a null-node. If this is so, they act as if the current node is a leaf.
Descenders differ in what they do when the branch index is unknown or out of range.
In this section, we will suppose that the tree has already been induced (using, say, default settings for TreeLearner
) and stored in a TreeClassifier
tree
.
For the first exercise, will implement a descender that decides for a random branchs when descent stops. The decision will be random, it will ignore any probabilities that might be computed based on branchSizes
or values of other example's attributes.
part of cb-descender.py (uses lenses.tab)
Everything goes according to the above template. When the branchSelector
does not return a (valid) branch, we select a random branch (and print it out for debugging purposes).
To see how it works, we'll take the third example from the table and remove the value of the attribute needed at the root of the tree.
part of cb-descender.py (uses lenses.tab)
We'll now tell the classifier to use our descender, and classify the example - we'll call the classifier for five times.
part of cb-descender.py (uses lenses.tab)
When the descender decides for the second branch (branch 1), the astigmatism and age is checked and the example is classified to "hard". When the descender takes the first branch (0), the classifier returns "none".
Our next descender will request a vote. It will, however, disregard any known probabilities and assign random weights to the branches.
part of cb-descender.py (uses lenses.tab)
In the first interesting line we construct a discrete distribution with random integers between 0 and 100, one for each branch. We normalize it and return the current node and the weights of votes. It's as simple as that.
We'll check the descender on the same example as above.
The first output line gives the weights of the branches - 0.338 for the first one and 0.662 for the second, which is reflected on the final answer.
As the last example, here's a handy descender that prints out the descriptions of branches on the way. When branchSelector
does not return (a valid) branch, it simply returns the current node, as if it was a leaf (you can change this if you want to).
part of cb-descender.py (uses lenses.tab)
We'll test it on the first example from the table (without removing any values).
2
For those that are familiar with C++ terms, but not with Python's API: each Python object has a pointer to a type description - the object that you get when calling Python's built-in function type()
. The type description gives the name of the type, the memory size of its instances, several flags and pointers to functions that the objects provides. This is somewhat similar to a C++ list of virtual methods, except that here the methods are defined in advance and have fixed position. For example, the tp_cmp
pointer points to the function that should be called when this object is to be compared to another; if NULL
, object does not support comparison.
3md
has two tables of virtual methods (vfptr
), one for Python and one for C++. When Python calls it, it uses the __call__
you defined, when C++ calls it, it calls the function defined in C++.
4
There's a case when the intermediate class is revealed, a TreeStopCriteria_Python
; this is needed because the class TreeStopCriteria
is not abstract, but we won't discuss the details here.