Prev: Testing and Evaluating, Next: Naive Bayes with Discretization, Up: Classification

Build Your Own Learner

This part of tutorial will show how to build learners and classifiers in Python, that is, how to build your own learners and classifiers. Especially for those of you that want to test some of your methods or want to combine existing techniques in Orange, this is a very important topic. Developing your own learners in Python makes prototyping of new methods fast and enjoyable.

There are different ways to build learners/classifiers in Python. We will take the route that shows how to do this correctly, in a sense that you will be able to use your learner as it would be any learner that Orange originally provides. Distinct to Orange learners is the way how they are invoked and what the return. Let us start with an example. Say that we have a Learner(), which is some learner in Orange. The learner can be called in two different ways:

learner = Learner() classifier = Learner(data)

In the first line, the learner is invoked without the data set and in that case it should return an instance of learner, such that later you may say classifier = learner(data) or you may call some validation procedure with a learner itself (say orngEval.CrossValidation([learner], data)). In the second line, learner is called with the data and returns a classifier.

Classifiers should be called with a data instance to classify, and should return either a class value (by default), probability of classes or both:

value = classifier(instance) value = classifier(instance, orange.GetValue) probabilities = classifier(instance, orange.GetProbabilities) value, probabilities = classifier(instance, orange.GetBoth)

Here is a short example:

> python
>>> import orange
>>> data = orange.ExampleTable("voting")
>>> learner = orange.BayesLearner()
>>> classifier = learner(data)
>>> classifier(data[0])
republican
>>> classifier(data[0], orange.GetBoth)
(republican, [0.99999994039535522, 7.9730767765795463e-008])
>>> classifier(data[0], orange.GetProbabilities)
[0.99999994039535522, 7.9730767765795463e-008]
>>> 
>>> c = orange.BayesLearner(data)
>>> c(data[12])
democrat
>>>

Throughout our examples, we will assume that our learner and the corresponding classifier will be defined in a single file (module) that will not contain any other code. This helps for code reuse, so that if you want to use your new method anywhere else, you just import it from that file. Each such module will contain a class Learner_Class and a class Classifier.

We will use this schema to define a learner that will use naive Bayesian classifier with embeded categorization of training data. Then we will show how to write naive Bayesian classifier in Python (that is, how to do this from scratch). We conclude with Python implementation of bagging.

Naive Bayes with Discretization

Let us build a learner/classifier that is an extension of build-in naive Bayes and which before learning categorizes the data (see also the lesson on Categorization). We will define a module nbdisc.py that will implement two classes, Learner and Classifier. Following is a Python code for a Learner class:

function Learner from nbdisc.py

class Learner(object): def __new__(cls, examples=None, name='discretized bayes', **kwds): learner = object.__new__(cls) if examples: learner.__init__(name) # force init return learner(examples) else: return learner # invokes the __init__ def __init__(self, name='discretized bayes'): self.name = name def __call__(self, data, weight=None): disc = orange.Preprocessor_discretize( \ data, method=orange.EntropyDiscretization()) model = orange.BayesLearner(disc, weight) return Classifier(classifier = model)

Learner_Class has three methods. Method __new__ creates the object and returns a learner or classifier, depending if examples where passed to the call. If the examples were passed as an argument than the method called the learner (invoking __call__ method). Method __init__ is invoked every time the class is called for the first time. Notice that all it does is remembers the only argument that this class can be called with, i.e. the argument name which defaults to ‘discretized bayes’. If you would expect any other arguments for your learners, you should handle them here (store them as class’ attributes using the keyword self).

If we have created an instance of the learner (and did not pass the examples as attributes), the next call of this learner will invoke a method __call__, where the essence of our learner is implemented. Notice also that we have included an attribute for vector of instance weights, which is passed to naive Bayesian learner. In our learner, we first discretize the data using Fayyad & Irani’s entropy-based discretization, then build a naive Bayesian model and finally pass it to a class Classifier. You may expect that at its first invocation the Classifier will just remember the model we have called it with:

class Classifier from nbdisc.py

class Classifier: def __init__(self, **kwds): self.__dict__.update(kwds) def __call__(self, example, resultType = orange.GetValue): return self.classifier(example, resultType)

The method __init__ in Classifier is rather general: it makes Classifier remember all arguments it was called with. They are then accessed through Classifiers’ arguments (self.argument_name). When Classifier is called, it expects an example and an optional argument that specifies the type of result to be returned.

This completes our code for naive Bayesian classifier with discretization. You can see that the code is fairly short (fewer than 20 lines), and it can be easily extended or changed if we want to do something else as well (like feature subset selection, ...).

Here are now a few lines to test our code:

uses iris.tab and nbdisc.py

> python
>>> import orange, nbdisc
>>> data = orange.ExampleTable("iris")
>>> classifier = nbdisc.Learner(data)
>>> print classifier(data[100])
Iris-virginica
>>> classifier(data[100], orange.GetBoth)
(, <0.000, 0.001, 0.999>)
>>>

For a more elaborate test that also shows the use of a learner (that is not given the data at its initialization), here is a script that does 10-fold cross validation:

nbdisc_test.py (uses iris.tab and nbdisc.py)

import orange, orngEval, nbdisc data = orange.ExampleTable("iris") results = orngEval.CrossValidation([nbdisc.Learner()], data) print "Accuracy = %5.3f" % orngEval.CA(results)[0]

The accuracy on this data set is about 92%. You may try to obtain a better accuracy by using some other type of discretization, or try some other learner on this data (hint: k-NN should perform better).

Python Implementation of Naive Bayesian Classifier

The naive Bayesian classifier we will implement in this lesson uses standard naive Bayesian algorithm also described in Michell: Machine Learning, 1997 (pages 177-180). Essentially, if an instance is described with n attributes ai (i from 1 to n), then the class that instance is classified to a class v from set of possible classes V according to naive Bayes classifier is:

formula for v

We will also compute a vector of elements

formula for pj

which, after normalization so that the sum of pj is equal to 1, represent class probabilities. The class probabilities and conditional probabilities (priors) in above formulas are estimated from training data: class probability is equal to the relative class frequency, while the conditional probability of attribute value given class is computed by figuring out the proportion of instances with a value of i-th attribute equal to ai among instances that from class vj.

To complicate things just a little bit, m-estimate (see Mitchell, and Cestnik IJCAI-1990) will be used instead of relative frequency when computing prior conditional probabilities. So (following the example in Mitchell), when assessing P=P(Wind=strong|PlayTennis=no) we find that the total number of training examples with PlayTennis=no is n=5, and of these there are nc=3 for which Wind=strong, than using relative frequency the corresponding probability would be

formula for P

Relative frequency has a problem when number of instance is small, and to alleviate that m-estimate assumes that there are m imaginary cases (m is also referred to as equivalent sample size) with equal probability of class values p. Our conditional probability using m-estimate is then computed as

formula for Pm

Often, instead of uniform class probability p, a relative class frequency as estimated from training data is taken.

We will develop a module called bayes.py that will implement our naive Bayes learner and classifier. The structure of the module will be as with previous example. Again, we will implement two classes, one for learning and the other on for classification. Here is a Learner: class

class Learner_Class from bayes.py

class Learner_Class: def __init__(self, m=0.0, name='std naive bayes', **kwds): self.__dict__.update(kwds) self.m = m self.name = name def __call__(self, examples, weight=None, **kwds): for k in kwds.keys(): self.__dict__[k] = kwds[k] domain = examples.domain # first, compute class probabilities n_class = [0.] * len(domain.classVar.values) for e in examples: n_class[int(e.getclass())] += 1 p_class = [0.] * len(domain.classVar.values) for i in range(len(domain.classVar.values)): p_class[i] = n_class[i] / len(examples) # count examples with specific attribute and # class value, pc[attribute][value][class] # initialization of pc pc = [] for i in domain.attributes: p = [[0.]*len(domain.classVar.values) for i in range(len(i.values))] pc.append(p) # count instances, store them in pc for e in examples: c = int(e.getclass()) for i in range(len(domain.attributes)): if not e[i].isSpecial(): pc[i][int(e[i])][c] += 1.0 # compute conditional probabilities for i in range(len(domain.attributes)): for j in range(len(domain.attributes[i].values)): for k in range(len(domain.classVar.values)): pc[i][j][k] = (pc[i][j][k] + self.m * p_class[k])/ \ (n_class[k] + self.m) return Classifier(m = self.m, domain=domain, p_class=p_class, \ p_cond=pc, name=self.name)

Initialization of Learner_Class saves the two attributes, m and name of the classifier. Notice that both parameters are optional, and the default value for m is 0, making naive Bayes m-estimate equal to relative frequency unless the user specifies some other value for m. Function __call__ is called with the training data set, computes class and conditional probabilities and calls classifiers, passing the probabilities along with some other variables required for classification.

class Classifier from bayes.py

class Classifier: def __init__(self, **kwds): self.__dict__.update(kwds) def __call__(self, example, result_type=orange.GetValue): # compute the class probabilities p = map(None, self.p_class) for c in range(len(self.domain.classVar.values)): for a in range(len(self.domain.attributes)): if not example[a].isSpecial(): p[c] *= self.p_cond[a][int(example[a])][c] # normalize probabilities to sum to 1 sum =0. for pp in p: sum += pp if sum>0: for i in range(len(p)): p[i] = p[i]/sum # find the class with highest probability v_index = p.index(max(p)) v = orange.Value(self.domain.classVar, v_index) # return the value based on requested return type if result_type == orange.GetValue: return v if result_type == orange.GetProbabilities: return p return (v,p) def show(self): print 'm=', self.m print 'class prob=', self.p_class print 'cond prob=', self.p_cond

Upon first invocation, the classifier will store the values of the parameters it was called with (__init__). When called with a data instance, it will first compute the class probabilities using the prior probabilities sent by the learner. The probabilities will be normalized to sum to 1. The class will then be found that has the highest probability, and the classifier will accordingly predict to this class. Notice that we have also added a method called show, which reports on m, class probabilities and conditional probabilities:

uses voting.tab

> python
>>> import orange, bayes
>>> data = orange.ExampleTable("voting")
>>> classifier = bayes.Learner(data)
>>> classifier.show()
m= 0.0
class prob= [0.38620689655172413, 0.61379310344827587]
cond prob= [[[0.79761904761904767, 0.38202247191011235], ...]]
>>>

The following script tests our naive Bayes, and compares it to 10-nearest neighbors. Running the script (do you it yourself) reports classification accuracies just about 90% (somehow, on this data set, kNN does better; smrc…).

bayes_test.py (uses bayes.py and voting.tab)

import orange, orngEval, bayes data = orange.ExampleTable("voting") bayes = bayes.Learner(m=2, name='my bayes') knn = orange.kNNLearner(k=10) knn.name = "knn" learners = [knn,bayes] results = orngEval.CrossValidation(learners, data) for i in range(len(learners)): print learners[i].name, orngEval.CA(results)[i]

Bagging

Here we show how to use the schema that allows us to build our own learners/classifiers for bagging. While you can find bagging, boosting, and other ensemble-related stuff in orngEnsemble module, we thought explaining how to code bagging in Python may provide for a nice example. The following pseudo-code (from Whitten & Frank: Data Mining) illustrates the main idea of bagging:

MODEL GENERATION Let n be the number of instances in the training data. For each of t iterations: Sample n instances with replacement from training data. Apply the learning algorithm to the sample. Store the resulting model. CLASSIFICATION For each of the t models: Predict class of instance using model. Return class that has been predicted most often.

Using the above idea, this means that our Learner_Class will need to develop t classifiers and will have to pass them to Classifier, which, once seeing a data instance, will use them for classification. We will allow parameter t to be specified by the user, 10 being the default.

The code for the Learner_Class is therefore:

class Learner_Class from bagging.py

class Learner_Class: def __init__(self, learner, t=10, name='bagged classifier'): self.t = t self.name = name self.learner = learner def __call__(self, examples, weight=None): n = len(examples) classifiers = [] for i in range(self.t): selection = [] for i in range(n): selection.append(random.randrange(n)) data = examples.getitems(selection) classifiers.append(self.learner(data)) return Classifier(classifiers = classifiers, \ name=self.name, domain=examples.domain)

Upon invocation, __init__ stores the base learning (the one that will be bagged), the value of the parameter t, and the name of the classifier. Note that while the learner requires the base learner to be specified, parameters t and name are optional.

When the learner is called with examples, a list of t classifiers is build and stored in variable classifier. Notice that for data sampling with replacement, a list of data instance indices is build (selection) and then used to sample the data from training examples (example.getitems). Finally, a Classifier is called with a list of classifiers, name and domain information.

class Classifier from bagging.py

class Classifier: def __init__(self, **kwds): self.__dict__.update(kwds) def __call__(self, example, resultType = orange.GetValue): freq = [0.] * len(self.domain.classVar.values) for c in self.classifiers: freq[int(c(example))] += 1 index = freq.index(max(freq)) value = orange.Value(self.domain.classVar, index) for i in range(len(freq)): freq[i] = freq[i]/len(self.classifiers) if resultType == orange.GetValue: return value elif resultType == orange.GetProbabilities: return freq else: return (value, freq)

For initialization, Classifier stores all parameters it was invoked with. When called with a data instance, a list freq is initialized which is of length equal to the number of classes and records the number of models that classify an instance to a specific class. The class that majority of models voted for is returned. While it may be possible to return classes index, or even a name, by convention classifiers in Orange return an object Value instead.

Notice that while, originally, bagging was not intended to compute probabilities of classes, we compute these as the proportion of models that voted for a certain class (this is probably incorrect, but suffice for our example, and does not hurt if only classes values and not probabilities are used).

Here is the code that tests our bagging we have just implemented. It compares a decision tree and its bagged variant. Run it yourself to see which one is better!

bagging_test.py (uses bagging.py and adult_sample.tab)

import orange, orngTree, orngEval, bagging data = orange.ExampleTable("adult_sample") tree = orngTree.TreeLearner(mForPrunning=10, minExamples=30) tree.name = "tree" baggedTree = bagging.Learner(learner=tree, t=5) learners = [tree, baggedTree] results = orngEval.crossValidation(learners, data, folds=5) for i in range(len(learners)): print learners[i].name, orngEval.CA(results)[i]

Prev: Testing and Evaluating, Next: Regression, Up: Classification