orngImpute: An Imputation Wrapper for Learning Algorithms

This module used to be larger, but most of its code went into the Orange's core for various reasons. So now it only contains a wrapper to be used with learning algorithms that cannot handle missing values: it will impute the missing examples using the imputer, call the learning and, if the imputation is also needed by the classifier, wrap the resulting classifier into another wrapper that will impute the missing values in examples to be classified.

Even so, the module is somewhat redundant, as all learners that cannot handle missing values should, in principle, provide the slots for imputer constructor. For instance, orange.LogRegLearner has an attribute imputerConstructor, and even if you don't set it, it will do some imputation by default.

The module consists of two classes. First is ImputeLearner. It is basically a learner, so the constructor will construct either an instance of ImputerLearner or, if called with examples, an instance of some classifier. There are a few attributes that need to be set, though.

Attributes

baseLearner
The wrapped learner.
imputerConstructor
An instance of a class derived from ImputerConstructor (or a class with the same call operator).
dontImputeClassifier
If given and set (this attribute is optional), the classifier will not be wrapped into an imputer. Do this if the classifier doesn't mind if the examples it is given have missing values.

The learner is best illustrated by its code - here's its complete __call__ operator.

def __call__(self, data, weight=0): trained_imputer = self.imputerConstructor(data, weight) imputed_data = trained_imputer(data, weight) baseClassifier = self.baseLearner(imputed_data, weight) if self.dontImputeClassifier: return baseClassifier else: return ImputeClassifier(baseClassifier, trained_imputer)

So "learning" goes like this. ImputeLearner will first construct the imputer (that is, call self.imputerConstructor to get a (trained) imputer. Than it will use the imputer to impute the data, and call the given baseLearner to construct a classifier. For instance, baseLearner could be a learner for logistic regression and the result would be a logistic regression model. If the classifier can handle unknown values (that is, if dontImputeClassifier, we return it as it is, otherwise we wrap it into ImputeClassifier, which is given the base classifier and the imputer which it can use to impute the missing values in (testing) examples.

The other class in the module is, of course, the classifier with imputation, ImputeClassifier.

Attributes

baseClassifier
The wrapped classifier
imputer
The imputer for imputation of unknown values

This class is even more trivial than the learner. Its constructor accepts two arguments, the classifier and the imputer, which are stored into the corresponding attributes. The call operator which does the classification then looks like this:

def __call__(self, ex, what=orange.GetValue): return self.baseClassifier(self.imputer(ex), what)

It imputes the missing values by calling the imputer and passes the class to the base classifier.

Note that in this setup the imputer is trained on the training data - even if you do cross validation, the imputer will be trained on the right data. In the classification phase we again use the imputer which was classified on the training data only.

Now for an example. Although most Orange's learning algorithms will take care of imputation internally, if needed, it can sometime happen that an expert will be able to tell you exactly what to put in the data instead of the missing values. The documentation on imputers in the Reference Guide presents various classes for imputation, but for this example we shall suppose that we want to impute the minimal value of each attribute. We will try to determine whether the naive Bayesian classifier with its implicit internal imputation works better than one that uses imputation by minimal values.

part of imputation.py

import orange, orngImpute, orngTest, orngStat ba = orange.BayesLearner() imba = orngImpute.ImputeLearner(baseLearner = ba, imputerConstructor=orange.ImputerConstructor_minimal) data = orange.ExampleTable("voting") res = orngTest.crossValidation([ba, imba], data) CAs = orngStat.CA(res) print "Without imputation: %5.3f" % CAs[0] print "With imputation: %5.3f" % CAs[1]

Note that we constructed just one instance of orange.BayesLearner, but this same instance is used twice in each fold, once it is given the examples as they are (and returns an instance of orange.BayesClassifier). The second time it is called by imba and the orange.BayesClassifier it returns is wrapped into orngImputer.ImputeClassifier. We thus have only one learner, but which produces two different classifiers in each round of testing.