Lookup Classifiers

Lookup classifiers predict classes by looking into stored lists of cases. There are two kinds of such classifiers in Orange. The simpler and fastest ClassifierByLookupTable use up to three discrete attributes and have a stored mapping from values of those attributes to class value. The more complex classifiers stores an ExampleTable and predicts the class by matching the example to examples in the table.

The natural habitat of these classifiers is feature construction: they usually reside in getValueFrom fields of constructed attributes to facilitate their automatic computation. For instance, the following script shows how to translate the Monk 1 dataset features into a more useful subset that will only include the attributes a, b, e, and attributes that will tell whether a and b are equal and whether e is 1 (don't bother the details, they follow later).

part of ClassifierByLookupTable.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") a, b, e = data.domain["a"], data.domain["b"], data.domain["e"] ab = orange.EnumVariable("a==b", values = ["no", "yes"]) ab.getValueFrom = orange.ClassifierByLookupTable(ab, a, b, \ ["yes", "no", "no", "no", "yes", "no", "no", "no", "yes"]) e1 = orange.EnumVariable("e==1", values = ["no", "yes"]) e1.getValueFrom = orange.ClassifierByLookupTable(e1, e, \ ["yes", "no", "no", "no", "?"]) data2 = data.select([a, b, ab, e, e1, data.domain.classVar])

We can check the correctness of the script by printing out several random examples from data2.

>>> for i in range(5): ... print data2.randomexample() ['1', '1', 'yes', '4', 'no', '1'] ['3', '3', 'yes', '2', 'no', '1'] ['2', '1', 'no', '4', 'no', '0'] ['2', '1', 'no', '1', 'yes', '1'] ['1', '1', 'yes', '3', 'no', '1']

The first ClassifierByLookupTable takes values of attributes a and b and computes the value of ab according to the rule given in the given table. The first three values correspond to a=1 and b=1, 2, 3; for the first combination, value of ab should be "yes", for the other two a and b are different. The next triplet correspond to a=2; here, the middle value is "yes"...

The second lookup is simpler: since it involves only a single attribute, the list is a simple one-to-one mapping from the four-valued e to the two-valued e1. The last value in the list is returned when e is unknown and tells that e1 should be unknown then as well.

Note that you don't need ClassifierByLookupTable for this. The new attribute e1 could be computed with a callback to Python, for instance:

e2.getValueFrom = lambda ex, rw: orange.Value(e2, ex["e"]=="1")

While functionally the same, using classifiers by lookup table is faster.


Classifiers by Lookup Table

Although the above example used ClassifierByLookupTable as if it was a concrete class, ClassifierByLookupTable is actually abstract. Calling its constructor is a typical Orange trick: what you get, is never ClassifierByLookupTable, but either ClassifierByLookupTable1, ClassifierByLookupTable2 and ClassifierByLookupTable3. As their names tell, the first classifies using a single attribute (so that's what we had for e1), the second uses a pair of attributes (and has been constructed for ab above), and the third uses three attributes. Class predictions for each combination of attribute values are stored in a (one dimensional) table. To classify an example, the classifier computes an index of the element of the table that corresponds to the combination of attribute values.

These classifiers are built to be fast, not safe. If you, for instance, change the number of values for one of the attributes, the Orange will most probably crash. To protect you somewhat, many of these classes' attributes are read-only and can only be set when the object is constructed.

Attributes

variable1[, variable2[, variable3]](read only)
The attribute(s) that the classifier uses for classification. ClassifierByLookupTable1 only has variable1, ClassifierByLookupTable2 also has variable2 and ClassifierByLookupTable3 has all three.
variables (read only)
The above variables, returned as a tuple.
noOfValues1, noOfValues2[, noOfValues3] (read only)
The number of values for variable1, variable2 and variable3. This is stored here to make the classifier faster. Those attributes are defined only for ClassifierByLookupTable2 (the first two) and ClassifierByLookupTable3 (all three).
lookupTable (read only)
A list of values (ValueList), one for each possible combination of attributes. For ClassifierByLookupTable1, there is an additional element that is returned when the attribute's value is unknown. Values are ordered by values of attributes, with variable1 being the most important. In case of two three valued attributes, the list order is therefore 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, 3-1, 3-2, 3-3, where the first digit corresponds to variable1 and the second to variable2.

The list is read-only in the sense that you cannot assign a new list to this field. You can, however, change its elements. Don't change its size, though.

distributions (read only)
Similar to lookupTable, but is of type DistributionList and stores a distribution for each combination of values.
dataDescription
An object of type EFMDataDescription, defined only for ClassifierByLookupTable2 and ClassifierByLookupTable3. They use it to make predictions when one or more attribute values are unknown. ClassifierByLookupTable1 doesn't need it since this case is covered by an additional element in lookupTable and distributions, as told above.

Methods

ClassifierByLookupTable(classVar, variable1[, variable2[, variable3]] [, lookupTable[, distributions]])
A general constructor that, based on the number of attribute descriptors, constructs one of the three classes discussed. If lookupTable and distributions are omitted, constructor also initializes lookupTable and distributions to two lists of the right sizes, but their elements are don't knows and empty distributions. If they are given, they must be of correct size.
ClassifierByLookupTable1(classVar, variable1 [, lookupTable, distributions])
ClassifierByLookupTable2(classVar, variable1, variable2, [, lookupTable[, distributions]])
ClassifierByLookupTable3(classVar, variable1, variable2, variable3, [, lookupTable[, distributions]])
Class-specific constructors that you can call instead of the general constructor. The number of attributes must match the constructor called.
getindex(example)
Returns an index into lookupTable or distributions. The formula depends upon the type of the classifier. If valuei is int(example[variablei]), then the corresponding formulae are
ClassifierByLookupTable1:
index = value1, or len(lookupTable)-1 if value is unknown
ClassifierByLookupTable2:
index = value1*noOfValues1 + value2, or -1 if any value is unknown
ClassifierByLookupTable3:
index = (value1*noOfValues1 + value2) * noOfValues2 + value3, or -1 if any value is unknown

Let's see some indices for randomly chosen examples from the original table.

part of ClassifierByLookupTable.py (continued from above) (uses monk1.tab)

>>> for i in range(5): ... ex = data.randomexample() ... print "%s: ab %i, e1 %i " % (ex, \ ... ab.getValueFrom.getindex(ex), \ ... e1.getValueFrom.getindex(ex)) ['1', '1', '2', '2', '4', '1', '1']: ab 0, e1 3 ['3', '3', '1', '2', '2', '1', '1']: ab 8, e1 1 ['2', '1', '2', '3', '4', '2', '0']: ab 3, e1 3 ['2', '1', '1', '2', '1', '1', '1']: ab 3, e1 0 ['1', '1', '1', '2', '3', '1', '1']: ab 0, e1 2

Classifier by ExampleTable

ClassifierByExampleTable is the alternative to ClassifierByLookupTable. It is to be used when the classification is based on more than three attributes. Instead of having a lookup table, it stores an ExampleTable, which is optimized for a faster access.

This class is used in similar contexts as ClassifierByLookupTable. If you write, for instance, a constructive induction algorithm, it is recommendable that the values of the new attribute are computed either by one of classifiers by lookup table or by ClassifierByExampleTable, depending on the number of bound attributes.

Attributes

sortedExamples
An ExampleTable with sorted examples for lookup. Examples in the table can be merged; if there were multiple examples with the same attribute values (but possibly different classes), they are merged into a single example. Regardless of merging, class values in this table are distributed: their svalue contains a Distribution.
classifierForUnknown
This classifier is used to classify examples which were not found in the table. If classifierForUnknown is not set, don't know's are returned.
variables (read only)
A tuple with attributes in the domain. This field is here so that ClassifierByExampleTable appears more similar to ClassifierByLookupTable. If a constructive induction algorithm returns the result in one of these classifiers, and you would like to check which attributes are used, you can use variables regardless of the class you actually got.

There are no specific methods for ClassifierByExampleTable. Since this is a classifier, it can be called. When the example to be classified includes unknown values, classifierForUnknown will be used if it is defined.

Although ClassifierByExampleTable is not really a classifier in the sense that you will use it to classify examples, but is rather a function for computation of intermediate values, it has an associated learner, LookupLearner. The learner's task is, basically, to construct an ExampleTable for sortedExamples. It sorts them, merges them and, of course, regards example weights in the process as well.

part of ClassifierByExampleTable.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") a, b, e = data.domain["a"], data.domain["b"], data.domain["e"] data_s = data.select([a, b, e, data.domain.classVar]) abe = orange.LookupLearner(data_s)

In data_s, we have prepared a table in which examples are described only by a, b, e and the class. Learner constructs a ClassifierByExampleTable and stores examples from data_s into its sortedExamples. Examples are merged so that there are no duplicates.

>>> print len(data_s) 432 >>> print len(abe2.sortedExamples) 36 >>> for i in abe2.sortedExamples[:5]: ... print i ['1', '1', '1', '1'] ['1', '1', '2', '1'] ['1', '1', '3', '1'] ['1', '1', '4', '1'] ['1', '2', '1', '1'] ['1', '2', '2', '0'] ['1', '2', '3', '0'] ['1', '2', '4', '0'] ['1', '3', '1', '1'] ['1', '3', '2', '0']

Well, there's a bit more here than meets the eye: each example's class value also stores the distribution of classes for all examples that were merged into it. In our case, the three attribute suffice to unambiguously determine the classes and, since example covered the entire space, all distributions have 12 examples in one of the class and none in the other.

>>> for i in abe2.sortedExamples[:10]: ... print i, i.getclass().svalue ['1', '1', '1', '1'] <0.000, 12.000> ['1', '1', '2', '1'] <0.000, 12.000> ['1', '1', '3', '1'] <0.000, 12.000> ['1', '1', '4', '1'] <0.000, 12.000> ['1', '2', '1', '1'] <0.000, 12.000> ['1', '2', '2', '0'] <12.000, 0.000> ['1', '2', '3', '0'] <12.000, 0.000> ['1', '2', '4', '0'] <12.000, 0.000> ['1', '3', '1', '1'] <0.000, 12.000> ['1', '3', '2', '0'] <12.000, 0.000>

ClassifierByExampleTable will usually used by getValueFrom. So, we would probably continue this by constructing a new attribute and put the classifier into its getValueFrom.

>>> y2 = orange.EnumVariable("y2", values = ["0", "1"]) >>> y2.getValueFrom = abe

There's something disturbing here. Although abe determines the value of y2, abe.classVar is still y. Orange doesn't bother (the whole example is artificial - you will seldom pack entire dataset in an ClassifierByExampleTable...), so shouldn't you. But still, for the sake of hygiene, you can conclude by

>>> abe.classVar = y2

Whole story can be greatly simplified. LookupLearner can also be called differently than other learners. Besides examples, you can pass the new class attribute and the attributes that should be used for classification. This saves us from constructing data_s and reassigning the classVar. It doesn't set the getValueFrom, though.

part of ClassifierByExampleTable.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") a, b, e = data.domain["a"], data.domain["b"], data.domain["e"] y2 = orange.EnumVariable("y2", values = ["0", "1"]) abe2 = orange.LookupLearner(y2, [a, b, e], data)

Let us, for the end, show another use of LookupLearner. With the alternative call arguments, it offers an easy way to observe attribute interactions. For this purpose, we shall omit e, and construct a ClassifierByExampleTable from a and b only.

part of ClassifierByExampleTable.py (uses monk1.tab)

y2 = orange.EnumVariable("y2", values = ["0", "1"]) abe2 = orange.LookupLearner(y2, [a, b], data) for i in abe2.sortedExamples: print i, i.getclass().svalue

The script's output show how the classes are distributed for different values of a and b.

['1', '1', '1'] <0.000, 48.000> ['1', '2', '0'] <36.000, 12.000> ['1', '3', '0'] <36.000, 12.000> ['2', '1', '0'] <36.000, 12.000> ['2', '2', '1'] <0.000, 48.000> ['2', '3', '0'] <36.000, 12.000> ['3', '1', '0'] <36.000, 12.000> ['3', '2', '0'] <36.000, 12.000> ['3', '3', '1'] <0.000, 48.000>

For instance, when a is '1' and b is '3', the majority class is '0', and the class distribution is 36:12 in favor of '0'.