Relevance of Attributes

There are a number of different measures for assessing the relevance of attributes with respect to much information they contain about the corresponding class. These procedures are also known as attribute scoring. Orange implements several methods that all stem from MeasureAttribute. The most of common ones compute certain statistics on conditional distributions of class values given the attribute values; in Orange, these are derived from MeasureAttributeFromProbabilities.


Base Classes

MeasureAttribute

MeasureAttribute is the base class for a wide range of classes that measure quality of attributes. The class itself is, naturally, abstract. Its fields merely describe what kinds of attributes it can handle and what kind of data it requires.

Attributes

handlesDiscrete
Tells whether the measure can handle discrete attributes.
handlesContinuous
Tells whether the measure can handle continuous attributes.
computesThresholds
Tells whether the measure implements the thresholdFunction.
needs
Tells what kind of data the measure needs. This can be either MeasureAttribute.NeedsGenerator, MeasureAttribute.NeedsDomainContingency or MeasureAttribute.NeedsContingency_Class. The first need an example generator (Relief is an example of such measure), the second can compute the quality from DomainContingency and the latter only needs the contingency (ContingencyAttrClass) the attribute distribution and the apriori class distribution. Most of the measure are content by the latter.

Several (but not all) measures can treat unknown attribute values in different ways, depending on field unknownsTreatment (this field is not defined in MeasureAttribute but in many derived classes). Undefined values can be

The default treatment is ReduceByUnknown, which is optimal in most cases and does not make additional presumptions (as, for instance, UnknownsToCommon which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other attributes). Use other treatments if you know that they make better sense on your data.

The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the attribute. The number does not have any absolute meaning and can vary widely for different attribute measures. The only common characteristic is that higher the value, better the attribute. If the attribute is so bad that it's quality cannot be measured, the measure returns MeasureAttribute.Rejected. None of the measures described here do so.

There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the attribute and the class need to be of the correct type for a particular measure.

Methods

__call__(attribute, examples[, apriori class distribution][, weightID])
__call__(attribute, domain contingency[, apriori class distribution])
__call__(contingency, class distribution[, apriori class distribution])
There are three call operators just to make your life simpler and faster. When working with the data, your method might have already computed, for instance, contingency matrix. If so and if the quality measure you use is OK with that (as most measures are), you can pass the contingency matrix and the measure will compute much faster. If, on the other hand, you only have examples and haven't computed any statistics on them, you can pass examples (and, optionally, an id for meta-attribute with weights) and the measure will compute the contingency itself, if needed.

Argument attribute gives the attribute whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the attribute is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the attribute in the domain, though.

The data is given either as examples (and, optionally, id for meta-attribute with weight), domain contingency (a list of contingencies) or contingency matrix and class distribution. If you use the latter form, what you should give as the class distribution depends upon what you do with unknown values (if there are any). If unknownsTreatment is IgnoreUnknowns, the class distribution should be computed on examples for which the attribute value is defined. Otherwise, class distribution should be the overall class distribution.

The optional argument with apriori class distribution is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate).

thresholdFunction(attribute, examples[, weightID])
This function computes the qualities for different binarizations of the continuous attribute attribute. The attribute should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing attribute values), the second is the measured quality for a corresponding binary attribute and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself.

The script below shows different ways to assess the quality of astigmatic, tear rate and the first attribute (whichever it is) in the dataset lenses.

part of measureattribute1.py (uses lenses.tab)

import orange, random data = orange.ExampleTable("lenses") meas = orange.MeasureAttribute_info() astigm = data.domain["astigmatic"] print "Information gain of 'astigmatic': %6.4f" % meas(astigm, data) classdistr = orange.Distribution(data.domain.classVar, data) cont = orange.ContingencyAttrClass("tear_rate", data) print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr) dcont = orange.DomainContingency(data) print "Information gain of the first attribute: %6.4f" % meas(0, dcont) print

As for many other classes in Orange, you can construct the object and use it on-the-fly. For instance, to measure the quality of attribute "tear_rate", you could write simply

>>> print orange.MeasureAttribute_info("tear_rate", data) 0.548794984818

You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF.

It is also possible to assess the quality of attributes that do not exist in the dataset. For instance, you can assess the quality of discretized attributes without constructing a new domain and dataset that would include them.

measureattribute1a.py (uses iris.tab)

import orange, random data = orange.ExampleTable("iris") d1 = orange.EntropyDiscretization("petal length", data) print orange.MeasureAttribute_info(d1, data)

The quality of the new attribute d1 is assessed on data, which does not include the new attribute at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the attribute to be present in the dataset.)

Finally, you can compute the quality of meta-attributes. The following script adds a meta-attribute to an example table, initializes it to random values and measures its information gain.

part of measureattribute1.py (uses lenses.tab)

mid = orange.newmetaid() data.domain.addmeta(mid, orange.EnumVariable(values = ["v0", "v1"])) data.addMetaAttribute(mid) rg = random.Random() rg.seed(0) for ex in data: ex[mid] = orange.Value(rg.randint(0, 1)) print "Information gain for a random meta attribute: %6.4f" % \ orange.MeasureAttribute_info(mid, data)

To show the computation of thresholds, we shall use the Iris data set.

measureattribute1a.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") meas = orange.MeasureAttribute_relief() for t in meas.thresholdFunction("petal length", data): print "%5.3f: %5.3f" % t

If we hadn't constructed the attribute in advance, we could write orange.MeasureAttribute_relief().thresholdFunction("petal length", data). This is not recommendable for ReliefF, since it may be a lot slower.

The script below finds and prints out the best threshold for binarization of an attribute, that is, the threshold with which the resulting binary attribute will have the optimal ReliefF (or any other measure).

thresh, score, distr = meas.bestThreshold("petal length", data) print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)

MeasureAttributeFromProbabilities

MeasureAttributeFromProbabilities is the abstract base class for attribute quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies.

Attributes

unknownsTreatment
Defines what to do with unknown values. See the possibilities described above.
estimatorConstructor, conditionalEstimatorConstructor
The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, ProbabilityEstimatorConstructor_m and ConditionalProbabilityEstimatorConstructor_ByRows (with estimator constructor again set to ProbabilityEstimatorConstructor_m), respectively.

Measures for Classification Problems

The following section describes the attribute quality measures suitable for discrete attributes and outcomes. See MeasureAttribute1.py, MeasureAttribute1a.py, MeasureAttribute1b.py, MeasureAttribute2.py and MeasureAttribute3.py for more examples on their use.

Information Gain

The most popular measure, information gain (MeasureAttribute_info), measures the expected decrease of the entropy.

Gain Ratio

Gain ratio (MeasureAttribute_gainRatio) was introduced by Quinlan in order to avoid overestimation of multi-valued attributes. It is computed as information gain divided by the entropy of the attribute's value. (It has been shown, however, that such measure still overstimates the attributes with multiple values.)

Gini index

Gini index (MeasureAttribute_gini) was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes.

Relevance

Relevance of attributes (MeasureAttribute_relevance) is a measure that discriminate between attributes on the basis of their potential value in the formation of decision rules.

Costs

MeasureAttribute_cost evaluates attributes based on the "saving" achieved by knowing the value of attribute, according to the specified cost matrix.

Attributes

cost
Cost matrix (see the page about cost matrices for details)

If cost of predicting the first class for an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for attribute 3 as follows.

>>> meas = orange.MeasureAttribute_cost() >>> meas.cost = ((0, 5), (1, 0)) >>> meas(3, data) 0.083333350718021393

This tells that knowing the value of attribute 3 would decrease the classification cost for appx 0.083 per example.

ReliefF

ReliefF (MeasureAttribute_relief) was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of attributes based on their ability to distinguish between very similar examples belonging to different classes.

Attributes

k
Number of neighbours for each example. Default is 5.
m
Number of reference examples. Default is 100. Set to -1 to take all the examples.
checkCachedData
A flag best left alone unless you know what you do.

Computation of ReliefF is rather slow since it needs to find k nearest neighbours for each of m reference examples (or all examples, if m is set to -1). Since we normally compute ReliefF for all attributes in the dataset, MeasureAttribute_relief caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version ExampleTable for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting checkCachedData to False. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the 'version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control.

Caching will only have an effect if you use the same instance for all attributes in the domain. So, don't do this:

for attr in data.domain.attributes: print orange.MeasureAttribute_relief(attr, data)

In this script, cached data dies together with the instance of MeasureAttribute_relief, which is constructed and destructed for each attribute separately. It's way faster to go like this.

meas = orange.MeasureAttribute_relief() for attr in data.domain.attributes: print meas(attr, data)

When called for the first time, meas will compute ReliefF for all attributes and the subsequent calls simply return the stored data.

Class MeasureAttribute_relief works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF.

Note that ReliefF can also compute the threshold function, that is, the attribute quality at different thresholds for binarization.

Finally, here is an example which shows what can happen if you disable the computation of checksums.

data = orange.ExampleTable("iris") r1 = orange.MeasureAttribute_relief() r2 = orange.MeasureAttribute_relief(checkCachedData = False) print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) for ex in data: ex[0] = 0 print "%.3f\t%.3f" % (r1(0, data), r2(0, data))

The first print prints out the same number, 0.321 twice. Then we annulate the first attribute. r1 notices it and returns -1 as it's ReliefF, while r2 does not and returns the same number, 0.321, which is now wrong.

Measure for Attributes for Regression Problems

Except for ReliefF, the only attribute quality measure available for regression problems is based on a mean square error.

Mean Square Error

The mean square error measure is implemented in class MeasureAttribute_MSE.

Attributes

unknownsTreatment
Tells what to do with unknown attribute values. See description on the top of this page.
m
Parameter for m-estimate of error. Default is 0 (no m-estimate).