There are a number of different measures for assessing the
relevance of attributes with respect to much information they contain
about the corresponding class. These procedures are also known as
attribute scoring. Orange implements several methods that all stem
from MeasureAttribute
. The most of common ones compute
certain statistics on conditional distributions of class values given
the attribute values; in Orange, these are derived from
MeasureAttributeFromProbabilities
.
is the base
class for a wide range of classes that measure quality of
attributes. The class itself is, naturally, abstract. Its fields
merely describe what kinds of attributes it can handle and what kind
of data it requires.
Attributes
thresholdFunction
.MeasureAttribute.NeedsGenerator
, MeasureAttribute.NeedsDomainContingency
or MeasureAttribute.NeedsContingency_Class
. The first need an example generator (Relief is an example of such measure), the second can compute the quality from DomainContingency
and the latter only needs the contingency (ContingencyAttrClass
) the attribute distribution and the apriori class distribution. Most of the measure are content by the latter.Several (but not all) measures can treat unknown attribute values in different ways, depending on field unknownsTreatment
(this field is not defined in MeasureAttribute
but in many derived classes). Undefined values can be
MeasureAttribute.IgnoreUnknowns
); this has the same effect as if the example for which the attribute value is unknown are removed.MeasureAttribute.ReduceByUnknown
); the attribute quality is reduced by the proportion of unknown values. In impurity measures, this can be interpreted as if the impurity is decreased only on examples for which the value is defined and stays the same for the others, and the attribute quality is the average impurity decrease.MeasureAttribute.UnknownsToCommon
); here, undefined values are replaced by the most common attribute value. If you want a more clever imputation, you should do it in advance.MeasureAttribute.UnknownsAsValue
)
The default treatment is ReduceByUnknown
, which is optimal in most cases and does not make additional presumptions (as, for instance, UnknownsToCommon
which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other attributes). Use other treatments if you know that they make better sense on your data.
The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the attribute. The number does not have any absolute meaning and can vary widely for different attribute measures. The only common characteristic is that higher the value, better the attribute. If the attribute is so bad that it's quality cannot be measured, the measure returns MeasureAttribute.Rejected
. None of the measures described here do so.
There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the attribute and the class need to be of the correct type for a particular measure.
Methods
Argument attribute
gives the attribute whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the attribute is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the attribute in the domain, though.
The data is given either as examples
(and, optionally, id for meta-attribute with weight), domain contingency
(a list of contingencies) or contingency
matrix and class distribution
. If you use the latter form, what you should give as the class distribution depends upon what you do with unknown values (if there are any). If unknownsTreatment
is IgnoreUnknowns
, the class distribution should be computed on examples for which the attribute value is defined. Otherwise, class distribution should be the overall class distribution.
The optional argument with apriori class distribution
is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate).
attribute
. The attribute should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing attribute values), the second is the measured quality for a corresponding binary attribute and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself.The script below shows different ways to assess the quality of astigmatic, tear rate and the first attribute (whichever it is) in the dataset lenses.
part of measureattribute1.py (uses lenses.tab)
As for many other classes in Orange, you can construct the object and use it on-the-fly. For instance, to measure the quality of attribute "tear_rate", you could write simply
You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF.
It is also possible to assess the quality of attributes that do not exist in the dataset. For instance, you can assess the quality of discretized attributes without constructing a new domain and dataset that would include them.
measureattribute1a.py (uses iris.tab)
The quality of the new attribute d1
is assessed on data
, which does not include the new attribute at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the attribute to be present in the dataset.)
Finally, you can compute the quality of meta-attributes. The following script adds a meta-attribute to an example table, initializes it to random values and measures its information gain.
part of measureattribute1.py (uses lenses.tab)
To show the computation of thresholds, we shall use the Iris data set.
measureattribute1a.py (uses iris.tab)
If we hadn't constructed the attribute in advance, we could write orange.MeasureAttribute_relief().thresholdFunction("petal length", data)
. This is not recommendable for ReliefF, since it may be a lot slower.
The script below finds and prints out the best threshold for binarization of an attribute, that is, the threshold with which the resulting binary attribute will have the optimal ReliefF (or any other measure).
is the abstract base class for attribute quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies.
Attributes
ProbabilityEstimatorConstructor_m
and ConditionalProbabilityEstimatorConstructor_ByRows
(with estimator constructor again set to ProbabilityEstimatorConstructor_m
), respectively.The following section describes the attribute quality measures suitable for discrete attributes and outcomes. See MeasureAttribute1.py, MeasureAttribute1a.py, MeasureAttribute1b.py, MeasureAttribute2.py and MeasureAttribute3.py for more examples on their use.
The most popular measure, information gain (
), measures the expected decrease of the entropy.
Gain ratio (
) was introduced by Quinlan in order to avoid overestimation of multi-valued attributes. It is computed as information gain divided by the entropy of the attribute's value. (It has been shown, however, that such measure still overstimates the attributes with multiple values.)
Gini index (MeasureAttribute_gini
) was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes.
Relevance of attributes (
)
is a measure that discriminate between attributes on the basis of
their potential value in the formation of decision rules.
evaluates attributes based on the "saving" achieved by knowing the value of attribute, according to the specified cost matrix.
Attributes
If cost of predicting the first class for an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for attribute 3 as follows.
This tells that knowing the value of attribute 3 would decrease the classification cost for appx 0.083 per example.
ReliefF (
) was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of attributes based on their ability to distinguish between very similar examples belonging to different classes.
Attributes
Computation of ReliefF is rather slow since it needs to find k
nearest neighbours for each of m
reference examples (or all examples, if m
is set to -1). Since we normally compute ReliefF for all attributes in the dataset, MeasureAttribute_relief
caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version ExampleTable
for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting checkCachedData
to False
. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the 'version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control.
Caching will only have an effect if you use the same instance for all attributes in the domain. So, don't do this:
In this script, cached data dies together with the instance of MeasureAttribute_relief
, which is constructed and destructed for each attribute separately. It's way faster to go like this.
When called for the first time, meas
will compute ReliefF for all attributes and the subsequent calls simply return the stored data.
Class MeasureAttribute_relief
works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF.
Note that ReliefF can also compute the threshold function, that is, the attribute quality at different thresholds for binarization.
Finally, here is an example which shows what can happen if you disable the computation of checksums.
The first print prints out the same number, 0.321 twice. Then we annulate the first attribute. r1
notices it and returns -1 as it's ReliefF, while r2
does not and returns the same number, 0.321, which is now wrong.
Except for ReliefF, the only attribute quality measure available for regression problems is based on a mean square error.
The mean square error measure is implemented in class
.
Attributes