This page describes a bunch of classes for different metrics for measure distances between examples.
Typical (although not all) measures of distance between examples require some "learning" - adjusting the measure to the data. For instance, when the dataset contains continuous attributes, the distances between continuous values should be normalized, e.g. by dividing the distance with the range of possible values or with some interquartile distance to ensure that all attributes have, in principle, similar impacts.
Different measures of distance thus appear in pairs - a class that measures the distance and a class that constructs it based on the data. The abstract classes representing such a pair are ExamplesDistance
and ExamplesDistanceConstructor
. Since most measures work on normalized distances between corresponding attributes, there is an abstract intermediate class ExamplesDistance_Normalized
that takes care of normalizing. The remaining classes correspond to different ways of defining the distances, such as Manhattan or Euclidean distance.
Unknown values are treated correctly only by Euclidean and Relief distance. For other measure of distance, a distance between unknown and known or between two unknown values is always 0.5.
Methods
Methods
ExamplesDistance
. Not all the data needs to be given. Most measures can be constructed from DomainBasicAttrStat
; if it is not given, they can help themselves either by examples
or DomainDistributions
. Some (e.g. ExamplesDistance_Hamming)
even do not need any arguments.This abstract class provides a function which is given two examples and returns a list of normalized distances between values of their attributes. Many distance measuring classes need such a function and are therefore derived from this class.
Attributes
Methods
Hamming distance between two examples is defined as the number of attributes in which the two examples differ. Note that this measure is not really appropriate for examples that contain continuous attributes.
This class is derived directly from ExamplesDistance
, not from ExamplesDistance_Normalized
.
Note: in some previous versions of Orange, this distance was wrongly referred to as Hamiltonian, not Hamming. This has been corrected without providing any aliases for backward compatibility.
The maximal (also called infinite distance) between two examples is defined as the maximal distance between two attribute values. If dist
is the result of ExamplesDistance_Normalized.attributeDistances
, then ExamplesDistance_Maximal
returns max(dist)
.
Manhattan distance between two examples is a sum of absolute values of distances between pairs of attributes, e.g. apply(add, [abs(x) for x in dist])
, where dist
is the result of ExamplesDistance_Normalized.attributeDistances
.
Euclidean distance is a square root of sum of squared per-attribute distances, i.e. sqrt(apply(add, [x*x for x in dist]))
, where dist
is the result of ExamplesDistance_Normalized.attributeDistances
.
Methods
DomainDistributions
that holds the distributions for all discrete attributes. This is needed to compute distances between known and unknown values.This measure of distance deals with unknown values by computing the expected square of distance based on the distribution obtained from the "training" data. Squared distance between
Continuous cases can be handled by averages and variances inherited from ExamplesDistance_normalized
. The data for discrete cases are stored in distributions
(used for unknown vs. known value) and in bothSpecial
(the precomputed distance between two unknown values).
See the output of examplesdistance-missing.py for an example.
ExamplesDistance_Relief
is similar to Manhattan distance, but incorporates a more correct treatment of undefined values, which is used by ReliefF measure.
If attributes are discrete, ExamplesDistance_Manhattan
basically counts the number of attributes in which two examples differ. It's therefore easily to "check" its results.
examplesdistance.py (uses lenses.tab)
The printout begins with: