Distances between Examples

This page describes a bunch of classes for different metrics for measure distances between examples.

Typical (although not all) measures of distance between examples require some "learning" - adjusting the measure to the data. For instance, when the dataset contains continuous attributes, the distances between continuous values should be normalized, e.g. by dividing the distance with the range of possible values or with some interquartile distance to ensure that all attributes have, in principle, similar impacts.

Different measures of distance thus appear in pairs - a class that measures the distance and a class that constructs it based on the data. The abstract classes representing such a pair are ExamplesDistance and ExamplesDistanceConstructor. Since most measures work on normalized distances between corresponding attributes, there is an abstract intermediate class ExamplesDistance_Normalized that takes care of normalizing. The remaining classes correspond to different ways of defining the distances, such as Manhattan or Euclidean distance.

Unknown values are treated correctly only by Euclidean and Relief distance. For other measure of distance, a distance between unknown and known or between two unknown values is always 0.5.


ExamplesDistance

Methods

__call__(example1, example2)
Returns a distance between the given examples as floating point number.

ExamplesDistanceConstructor

Methods

__call__([examples, weightID][, DomainDistributions][, DomainBasicAttrStat])
Constructs an instance of ExamplesDistance. Not all the data needs to be given. Most measures can be constructed from DomainBasicAttrStat; if it is not given, they can help themselves either by examples or DomainDistributions. Some (e.g. ExamplesDistance_Hamming) even do not need any arguments.

ExamplesDistance_Normalized

This abstract class provides a function which is given two examples and returns a list of normalized distances between values of their attributes. Many distance measuring classes need such a function and are therefore derived from this class.

Attributes

normalizers
A precomputed list of normalizing factors for attribute values:
  • if a factor positive, differences in attribute's values are multiplied by it; for continuous attributes the factor would be 1/(max_value-min_value) and for ordinal attributes the factor is 1/number-of-values. If either (or both) of attributes are unknown, the distance is 0.5
  • if a factor is -1, the attribute is nominal; the distance between two values is 0 if they are same (or at least one is unknown) and 1 if they are different.
  • if a factor is 0, the attribute is ignored.
bases, averages, variances
The minimal values, averages and variances (continuous attributes only)
domainVersion
stores a domain version for which the normalizers were computed. The domain version is increased each time a domain description is changed (i.e. attributes are added or removed); this is used for a quick check that the user is not attempting to measure distances between examples that do not correspond to normalizers. Since domains are practicably immutable (especially from Python), you don't need to care about this anyway.

Methods

attributeDistances(example1, example2)
Returns a list of floats representing distances between pairs of attribute values of the two examples.

ExamplesDistance_Hamming / ExamplesDistanceConstructor_Hamming

Hamming distance between two examples is defined as the number of attributes in which the two examples differ. Note that this measure is not really appropriate for examples that contain continuous attributes.

This class is derived directly from ExamplesDistance, not from ExamplesDistance_Normalized.

Note: in some previous versions of Orange, this distance was wrongly referred to as Hamiltonian, not Hamming. This has been corrected without providing any aliases for backward compatibility.

ExamplesDistance_Maximal / ExamplesDistanceConstructor_Maximal

The maximal (also called infinite distance) between two examples is defined as the maximal distance between two attribute values. If dist is the result of ExamplesDistance_Normalized.attributeDistances, then ExamplesDistance_Maximal returns max(dist).

ExamplesDistance_Manhattan / ExamplesDistanceConstructor_Manhattan

Manhattan distance between two examples is a sum of absolute values of distances between pairs of attributes, e.g. apply(add, [abs(x) for x in dist]), where dist is the result of ExamplesDistance_Normalized.attributeDistances.

ExamplesDistance_Euclidean / ExamplesDistanceConstructor_Euclidean

Euclidean distance is a square root of sum of squared per-attribute distances, i.e. sqrt(apply(add, [x*x for x in dist])), where dist is the result of ExamplesDistance_Normalized.attributeDistances.

Methods

distributions
An object of type DomainDistributions that holds the distributions for all discrete attributes. This is needed to compute distances between known and unknown values.
bothSpecialDist
A list containing the distance between two unknown values for each discrete attribute.

This measure of distance deals with unknown values by computing the expected square of distance based on the distribution obtained from the "training" data. Squared distance between

  • a known and unknown continuous attribute equals squared distance between the known and the average, plus variance
  • two unknown continuous attributes equals double variance
  • a known and unknown discrete attribute equals the probability that the unknown attribute has different value than the known (ie, 1 - probability of the known value)
  • two unknown discrete attributes equals the probability that two random chosen values are equal, which can be computed as 1 - sum of squares of probabilities.

Continuous cases can be handled by averages and variances inherited from ExamplesDistance_normalized. The data for discrete cases are stored in distributions (used for unknown vs. known value) and in bothSpecial (the precomputed distance between two unknown values).

See the output of examplesdistance-missing.py for an example.

ExampleDistance_Relief / ExampleDistanceConstructor_Relief

ExamplesDistance_Relief is similar to Manhattan distance, but incorporates a more correct treatment of undefined values, which is used by ReliefF measure.


Example

If attributes are discrete, ExamplesDistance_Manhattan basically counts the number of attributes in which two examples differ. It's therefore easily to "check" its results.

examplesdistance.py (uses lenses.tab)

import orange data = orange.ExampleTable("lenses") distance = orange.ExamplesDistanceConstructor_Manhattan(data) ref = data[0] print "*** Reference example: ", ref for ex in data: print ex, distance(ex, ref)

The printout begins with:

*** Reference example: ['young', 'myope', 'no', 'reduced', 'none'] ['young', 'myope', 'no', 'reduced', 'none'] 0.0 ['young', 'myope', 'no', 'normal', 'soft'] 1.0 ['young', 'myope', 'yes', 'reduced', 'none'] 1.0 ['young', 'myope', 'yes', 'normal', 'hard'] 2.0 ['young', 'hypermetrope', 'no', 'reduced', 'none'] 1.0 ['young', 'hypermetrope', 'no', 'normal', 'soft'] 2.0 ['young', 'hypermetrope', 'yes', 'reduced', 'none'] 2.0 ['young', 'hypermetrope', 'yes', 'normal', 'hard'] 3.0