Discretization


Example-based automatic discretization is in essence similar to learning: given a set of examples, discretization method proposes a list of suitable intervals to cut the attribute's values into. For this reason, Orange structures for discretization resemble its structures for learning. Objects derived from orange.Discretization play a role of "learner" that, upon observing the examples, construct an orange.Discretizer whose role is to convert continuous values into discrete according to the rule found by Discretization.

Orange core now supports several methods of discretization; here's a list of methods with belonging classes.

Equi-distant discretization (EquiDistDiscretization, EquiDistDiscretizer)
The range of attribute's values is split into prescribed number equal-sized intervals.

Quantile-based discretization (EquiNDiscretization, IntervalDiscretizer)
The range is split into intervals containing equal number of examples.

Entropy-based discretization (EntropyDiscretization, IntervalDiscretizer)
Developed by Fayyad and Irani, this method balances between entropy in intervals and MDL of discretization.

Bi-modal discretization (BiModalDiscretization, BiModalDiscretizer/IntervalDiscretizer)
Two cut-off points set to optimize the difference of the distribution in the middle interval and the distributions outside it.

Fixed discretization (FixedDiscretization, IntervalDiscretizer)
Discretization with user-prescribed cut-off points.

General Schema

Instances of classes derived from orange.Discretization define a single method: the call operator. The object can also be called through constructor.

__call__(attribute, examples[, weightID])
Given a continuous attribute, examples and, optionally id of attribute with example weight, this function returns a discretized attribute. Argument attribute can be a descriptor, index or name of the attribute.

Here's an example.

part of discretization.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") sep_w = orange.EntropyDiscretization("sepal width", data) data2 = data.select([data.domain["sepal width"], sep_w, data.domain.classVar]) for ex in data2[:10]: print ex

The discretized attribute sep_w is constructed with a call to EntropyDiscretization (instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor, as is often allowed in Orange). We then constructed a new ExampleTable with attributes "sepal width" (the original continuous attribute), sep_w and the class attribute. Script output is:

[3.500000, '>3.30', 'Iris-setosa'] [3.000000, '(2.90, 3.30]', 'Iris-setosa'] [3.200000, '(2.90, 3.30]', 'Iris-setosa'] [3.100000, '(2.90, 3.30]', 'Iris-setosa'] [3.600000, '>3.30', 'Iris-setosa'] [3.900000, '>3.30', 'Iris-setosa'] [3.400000, '>3.30', 'Iris-setosa'] [3.400000, '>3.30', 'Iris-setosa'] [2.900000, '<2.90', 'Iris-setosa'] [3.100000, '(2.90, 3.30]', 'Iris-setosa']

EntropyDiscretization named the new attribute's values by the interval range (it also named the attribute as "D_sepal width"). The new attribute's values get computed automatically when they are needed.

As those that have read about Variable know, the answer to "How this works?" is hidden in the attribute's field getValueFrom. This little dialog reveals the secret.

>>> sep_w EnumVariable 'D_sepal width' >>> sep_w.getValueFrom >>> sep_w.getValueFrom.whichVar FloatVariable 'sepal width' >>> sep_w.getValueFrom.transformer >>> sep_w.getValueFrom.transformer.points <2.90000009537, 3.29999995232>

So, the select statement in the above example converted all examples from data to the new domain. Since the new domain includes the attribute sep_w that is not present in the original, sep_w's values are computed on the fly. For each example in data, sep_w.getValueFrom is called to compute sep_w's value (if you ever need to call getValueFrom, you shouldn't call getValueFrom directly but call computeValue instead). sep_w.getValueFrom looks for value of "sepal width" in the original example. The original, continuous sepal width is passed to the transformer that determines the interval by its field points. Transformer returns the discrete value which is in turn returned by getValueFrom and stored in the new example.

You don't need to understand this mechanism exactly. It's important to know that there are two classes of objects for discretization. Those derived from Discretizer (such as IntervalDiscretizer that we've seen above) are used as transformers that translate continuous value into discrete. Discretization algorithms are derived from Discretization. Their job is to construct a Discretizer and return a new variable with the discretizer stored in getValueFrom.transformer.


Discretizers

Different discretizers support different methods for conversion of continuous values into discrete. The most general is IntervalDiscretizer that is also used by most discretization methods. Two other discretizers, EquiDistDiscretizer and ThresholdDiscretizer could easily be replaced by IntervalDiscretizer but are used for speed and simplicity. The fourth discretizer, BiModalDiscretizer is specialized for discretizations induced by BiModalDiscretization.

All discretizers support a handy method for construction of a new attribute from an existing one.

Methods

constructVariable(attribute)
Constructs a new attribute descriptor; the new attribute is discretized attribute. The new attribute's name equal attribute.name prefixed by "D_", and its symbolic values are discretizer specific. The above example shows what comes out form IntervalDiscretizer. Discretization algorithms actually first construct a discretizer and then call its constructVariable to construct an attribute descriptor.

An example of how this method is used is shown in the following section about IntervalDiscretizer.

IntervalDiscretizer

IntervalDiscretizer is the most common discretizer. It made its first appearance in the example about general discretization schema and you will see more of it later. It has a single interesting attribute.

Attributes

points
Cut-off points. All values below or equal to the first point belong to the first interval, those between the first and the second (including those equal to the second) go to the second interval and so forth to the last interval which covers all values greater than the last element in points. The number of intervals is thus len(points)+1.

Let us manually construct an interval discretizer with cut-off points at 3.0 and 5.0. We shall use the discretizer to construct a discretized sepal length.

part of discretization.py (uses iris.tab)

idisc = orange.IntervalDiscretizer(points = [3.0, 5.0]) sep_l = idisc.constructVariable(data.domain["sepal length"]) data2 = data.select([data.domain["sepal length"], sep_l, data.domain.classVar])

That's all. First five examples of data2 are now

[5.100000, '>5.00', 'Iris-setosa'] [4.900000, '(3.00, 5.00]', 'Iris-setosa'] [4.700000, '(3.00, 5.00]', 'Iris-setosa'] [4.600000, '(3.00, 5.00]', 'Iris-setosa'] [5.000000, '(3.00, 5.00]', 'Iris-setosa']

Can you use the same discretizer for more than one attribute? Yes, as long as they have same cut-off points, of course. Simply call constructVar for each continuous attribute.

part of discretization.py (uses iris.tab)

idisc = orange.IntervalDiscretizer(points = [3.0, 5.0]) newattrs = [idisc.constructVariable(attr) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.classVar])

Each attribute now has its own ClassifierFromVar in its getValueFrom, but all use the same IntervalDiscretizer, idisc. Changing an element of its points affect all attributes.

Do not change the length of points if the discretizer is used by any attribute. The length of points should always match the number of values of the attribute, which is determined by the length of the attribute's field values. Therefore, if attr is a discretized attribute, than len(attr.values) must equal len(attr.getValueFrom.transformer.points)+1. It always does, unless you deliberately change it. If the sizes don't match, Orange will probably crash, and it will be entirely your fault.

EquiDistDiscretizer

EquiDistDiscretizer is a bit faster but more rigid than IntervalDiscretizer: it uses intervals of fixed width.

Attributes

firstCut
The first cut-off point.
step
Width of intervals.
numberOfIntervals
Number of intervals.
points (read-only)
The cut-off points; this is not a real attribute although it behaves as one. Reading it constructs a list of cut-off points and returns it, but changing the list doesn't affect the discretizer - it's a separate list. This attribute is here only for to give the EquiDistDiscretizer the same interface as that of IntervalDiscretizer.

All values below firstCut belong to the first interval (including possible values smaller than firstVal. Otherwise, value val's interval is floor((val-firstVal)/step). If this is turns out to be greater or equal to numberOfIntervals, it is decreased to numberOfIntervals-1.

This discretizer is returned by EquiDistDiscretization; you can see an example in the corresponding section. You can also construct an EquiDistDiscretization manually and call its constructVariable, just as already shown for the IntervalDiscretizer.

ThresholdDiscretizer

Threshold discretizer converts continuous values into binary by comparing them with a threshold. This discretizer is actually not used by any discretization method, but you can use it for manual discretization. Orange needs this discretizer for binarization of continuous attributes in decision trees.

Attributes

threshold
Threshold; values below or equal to the threshold belong to the first interval and those that are greater go to the second.

BiModalDiscretizer

This discretizer is the first discretizer that couldn't be replaced by IntervalDiscretizer. It has two cut off points and values are discretized according to whether they belong to the middle region (which includes the lower but not the upper boundary) or not. The discretizer is returned by ByModalDiscretization if its field splitInTwo is true (which by default is); see an example there.

Attributes

low, high
Lower and upper boundary of the interval. The lower is included in the interval and the upper is not.

Discretization Algorithms

Discretization with Intervals of Equal Size

EquiDistDiscretization discretizes the attribute by cutting it into the prescribed number of intervals of equal width. The examples are needed to determine the span of attribute values. The interval between the smallest and the largest is then cut into equal parts.

Attributes

numberOfIntervals
Number of intervals into which the attribute is to be discretized. Default value is 4.

For an example, we shall discretize all attributes of Iris dataset into 6 intervals. We shall construct an ExampleTable with discretized attributes and print description of the attributes.

disc = orange.EquiDistDiscretization(numberOfIntervals = 6) newattrs = [disc(attr, data) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.classVar]) for attr in newattrs: print "%s: %s" % (attr.name, attr.values)

Script's answer is

D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10>

Any more decent ways for a script to find the interval boundaries than by parsing the symbolic values? Sure, they are hidden in the discretizer, which is, as usual, stored in attr.getValueFrom.transformer.

Compare the following with the values above.

>>> for attr in newattrs: ... print "%s: first interval at %5.3f, step %5.3f" % \ ... (attr.name, attr.getValueFrom.transformer.firstCut, \ ... attr.getValueFrom.transformer.step) D_sepal length: first interval at 4.900, step 0.600 D_sepal width: first interval at 2.400, step 0.400 D_petal length: first interval at 1.980, step 0.980 D_petal width: first interval at 0.500, step 0.400

As all discretizers, EquiDistDiscretizer also has the method constructVariable. The following example discretizes all attributes into six equal intervals of width 1, the first interval

edisc = orange.EquiDistDiscretizer(firstVal = 2.0, step = 1.0, numberOfIntervals = 5) newattrs = [edisc.constructVariable(attr) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.classVar]) for ex in data2[:10]: print ex

Discretization with Intervals Containing (Approximately) Equal Number of Examples

EquiNDiscretization discretizes the attribute by cutting it into the prescribed number of intervals so that each of them contains equal number of examples. The examples are obviously needed for this discretization, too.

Attributes

numberOfIntervals
Number of intervals into which the attribute is to be discretized. Default value is 4.

The use of this discretization is equivalent to the above one, except that we use EquiNDiscretization instead of EquiDistDiscretization. The resulting discretizer is IntervalDiscretizer, hence it has points instead of firstCut/step/numberOfIntervals.

Entropy-based Discretization

Fayyad-Irani's discretization method works without a predefined number of intervals. Instead, it recursively splits intervals at the cut-off point that minimizes the entropy, until the entropy decrease is smaller than the increase of MDL induced by the new point.

An interesting thing about this discretization technique is that an attribute can be discretized into a single interval, if no suitable cut-off points are found. If this is the case, the attribute is rendered useless and can be removed. This discretization can therefore also serve for feature subset selection.

Attributes

forceAttribute
Forces the algorithm to induce at least one cut-off point, even when its information gain is lower than MDL (default: false).

part of discretization.py (uses iris.tab)

entro = orange.EntropyDiscretization() for attr in data.domain.attributes: disc = entro(attr, data) print "%s: %s" % (attr.name, disc.getValueFrom.transformer.points)

The output shows that all attributes are discretized onto three intervals:

sepal length: <5.5, 6.09999990463> sepal width: <2.90000009537, 3.29999995232> petal length: <1.89999997616, 4.69999980927> petal width: <0.600000023842, 1.70000004768>

Bi-Modal Discretization

BiModalDiscretization sets two cut-off points so that the class distribution of examples in between is as different from the overall distribution as possible. The difference is measure by chi-square statistics. All possible cut-off points are tried, thus the discretization runs in O(n2).

This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the attribute. Depending on the nature of the attribute, we can treat the lower and higher values separately, thus discretizing the attribute into three intervals, or together, in a binary attribute whose values correspond to normal and abnormal.

Attributes

splitInTwo
Decides whether the resulting attribute should have three or two. If true (default), we have three intervals and the discretizer is of type BiModalDiscretizer. If false the result is the ordinary IntervalDiscretizer.

Iris dataset has three-valued class attribute, classes are setosa, virginica and versicolor. As the picture below shows, sepal lenghts of versicolors are between lengths of setosas and virginicas (the picture itself is drawn using LOESS probability estimation, see documentation on naive Bayesian learner.

If we merge classes setosa and virginica into one, we can observe whether the bi-modal discretization would correctly recognize the interval in which versicolors dominate.

newclass = orange.EnumVariable("is versicolor", values = ["no", "yes"]) newclass.getValueFrom = lambda ex, w: ex["iris"]=="Iris-versicolor" newdomain = orange.Domain(data.domain.attributes, newclass) data_v = orange.ExampleTable(newdomain, data)

In this script, we have constructed a new class attribute which tells whether an iris is versicolor or not. We have told how this attribute's value is computed from the original class value with a simple lambda function. Finally, we have constructed a new domain and converted the examples. Now for discretization.

for attr in data_v.domain.attributes: disc = bimod(attr, data_v) print "%s: (%5.3f, %5.3f)" % (attr.name, \ disc.getValueFrom.transformer.low, \ disc.getValueFrom.transformer.high)

Script prints out the middle intervals:

Bi-Modal discretization sepal length: (5.400, 6.200] sepal width: (2.000, 2.900] petal length: (1.900, 4.700] petal width: (0.600, 1.600]

Judging by the graph, the cut-off points for "sepal length" make sense.