Preprocessors are classes that take examples, usually stored as
ExampleTable
and return a new ExampleTable
with the examples somehow preprocessed - filtered, weighted etc. All
preprocessors can therefore be called and given an example generator
and, optionally, an id of a meta-attribute with weights. They return
either an ExampleTable
or a tuple with
ExampleTable
and meta-attribute id; tuple is returned if
id was passed to a preprocessor or if the preprocessor itself added a
weight. All other parameters (such as, for example, level of noise or
attributes that are to be removed) are properties and not (direct)
arguments to the call.
Preprocessors can be used as classes or called, as many other classes in Orange, as functions. The example for both ways are following shortly.
There is also a method selectionVector
which instead of examples returns a list of booleans denoting which examples are accepted and which are not. This of course only supported by preprocessors that filter examples, not by those that modify them.
Most of code samples will work with lenses dataset. We thus suppose that Orange is imported, data is loaded and there are variables that correspond to attribute descriptors:
Selection/removal of attributes is taken care by preprocessors
and
.
Attributes
The long way to use the Preprocessor_select
is to construct the object, assign the attributes and call it.
pp-attributes.py (uses lenses.tab)
Note that you cannot pass the attributes names (eg. pp.attributes = ["age", "tears"]
) since the domain is not known at the time of the preprocessor construction. Variables age
and tears
are attribute descriptors.
A quicker way to use preprocessor is to construct the object, pass the arguments and set the options in a single call.
pp-attributes.py (uses lenses.tab)
In most cases, however, we'll have examples stored in an ExampleTable
and utilize the select
statement instead of those two preprocessors.
This section covers preprocessors for selection of examples. Selection can be random, based on criteria matching or on checking for the presence of (un)defined values.
As for selecting the attribute subset, there are again two preprocessors -
takes examples that match the given criteria and
removes them.
Attributes
ValueFilterList
(don't mind about it, if you don't need to) containing the criteria that an example must match to be selected/removed.In below examples we shall concentrate on Preprocessor_take
; Preprocessor_drop
works analogously.
pp-select.py (uses lenses.tab)
We required "prescr" to be "hyper", and "age" to be "young" or "psby". The latter was given as a list of strings, and the former as a single string (although we could also pass it in a one-element list). The field pp.values
behaves like a dictionary, where keys are attributes and values are conditions.
This should be enough for most users. If you need to know everything: the condition is not simply a list of strings (["young", "psby"]
), but an object of type ValueFilter_discrete
. The acceptable values are stored in a field surprisingly called values
, and there is another field, acceptSpecial
that decides what to do with special attribute values (don't know, don't care). You can check that pp.values[age].acceptSpecial
is -1, which means that special values are simply ignored. If acceptSpecial
is 0, the example is rejected and if it is 1, it is accepted (as if the attribute's value would be one of those listed in values
). Should you by any reason want to specify the condition directly, you can do it by pp.values[age] = orange.ValueFilter_discrete(values = orange.ValueList(["young", "psby"], age))
(Did I just hear something about prefering the shortcut?)
More information on ValueFilterList
can be find in the page about filters.
As you suspected, it is also possible to filter by values of continuous attributes. So, if age was continuous, we could select teenagers by
Both boundaries are inclusive. How to select those from outside an interval? By reversing the order:
Again, this should be enough for most. For "hackers": the condition is stored as ValueFilter_continuous
which has a common ancestor with ValueFilter_discrete
, ValueFilter
. The boundaries are in the fields min
and max
. Here, min
is always smaller or equal to max
; there is a flag outside
which is false by default. Again, you can construct the condition manually, by pp.values[age] = orange.ValueFilter_continuous(min=10, max=19)
.
Finally, here's a shortcut. The preprocessor's field values
behaves like a dictionary and can thus also be initialized as one. The shortest way to achieve the same result as above is
merges multiple occurrences of the same example into a single example whose weight is the sum of all merged examples. If examples were originally non-weighted, a new weight meta-attribute is prepared. This preprocessor always returns a tuple with examples and weight id.
To show how to use it, we shall first remove the attribute age
, to introduce duplicated examples.
pp-duplicates.py (uses lenses.tab)
The new weight attribute has id -2 (which can be checked by looking at weightID
and the resulting examples are merges of up to three original examples. Note that you may get values other than -2 if you run the script for multiple times.
There are four preprocessors that select or remove examples with missing values.
removes examples with any missing values and
removes examples with missing the class value. The other pair is
and
that select only examples with at least one missing value or without class value, respectively. See examples in the section about adding missing values.
Preprocessor_shuffle
.
Attributes
Here's how to use it:
Executing this will create a new table d2
, which will contain all examples from d
but with the values of the first attribute permuted.
Orange has separate preprocessors for discrete and continuous noise. When discrete noise is applied, a proportion of noisy values needs to be provided. If it is, for instance, 0.25, then every fourth value will be changed to a random. Note that this does not mean that a quarter of values will be changed since the random value can be equal to the original. For continuous noise, the value is modified by a random value from Gaussian distribution; the user provides the deviation.
Preprocessor
sets the example's class to a random value with a given probability.
Attributes
None
, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.part of pp-noise.py (uses lenses.tab)
Note again that this doesn't mean that the class for half of examples will be changed. The preprocessor works like this. If the dataset has N
examples, class attribute has v
values and the noise proportion is p
, then N*p/v
randomly chosen examples will be assigned to class 1, other N*p/v
randomly chosen examples will be assigned to class 2 and so forth; N*(1-p)/v
examples will be left alone. When numbers do not divide evenly, they are not rounded to the closest integer; instead, groups that get an example more or less are chosen at random.
For instance, dataset lenses has 24 examples and class attribute has three distinct values. Four randomly chose examples are assigned to each of the three classes, while the remaining 12 examples are left as they are.
Preprocessor
adds Gaussian noise with given deviation.
Attributes
deviation
) is added to each example.None
, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.To show how this works, we shall construct a simple example table with 20 examples, described only by the "class" attribute. It will be continuous and always have value 100. To this, we will apply Gaussian noise with deviation 10.
part of pp-noise.py (uses lenses.tab)
Preprocessor
sets attributes to random values; probability can be prescribed for each attribute individually and for all attributes in general.
Attributes
None
, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.Note the treatment of the class attribute. If you want to add class noise with this filter, it does not suffice to set defaultProportion
as this only applies to other attributes. You need to specifically request the noise for class attribute in proportions
.
part of pp-noise.py (uses lenses.tab)
This preprocessor will set 30% of values of "age" to random, as well as 50% of values of "prescr" and 20% of values of other attributes. See the above description of Preprocessor_addClassNoise
for details on how examples are selected.
The class attribute will be left intact. Note that age
and prescr
are attribute descriptors, not strings or indices.
To add noise to continuous attributes use preprocessor Preprocessor_addGaussianNoise
.
Attributes
deviations
None
, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.The following scripts adds Gaussian noise with deviation 1.0 to all attributes in iris dataset except to "petal_width". To achieve this, it sets the defaultDeviation
to 1.0, but specifically sets the noise level for "petal_width" to 0.
part of pp-noise.py (uses lenses.tab)
Preprocessors for adding missing values (that is, replacing known values with don't-knows or don't-cares) are similar to those for introducing noise. There are two preprocessors, one for all attributes and another that only manipulates the class attribute values.
Removing class values is taken care by
.
Attributes
orange.ValueTypes.DK
or orange.ValueTypes.DC
for "don't know" and "don't care".None
, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.The following script replaces 50% of class values by "don't know", prints out the classes for all examples, then removes examples with missing classes and print the classes out again.
part of pp-missing.py (uses lenses.tab)
replaces known values of attributes with unknowns of prescribed type.
Attributes
proportions
.orange.ValueTypes.DK
or orange.ValueTypes.DC
for "don't know" and "don't care".None
, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.As for adding noise, this preprocessor does not manipulate the class value unless the class attributes is specifically listed in proportions
.
The following examples removes 20% of values of "age" and 50% of values of "astigm" in dataset lenses, replacing them with "don't care". The it prints out the examples with missing values.
part of pp-missing.py (uses lenses.tab)
Orange stores weights of examples as meta-attributes. Weights can be stored and read from file (as any other meta-attribute) if they are given in advance or computed outside Orange. Orange itself has two preprocessors for weighting examples.
Weighting by classes (
) assigns weights according to classes of examples.
Attributes
If you have, for instance, loaded the now famous lenses domain and want all examples of the first class to have a weight of 2.0 and those of the second and the third a weight of 1.0, you would achieve this by
part of pp-weights.py (uses lenses.tab)
Script prints
The number of examples in the first class has seemingly doubled. Printing out the examples reveals that those belonging to the first class ("no") have a weight of 2.0 while the other weigh 1.0. Weighted examples can then be used for learning, either directly
or by through some sampling procedure, such as cross-validation
In a highly unbalanced dataset where the majority class prevails by large over the minority, it is often desirable to reduce the number of majority class examples. The traditional way of doing this is by randomly selecting only a certain proportion of examples belonging to majority class. The alternative to this is assigning a smaller weight to examples belonging to the majority class (this, naturally, requires a learning algorithm capable of processing weights). For this purpose, Preprocessor_addClassWeight
can be told to equalize the class distribution prior to weighting. Let's see how this works on lenses.
part of pp-weights.py (uses lenses.tab)
Equalizing computes such weights that the weighted number of examples in each class is equivalent. In our case, we originally had a total of 24 examples, of which 15 belong to the first class, and 5 and 4 in the other two. The preprocessor reweighted examples so that there are 8 (one third) in each of the three classes. Examples in the first class therefore got a weight of 8/15=0.53, as can be quickly checked:
Usually, you would use both, equalizing and weighting. This way, you prescribe the exact proportions of classes.
part of pp-weights.py (uses lenses.tab)
This script prints
Formally, preprocessor functions like this:
classWeights.
classWeights
is passed, the examples will be reweighted so that the (weighted) number of examples (i.e. the sum of example weights) stays the same and the weighted distribution of classes is uniform.classWeights
; if weights of two classes in the classWeights
list or a and b, then the ratio of weighted examples of those two classes will be a:b. The sum of weights of all examples does not stay the same but multiplies by the sum of elements in classWeight
.The latter case sound complicated, but isn't. As we have seen in the last example on lenses domain, the number of examples stayed the same (12+6+6=24) when the classWeights
were [0.5, 0.25, 0.25]
. If classWeights
were [2, 1, 1]
, the (weighted) number of examples would quadruple. The actual number of examples (length of the example table) would naturally stay the same, what changes is only the sum of weights.
Special care is taken of the empty classes. If we have a three-class problem with 24 examples, but one of the classes is empty, pure equalization would put 12 (not 8!) examples to each class. Similar holds for the case when equalization and class weights are given: if classWeights
sums to 1, the sum of weights will stay the same.
The other weights introducing preprocessor deals with censoring. In some areas, like medicine, we often deal with examples with different credibilities. If, for instance, we follow a group of cancer patients treated with chemotherapy and the class attribute tells whether the disease recurred, then we might have patients which were followed for a period of five years and others which moved from the country or died for an unrelated reason in a few months. Although both may be classified as non-recurring, it is obvious that the weight of the former should be greater than that of the latter.
Attributes
outcomeVar
that denotes failure; all other denote censoring (eg. if symbolic value "fail" denotes failure, then eventValue
must be set to outcomeVar.values.index("fail")
or, equivalently, int(orange.Value(outcomeVar, "fail"))
.orange.Preprocessor_addClassWeight.Linear
, orange.Preprocessor_addClassWeight.KM
or orange.Preprocessor_addClassWeight.Bayes
for linear, Kaplan-Meier and Bayesian weighting (see below).true
(default is false
), for each censored example a complementary failing example is added with the weight equal to the amount for which the original example's weight was decreased.There are different approaches to weighting censored examples. Orange implements three of them. In any case, examples that failed are good examples of failing. They failed for sure and have a weight of 1. The same goes for examples that did not fail and were observed for at least maxTime
(given by user). Weighting is needed for examples that did not fail but were not observed long enough. If addComplementary
flag is false
(default), the example's weight is decreased by a factor computed by one of the methods described below. If true
, a complementary failing example is added with the weight equal to the amount for which the original example's weight was decreased.
maxTime
. If maxTime
is not given, the maximal time in the data is taken.Non-failing examples that are observed for time t<maxTime have a weight of KM(maxTime)/KM(t) - this is a conditional probability for not failing till maxTime
given that the example did not fail before time t.
maxTime
that failed. Likewise, conditional probability that an example that will eventually fail but doesn't fail at (or before) time t
is computed from a corresponding proportion. The third needed probability is probability of not failing at (or before) time t
, which is computed as proportion of those that didn't fail at t
among those that were observed for at least time t
.Practical experiments showed that all weighting methods give similar results.
The following script load the new Wisconsin breast cancer dataset which tells whether cancer recurred or not; if it recurred, it gives time of recurrence, if not, it gives disease free time. Weights are assigned using Kaplan-Meier method with 20 as maximal time. The name of the attribute with time is "time". Failing examples are those whose class value is "R"; we don't need to assign the outcomeVar
, since the event is stored in the class attribute.
To see the results, we print out all non-recurring examples with disease free time less than 10.
part of pp-weights.py (uses lenses.csv)
The script prints
The discretization preprocessor
is a substitute for the discretizers in module orngDisc
. It has three attributes.
Attributes
None
(default) to discretize all.false
.Discretization
e.g. EquiDistDiscretization
, EquiNDiscretization
and EntropyDiscretization
.This is the simplest way to discretize the iris dataset:
part of pp-discretization.py (uses iris.tab)
To discretize only "petal length" and "sepal length", set the attributes
:
The last preprocessor,
offers a way of applying example filters.
Attributes
filter
is asked whether to keep it or not.For instance, to exclude the examples with defined class values, you can call
Note that you can employ preprocessors for most tasks that you could use the filters for, and that filters can also be applied by ExampleTable
's method filter
. The preferred way is the way which you prefer.