Preprocessing

Preprocessors are classes that take examples, usually stored as ExampleTable and return a new ExampleTable with the examples somehow preprocessed - filtered, weighted etc. All preprocessors can therefore be called and given an example generator and, optionally, an id of a meta-attribute with weights. They return either an ExampleTable or a tuple with ExampleTable and meta-attribute id; tuple is returned if id was passed to a preprocessor or if the preprocessor itself added a weight. All other parameters (such as, for example, level of noise or attributes that are to be removed) are properties and not (direct) arguments to the call.

Preprocessors can be used as classes or called, as many other classes in Orange, as functions. The example for both ways are following shortly.

There is also a method selectionVector which instead of examples returns a list of booleans denoting which examples are accepted and which are not. This of course only supported by preprocessors that filter examples, not by those that modify them.

Most of code samples will work with lenses dataset. We thus suppose that Orange is imported, data is loaded and there are variables that correspond to attribute descriptors:

import orange data = orange.ExampleTable("lenses") age, prescr, astigm, tears = data.domain.attributes

Selection of attributes

Selection/removal of attributes is taken care by preprocessors Preprocessor_select and Preprocessor_ignore.

Attributes

attributes
Attributes to be selected/removed

The long way to use the Preprocessor_select is to construct the object, assign the attributes and call it.

pp-attributes.py (uses lenses.tab)

>>> pp = orange.Preprocessor_select() >>> pp.attributes = [age, tears] >>> >>> data2 = pp(data) >>> print "Attributes: %s, classVar %s" % (data2.domain.attributes, data2.domain.classVar) Attributes: <EnumVariable 'age', EnumVariable 'tears'>, classVar None

Note that you cannot pass the attributes names (eg. pp.attributes = ["age", "tears"]) since the domain is not known at the time of the preprocessor construction. Variables age and tears are attribute descriptors.

A quicker way to use preprocessor is to construct the object, pass the arguments and set the options in a single call.

pp-attributes.py (uses lenses.tab)

>>> data2 = orange.Preprocessor_ignore(data, attributes = [age, tears]) >>> print "Attributes: %s, classVar %s" % (data2.domain.attributes, data2.domain.classVar) Attributes: <EnumVariable 'prescr', EnumVariable 'astigm'>, classVar EnumVariable 'y'

In most cases, however, we'll have examples stored in an ExampleTable and utilize the select statement instead of those two preprocessors.

Selection of examples

This section covers preprocessors for selection of examples. Selection can be random, based on criteria matching or on checking for the presence of (un)defined values.

Selection by values

As for selecting the attribute subset, there are again two preprocessors - Preprocessor_take takes examples that match the given criteria and Preprocesor_drop removes them.

Attributes

values
A dictionary-like list of type ValueFilterList (don't mind about it, if you don't need to) containing the criteria that an example must match to be selected/removed.

In below examples we shall concentrate on Preprocessor_take; Preprocessor_drop works analogously.

pp-select.py (uses lenses.tab)

>>> pp = orange.Preprocessor_take() >>> pp.values[prescr] = "hyper" >>> pp.values[age] = ["young", "psby"] >>> data2 = pp(data) >>> >>> for ex in data2: >>> print ex ['psby', 'hyper', 'y', 'normal', 'no'] ['psby', 'hyper', 'y', 'reduced', 'no'] ['psby', 'hyper', 'n', 'normal', 'soft'] ['psby', 'hyper', 'n', 'reduced', 'no'] ['young', 'hyper', 'y', 'normal', 'hard'] ['young', 'hyper', 'y', 'reduced', 'no'] ['young', 'hyper', 'n', 'normal', 'soft'] ['young', 'hyper', 'n', 'reduced', 'no']

We required "prescr" to be "hyper", and "age" to be "young" or "psby". The latter was given as a list of strings, and the former as a single string (although we could also pass it in a one-element list). The field pp.values behaves like a dictionary, where keys are attributes and values are conditions.

This should be enough for most users. If you need to know everything: the condition is not simply a list of strings (["young", "psby"]), but an object of type ValueFilter_discrete. The acceptable values are stored in a field surprisingly called values, and there is another field, acceptSpecial that decides what to do with special attribute values (don't know, don't care). You can check that pp.values[age].acceptSpecial is -1, which means that special values are simply ignored. If acceptSpecial is 0, the example is rejected and if it is 1, it is accepted (as if the attribute's value would be one of those listed in values). Should you by any reason want to specify the condition directly, you can do it by pp.values[age] = orange.ValueFilter_discrete(values = orange.ValueList(["young", "psby"], age)) (Did I just hear something about prefering the shortcut?)

More information on ValueFilterList can be find in the page about filters.

As you suspected, it is also possible to filter by values of continuous attributes. So, if age was continuous, we could select teenagers by

>>> pp.values[age] = (10, 19)

Both boundaries are inclusive. How to select those from outside an interval? By reversing the order:

>>> pp.values[age] = (19, 10)

Again, this should be enough for most. For "hackers": the condition is stored as ValueFilter_continuous which has a common ancestor with ValueFilter_discrete, ValueFilter. The boundaries are in the fields min and max. Here, min is always smaller or equal to max; there is a flag outside which is false by default. Again, you can construct the condition manually, by pp.values[age] = orange.ValueFilter_continuous(min=10, max=19).

Finally, here's a shortcut. The preprocessor's field values behaves like a dictionary and can thus also be initialized as one. The shortest way to achieve the same result as above is

data2 = orange.Preprocessor_take(data, values = {prescr: "hyper", age: ["young", "psby"]})

Removal of duplicates

Preprocessor_remove_duplicates merges multiple occurrences of the same example into a single example whose weight is the sum of all merged examples. If examples were originally non-weighted, a new weight meta-attribute is prepared. This preprocessor always returns a tuple with examples and weight id.

To show how to use it, we shall first remove the attribute age, to introduce duplicated examples.

pp-duplicates.py (uses lenses.tab)

>>> data2 = orange.Preprocessor_ignore(data, attributes = [age]) >>> data2, weightID = orange.Preprocessor_removeDuplicates(data2) >>> for ex in data2: >>> print ex ['hyper', 'y', 'normal', 'no'], {-2:2.00} ['hyper', 'y', 'reduced', 'no'], {-2:3.00} ['hyper', 'n', 'normal', 'soft'], {-2:3.00} ['hyper', 'n', 'reduced', 'no'], {-2:3.00} ['myope', 'y', 'normal', 'hard'], {-2:3.00} ['myope', 'y', 'reduced', 'no'], {-2:3.00} ['myope', 'n', 'normal', 'no'], {-2:1.00} ['myope', 'n', 'reduced', 'no'], {-2:3.00} ['myope', 'n', 'normal', 'soft'], {-2:2.00} ['hyper', 'y', 'normal', 'hard'], {-2:1.00}

The new weight attribute has id -2 (which can be checked by looking at weightID and the resulting examples are merges of up to three original examples. Note that you may get values other than -2 if you run the script for multiple times.

Selection by missing values

There are four preprocessors that select or remove examples with missing values. Preprocessor_dropMissing removes examples with any missing values and Preprocessor_dropMissingClasses removes examples with missing the class value. The other pair is Preprocessor_takeMissing and Preprocessor_takeMissingClasses that select only examples with at least one missing value or without class value, respectively. See examples in the section about adding missing values.

Shuffling

Some statistical tests require making a certain attribute useless by permuting its values across examples. There is a preprocessor to do this, called Preprocessor_shuffle.

Attributes

attributes
A list of attributes (usually a single attribute) whose values need to be shuffled.

Here's how to use it:

d2 = orange.Preprocessor_shuffle(d, attributes=[d.domain[0]])

Executing this will create a new table d2, which will contain all examples from d but with the values of the first attribute permuted.

Adding noise

Orange has separate preprocessors for discrete and continuous noise. When discrete noise is applied, a proportion of noisy values needs to be provided. If it is, for instance, 0.25, then every fourth value will be changed to a random. Note that this does not mean that a quarter of values will be changed since the random value can be equal to the original. For continuous noise, the value is modified by a random value from Gaussian distribution; the user provides the deviation.

Class noise

Preprocessor Preprocessor_addClassNoise sets the example's class to a random value with a given probability.

Attributes

proportion
Proportion of changed values. If set to, for instance, 0.25, preprocessor will set random classes to one quarter of examples (rounded to the closest integer).
randomGenerator
Random number generator to be used for adding noise. If left None, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.

part of pp-noise.py (uses lenses.tab)

>>> data2 = orange.Preprocessor_addClassNoise(data, proportion = 0.5)

Note again that this doesn't mean that the class for half of examples will be changed. The preprocessor works like this. If the dataset has N examples, class attribute has v values and the noise proportion is p, then N*p/v randomly chosen examples will be assigned to class 1, other N*p/v randomly chosen examples will be assigned to class 2 and so forth; N*(1-p)/v examples will be left alone. When numbers do not divide evenly, they are not rounded to the closest integer; instead, groups that get an example more or less are chosen at random.

For instance, dataset lenses has 24 examples and class attribute has three distinct values. Four randomly chose examples are assigned to each of the three classes, while the remaining 12 examples are left as they are.

Preprocessor Preprocessor_addGaussianClassNoise adds Gaussian noise with given deviation.

Attributes

deviation
Sets the deviation for the noise; a random number with distribution N(0, deviation) is added to each example.
randomGenerator
Random number generator to be used for adding noise. If left None, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.

To show how this works, we shall construct a simple example table with 20 examples, described only by the "class" attribute. It will be continuous and always have value 100. To this, we will apply Gaussian noise with deviation 10.

part of pp-noise.py (uses lenses.tab)

>>> cdomain = orange.Domain([orange.FloatVariable()]) >>> cdata = orange.ExampleTable(cdomain, [[100]]*20) >>> cdata2 = orange.Preprocessor_addGaussianClassNoise(cdata, deviation=10)

Attribute noise

Preprocessor Preprocessor_addNoise sets attributes to random values; probability can be prescribed for each attribute individually and for all attributes in general.

Attributes

proportions
A dictionary-like list with proportions of each individual attribute values that are set to random. This list can also include the class attribute
defaultProportion
Proportion of changed values for all attributes not specifically listed above. Default proportion does not cover the class attribute.
randomGenerator
Random number generator to be used for adding noise. If left None, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.

Note the treatment of the class attribute. If you want to add class noise with this filter, it does not suffice to set defaultProportion as this only applies to other attributes. You need to specifically request the noise for class attribute in proportions.

part of pp-noise.py (uses lenses.tab)

age, prescr, astigm, tears, y = tuple(data.domain.variables) pp = orange.Preprocessor_addNoise() pp.proportions[age]=0.3 pp.proportions[prescr]=0.5 pp.defaultProportion = 0.2 data2 = pp(data)

This preprocessor will set 30% of values of "age" to random, as well as 50% of values of "prescr" and 20% of values of other attributes. See the above description of Preprocessor_addClassNoise for details on how examples are selected.

The class attribute will be left intact. Note that age and prescr are attribute descriptors, not strings or indices.

To add noise to continuous attributes use preprocessor Preprocessor_addGaussianNoise.

Attributes

deviations
Deviations for individual attributes
defaultDeviation
Deviations for attributes not specifically listed in deviations
randomGenerator
Random number generator to be used for adding noise. If left None, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.

The following scripts adds Gaussian noise with deviation 1.0 to all attributes in iris dataset except to "petal_width". To achieve this, it sets the defaultDeviation to 1.0, but specifically sets the noise level for "petal_width" to 0.

part of pp-noise.py (uses lenses.tab)

pp = orange.Preprocessor_addGaussianNoise() pp.deviations[iris.domain["petal width"]] = 0.0 pp.defaultDeviation = 1.0 data2 = pp(iris)

Adding missing values

Preprocessors for adding missing values (that is, replacing known values with don't-knows or don't-cares) are similar to those for introducing noise. There are two preprocessors, one for all attributes and another that only manipulates the class attribute values.

Removing class values

Removing class values is taken care by Preprocessor_addMissingClasses.

Attributes

proportion
Proportion of examples for which the class value will be removed.
specialType
The type of special value to be used. Can be orange.ValueTypes.DK or orange.ValueTypes.DC for "don't know" and "don't care".
randomGenerator
Random number generator to be used for selecting examples for class value removal. If left None, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.

The following script replaces 50% of class values by "don't know", prints out the classes for all examples, then removes examples with missing classes and print the classes out again.

part of pp-missing.py (uses lenses.tab)

pp = orange.Preprocessor_addMissingClasses() pp.proportion = 0.5 pp.specialType = orange.ValueTypes.DK data2 = pp(data) print "Removing 50% of class values:", for ex in data2: print ex.getclass(), print data2 = orange.Preprocessor_dropMissingClasses(data2) print "Removing examples with unknown class values:", for ex in data2: print ex.getclass(), print

Removing attribute values

Preprocessor_addMissing replaces known values of attributes with unknowns of prescribed type.

Attributes

proportions
A dictionary-like list with proportions of examples for which the value of particular attribute will be replaced by undefined values.
defaultProportion
Proportion of changed values for attributes not specifically listed in proportions.
specialType
The type of special value to be used. Can be orange.ValueTypes.DK or orange.ValueTypes.DC for "don't know" and "don't care".
randomGenerator
Random number generator to be used to select values for removal. If left None, a new generator is constructed each time the preprocessor is called, and initialized with random seed 0.

As for adding noise, this preprocessor does not manipulate the class value unless the class attributes is specifically listed in proportions.

The following examples removes 20% of values of "age" and 50% of values of "astigm" in dataset lenses, replacing them with "don't care". The it prints out the examples with missing values.

part of pp-missing.py (uses lenses.tab)

age, prescr, astigm, tears, y = data.domain.variables pp = orange.Preprocessor_addMissing() pp.proportions = {age: 0.2, astigm: 0.5} pp.specialType = orange.ValueTypes.DC data = pp(data) print "\n\nSelecting examples with unknown values" data3 = orange.Preprocessor_takeMissing(data2) for ex in data3: print ex

Assigning weights

Orange stores weights of examples as meta-attributes. Weights can be stored and read from file (as any other meta-attribute) if they are given in advance or computed outside Orange. Orange itself has two preprocessors for weighting examples.

Weighting by classes

Weighting by classes (Preprocessor_addClassWeight) assigns weights according to classes of examples.

Attributes

classWeights
A list of weights for each class.
equalize
Make the class distribution homogenous by decreasing and increasing example weights.

If you have, for instance, loaded the now famous lenses domain and want all examples of the first class to have a weight of 2.0 and those of the second and the third a weight of 1.0, you would achieve this by

part of pp-weights.py (uses lenses.tab)

pp = orange.Preprocessor_addClassWeight() pp.classWeights = [2.0, 1.0, 1.0] data2, weightID = pp(data) print " - original class distribution: ", orange.Distribution(y, data2) print " - weighted class distribution: ", orange.Distribution(y, data2, weightID)

Script prints

- original class distribution: <15.000, 5.000, 4.000> - weighted class distribution: <30.000, 5.000, 4.000>

The number of examples in the first class has seemingly doubled. Printing out the examples reveals that those belonging to the first class ("no") have a weight of 2.0 while the other weigh 1.0. Weighted examples can then be used for learning, either directly

>>> ba = orange.BayesLearner(data2, weight)

or by through some sampling procedure, such as cross-validation

>>> res = orngTest.crossValidation([orange.BayesLearner()], (data2, weight))

In a highly unbalanced dataset where the majority class prevails by large over the minority, it is often desirable to reduce the number of majority class examples. The traditional way of doing this is by randomly selecting only a certain proportion of examples belonging to majority class. The alternative to this is assigning a smaller weight to examples belonging to the majority class (this, naturally, requires a learning algorithm capable of processing weights). For this purpose, Preprocessor_addClassWeight can be told to equalize the class distribution prior to weighting. Let's see how this works on lenses.

part of pp-weights.py (uses lenses.tab)

data2, weightID = orange.Preprocessor_addClassWeight(data, equalize=1)

Equalizing computes such weights that the weighted number of examples in each class is equivalent. In our case, we originally had a total of 24 examples, of which 15 belong to the first class, and 5 and 4 in the other two. The preprocessor reweighted examples so that there are 8 (one third) in each of the three classes. Examples in the first class therefore got a weight of 8/15=0.53, as can be quickly checked:

>>> data2[0] ['psby', 'hyper', 'y', 'normal', 'no'], {6:0.53}

Usually, you would use both, equalizing and weighting. This way, you prescribe the exact proportions of classes.

part of pp-weights.py (uses lenses.tab)

pp = orange.Preprocessor_addClassWeight() pp.classWeights = [0.5, 0.25, 0.25] pp.equalize = 1 data2, weightID = pp(data) print " - original class distribution: ", orange.Distribution(y, data2) print " - weighted class distribution: ", orange.Distribution(y, data2, weightID)

This script prints

- original class distribution: <15.000, 5.000, 4.000> - weighted class distribution: <12.000, 6.000, 6.000>

Formally, preprocessor functions like this:

  • If equalization is not requested, each examples' existing weight is multiplied by the corresponding weight in classWeights.
  • If equalization is requested and no (or empty) classWeights is passed, the examples will be reweighted so that the (weighted) number of examples (i.e. the sum of example weights) stays the same and the weighted distribution of classes is uniform.
  • If equalization is requested and class weights are given, ratios of class frequencies will be as given in classWeights; if weights of two classes in the classWeights list or a and b, then the ratio of weighted examples of those two classes will be a:b. The sum of weights of all examples does not stay the same but multiplies by the sum of elements in classWeight.

The latter case sound complicated, but isn't. As we have seen in the last example on lenses domain, the number of examples stayed the same (12+6+6=24) when the classWeights were [0.5, 0.25, 0.25]. If classWeights were [2, 1, 1], the (weighted) number of examples would quadruple. The actual number of examples (length of the example table) would naturally stay the same, what changes is only the sum of weights.

Special care is taken of the empty classes. If we have a three-class problem with 24 examples, but one of the classes is empty, pure equalization would put 12 (not 8!) examples to each class. Similar holds for the case when equalization and class weights are given: if classWeights sums to 1, the sum of weights will stay the same.

Censoring

The other weights introducing preprocessor deals with censoring. In some areas, like medicine, we often deal with examples with different credibilities. If, for instance, we follow a group of cancer patients treated with chemotherapy and the class attribute tells whether the disease recurred, then we might have patients which were followed for a period of five years and others which moved from the country or died for an unrelated reason in a few months. Although both may be classified as non-recurring, it is obvious that the weight of the former should be greater than that of the latter.

Attributes

outcomeVar
Descriptor for attribute containing the outcome; if left unset, class variable is used as outcome. It can be either meta- or normal attribute.
timeVar
The attribute with follow-up time. This will usually (but not necessarily) be meta-attribute.
eventValue
An integer index of the value of outcomeVar that denotes failure; all other denote censoring (eg. if symbolic value "fail" denotes failure, then eventValue must be set to outcomeVar.values.index("fail") or, equivalently, int(orange.Value(outcomeVar, "fail")).
method
Sets the weighting method used; can be orange.Preprocessor_addClassWeight.Linear, orange.Preprocessor_addClassWeight.KM or orange.Preprocessor_addClassWeight.Bayes for linear, Kaplan-Meier and Bayesian weighting (see below).
maxTime
Time after which a censored example is treated as "survived". This attribute's meaning depends on the selected method.
addComplementary
If true (default is false), for each censored example a complementary failing example is added with the weight equal to the amount for which the original example's weight was decreased.

There are different approaches to weighting censored examples. Orange implements three of them. In any case, examples that failed are good examples of failing. They failed for sure and have a weight of 1. The same goes for examples that did not fail and were observed for at least maxTime (given by user). Weighting is needed for examples that did not fail but were not observed long enough. If addComplementary flag is false (default), the example's weight is decreased by a factor computed by one of the methods described below. If true, a complementary failing example is added with the weight equal to the amount for which the original example's weight was decreased.

Linear weighting
Linear weighting is a simple ad-hoc method which assigns a weight linear to the observation time: an example that was observed for time t gets a weight of t/maxTime. If maxTime is not given, the maximal time in the data is taken.
Kaplan-Meier
Kaplan-Meier curve models the probability of not failing at or before time ti. It is computed iteratively: probability of not failing at or before time ti equals probability of not failing before or at ti-1 multiplied by probability for not failing in interval between those two times. The latter probability is estimated as a proportion of examples (say patients) that were OK by ti-1 but failed in between ti-1 - ti.

Non-failing examples that are observed for time t<maxTime have a weight of KM(maxTime)/KM(t) - this is a conditional probability for not failing till maxTime given that the example did not fail before time t.

Weighting by Bayesian probabilities
This method is similar to Kaplan-Meier, but simpler. For each time-point t we can compute the probability of failing by Bayesian formula, where aprior probability of failing is computed as the proportion of examples observed at maxTime that failed. Likewise, conditional probability that an example that will eventually fail but doesn't fail at (or before) time t is computed from a corresponding proportion. The third needed probability is probability of not failing at (or before) time t, which is computed as proportion of those that didn't fail at t among those that were observed for at least time t.

Practical experiments showed that all weighting methods give similar results.

The following script load the new Wisconsin breast cancer dataset which tells whether cancer recurred or not; if it recurred, it gives time of recurrence, if not, it gives disease free time. Weights are assigned using Kaplan-Meier method with 20 as maximal time. The name of the attribute with time is "time". Failing examples are those whose class value is "R"; we don't need to assign the outcomeVar, since the event is stored in the class attribute.

To see the results, we print out all non-recurring examples with disease free time less than 10.

part of pp-weights.py (uses lenses.csv)

import orange data = orange.ExampleTable("wpbc") time = data.domain["time"] fail = data.domain.classVar.values.index("R") data2, weightID = orange.Preprocessor_addCensorWeight( data, 0, eventValue = fail, timeVar=time, maxTime = 20, method = orange.Preprocessor_addCensorWeight.KM) print "class\ttime\tweight" for ex in data2.select(recur="N", time=(0, 10)): print "%s\t%5.2f\t%5.3f" % (ex.getclass(), float(ex["time"]), ex.getmeta(weightID)) print

The script prints

class time weight N 10.00 0.927 N 5.00 0.875 N 5.00 0.875 N 5.00 0.875 N 8.00 0.895 N 1.00 0.852 N 10.00 0.927 N 7.00 0.885 N 8.00 0.895 N 1.00 0.852 N 9.00 0.910 N 10.00 0.927 N 6.00 0.880 N 8.00 0.895 N 3.00 0.861 N 3.00 0.861 N 10.00 0.927 N 8.00 0.895 N 6.00 0.880

Discretization

The discretization preprocessor Preprocessor_discretize is a substitute for the discretizers in module orngDisc. It has three attributes.

Attributes

attributes
A list of attributes to be discretized. Leave it None (default) to discretize all.
discretizeClass
Tells whether to discretize the class attribute as well. Default is false.
method
Discretization method. Need to be set to a component derived from Discretization e.g. EquiDistDiscretization, EquiNDiscretization and EntropyDiscretization.

This is the simplest way to discretize the iris dataset:

part of pp-discretization.py (uses iris.tab)

import orange iris = orange.ExampleTable("iris") pp = orange.Preprocessor_discretize() pp.method = orange.EquiDistDiscretization(numberOfIntervals = 5) data2 = pp(iris)

To discretize only "petal length" and "sepal length", set the attributes:

pp.attributes = [iris.domain["petal length"], iris.domain["sepal length"]]

Applying filters

The last preprocessor, Preprocessor_filter offers a way of applying example filters.

Attributes

filter
For each example in the example generator filter is asked whether to keep it or not.

For instance, to exclude the examples with defined class values, you can call

data2 = orange.Preprocessor_filter(data, filter = orange.Filter_hasClassValue())

Note that you can employ preprocessors for most tasks that you could use the filters for, and that filters can also be applied by ExampleTable's method filter. The preferred way is the way which you prefer.