Value Transformers

Class TransformValue is a base class for a hierarchy of classes used throughout Orange for simple transformation of values. Discretization, for instances, creates a transformer that converts continuous values into discrete, while continuizers do the opposite. Classification trees use transformers for binarization where values of discrete attributes are converted into binary.

Transformers are most commonly used in conjunction with Classifiers from Attribute. It is also possible to subtype this class in Python.

Although this classes can occasionally come very handy, you will mostly encounter them when created by other methods, such as discretization.

Transforming Individual Attributes

TransformValue

TransformValue is the abstract root of the hierarchy, itself derived from Orange. When called with a Value as an argument, it returns the transformed value.

See Classifiers from Attribute for an example of how to derive new Python classes from TransformValue.

Attributes

subTransformer
Specifies the transformation that takes place prior to this. This way, transformations can be chained, although this will seldom be needed.

Ordinal2Continuous

Ordinal2Continuous converts ordinal values to equidistant continuous. Four-valued attribute with, say, values 'small', 'medium', 'large', 'extra large' would be converted to 0.0, 1.0, 2.0 and 3.0. You can also specify a factor by which the values are multiplied. If the factor for above attribute is set to 1/3 (or, in general, to 1 by number of values), the new continuous attribute will have values from 0.0 to 1.0.

Attributes

factor
The factor by which the values are multiplied.

part of transformvalues-o2c.py (uses lenses.tab)

import orange data = orange.ExampleTable("lenses") age = data.domain["age"] age_c = orange.FloatVariable("age_c") age_c.getValueFrom = orange.ClassifierFromVar(whichVar = age) age_c.getValueFrom.transformer = orange.Ordinal2Continuous() newDomain = orange.Domain([age, age_c], data.domain.classVar) newData = orange.ExampleTable(newDomain, data)

The values of attribute 'age' ('young', 'pre-presbyopic' and 'presbyopic') are in the new domain transformed to 0.0, 1.0 and 2.0. If we additionally set age_c.getValueFrom.transformer.factor to 0.5, the new values will be 0.0, 0.5 and 1.0.

Discrete2Continuous

Discrete2Continuous converts a discrete value to a continuous so that some designated value is converted to 1.0 and all others to 0.0 or -1.0, depending on the settings.

Attributes

value
The value that in converted to 1.0; others are converted to 0.0 or -1.0. Value needs to be specified by an integer index, not a Value.
zeroBased
Decides whether the other values will be transformed to 0.0 (True, default) or -1.0 (False). When False undefined values are transformed to 0.0. Otherwise, undefined values yield an error.
invert
If True (default is False), the transformations are reversed - the selected value becomes 0.0 (or -1.0) and others 1.0.

The following examples load the Monk 1 dataset and prepares various transformations for attribute "e".

part of transformvalues-d2c.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") e = data.domain["e"] e1 = orange.FloatVariable("e=1") e1.getValueFrom = orange.ClassifierFromVar(whichVar = e) e1.getValueFrom.transformer = orange.Discrete2Continuous() e1.getValueFrom.transformer.value = int(orange.Value(e, "1"))

We first construct a new continuous attribute e1, and set its getValueFrom to a newly constructed classifier that will extract the value of e from any example it's given. Then we tell the classifier to transform the gotten value using a Discrete2Continuous transformation. The tranformations value is set to the index of e's value "1"; one way to do it is to construct a Value of attribute e and cast it to integer (if you don't understand this, use it without understanding it).

To demonstrate the use of various flags, we constructed two more attributes in a similar manner. Both are based on e, all check whether e's value is "1", except that the new attribute's e10 tranformation will not be zero based and the e01's transformation will also be inverted:

part of transformvalues-d2c.py

(...) e10.getValueFrom.transformer.zeroBased = False (...) e01.getValueFrom.transformer.zeroBased = False e01.getValueFrom.transformer.invert = True

Finally, we shall construct a new domain that will only have the original e and its transformations, and the class. We shall convert the entire table to that domain and print out the first ten examples.

part of transformvalues-d2c.py

newDomain = orange.Domain([e, e1, e10, e01], data.domain.classVar) newData = orange.ExampleTable(newDomain, data) for ex in newData[:10]: print ex

Here's the script's output.

['1', 1.000, 1.000, -1.000, '1'] ['1', 1.000, 1.000, -1.000, '1'] ['2', 0.000, -1.000, 1.000, '1'] ['2', 0.000, -1.000, 1.000, '1'] ['3', 0.000, -1.000, 1.000, '1'] ['3', 0.000, -1.000, 1.000, '1'] ['4', 0.000, -1.000, 1.000, '1'] ['4', 0.000, -1.000, 1.000, '1'] ['1', 1.000, 1.000, -1.000, '1'] ['1', 1.000, 1.000, -1.000, '1']

The difference between the second and the third attribute is in that where the second has zero's, the third has -1's. The last attribute (before the class) is reversed version of the third.

You can, of course, "divide" a single attribute into a number of continuous attributes. Original attribute e has four possible values; let's create for new attributes, each corresponding to one of e's values.

part of transformvalues-d2c.py (uses monk1.tab)

attributes = [e] for v in e.values: newattr = orange.FloatVariable("e=%s" % v) newattr.getValueFrom = orange.ClassifierFromVar(whichVar = e) newattr.getValueFrom.transformer = orange.Discrete2Continuous() newattr.getValueFrom.transformer.value = int(orange.Value(e, v)) attributes.append(newattr)

The output of this script is

['1', 1.000, 0.000, 0.000, 0.000, '1'] ['1', 1.000, 0.000, 0.000, 0.000, '1'] ['2', 0.000, 1.000, 0.000, 0.000, '1'] ['2', 0.000, 1.000, 0.000, 0.000, '1'] ['3', 0.000, 0.000, 1.000, 0.000, '1'] ['3', 0.000, 0.000, 1.000, 0.000, '1'] ['4', 0.000, 0.000, 0.000, 1.000, '1'] ['4', 0.000, 0.000, 0.000, 1.000, '1'] ['1', 1.000, 0.000, 0.000, 0.000, '1'] ['1', 1.000, 0.000, 0.000, 0.000, '1']

NormalizeContinuous

Transformer NormalizeContinuous takes a continuous values and keeps it continuous, but subtracts the average and divides the difference by half of the span; v' = (v-average) / span

Attributes

average
The value that is subtracted from the original.
span
The divisor

The following script "normalizes" all attribute in the Iris dataset by subtracting the average value and dividing by the half of deviation.

part of transformvalues-nc.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") domstat = orange.DomainBasicAttrStat(data) newattrs = [] for attr in data.domain.attributes: attr_c = orange.FloatVariable(attr.name+"_n") attr_c.getValueFrom = orange.ClassifierFromVar(whichVar = attr) transformer = orange.NormalizeContinuous() attr_c.getValueFrom.transformer = transformer transformer.average = domstat[attr].avg transformer.span = domstat[attr].dev/2 newattrs.append(attr_c) newDomain = orange.Domain(data.domain.attributes + newattrs, data.domain.classVar) newData = orange.ExampleTable(newDomain, data)

MapIntValue

MapIntValue is a discrete-to-discrete transformer that changes values according to the given mapping. MapIntValue is used for binarization in decision trees.

Attributes

mapping
Mapping that determines the new value: v' = mapping[v]. Undefined values remain undefined. Mapping is indexed by integer indices and contains integer indices of values.

The following script transforms the value of 'age' in dataset lenses from 'young' to 'young', and from 'pre-presbyopic' and 'presbyopic' to 'old'.

part of transformvalues-miv.py (uses lenses.tab)

age = data.domain["age"] age_b = orange.EnumVariable("age_c", values = ['young', 'old']) age_b.getValueFrom = orange.ClassifierFromVar(whichVar = age) age_b.getValueFrom.transformer = orange.MapIntValue() age_b.getValueFrom.transformer.mapping = [0, 1, 1]

The mapping tells that 0th value of age goes to 0th, while 1st and 2nd go to the 1st value of age_b.

Transforming Domains and Datasets

In the example on use of NormalizeContinuous we have already seen how to transform all attributes of some dataset and prepare the corresponding new dataset. This operation is rather common, so it makes sense to have a few classes for accomplishing this task. Such a class is inevitably less flexible than per-attribute transformations, since no specific options can be set for individual attributes. For instance, DomainContinuizer which will be introduced below, can be told how to treat multinominal attributes, but the same treatment then applies to all such attributes. In case that some of your attributes need specific treatment, you will have to program individual treatments yourself, in the manner similar to what we showed while introducing NormalizeContinuous.

DomainContinuizer

DomainContinuizer is a class that, given a domain or a set of examples returns a new domain containing only continuous attributes. If examples are given, the original continuous attribute can be normalized, while for discrete attributes it is possible to use the most frequent value as the base. The attributes are treated according to their type:

The fate of the class attribute is determined specifically.

Attributes

zeroBased
This flag corresponds to zeroBased flag of class Discrete2Continuous and determines the value used as the "low" value of the attribute. When binary attribute are transformed into continuous or when multivalued attribute is transformed into multiple attributes, the transformed attribute can either have values 0.0 and 1.0 (default, zeroBased=True) or -1.0 and 1.0. In the following text, we will assume that zeroBased is True and use 0.0.
multinomialTreatment
decides the treatment of multinomial attributes. Let N be the number of the attribute's values.
DomainContinuizer.LowestIsBase
The attribute is replaced by N-1 attributes. If the attribute has the lowest value (0), all N-1 attributes are zero. If not, the attribute corresponding to the actual attributes value (the first of the attributes corresponding to value 1, the second to 2...) will be 1.0 and the other will be 0.0. For attributes that have baseValue set, the specified value is used as base instead of the lowest one.
DomainContinuizer.FrequentIsBase
The attribute is treated in the same fashion as above, except that not the lowest but the most frequent value is used as a base. If there are more attributes that share the first place, the lowest value is used. For this option to work, the continuized domain needs to be constructed from a dataset, not a domain (which doesn't give information on value frequencies). Again, if attribute has baseValue set, the specified value is used instead of the most frequent.
DomainContinuizer.NValues
The attribute is replaced by N attributes. If you plan to use the newly constructed domain in statistical modelling, make sure that the method is immune to dependent attributes. An exception to that are binary attributes which are still replaced by a single attribute.
DomainContinuizer.Ignore
Multivalued attributes are omitted.
DomainContinuizer.ReportError
If multivalued attribute is encountered, an error is raised.
DomainContinuizer.AsOrdinal
Multivalued attributes are treated as ordinal, ie replaced by a continuous attribute with the values' index (see Ordinal2Continuous).
DomainContinuizer.AsNormalizedOrdinal
As above, except that the resulting continuous value will be from range 0 to 1.
normalizeContinuous
If True (not by default) continuous attributes are "normalized": they are subtracted the average value and divided by the deviation. This is only possible when the continuizer is given the data, not only the domain.
classTreatment
Determines what happens with the class attribute if it is discrete.
DomainContinuizer.Ignore
Class attribute is copied as is. Note that this is different from the meaning of this value at multinomialTreatment where it denotes omitting the attribute.
DomainContinuizer.AsOrdinal, DomainContinuizer.AsNormalizedOrdinal
If class is multinomial, it is treated as ordinal, in the same manner as described above. Binary classes are transformed to 0.0/1.0 attributes.
It is not possible to normalize the continuous class with DomainContinuizer.

Let us first examine the effect of multinomialTreatment on attributes from dataset "bridges". To be able to follow the transformations, we shall first print out a description of domain and the 15th example in the dataset.

part of transformvalues-domain.py (uses bridges.tab)

def printExample(ex): for val in ex: print "%20s: %s" % (val.variable.name, val) data = orange.ExampleTable("bridges") for attr in data.domain: if attr.varType == orange.VarTypes.Continuous: print "%20s: continuous" % attr.name else: print "%20s: %s" % (attr.name, attr.values) print print "Original 15th example:" printExample(data[15])

We'll show the output in a moment. Let us now use the lowest values as the bases and continuize the attributes.

part of transformvalues-domain.py

continuizer = orange.DomainContinuizer() continuizer.multinomialTreatment = continuizer.LowestIsBase domain0 = continuizer(data) data0 = data.translate(domain0) printExample(data0[15])

Here's what we get; to the left, we've added the original example and the domain description, so that we can see what happens.

RIVER=A: 0.000 RIVER=O: 0.000 RIVER=Y: 0.000 ERECTED: 1863 PURPOSE=AQUEDUCT: 0.000 PURPOSE=RR: 1.000 PURPOSE=WALK: 0.000 LENGTH: 1000 LANES: 2 CLEAR-G=G: 0.000 T-OR-D=DECK: 0.000 MATERIAL=IRON: 1.000 MATERIAL=STEEL: 0.000 SPAN=MEDIUM: 1.000 SPAN=LONG: 0.000 REL-L=S-F: ? REL-L=F: ? TYPE=SUSPEN: 0.000 TYPE=SIMPLE-T: 1.000 TYPE=ARCH: 0.000 TYPE=CANTILEV: 0.000 TYPE=NIL: 0.000 TYPE=CONT-T: 0.000 RIVER: M ERECTED: 1863 PURPOSE: RR LENGTH: 1000 LANES: 2 CLEAR-G: N T-OR-D: THROUGH MATERIAL: IRON SPAN: MEDIUM REL-L: ? TYPE: SIMPLE-T RIVER: <M, A, O, Y> ERECTED: continuous PURPOSE: <HIGHWAY, AQUEDUCT, RR, WALK> LENGTH: continuous LANES: continuous CLEAR-G: <N, G> T-OR-D: <THROUGH, DECK> MATERIAL: <WOOD, IRON, STEEL> SPAN: <SHORT, MEDIUM, LONG> REL-L: <S, S-F, F> TYPE: <WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, NIL, CONT-T>

The first, four-valued attribute River is replaced by three attributes corresponding to values "A", "O" and "Y". For the 15th example, River is "M" so all three attributes are 0.0. The continuous year is left intact. Of the three attributes that describe the purpose of the bridge, "PURPOSE=RR" is 1.0 since this is the rail-road bridge. Value of the three-valued "REL-L" is undefined in the original example, so the corresponding two attributes in the new domain are undefined as well...

In the next test, we replaced continuizer.LowestIsBase by continuizer.FrequentIsBase, instructing Orange to use the most frequent values for base values.

RIVER=M: 1.000 RIVER=O: 0.000 RIVER=Y: 0.000 ERECTED: 1863 PURPOSE=AQUEDUCT: 0.000 PURPOSE=RR: 1.000 PURPOSE=WALK: 0.000 LENGTH: 1000 LANES: 2 CLEAR-G=N: 1.000 T-OR-D=DECK: 0.000 MATERIAL=WOOD: 0.000 MATERIAL=IRON: 1.000 SPAN=SHORT: 0.000 SPAN=LONG: 0.000 REL-L=S: ? REL-L=S-F: ? TYPE=WOOD: 0.000 TYPE=SUSPEN: 0.000 TYPE=ARCH: 0.000 TYPE=CANTILEV: 0.000 TYPE=NIL: 0.000 TYPE=CONT-T: 0.000 RIVER: M ERECTED: 1863 PURPOSE: RR LENGTH: 1000 LANES: 2 CLEAR-G: N T-OR-D: THROUGH MATERIAL: IRON SPAN: MEDIUM REL-L: ? TYPE: SIMPLE-T RIVER: <M, A, O, Y> ERECTED: continuous PURPOSE: <HIGHWAY, AQUEDUCT, RR, WALK> LENGTH: continuous LANES: continuous CLEAR-G: <N, G> T-OR-D: <THROUGH, DECK> MATERIAL: <WOOD, IRON, STEEL> SPAN: <SHORT, MEDIUM, LONG> REL-L: <S, S-F, F> TYPE: <WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, NIL, CONT-T>

Comparing the outputs, we notice that for the first attribute, "A" is chosen as the base value instead of "M", so the three new attributes tell whether the bridge is over "M", "O" or "Y". As for Purpose, nothing changes since highway bridges are the most often. The base value also changes for the binary Clear-G, since G is more frequent than N...

Next alternative is continuizer.NValues, which turns N-valued attributes into N attributes, except for N==2, where we still get the binary attribute, using the lowest value for the base.

RIVER=M: 1.000 RIVER=A: 0.000 RIVER=O: 0.000 RIVER=Y: 0.000 ERECTED: 1863 PURPOSE=HIGHWAY: 0.000 PURPOSE=AQUEDUCT: 0.000 PURPOSE=RR: 1.000 PURPOSE=WALK: 0.000 LENGTH: 1000 LANES: 2 CLEAR-G=G: 0.000 T-OR-D=DECK: 0.000 MATERIAL=WOOD: 0.000 MATERIAL=IRON: 1.000 MATERIAL=STEEL: 0.000 SPAN=SHORT: 0.000 SPAN=MEDIUM: 1.000 SPAN=LONG: 0.000 REL-L=S: ? REL-L=S-F: ? REL-L=F: ? TYPE=WOOD: 0.000 TYPE=SUSPEN: 0.000 TYPE=SIMPLE-T: 1.000 TYPE=ARCH: 0.000 TYPE=CANTILEV: 0.000 TYPE=NIL: 0.000 TYPE=CONT-T: 0.000 RIVER: M ERECTED: 1863 PURPOSE: RR LENGTH: 1000 LANES: 2 CLEAR-G: N T-OR-D: THROUGH MATERIAL: IRON SPAN: MEDIUM REL-L: ? TYPE: SIMPLE-T RIVER: <M, A, O, Y> ERECTED: continuous PURPOSE: <HIGHWAY, AQUEDUCT, RR, WALK> LENGTH: continuous LANES: continuous CLEAR-G: <N, G> T-OR-D: <THROUGH, DECK> MATERIAL: <WOOD, IRON, STEEL> SPAN: <SHORT, MEDIUM, LONG> REL-L: <S, S-F, F> TYPE: <WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, NIL, CONT-T>

The least exciting case is continuizer.Ignore, which reduces the attribute set to continuous attributes.

ERECTED: 1863 LENGTH: 1000 LANES: 2 CLEAR-G=G: 0.000 T-OR-D=DECK: 0.000 RIVER: M ERECTED: 1863 PURPOSE: RR LENGTH: 1000 LANES: 2 CLEAR-G: N T-OR-D: THROUGH MATERIAL: IRON SPAN: MEDIUM REL-L: ? TYPE: SIMPLE-T RIVER: <M, A, O, Y> ERECTED: continuous PURPOSE: <HIGHWAY, AQUEDUCT, RR, WALK> LENGTH: continuous LANES: continuous CLEAR-G: <N, G> T-OR-D: <THROUGH, DECK> MATERIAL: <WOOD, IRON, STEEL> SPAN: <SHORT, MEDIUM, LONG> REL-L: <S, S-F, F> TYPE: <WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, NIL, CONT-T>

The last two variations retain the number of attributes, but turn them into continuous. continuizer.AsOrdinal looks like this.

C_RIVER: 0.000 ERECTED: 1863 C_PURPOSE: 2.000 LENGTH: 1000 LANES: 2 C_CLEAR-G: 0.000 C_T-OR-D: 0.000 C_MATERIAL: 1.000 C_SPAN: 1.000 C_REL-L: ? C_TYPE: 2.000 RIVER: M ERECTED: 1863 PURPOSE: RR LENGTH: 1000 LANES: 2 CLEAR-G: N T-OR-D: THROUGH MATERIAL: IRON SPAN: MEDIUM REL-L: ? TYPE: SIMPLE-T RIVER: <M, A, O, Y> ERECTED: continuous PURPOSE: <HIGHWAY, AQUEDUCT, RR, WALK> LENGTH: continuous LANES: continuous CLEAR-G: <N, G> T-OR-D: <THROUGH, DECK> MATERIAL: <WOOD, IRON, STEEL> SPAN: <SHORT, MEDIUM, LONG> REL-L: <S, S-F, F> TYPE: <WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, NIL, CONT-T>

For instance, the value of C_Purpose is 2.000 since the Purpose has the 2nd possible value of purpose (if we start counting by 0). Finally, continuizer.AsNormalizedOrdinal normalizes the new continuous attributes to range 0.0 - 1.0.

C_RIVER: 0.000 ERECTED: 1863 C_PURPOSE: 0.667 LENGTH: 1000 LANES: 2 C_CLEAR-G: 0.000 C_T-OR-D: 0.000 C_MATERIAL: 0.500 C_SPAN: 0.500 C_REL-L: ? C_TYPE: 0.333 RIVER: M ERECTED: 1863 PURPOSE: RR LENGTH: 1000 LANES: 2 CLEAR-G: N T-OR-D: THROUGH MATERIAL: IRON SPAN: MEDIUM REL-L: ? TYPE: SIMPLE-T RIVER: <M, A, O, Y> ERECTED: continuous PURPOSE: <HIGHWAY, AQUEDUCT, RR, WALK> LENGTH: continuous LANES: continuous CLEAR-G: <N, G> T-OR-D: <THROUGH, DECK> MATERIAL: <WOOD, IRON, STEEL> SPAN: <SHORT, MEDIUM, LONG> REL-L: <S, S-F, F> TYPE: <WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, NIL, CONT-T>

Values of Purpose now transform to 0.000, 0.333, 0.667 and 1.000; for railroad bridges, the corresponding value is 0.667.