Attribute Types Defined in Python

Note: this page includes some advanced technical details. The recommended approach is that you read it and ignore the parts you don't understand. If the things later don't work as expected, read it again...

Warning: at the time of writing this (Aug 24 2004), this stuff is relatively untested, but we will use it in our own work as a kind of beta-testing. Please report any bugs (or remind as to remove this notice eventually :)

Besides the usual discrete and continuous attributes, which are used by learning algorithms, and strings and distributions that are here for convenience, Orange also supports arbitrary attribute types which defined in Python, that is, attributes with descriptors that are Python classes derived from PythonVariable (which is itself derived from Variable).

Such attributes cannot be used by Orange's learning methods, since most learning algorithms only handle discrete and continuous attributes (with many of them covering only one of the two types). You can, however, use attributes defined in Python in your specific learning algorithms. Another use for such attributes can be describing the examples: by using Python-defined attributes as meta attributes, you can attach arbitrary descriptors to examples. These descriptors won't be used while learning, but can be useful when presenting the examples, or by any auxiliary processes, such as example subset selections. Finally, Python-defined attributes can store data that is converted to ordinary (discrete and continuous attributes) when needed. If a Python-defined value is a list with the dates of patient's visits to the doctor, it can be used for constructing a continuous attribute that will tell the number of visits, the longest span between two consecutive visits or the time between the first and the second visit.

Python attributes can be constructed in a script or loaded from the old-style tab-delimited file, as described below. No other file formats can accommodate for these attributes.

Attributes

usePickle
Affects the way the data is saved and loaded. See the section on loading/saving values. Default is false (using __str__ is preferred over pickling).
useSomeValue
Tells what kind of data will the overloaded methods get and return. If true (default), the methods will deal with pure Python objects; if false, the methods will get and should return objects of type Value, and the corresponding Python objects can be stored into the field svalue. The following documentation is written as if useSomeValue is set. If it's not (which you'll seldom need), you need to modify the str2val, val2str and similar functions accordingly.

Associated with PythonVariable is a type PythonValue. PythonValue is a class derived from SomeValue (therefore a sibling of StringValue and Distribution) and stores a Python object. You will most often do without explicitly using PythonValue, since Orange will usually do the conversion for you, except in the cases where this could lead to ambiguity and hard-to-find errors in your scripts. Read on, and you shall see where and why.

Simple attribute values in Python

Say we have some data loaded and would like to add an attribute with some Python values to the examples. The easiest way to do this is to attach a meta attribute, like this.

part of pythonvariable1.py

import orange data = orange.ExampleTable("lenses") newattr = orange.PythonVariable("foo") data.domain.addmeta(orange.newmetaid(), newattr) data[0]["foo"] = ("a", "tuple") data[1]["foo"] = "a string" data[2]["foo"] = orange data[3]["foo"] = data

The example is certainly weird and senseless, but it shows that value of such attribute can be just anything, from tuples and strings to arbitrary Python objects, such as models and even the example table itself.

If you now check the value of data[1]["foo"] you will discover that it is not a string but orange.Value. Sure, examples store values, not just anything you throw into them. Orange did the conversion automatically at the above assignments. The actual value can be read through the Value's field value. Therefore, data[1]["foo"].value will return the string "a string". And if you, for any perverse reason, want to use the Bayesian learner through the module and data stored in the attributes, you would write data[2]["foo"].value.BayesLearner(data[3].value) (which is, of course, equivalent to orange.BayesLearner(data)).

There is a subtlety using value field; assigning, say, data[1]["foo"].value = 15 won't work as intended - see the beginning of documentation on Value for explanation.

Like any attribute, PythonVariable can compute its values from values of other attributes, as described in the documentation on the Variable's method getValueFrom. Let us show how this is done on another not necessarily useful example: we shall construct a an attribute whose value will be a list of indices, representing the values of other example's attributes.

part of pythonvariable1.py

def extolist(ex, wh=0): return orange.PythonValue(map(int, ex)) listvar = orange.PythonVariable("a_list") listvar.getValueFrom = extolist newdomain = orange.Domain(data.domain.attributes + [listvar], data.domain.classVar) newdata = orange.ExampleTable(newdomain, data)

The first few examples in the newdata look like this.

['young', 'myope', 'no', 'reduced', '[0, 0, 0, 0, 0]', 'none'] ['young', 'myope', 'no', 'normal', '[0, 0, 0, 1, 1]', 'soft'] ['young', 'myope', 'yes', 'reduced', '[0, 0, 1, 0, 0]', 'none'] ['young', 'myope', 'yes', 'normal', '[0, 0, 1, 1, 2]', 'hard']

Each element of the list corresponds to an index of the attribute value.

Note that the function extolist, which we used as a classifier to put in listvar's getValueFrom explicitly constructs a PythonValue. Couldn't we just write return map(int, ex), and let Orange treat this a value? Well, it's time to describe the story behind PythonValue.

As you've probably read in the documentation on Value, Value can store an integer used as an index of a discrete attribute value, a floating-point value of continuous attribute or a value derived from SomeValue. The latter is stored in the value's field svalue (or so it seems from Python; the actual C++ field is named differently). Field value is a kind of synonym for all three - it can return an integer, a float of SomeValue, depending upon the attribute (value) type.

In the first example, we have set the value of data[i]["foo"]. Orange knows that the corresponding attribute (data.domain["foo"]) is of type PythonVariable and converts the passed value (a tuple, string, module, object) accordingly. If data.domain["foo"] was a discrete attribute (EnumVariable) it would attempt accept the value "string" (if "string" was among the possible attribute's values) and raise a type error in other cases.

No such check can be done in function extolist. Classifiers are expected to return values and Orange would be all to happy to convert a list returned by map(int, ex) to a Value if it only knew how. But it has no idea about which type of attribute's value is this supposed to be. If this is a value of PythonVariable, it's alright, but if it's a discrete attribute, we'd have to raise an exception. Orange could, in principle, observe the value's type, conclude that this cannot be anything else than a PythonVariable and return a PythonValue, but this would be dangerous: anytime you would misconstruct a value, Orange would silently convert it to PythonValue, which would cause troubles God knows where.

There is however a workaround. You can do this as follows.

def extolist(ex, wh=0): return map(int, ex) listvar = orange.PythonVariable("a_list") listvar.getValueFrom = extolist listvar.getValueFrom.classVar = listvar

Orange now knows that the classifier returns values of attribute listvar, which is of type PythonVariable, so it can convert map(int, ex) into a value. (OK, could you write extolist.classVar = listvar? See documentation on deriving classes from Orange classes for an explanation why not. And, again, if you don't understand something here on this page, just skip it.)

Storing/Reading from Files and Deriving new attribute types

Storing Python values to files and reading them, and deriving new attribute types from PythonVariable are two very related topics. The basic job of attribute descriptors, that is, instances of classes derived from Variable is to convert the values to and from a string representation, so that they can be saved and loaded from text-based files (in whatever format and with whichever delimiters), and printed and set by the user.

All attribute descriptors define methods str2val for converting a string to a Value and val2str for the opposite, the first getting a string and returning the value and the other is just the opposite. You don't need to know about these two methods for other attribute types (and even have no direct access to them), but you indirectly use them all the time. If you inquire about the value of data[0]["age"] and see it's "young" or if you set it to "presbyopic", this goes through data.domain["age"]'s str2val and val2str, respectively.

If you want to define a special syntax for your Python-based attribute, you will need to derive a new Python class from PythonVariable and define the two functions.

part of pythonvariable2a.py

import orange, time class DateVariable(orange.PythonVariable): def str2val(self, str): return time.strptime(str, "%b %d %Y") def val2str(self, val): return time.strftime("%b %d %Y (%a)", val)

Here we defined an attribute to represent a date. We used Python's module time whose functions strptime and strftime convert a date, represented as a string in a given format to an instance of time.struct_time, used for representing dates, and back. The string formats for str2val and val2str do not need to match. See this.

>>> birth = DateVariable("birth") >>> val = birth("Aug 19 2003") >>> print val Aug 19 2003 (Tue)

When giving a value, we specify a month (a three-letter abbreviation), a day of month and a year. When the value is printed, a weekday is added.

Special values are treated separately: empty strings, question marks and tildes are converted to values without calling str2val and special values are converted to string without val2str. However, str2val can still return a special value it the string denotes one in some special syntax used. To do this, it should return PythonValueSpecial(type), where type is orange.ValueTypes.DC (which equals 1 and means don't care), orange.ValueTypes.DK (2, don't know) or any other non-zero integer (which will denote a special value of other types you need).

Let us construct an example table that will include a new attribute: we shall load the lenses data set, add the new attribute and set its value for the first example.

part of pythonvariable2a.py (uses lenses)

data = orange.ExampleTable("lenses") newdomain = orange.Domain(data.domain.attributes + [birth], data.domain.classVar) newdata = orange.ExampleTable(newdomain, data) newdata[0]["birth"] = "Aug 19 2003" print newdata[0]

You can also save the newdata to a tab-delimited file (other formats do not support Python-based attributes).

If val2str is not defined, Orange will "print" the value to a string. The alternative to defining the DateVariable's val2str is defining a special Python class that will represent a date and overload its method __str__, like in the following example.

part of pythonvariable2b.py

class DateValue(orange.SomeValue): def __init__(self, date): self.date = date def __str__(self): return time.strftime("%b %d %Y (%a)", self.date) class DateVariable(orange.PythonVariable): def str2val(self, str): return DateValue(time.strptime(str, "%b %d %Y"))

You may sometimes want to use a different string representation for saving and loading from files. This will be useful when the object is rather complex, so you would need a simpler (yet possibly inaccurate) form for printing the value and a more complex form for storing it. Also, it may be sometimes inconvenient or even impossible to parse the human-readable strings. Finally, we would even have problems saving the above attribute since str2val and val2str use different date formats.

To define a different representation for saving values to files, you need to define methods filestr2val and val2filestr. They are similar to str2val and val2str, except that they get an additional argument: an example that is being read or written. In the former case, the example may be half constructed: the line in a file is always interpreted from left to right, so some values are already set while other are random (you may notice they are actually not, but refer from using them to avoid incompatibilities with future versions of Orange).

For our DateVariable, the two additional functions could, for instance, look as follows.

part of pythonvariable2c.py

def filestr2val(self, str, example): if str == "unknown": return orange.PythonValueSpecial(orange.ValueTypes.DK) return DateValue(time.strptime(str, "%m/%d/%Y")) def val2filestr(self, val, example): return time.strftime("%m/%d/%Y", val)

We have added a new representation for unknown values: string "unknown" translated to DK. Just for fun, we use a different date format - month (given numerically), day and year, divided by slashes.

PythonVariable has a flag usePickle. If set and val2filestr is undefined, Orange will pickle values when saving to a file. To accommodate for the file's limitations, newlines in the pickled string are changed to "\n" (if you attempt to manually unpickle the strings you find in files, you'll need to convert this back). See Python documentation on module "pickle" for details on pickling; basically, Orange will use pickle.dumps function which can convert practically any Python object to a string (a concept which is also known as serialization (in Java) or marshalling).

Finally, here's how loading and saving from files goes. Converting a value read from the file goes like this:

  1. If value is an empty string or a question mark, it's don't know. If it's a tilde, it's don't care. You cannot override that.
  2. If filestr2val is defined, it's called. Error is reported on error.
  3. If usePickled is not set and str2val is defined, str2val is called. Error is reported if this fails.
  4. If usePickled is set, Python attempts to unpickle the string. If this fails, Orange continues with the next step.
  5. As a final attempt, Orange will treat the string as a Python expression. The scope (local and global variables) will be the same as at the call of ExampleTable(filename). The example that is being constructed (the same object as the last argument of filestr2val) will be present as a local variable __fileExample).
  6. If none of this succeeds, Orange reports an error.

Writing to a file goes like follows.

  1. Special values are represented with question marks and tildes.
  2. If val2filestr is defined, it's used.
  3. If usePickle is set, the value is pickled.
  4. val2str is called if defined.
  5. The value is printed. This can never fail, but usually won't give useful results if you don't redefined the value's __str__, as shown above.

Although it may seem complicated, this order is natural and will work seamlessly - if you just redefine what you think sensible, Orange will probably work it out fine. Only if it doesn't, check the above steps to determine what went wrong.

Other Methods You Can Redefine

Attribute descriptors derived from Variable may support methods for returning a sequence of values, random values and the number of different values. You don't need to provide those methods. If you want, here are the methods you will need to define.

Overloadable methods of PythonVariable

firstValue(self)
Returns the first value of the attribute.
nextValue(self, value)
Returns the next value after value.
randomValue(self, int)
Returns a random value. It is desirable that the method uses the given integer as an argument for constructing the value, i.e. for initializing the random number generator or through some hashing scheme.
__len__(self)
Returns the number of different values.

You can also redefine two PythonValue's methods. Besides __str__ which we examined above, you can also redefine __cmp__(self, other) which should return a negative integer is self is smaller than other, zero if they are equal and positive integer if self is greater. The meaning of "smaller" and "greater" depends upon the type of the attribute. If you leave the function undefined, Orange will use the Python's built-in comparison function. Which is great, since all decent Python objects support sensible comparisons. For instance, let us continue with the example in which we first defined DateVariable (but haven't defined DateValue).

part of pythonvariable2a.py

newdata[0]["birth"] = "Aug 19 2003" newdata[1]["birth"] = "Jan 12 1998" newdata[2]["birth"] = "Sep 1 1995" newdata[3]["birth"] = "May 25 2001" newdata.sort("birth") print "\nSorted data" for i in newdata: print i

This sorts the examples according to the date of birth (it might well be that people born in 2003 do not wear contact lenses at the time of writing this documentation, but we expect Orange to be around for a while :).

This won't work with PythonValue as we defined it (the second example, where we showed how to redefine __str__). To fix it, we need to define a method __cmp__ for PythonValue. The easiest way is to pass the work to Python, like this.

part of pythonvariable2b.py

def __cmp__(self, other): return cmp(self.date, other.date)

Tab-delimited format extensions

Tab-delimited format is the only format that supports Python-based attribute values. (Some other, especially Excel, may follow soon.) The attribute type (the second row) can be given in three ways.

The latter tricks - putting arbitrary expressions into can sometimes come handy for specifying the values as well. Here's an example of a file with Python expressions used for specifying values.

pythonvariable.tab

tear_rate foo lenses discrete python none soft hard class reduced [0, 0, 0, 0, 0] none normal A(3.14) soft reduced math.sqrt(4+a) none normal a*5 hard reduced [0, 1, 0, 0, 0] none normal perfectSquares(100) soft reduced [0, a, 1, 0, 0] none

Loading this file requires defining A (this shall be some class), import math so we can compute math.sqrt, define a to be some number, and perfectSquare(n) to be a function (which will, in our case, return a list of perfect squares for up to n).

part of pythonvariable2d.py

import orange, math def perfectSquares(x): return filter(lambda x:math.floor(math.sqrt(x)) == math.sqrt(x), range(x+1)) class A: def __init__(self, x): self.x = x def __str__(self): return "value: %s" % self.x a = 12 data = orange.ExampleTable("pythonvariable.tab") for i in data: print i

And here's the output:

['reduced', '[0, 0, 0, 0, 0]', 'none'] ['normal', 'value: 3.14', 'soft'] ['reduced', '4.0', 'none'] ['normal', '60', 'hard'] ['reduced', '[0, 1, 0, 0, 0]', 'none'] ['normal', '[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]', 'soft'] ['reduced', '[0, 12, 1, 0, 0]', 'none']