Note: this page includes some advanced technical details. The recommended approach is that you read it and ignore the parts you don't understand. If the things later don't work as expected, read it again...
Warning: at the time of writing this (Aug 24 2004), this stuff is relatively untested, but we will use it in our own work as a kind of beta-testing. Please report any bugs (or remind as to remove this notice eventually :)
Besides the usual discrete and continuous attributes, which are used by learning algorithms, and strings and distributions that are here for convenience, Orange also supports arbitrary attribute types which defined in Python, that is, attributes with descriptors that are Python classes derived from
(which is itself derived from Variable
).
Such attributes cannot be used by Orange's learning methods, since most learning algorithms only handle discrete and continuous attributes (with many of them covering only one of the two types). You can, however, use attributes defined in Python in your specific learning algorithms. Another use for such attributes can be describing the examples: by using Python-defined attributes as meta attributes, you can attach arbitrary descriptors to examples. These descriptors won't be used while learning, but can be useful when presenting the examples, or by any auxiliary processes, such as example subset selections. Finally, Python-defined attributes can store data that is converted to ordinary (discrete and continuous attributes) when needed. If a Python-defined value is a list with the dates of patient's visits to the doctor, it can be used for constructing a continuous attribute that will tell the number of visits, the longest span between two consecutive visits or the time between the first and the second visit.
Python attributes can be constructed in a script or loaded from the old-style tab-delimited file, as described below. No other file formats can accommodate for these attributes.
Attributes
__str__
is preferred over pickling).Value
, and the corresponding Python objects can be stored into the field svalue
. The following documentation is written as if useSomeValue
is set. If it's not (which you'll seldom need), you need to modify the str2val
, val2str
and similar functions accordingly.Associated with PythonVariable
is a type PythonValue
. PythonValue
is a class derived from SomeValue
(therefore a sibling of StringValue
and Distribution
) and stores a Python object. You will most often do without explicitly using PythonValue
, since Orange will usually do the conversion for you, except in the cases where this could lead to ambiguity and hard-to-find errors in your scripts. Read on, and you shall see where and why.
Say we have some data loaded and would like to add an attribute with some Python values to the examples. The easiest way to do this is to attach a meta attribute, like this.
part of pythonvariable1.py
The example is certainly weird and senseless, but it shows that value of such attribute can be just anything, from tuples and strings to arbitrary Python objects, such as models and even the example table itself.
If you now check the value of data[1]["foo"]
you will discover that it is not a string but orange.Value
. Sure, examples store values, not just anything you throw into them. Orange did the conversion automatically at the above assignments. The actual value can be read through the Value
's field value
. Therefore, data[1]["foo"].value
will return the string "a string"
. And if you, for any perverse reason, want to use the Bayesian learner through the module and data stored in the attributes, you would write data[2]["foo"].value.BayesLearner(data[3].value)
(which is, of course, equivalent to orange.BayesLearner(data)
).
There is a subtlety using value
field; assigning, say, data[1]["foo"].value = 15
won't work as intended - see the beginning of documentation on Value
for explanation.
Like any attribute, PythonVariable
can compute its values from values of other attributes, as described in the documentation on the Variable
's method getValueFrom
. Let us show how this is done on another not necessarily useful example: we shall construct a an attribute whose value will be a list of indices, representing the values of other example's attributes.
part of pythonvariable1.py
The first few examples in the newdata
look like this.
Each element of the list corresponds to an index of the attribute value.
Note that the function extolist
, which we used as a classifier to put in listvar
's getValueFrom
explicitly constructs a PythonValue
. Couldn't we just write return map(int, ex)
, and let Orange treat this a value? Well, it's time to describe the story behind PythonValue
.
As you've probably read in the documentation on Value
, Value
can store an integer used as an index of a discrete attribute value, a floating-point value of continuous attribute or a value derived from SomeValue
. The latter is stored in the value's field svalue
(or so it seems from Python; the actual C++ field is named differently). Field value
is a kind of synonym for all three - it can return an integer, a float of SomeValue
, depending upon the attribute (value) type.
In the first example, we have set the value of data[i]["foo"]
. Orange knows that the corresponding attribute (data.domain["foo"]
) is of type PythonVariable
and converts the passed value (a tuple, string, module, object) accordingly. If data.domain["foo"]
was a discrete attribute (EnumVariable
) it would attempt accept the value "string"
(if "string" was among the possible attribute's values) and raise a type error in other cases.
No such check can be done in function extolist
. Classifiers are expected to return values and Orange would be all to happy to convert a list returned by map(int, ex)
to a Value
if it only knew how. But it has no idea about which type of attribute's value is this supposed to be. If this is a value of PythonVariable
, it's alright, but if it's a discrete attribute, we'd have to raise an exception. Orange could, in principle, observe the value's type, conclude that this cannot be anything else than a PythonVariable
and return a PythonValue
, but this would be dangerous: anytime you would misconstruct a value, Orange would silently convert it to PythonValue
, which would cause troubles God knows where.
There is however a workaround. You can do this as follows.
Orange now knows that the classifier returns values of attribute listvar
, which is of type PythonVariable
, so it can convert map(int, ex)
into a value. (OK, could you write extolist.classVar = listvar
? See documentation on deriving classes from Orange classes for an explanation why not. And, again, if you don't understand something here on this page, just skip it.)
Storing Python values to files and reading them, and deriving new attribute types from PythonVariable
are two very related topics. The basic job of attribute descriptors, that is, instances of classes derived from Variable
is to convert the values to and from a string representation, so that they can be saved and loaded from text-based files (in whatever format and with whichever delimiters), and printed and set by the user.
All attribute descriptors define methods str2val
for converting a string to a Value
and val2str
for the opposite, the first getting a string and returning the value and the other is just the opposite. You don't need to know about these two methods for other attribute types (and even have no direct access to them), but you indirectly use them all the time. If you inquire about the value of data[0]["age"]
and see it's "young" or if you set it to "presbyopic", this goes through data.domain["age"]
's str2val
and val2str
, respectively.
If you want to define a special syntax for your Python-based attribute, you will need to derive a new Python class from PythonVariable
and define the two functions.
part of pythonvariable2a.py
Here we defined an attribute to represent a date. We used Python's module time
whose functions strptime
and strftime
convert a date, represented as a string in a given format to an instance of time.struct_time
, used for representing dates, and back. The string formats for str2val
and val2str
do not need to match. See this.
When giving a value, we specify a month (a three-letter abbreviation), a day of month and a year. When the value is printed, a weekday is added.
Special values are treated separately: empty strings, question marks and tildes are converted to values without calling str2val
and special values are converted to string without val2str
. However, str2val
can still return a special value it the string denotes one in some special syntax used. To do this, it should return PythonValueSpecial(type)
, where type is orange.ValueTypes.DC
(which equals 1 and means don't care), orange.ValueTypes.DK
(2, don't know) or any other non-zero integer (which will denote a special value of other types you need).
Let us construct an example table that will include a new attribute: we shall load the lenses data set, add the new attribute and set its value for the first example.
part of pythonvariable2a.py (uses lenses)
You can also save the newdata
to a tab-delimited file (other formats do not support Python-based attributes).
If val2str
is not defined, Orange will "print" the value to a string. The alternative to defining the DateVariable
's val2str
is defining a special Python class that will represent a date and overload its method __str__
, like in the following example.
part of pythonvariable2b.py
You may sometimes want to use a different string representation for saving and loading from files. This will be useful when the object is rather complex, so you would need a simpler (yet possibly inaccurate) form for printing the value and a more complex form for storing it. Also, it may be sometimes inconvenient or even impossible to parse the human-readable strings. Finally, we would even have problems saving the above attribute since str2val
and val2str
use different date formats.
To define a different representation for saving values to files, you need to define methods filestr2val
and val2filestr
. They are similar to str2val
and val2str
, except that they get an additional argument: an example that is being read or written. In the former case, the example may be half constructed: the line in a file is always interpreted from left to right, so some values are already set while other are random (you may notice they are actually not, but refer from using them to avoid incompatibilities with future versions of Orange).
For our DateVariable
, the two additional functions could, for instance, look as follows.
part of pythonvariable2c.py
We have added a new representation for unknown values: string "unknown" translated to DK
. Just for fun, we use a different date format - month (given numerically), day and year, divided by slashes.
PythonVariable
has a flag usePickle
. If set and val2filestr
is undefined, Orange will pickle values when saving to a file. To accommodate for the file's limitations, newlines in the pickled string are changed to "\n" (if you attempt to manually unpickle the strings you find in files, you'll need to convert this back). See Python documentation on module "pickle" for details on pickling; basically, Orange will use pickle.dumps
function which can convert practically any Python object to a string (a concept which is also known as serialization (in Java) or marshalling).
Finally, here's how loading and saving from files goes. Converting a value read from the file goes like this:
filestr2val
is defined, it's called. Error is reported on error.usePickled
is not set and str2val
is defined, str2val
is called. Error is reported if this fails.usePickled
is set, Python attempts to unpickle the string. If this fails, Orange continues with the next step.ExampleTable(filename)
. The example that is being constructed (the same object as the last argument of filestr2val
) will be present as a local variable __fileExample
).Writing to a file goes like follows.
val2filestr
is defined, it's used.usePickle
is set, the value is pickled.val2str
is called if defined.__str__
, as shown above.Although it may seem complicated, this order is natural and will work seamlessly - if you just redefine what you think sensible, Orange will probably work it out fine. Only if it doesn't, check the above steps to determine what went wrong.
Attribute descriptors derived from Variable
may support methods for returning a sequence of values, random values and the number of different values. You don't need to provide those methods. If you want, here are the methods you will need to define.
Overloadable methods of PythonVariable
value.
You can also redefine two PythonValue
's methods. Besides __str__
which we examined above, you can also redefine __cmp__(self, other)
which should return a negative integer is self
is smaller than other
, zero if they are equal and positive integer if self
is greater. The meaning of "smaller" and "greater" depends upon the type of the attribute. If you leave the function undefined, Orange will use the Python's built-in comparison function. Which is great, since all decent Python objects support sensible comparisons. For instance, let us continue with the example in which we first defined DateVariable
(but haven't defined DateValue
).
part of pythonvariable2a.py
This sorts the examples according to the date of birth (it might well be that people born in 2003 do not wear contact lenses at the time of writing this documentation, but we expect Orange to be around for a while :).
This won't work with PythonValue
as we defined it (the second example, where we showed how to redefine __str__
). To fix it, we need to define a method __cmp__
for PythonValue
. The easiest way is to pass the work to Python, like this.
part of pythonvariable2b.py
Tab-delimited format is the only format that supports Python-based attribute values. (Some other, especially Excel, may follow soon.) The attribute type (the second row) can be given in three ways.
python
, a descriptor of type PythonVariable
will be constructed. Orange will try to unpickle the attribute values and treat them as Python expressions if this fails. This way you can given strings, lists and tuples, and as you will see soon, also more complex types.python:AttributeDescriptorType
, the attribute of the corresponding type will be constructed. Type may be PythonVariable
(in which case writing only python
would have the same effect), it can be DateVariable
or even orange.EnumVariable
(having the same effect as writing d
or discrete
). The descriptor type must be defined in the Python scope from which the example loading was called. In addition, you must give a full name of the type. If DateVariable
is defined in a module orngDates
and you imported it using import orngDates
(and not from orngDates import *
), the corresponding type definition would be python:orngDates.DateVariable
.python:expression
, where expression
is any Python expression whose evaluation results in an attribute descriptor. You will most often use it to call the descriptor's constructor, for instance python:DateVariable(arg1, arg2, arg3=15)
or python:orange.FloatVariable(numberOfDecimals=5)
. You can however use arbitrary expressions here. The type can be defined as python:evar
, and you would need to execute something like evar = orange.EnumVariable()
prior to loading the data. Or, you can call a function that returns an attribute descriptor, or select an attribute descriptor from a list...The latter tricks - putting arbitrary expressions into can sometimes come handy for specifying the values as well. Here's an example of a file with Python expressions used for specifying values.
Loading this file requires defining A
(this shall be some class), import math
so we can compute math.sqrt
, define a
to be some number, and perfectSquare(n)
to be a function (which will, in our case, return a list of perfect squares for up to n
).
part of pythonvariable2d.py
And here's the output: