Attribute Descriptors

Attribute descriptors are stored in objects derived from type orange.Variable. Their role is to identify the attributes. Two attributes in Orange are same, if they have the same descriptor, not the same name. Besides, descriptors store symbolic names for attributes and their symbolic values. Another important feature of orange.Variable is that define a method by which an attribute value can be computed from other attributes; this is used in, for instance, discretization.

Variables can be constructed the usual way, through constructors, or by calling functions orange.Variable.getExisting or orange.Variable.make. These functions search through the existing variables to find one with the same name, type and for discrete attributes, values. If the succeed, the existing descriptor (an instance of Variable) is returned. If none is found, orange.Variable.getExisting returns None, while orange.Variable.make creates a new variable. Through using these two functions, same-named attributes can be the same attributes. This is needed for loading the data, while typical user-written scripts seldom require such attributes as they can store and reuse descriptors themselves. The functions are described later on.

Variable

orange.Variable is a base class for attribute descriptors.

Attributes

name
Each attribute has a name. An empty string is a wholly legal name that can and should be used for temporary or very internal attributes. Two attributes can have the same name: Orange does not distinguish attributes by names except in communication with user (when user wants to see a value of attribute 'age', the name is obviously used) or when loading the data (see the explanation in Supported File Formats). However, if two attributes with same names appear in the same domain and indexing by names is used, results of user queries are unpredictable. In general, try to avoid giving the same name to different attributes.
varType
varType is an integer describing the attribute type. As for orange.Value's varType, it can be orange.VarTypes.Discrete (1), orange.VarTypes.Continuous (2) or orange.VarTypes.Other.
getValueFrom
When attribute is derived from other attributes, e.g. through discretization, binarization or some form of constructive induction, getValueFrom points to a "function" that computes the value of the attribute from values of other attributes. The function is actually an orange.Classifier: its input is an orange.Example whose values are used to compute the value of the derived attribute, and its result is the computed value. A great thing about this is that it usually happens behind your back. Even more, you should never call getValueFrom directly, but should do so through method computeValue that establishes security measures prohibiting deadlocks.

Although getValueFrom is always of type orange.Classifier, you can set it to an ordinary Python function or callable class. Orange will automatically wrap it into an orange.Classifier, as described in Subtyping Orange classes in Python.

See the corresponding example below.

ordered
A flag telling whether the attribute values are ordered. At the moment, no method actually treats ordinal attributes differently than nominal, so this flag is reserved for future use.
distributed
A flag that tells whether the values of this attribute are distributions. As for flag ordered, no methods treat such attributes in any special manner, so the flag is again reserved for future use.
sourceVariable
Another attribute for potential use in future: if getValueFrom computes the attribute value from a single attribute, this attribute can be (but is not necessarily) stored in sourceVariable. As this is only used in a rather obscure place you won't run into, there's no harm in not ever setting sourceVariable.
randomGenerator
Local random number generator used by method randomvalue.
defaultMetaId
A proposed meta id to be used with that variable. By default it is set to 0; when the attribute is first registered with any domain. It does not mean that the attribute should always have this same meta id. defaultMetaId is, for instance, used by the data loader for tab-delimited file format, or by function newmetaid, if the variable is passed as an argument.

Methods

<constructors>
Constructors for classes derived from orange.Variable (which is abstract itself) can be given the usual keyword arguments. Besides, the attribute name can be given directly. That is, an attribute descriptor for continuous attribute "age" can be constructed by calling orange.FloatVariable("age") or, equivalently, by orange.FloatVariable(name="age").
<call>
Calling a descriptor can be used to convert symbolic, integer or any other applicable native Python types into orange.Value objects for this attribute. Calling var(val) is equivalent to orange.Value(var, val); see construction of values.
<iteration>
Attribute descriptors can be used in for loops. So for val in var would iterate through all values of attribute var, when possible.
randomvalue()
randomvalue returns a random value for the attribute, when possible. This function uses randomGenerator; if none has been assigned yet, a new is constructed with the initial seed 0, and stored for the future use.
computeValue(example)
Calls getValueFrom through a mechanism that prevents deadlocks by circular calls.
EnumVariable

EnumVariable

EnumVariable is a descriptor for nominal and ordinal attributes. It defines two additional attributes, values and baseValue, and no additional methods. Iterating and returning random values is supported.

Attributes

values
A list with symbolic names for attribute's values. Values for attributes of type are stored as integers referring to this list. Therefore, modifying this list instantly changes names of values of examples, as they are printed out or referred to by user. The size of the list is also used to indicate the number of possible values for this attribute; changing the size, especially shrinking the list can have disastrous effects and is therefore not really recommendable. Also, do not add values to the list by calling its append or extend method: call EnumVariable.addValues described below.

It is also assumed that values is always defined (but can be empty), so you should never set values to None.

baseValue
Sets the base value for the attribute. This can be, for instance a "normal" value, such as "no complications" as opposed to abnormal "low blood pressure" and "excessive blooding". The base value can be (and is) used by certain statistics and, potentially, learning algorithms. baseValue is an integer that is to be interpreted as an index to values. The absence of base value ("sex" can be either "female" or "male", without an obvious base value) is indicated by -1.

Methods

addValue(string)
Adds a value to values. Always call this function instead of appending to values.

FloatVariable

FloatVariable is a descriptor for continuous attributes.

startValue, endValue, stepValue
The range of attribute, used for returning random values and for iteration. You can leave the three values at defaults (-1, which is interpreted as undefined), if you don't need randoms and iterations. (I can't recall ever using them...)
numberOfDecimals
The number of decimals used when the value is printed, converted to a string or saved to a file
scientificFormat
If True, the value is printed in scientific format whenever it would have more than 5 digits. In this case, numberOfDecimals is ignored.
adjustDecimals
Tells Orange to monitor the number of decimals when the value is converted from a string (either by setting the attribute values, e.g. example[0]="3.14" or when reading from file). The value of 0 means that the number of decimals should not be adjusted, while 1 and 2 mean that adjustments are on, with 2 denoting that no values have been converted yet.

By default, adjustment of number of decimals goes as follows. If the attribute was constructed when examples were read from a file, it will be printed with the same number of decimals as the largest number of decimals encountered in the file. If scientific notation occurs in the file, scientificFormat will be set to True and scientific format will be used for values too large or too small.

If the attribute is created in a script, it will have, by default, three decimals places. This can be changed either by setting the attribute value from a string (e.g. example[0]="3.14", but not example[0]=3.14) or by manually setting the numberOfDecimals (e.g. attr.numberOfDecimals=1).

StringVariable

StringVariable describes attributes that contains strings. No method can use them for learning; some will complain and other will silently ignore them when the encounter them. They can be, however, useful for meta-attributes; if examples in dataset have unique id's, the most efficient way to retain them is to read them as meta-attributes. In general, never use discrete attributes with many (say, more than 50) values. Such attributes are probably not of any use for learning and should be stored as string attributes.

There's a short and simple example which makes use of StringVariable near the end of the page about Domain.

When converting strings into values and back, empty strings are treated differently than usual. For other types, an empty string can be used as a synonymous for question mark ("don't know"), while StringVariable will take empty string as an empty string -- that is, except when loading or saving into file. Empty strings in files are interpreted as "don't know". You can, however, enclose the string into double quotes; these get removed when the string is loaded. Therefore, to give an empty string, put it into double quotes, "".

PythonVariable

PythonVariable is a base class for descriptors defined in Python. Itself fully functional, PythonVariable can already be used as a descriptor for attributes that contain arbitrary Python values. Since this is an advanced topic, PythonVariables are described on a a separate page.

Using getValueFrom

Monk 1 is a well-known dataset with target concept y := a==b or e==1. It does not hurt, even more, it can even help if we replace the four-valued attribute e with a binary attribute having values 1 and not 1. The new attribute shall be computed from the old one on the fly.

part of variable.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") e2 = orange.EnumVariable("e2", values = ["not 1", "1"]) def checkE(example, returnWhat): if example["e"]=="1": return orange.Value(e2, "1") else: return orange.Value(e2, "not 1") e2.getValueFrom = checkE

Our new attribute is named e2; we define it by descriptor of type orange.EnumVariable, with appropriate name and values not 1 and 1 (we chose this order so that the not 1's index is 0, which can be, if needed, interpreted as false).

checkE is a function that is passed an example and another argument we don't care about. If example's attribute e equals 1, the function returns value 1, otherwise it returns not 1. Both are returned as values, not plain strings of attribute e2. Finally, we tell e2 to use checkE to compute its value when needed, by assigning checkE to getValueFrom.

In most circumstances, value of e2 can be computed on the fly - we can pretend that the attribute exists in the data, although it doesn't (but can be computed from it). For instance, we can observe the conditional distribution of classes with regard to e2.

>>> dist = orange.Distribution(e2, data) >>> print dist <324.000, 108.000> >>> >>> cont = orange.ContingencyAttrClass(e2, data) >>> print "Class distribution when e=1:", cont["1"] Class distribution when e=1: <0.000, 108.000> >>> print "Class distribution when e<>1:", cont["not 1"] Class distribution when e<>1: <216.000, 108.000>

orange.Distribution is called to compute the distribution for e2 in data. When it notices that data.domain does not contain e2, it checks whether e2's getValueFrom is defined and, seeing that it is, utilizes it to get e2's values.

We describe technical details to make you aware that automatic recomputation requires some effort on the side of orange.ContingencyAttrClass. There are methods which will not do that for you, either because it would be too complex or time consuming. An example of such situation is constructive induction by function decomposition; making incompatibility matrices with attributes computed on the fly would be slow and impractical, so attempting it would yield an error. In such cases, you can simply convert entire examples table to a new domain that also includes the new attribute.

part of variable.py (uses monk1.tab)

newDomain = orange.Domain([data.domain["a"], data.domain["b"], e2, data.domain.classVar]) newData = orange.ExampleTable(newDomain, data)

Automatic computation is useful when the data is split onto training and testing examples. Training examples can be modified by adding, removing and transforming attributes (in a typical setup, continuous attributes are discretized prior to learning, therefore the original attributes are replaced by new attributes), while testing examples are left as they are. When they are classified, the classifier automatically converts the testing examples into the new domain, which includes recomputation of transformed attributes. With our toy script, we can split the data, use it for learning and then test the classification of unmodified test examples.

variable2.py (uses monk1.tab)

import orange, orngTree data = orange.ExampleTable("monk1") indices = orange.MakeRandomIndices2(data, p0=0.7) trainData = data.select(indices, 0) testData = data.select(indices, 1) e2 = orange.EnumVariable("e2", values = ["not 1", "1"]) e2.getValueFrom = lambda example, returnWhat: orange.Value(e2, example["e"]=="1") newDomain = orange.Domain([data.domain["a"], data.domain["b"], e2, data.domain.classVar]) newTrain = orange.ExampleTable(newDomain, trainData) tree = orange.TreeLearner(newTrain) orngTree.printTxt(tree) for ex in testData[:10]: print ex.getclass(), tree(ex)

First, note that we have rewritten the above example, replacing the checkE function with a simpler lambda function, which exploits the fact that Python's false and true equal 0 and 1. We have split the data into trainData and testData, with 70% and 30% of examples, respectively. After constructing a new domain, we only translate the training examples and induce a decision tree. Printout shows that it first split the examples by the attribute e2 and then, if e2 is not 1, it (implicitly) checks the equality of a and b. In the for loop, examples from testData, which does not have attribute e2 are correctly classified. The way this is done is same for all classifiers: classifier stores the domain description for the learning examples (or, to be more precise, a domain in which the model is described). Prior to classification, examples from other domains are converted to the stored domain. In our case, examples from testData are converted to newDomain, and the given lambda function is used to compute the value from e2 from e.

What to do if an attribute can be computed from different domains, using different procedures? Can there be more than one function to be tried? Why is there only one getValueFrom, not a list of them? Although we are pretty advanced Orange users, we never ran into a situation where we needed this (obviously; if needed it, we'd have done something about it :). If you, however, need to specify more than one function for attribute value computation, you can define a Python class that stores a list of functions and calls them in appropriate manner. Then give an object of this class to getValueFrom. And tell us about your case, and we shall rethink our position.

Advanced: Reuse of Descriptors

There are situations when the attribute descriptor may need to be reused, yet the reference to it is not available. Typically, the user loads some training examples, trains a classifier and then loads a separate test set. For the classifier to recognize the attributes in the second data set, the descriptors, not just the names, need to be the same. This problem was first solved by requiring the user to explicitly provide the "original" Domain, which mystified too many, so later on Orange used domain depots where it looked for suitable domains to reuse without any user intervention. This worked - with a few nasty exceptions - until Orange started to (tend to) support pickling: as unpickling always created new attributes, unpickled classifiers (or data or any other object storing references to descriptors) were useless.

Orange now maintains a list of all existing Variables and can check it before constructing new variables. This is done while loading the data, will be used for unpickling and can be explicitly used by the user. Creating variables directly, with constructors (EnumVariable() etc) always constructs brand new variables.

The search is based on four arguments: the attribute's name, type, ordered values and unordered values. As for the latter two, the values can be explicitly ordered by the user, e.g. in the second line of the tab-delimited file, for instance to order sizes as small-medium-big.

The search for existing variables can end with one of the following statuses. (Note: Use symbolic constants, not integer numbers given in parentheses; we may introduce a new status, ExtraValues between OK and MissingValues. You can, however, count on the order of statuses to stay the same.)

  • orange.Variable.MakeStatus.NotFound (4): the attribute with that name and type does not exist
  • orange.Variable.MakeStatus.Incompatible (3): there is (or are) attributes with matching name and type, but their list of values is incompatible with the prescribed ordered values. For example, if the existing variable already has values ["a", "b"] and the new one wants ["b", "a"], this is no go. The existing list can, however be extended by the new values, so searching for ["a", "b", "c"] would succeed. So will also the search for ["a"], since the extra existing value does not matter. The formal rule is thus that the values are compatible if existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)].
  • orange.Variable.MakeStatus.NoRecognizedValues (2): there is a matching attribute, yet it has none of the values that the new attribute will have (this is obviously possible only if the new attribute has no prescribed ordered values). For instance, we search for an attribute "sex" with values "male" and "female", while there is an attribute of the same name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this attribute is possible, though this should probably be a new attribute since it obviously comes from a different data set. If we do decide for reuse, the old attribute will get some unneeded new values and the new one will inherit some from the old.
  • orange.Variable.MakeStatus.MissingValues (1): there is a matching attribute with some of the values that the new one requires, but some values are missing. This situation is neither uncommon nor suspicious: in case of separate training and testing data sets there may be attribute values which occur in one set but not in the other.
  • orange.Variable.MakeStatus.OK (0): the is an attribute which contains all the prescribed values in the correct order. The existing attribute may have some extra values, though.

Continuous attributes can obviously have only two statuses, NotFound or OK.

When loading the data using orange.ExampleTable, orange takes the safest approach and, by default, reuses everything that is compatible, that is, up to and including NoRecognizedValues. Unintended reuse would be obvious from the attribute having to many values, which the user can notice and fix. More on that in the page on loading data.

There are two functions for reusing the attributes instead of creating new ones.

Variable.make(name, type[, ordered-values, unordered-values, createNewOn])

The type should be one of the types in orange.VarTypes, e.g., orange.VarTypes.Discrete. Values can be given with any iterable type (list, set...). The optional createOnNew specifies the status at which a new attribute is created. The status must be at most Incompatible since incompatible (or non-existing) attributes cannot be reused. If it is set lower, for instance to MissingValues, a new attribute will be created even if there exists an attribute which only misses same values. If set to OK, the function will always create a new attribute.

The function returns a tuple containing an attribute descriptor and the status of the best matching attribute. So, if createOnNew was set to MissingValues, and there exists an attribute whose status is, say, UnrecognizedValues, a new attribute would be created, while the second element of the tuple would contain UnrecognizedValues. If, on the other hand, there exists an attribute which is perfectly OK, its descriptor is returned and the returned status is OK. The function returns no indicator whether the returned constructor is reused or not. This can be, however, read from the status code: if it is smaller than the specified createNewOn, the attribute is reused, otherwise we got a new descriptor.

The exception to the rule is when createNewOn is OK. In this case, the function does not search through the existing attributes and cannot know the statuses, so the returned status in this case is always OK.

Variable.getExisting(name, type[, ordered-values, unordered-values, createNewOn])
This function is essentially the same as make except that it does not construct a new attribute but returns None instead.

Here are a few examples for Variable.make; getExisting works similarly. These examples give the shown results if executed only once (in a Python session) and in this order.

part of variableReuse.py

>>> v1, s = orange.Variable.make("a", orange.VarTypes.Discrete, ["a", "b"]) >>> print s, v1.values 4 <a, b>

No surprises here: new variable is created and the status is NotFound.

>>> v2, s = orange.Variable.make("a", orange.VarTypes.Discrete, ["a"], ["c"]) >>> print s, v2 is v1, v1.values 1 True <a, b, c>

The status is 1 (MissingValues), yet the variable is reused (v2 is v1 is True). v1 gets a new value, c, which was given as an unordered value. It does not matter that the new variable does not need value b.

>>> v3, s = orange.Variable.make("a", orange.VarTypes.Discrete, ["a", "b", "c", "d"]) >>> print s, v3 is v1, v1.values 1 True <a, b, c, d>

This is similar as before, except that the new value, d is not among the ordered values.

>>> v4, s = orange.Variable.make("a", orange.VarTypes.Discrete, ["b"]) >>> print s, v4 is v1, v1.values, v4.values 3, False, <b>, <a, b, c, d>

The new attribute needs to have b as the first value, so it is incompatible with the existing attribute. The status is thus 3 (Incompatible), the two attributes are not equal and have different lists of values.

>>> v5, s = orange.Variable.make("a", orange.VarTypes.Discrete, None, ["c", "a"]) >>> print s, v5 is v1, v1.values, v5.values 0 True <a, b, c, d> <a, b, c, d>

The new attribute has values c and a, but does not mind about the order, so the existing attribute is OK.

>>> v6, s = orange.Variable.make("a", orange.VarTypes.Discrete, None, ["e"]) "a"]) >>> print s, v6 is v1, v1.values, v6.values 2 True <a, b, c, d, e> <a, b, c, d, e>

The new attribute has different values than the existing (status is 2, NoRecognizedValues), but the existing is reused nevertheless. Note that we gave e in the list of unordered values. If it was among the ordered, the reuse would fail.

>>> v7, s = orange.Variable.make("a", orange.VarTypes.Discrete, None, ["f"], orange.Variable.MakeStatus.NoRecognizedValues)) "a"]) >>> print s, v7 is v1, v1.values, v7.values 2 False <a, b, c, d, e> <f>

This is the same as before, except that we prohibited reuse when there are no recognized value. Hence a new attribute is created, though the returned status is the same as before.

>>> v8, s = orange.Variable.make("a", orange.VarTypes.Discrete, ["a", "b", "c", "d", "e"], None, orange.Variable.MakeStatus.OK) >>> print s, v8 is v1, v1.values, v8.values 0 False <a, b, c, d, e> <a, b, c, d, e>

Finally, this is a perfect match, but any reuse is prohibited, so a new attribute is created.