Attribute descriptors are stored in objects derived from type
orange.Variable
. Their role is to identify the
attributes. Two attributes in Orange are same, if they have the
same descriptor, not the same name. Besides, descriptors store
symbolic names for attributes and their symbolic values. Another
important feature of orange.Variable
is that define
a method by which an attribute value can be computed from other
attributes; this is used in, for instance, discretization.
Variables can be constructed the usual way, through constructors, or by calling functions orange.Variable.getExisting
or orange.Variable.make
. These functions search through the existing variables to find one with the same name, type and for discrete attributes, values. If the succeed, the existing descriptor (an instance of Variable
) is returned. If none is found, orange.Variable.getExisting
returns None, while orange.Variable.make
creates a new variable. Through using these two functions, same-named attributes can be the same attributes. This is needed for loading the data, while typical user-written scripts seldom require such attributes as they can store and reuse descriptors themselves. The functions are described later on.
orange.
is a base class for attribute descriptors.
Attributes
varType
is an integer describing the attribute
type. As for orange.Value
's varType
, it
can be orange.VarTypes.Discrete
(1),
orange.VarTypes.Continuous
(2) or
orange.VarTypes.Other
.getValueFrom
points to a "function" that
computes the value of the attribute from values of other
attributes. The function is actually an
orange.Classifier
: its input is an
orange.Example
whose values are used to compute the
value of the derived attribute, and its result is the computed
value. A great thing about this is that it usually happens behind
your back. Even more, you should never call
getValueFrom
directly, but should do so through
method computeValue
that establishes security
measures prohibiting deadlocks.
Although getValueFrom
is always of type
orange.Classifier
, you can set it to an ordinary
Python function or callable class. Orange will automatically wrap
it into an orange.Classifier
, as described in Subtyping Orange classes in Python.
See the corresponding example below.
ordered
, no methods treat such attributes in any special manner, so the flag is again reserved for future use.getValueFrom
computes the attribute value from a
single attribute, this attribute can be (but is not necessarily)
stored in sourceVariable
. As this is only used in a
rather obscure place you won't run into, there's no harm in not
ever setting sourceVariable
.randomvalue
.defaultMetaId
is, for instance, used by the data loader for tab-delimited file format, or by function newmetaid
, if the variable is passed as an argument.Methods
orange.Variable
(which is abstract itself) can be
given the usual keyword arguments. Besides, the attribute name
can be given directly. That is, an attribute descriptor for
continuous attribute "age" can be constructed by calling
orange.FloatVariable("age")
or, equivalently, by
orange.FloatVariable(name="age")
.orange.Value
objects for this attribute. Calling
var(val)
is equivalent to orange.Value(var,
val)
; see construction of
values.for
loops.
So for val in var
would iterate through all values
of attribute var
, when possible.randomvalue
returns a random value for the attribute, when possible. This function uses randomGenerator
; if none has been assigned yet, a new is constructed with the initial seed 0, and stored for the future use.getValueFrom
through a mechanism that prevents deadlocks by circular calls.
is a descriptor for nominal and
ordinal attributes. It defines two additional attributes,
values
and baseValue
, and no additional
methods. Iterating and returning random values is supported.
Attributes
are stored as
integers referring to this list. Therefore, modifying this list
instantly changes names of values of examples, as they are
printed out or referred to by user. The size of the list is also
used to indicate the number of possible values for this
attribute; changing the size, especially shrinking the list
can have disastrous effects and is therefore not really
recommendable. Also, do not add values to the list by calling its append
or extend
method: call EnumVariable.addValues
described below.
It is also assumed that values
is always defined
(but can be empty), so you should never set values
to None
.
baseValue
is an
integer that is to be interpreted as an index to
values
. The absence of base value ("sex" can be
either "female" or "male", without an obvious base value) is
indicated by -1
.Methods
values
.
is a descriptor for continuous
attributes.
-1
, which is
interpreted as undefined), if you don't need randoms and
iterations. (I can't recall ever using them...)True
, the value is printed in scientific format whenever it would have more than 5 digits. In this case, numberOfDecimals
is ignored.example[0]="3.14"
or when reading from file). The value of 0 means that the number of decimals should not be adjusted, while 1 and 2 mean that adjustments are on, with 2 denoting that no values have been converted yet.By default, adjustment of number of decimals goes as follows. If the attribute was constructed when examples were read from a file, it will be printed with the same number of decimals as the largest number of decimals encountered in the file. If scientific notation occurs in the file, scientificFormat
will be set to True and scientific format will be used for values too large or too small.
If the attribute is created in a script, it will have, by default, three decimals places. This can be changed either by setting the attribute value from a string (e.g. example[0]="3.14"
, but not example[0]=3.14
) or by manually setting the numberOfDecimals
(e.g. attr.numberOfDecimals=1
).
describes attributes that contains
strings. No method can use them for learning; some will complain
and other will silently ignore them when the encounter them. They
can be, however, useful for meta-attributes; if examples in
dataset have unique id's, the most efficient way to retain them
is to read them as meta-attributes. In general, never use
discrete attributes with many (say, more than 50) values. Such
attributes are probably not of any use for learning and should be
stored as string attributes.
There's a short and simple example which makes use of
StringVariable
near the end of the page about Domain
.
When converting strings into values and back, empty strings are treated differently than usual. For other types, an empty string can be used as a synonymous for question mark ("don't know"), while StringVariable
will take empty string as an empty string -- that is, except when loading or saving into file. Empty strings in files are interpreted as "don't know". You can, however, enclose the string into double quotes; these get removed when the string is loaded. Therefore, to give an empty string, put it into double quotes, ""
.
is a base class for descriptors defined in Python. Itself fully functional, PythonVariable
can already be used as a descriptor for attributes that contain arbitrary Python values. Since this is an advanced topic, PythonVariable
s are described on a a separate page.
Monk 1 is a well-known dataset with target concept y :=
a==b or e==1
. It does not hurt, even more, it can even
help if we replace the four-valued attribute e
with
a binary attribute having values 1
and not
1
. The new attribute shall be computed from the old one on
the fly.
part of variable.py (uses monk1.tab)
Our new attribute is named e2
; we define it by
descriptor of type orange.EnumVariable
, with
appropriate name and values not 1
and 1
(we chose this order so that the not 1
's index is 0, which can be, if needed, interpreted as false
).
checkE
is a function that is passed an example
and another argument we don't care about. If example's attribute
e
equals 1
, the function returns value
1
, otherwise it returns not 1
. Both are returned as values, not plain strings of attribute
e2
. Finally, we tell e2
to use
checkE
to compute its value when needed, by
assigning checkE
to getValueFrom
.
In most circumstances, value of e2
can be computed on the fly - we can pretend that the attribute exists in the data
, although it doesn't (but can be computed from it). For instance, we can observe the conditional distribution of classes with regard to e2
.
orange.Distribution
is called to compute the distribution for e2
in data
. When it notices that data.domain
does not contain e2
, it checks whether e2
's getValueFrom
is defined and, seeing that it is, utilizes it to get e2
's values.
We describe technical details to make you aware that automatic recomputation requires some effort on the side of orange.ContingencyAttrClass
. There are methods which will not do that for you, either because it would be too complex or time consuming. An example of such situation is constructive induction by function decomposition; making incompatibility matrices with attributes computed on the fly would be slow and impractical, so attempting it would yield an error. In such cases, you can simply convert entire examples table to a new domain that also includes the new attribute.
part of variable.py (uses monk1.tab)
Automatic computation is useful when the data is split onto training and testing examples. Training examples can be modified by adding, removing and transforming attributes (in a typical setup, continuous attributes are discretized prior to learning, therefore the original attributes are replaced by new attributes), while testing examples are left as they are. When they are classified, the classifier automatically converts the testing examples into the new domain, which includes recomputation of transformed attributes. With our toy script, we can split the data, use it for learning and then test the classification of unmodified test examples.
variable2.py (uses monk1.tab)
First, note that we have rewritten the above example,
replacing the checkE
function with a simpler
lambda function
, which exploits the fact that
Python's false
and true
equal 0 and 1.
We have split the data
into trainData
and testData
, with 70% and 30% of examples,
respectively. After constructing a new domain, we only translate
the training examples and induce a decision tree. Printout shows
that it first split the examples by the attribute e2
and then, if e2
is not 1, it (implicitly) checks the
equality of a
and b
. In the
for
loop, examples from testData
, which
does not have attribute e2
are correctly classified.
The way this is done is same for all classifiers: classifier
stores the domain description for the learning examples (or, to
be more precise, a domain in which the model is described). Prior
to classification, examples from other domains are converted to
the stored domain. In our case, examples from
testData
are converted to newDomain
,
and the given lambda function is used to compute the value from
e2
from e
.
What to do if an attribute can be computed from different
domains, using different procedures? Can there be more than one
function to be tried? Why is there only one
getValueFrom
, not a list of them? Although we are
pretty advanced Orange users, we never ran into a situation where
we needed this (obviously; if needed it, we'd have done something
about it :). If you, however, need to specify more than one
function for attribute value computation, you can define a Python
class that stores a list of functions and calls them in
appropriate manner. Then give an object of this class to
getValueFrom
. And tell us about your case, and we
shall rethink our position.
There are situations when the attribute descriptor may need to be reused, yet the reference to it is not available. Typically, the user loads some training examples, trains a classifier and then loads a separate test set. For the classifier to recognize the attributes in the second data set, the descriptors, not just the names, need to be the same. This problem was first solved by requiring the user to explicitly provide the "original" Domain
, which mystified too many, so later on Orange used domain depots where it looked for suitable domains to reuse without any user intervention. This worked - with a few nasty exceptions - until Orange started to (tend to) support pickling: as unpickling always created new attributes, unpickled classifiers (or data or any other object storing references to descriptors) were useless.
Orange now maintains a list of all existing Variables and can check it before constructing new variables. This is done while loading the data, will be used for unpickling and can be explicitly used by the user. Creating variables directly, with constructors (EnumVariable()
etc) always constructs brand new variables.
The search is based on four arguments: the attribute's name, type, ordered values and unordered values. As for the latter two, the values can be explicitly ordered by the user, e.g. in the second line of the tab-delimited file, for instance to order sizes as small-medium-big.
The search for existing variables can end with one of the following statuses. (Note: Use symbolic constants, not integer numbers given in parentheses; we may introduce a new status, ExtraValues
between OK
and MissingValues
. You can, however, count on the order of statuses to stay the same.)
orange.Variable.MakeStatus.NotFound (4)
: the attribute with that name and type does not existorange.Variable.MakeStatus.Incompatible (3)
: there is (or are) attributes with matching name and type, but their list of values is incompatible with the prescribed ordered values. For example, if the existing variable already has values ["a", "b"] and the new one wants ["b", "a"], this is no go. The existing list can, however be extended by the new values, so searching for ["a", "b", "c"] would succeed. So will also the search for ["a"], since the extra existing value does not matter. The formal rule is thus that the values are compatible if existing_values[:len(ordered_values)] == ordered_values[:len(existing_values)]
.orange.Variable.MakeStatus.NoRecognizedValues (2)
:
there is a matching attribute, yet it has none of the values that the new attribute will have (this is obviously possible only if the new attribute has no prescribed ordered values). For instance, we search for an attribute "sex" with values "male" and "female", while there is an attribute of the same name with values "M" and "F" (or, well, "no" and "yes" :). Reuse of this attribute is possible, though this should probably be a new attribute since it obviously comes from a different data set. If we do decide for reuse, the old attribute will get some unneeded new values and the new one will inherit some from the old.orange.Variable.MakeStatus.MissingValues (1)
: there is a matching attribute with some of the values that the new one requires, but some values are missing. This situation is neither uncommon nor suspicious: in case of separate training and testing data sets there may be attribute values which occur in one set but not in the other.orange.Variable.MakeStatus.OK (0)
: the is an attribute which contains all the prescribed values in the correct order. The existing attribute may have some extra values, though.Continuous attributes can obviously have only two statuses, NotFound
or OK
.
When loading the data using orange.ExampleTable
, orange takes the safest approach and, by default, reuses everything that is compatible, that is, up to and including NoRecognizedValues
. Unintended reuse would be obvious from the attribute having to many values, which the user can notice and fix. More on that in the page on loading data.
There are two functions for reusing the attributes instead of creating new ones.
The type
should be one of the types in orange.VarTypes
, e.g., orange.VarTypes.Discrete
. Values can be given with any iterable type (list, set...). The optional createOnNew
specifies the status at which a new attribute is created. The status must be at most Incompatible
since incompatible (or non-existing) attributes cannot be reused. If it is set lower, for instance to MissingValues
, a new attribute will be created even if there exists an attribute which only misses same values. If set to OK
, the function will always create a new attribute.
The function returns a tuple containing an attribute descriptor and the status of the best matching attribute. So, if createOnNew
was set to MissingValues
, and there exists an attribute whose status is, say, UnrecognizedValues
, a new attribute would be created, while the second element of the tuple would contain UnrecognizedValues
. If, on the other hand, there exists an attribute which is perfectly OK, its descriptor is returned and the returned status is OK
. The function returns no indicator whether the returned constructor is reused or not. This can be, however, read from the status code: if it is smaller than the specified createNewOn
, the attribute is reused, otherwise we got a new descriptor.
The exception to the rule is when createNewOn
is OK. In this case, the function does not search through the existing attributes and cannot know the statuses, so the returned status in this case is always OK
.
make
except that it does not construct a new attribute but returns None
instead.Here are a few examples for Variable.make
; getExisting
works similarly. These examples give the shown results if executed only once (in a Python session) and in this order.
part of variableReuse.py
No surprises here: new variable is created and the status is NotFound
.
The status is 1 (MissingValues
), yet the variable is reused (v2 is v1
is True
). v1
gets a new value, c
, which was given as an unordered value. It does not matter that the new variable does not need value b
.
This is similar as before, except that the new value, d
is not among the ordered values.
The new attribute needs to have b
as the first value, so it is incompatible with the existing attribute. The status is thus 3 (Incompatible
), the two attributes are not equal and have different lists of values.
The new attribute has values c
and a
, but does not mind about the order, so the existing attribute is OK
.
The new attribute has different values than the existing (status is 2, NoRecognizedValues
), but the existing is reused nevertheless. Note that we gave e
in the list of unordered values. If it was among the ordered, the reuse would fail.
This is the same as before, except that we prohibited reuse when there are no recognized value. Hence a new attribute is created, though the returned status is the same as before.
Finally, this is a perfect match, but any reuse is prohibited, so a new attribute is created.