Domain Descriptors

Domain descriptor (orange.Domain) serves three purposes.

Attributes

orange.Domain has a few public accessible fields. Domain descriptor is referenced by many objects and modifying it is so unsafe, that it even the underlying C++ code only performs it on fresh new domains.

Attributes

attributes (read-only)
A list of domain's attributes, not including the class attribute.
variables (read-only)
A list of domain's attributes including the class attributes.
classVar
Domain's class attribute, or None if domain is classless. In the latter case, attributes and variables are equal.
version (read-only)
An integer value that is changed whenever the domain is modified. This rarely happens, as told above. The number can also be used as unique domain identifier; two different domains have different domainVersions.

Construction

There are numerous ways to construct an orange.Domain. Let a, b and c be three attribute descriptors (for instance, created by a, b, c = [orange.EnumVariable(x) for x in ["a", "b", "c"]].

orange.Domain(list-of-attributes)
This is the simplest and the most useful constructor. The new domain will contain the listed attributes, given as objects derived from orange.Variable. The last attribute in the list will be the class attribute. >>> d = orange.Domain([a, b, c]) >>> print d.attributes <EnumVariable 'a', EnumVariable 'b'> >>> print d.classVar EnumVariable 'c'
orange.Domain(list-of-attributes, class-attribute)
This is similar to above, except that the class attribute is given separately. Class-attribute must be an attribute descriptor (orange.Variable), not a string with an attribute name. >>> d = orange.Domain([a, b], c) >>> print d.attributes <EnumVariable 'a', EnumVariable 'b'> >>> print d.classVar EnumVariable 'c'
orange.Domain(list-of-attributes, flag)
Another variation of the first case. The flag is interpreted as boolean value; if true, the domain will have the class attribute (the last attribute in the list). If false, there's no class attribute. That's therefore a way for constructing a classless domain. >>> d = orange.Domain([a, b, c], 0) >>> print d.attributes <EnumVariable 'a', EnumVariable 'b', EnumVariable 'c'> >>> print d.classVar None If we replace 0 by 1, we get the same results as in the first two cases.

orange.Domain(list-of-attributes, source)
In this form, list of attributes can also include attribute names, not only attribute descriptors. The second argument, source, can be an existing domain or a list of attribute descriptors. >>> d1 = orange.Domain([a, b]) >>> d2 = orange.Domain(["a", b, c], d1) When constructing d2, we want it to include "a", b and c. To resolve string "a", Orange checks the given source, domain d1, to see whether it contains an attribute with that name. As it does, the descriptor (a) is added to d2. The other two attributes, b and c are added without checking d1.
orange.Domain(list-of-attributes, flag, source)
This is similar to the case above, except that the additional flag tells whether there is a class attribute or not. >>> d2 = orange.Domain(["a", b, c], 0, [a, b])

Here, d2 includes all three attributes but no class attributes. Note also that we have used a list of attribute descriptor as source instead of an existing domain, as we did in the previous example.

orange.Domain(domain)
This is the cloning constructor - the new domain has the same attributes and class attribute and the domain passed as an argument, but is a different domain. Since domains are immutable, this constructor is not of much use.

orange.Domain(domain, class-attribute)
This one is more useful: the new domain has the same attributes as the old one, except that the class attribute is changed. The old class attribute becomes an ordinary attribute and the attribute specified by the second argument becomes a class attribute. The class attribute can be specified either by name (if so, it must be a name of one of the attributes that exist in the "original" domain) or by descriptor. >>> d1 = orange.Domain([a, b, c]) >>> d2 = orange.Domain(d1, a) >>> print d2.attributes <EnumVariable 'b', EnumVariable 'c'> >>> print d2.classVar EnumVariable 'a' In this example, we started with a domain d1 with attributes a and b, and c was the class. In the second domain, d2, c becomes an ordinary attribute and a is the class attribute. This constructor can be used to add classes to classless domains. >>> d1 = orange.Domain([a, b], 0) >>> d2 = orange.Domain(d1, c) Here, d1 is a classless domain and in d2 we added the attribute c as class attribute. If c> existed before (as an ordinary attribute), it would, naturally, be removed from the list of ordinary attributes.

orange.Domain(domain, flag)
Finally, this constructor can be used to remove the class attribute. As before, flag tells whether the new domain should have a class attribute or not. However, the flag has effect only if the original domain has a class attribute; if so and if flag is 0, the class attribute is moved to ordinary attributes. In all other cases, the new domain is simply a cloned original domain.

Checking attribute types

There are three convenient functions for checking whether the domain contains any discrete, continuous or other-type attributes.

hasDiscreteAttributes([includeClass])
hasContinuousAttributes([includeClass])
hasOtherAttributes([includeClass])
The boolean argument tells whether the function should also check the class attribute or not. Default value is True.

Conversion of examples

Examples can be converted from one domain to another by calling the domain descriptor.

domain2.py (uses monk1.tab)

>>> import orange >>> data = orange.ExampleTable("monk1") >>> d2 = orange.Domain(["a", "b", "e", "y"], data.domain) >>> >>> example = data[55] >>> print example ['1', '2', '1', '1', '4', '2', '0'] >>> example2 = d2(example) >>> print example2 ['1', '2', '4', '0']

You will probably convert examples this way when writing your own classifiers in Python. Existing classifiers do exactly the same. orange.BayesClassifier, for instance, stores the domain of the learning examples and calls it to convert the examples to be classified.

An equivalent way of converting examples is to construct a new example, passing the new domain to the constructor.

>>> example2 = orange.Example(d2, example)

Example tables can be converted in a similar manner.

>>> data2 = orange.ExampleTable(d2, data) >>> print data2[55] ['1', '2', '4', '0']

Meta Attributes

Meta-values are additional values that can be attached to examples and can have any meaning you want. It is not necessary that all examples in an example table (or even all examples from some domain) have certain meta-value. See documentation on Example for a more thorough description of meta-values.

Meta attributes that appear in examples can, but don't need to be known to the domain descriptor. Even if they are known, there are no obligations in one way or another: domain does not need to know about any meta values that are attached to examples, and examples do not need to have (all) meta values that are "registered" in the corresponding domain.

Why register meta attributes by the domain?

  • If the domain knows about a meta attribute, example indexing can be made smarter. While values of unregistered meta attributes can be obtained only through indices (e.g. example[id], where id needs to be an integer), values of registered meta attributes are also accessible through string or variable descriptor indices (example["age"]).
  • When printing out an example, the symbolic values of discrete attributes can only be printed if the attribute is registered. Also, if the attribute is registered, the printed out example will show a (more informative) attribute's name instead of a meta-id.
  • Registering an attribute provides a way to attach a descriptor to a meta-id. See how the basket file format uses this feature.
  • When saving examples to a file, only the values of registered meta attributes are saved (and even this only in tab-delimited and related formats since traditional file formats like C4.5's have no meta attributes).
  • When a new example is constructed, it is automatically assigned the meta attributes listed in the domain; their values are, of course, set to unknown.

For the latter two points - saving to a file and construction of new examples - there is an additional flag, added for several practical reasons: a meta attribute can be marked as "optional". Such meta attributes are not saved and not added to newly constructed examples. This functionality is used in, for instance, the above mentioned basket format, where new meta attributes are created while loading the file and we certainly don't want a new example to contain all words from the past examples.

There is another distinction between the optional and non-optional meta attributes: the latter are expected to be present in all examples of that domain. Saving to files, for one, expects them and will fail if a non-optional meta value is missing. Optional attributes may be missing. These rules are, however, mostly not strictly enforced, so adhering to them is rather up to your choice.

In general, register the meta attributes which are permanent and have a certain meaning. If you can't name it, you possibly don't want to register it. An animal name in the zoo data set or a patient ID in a typical medical data set is a good example for a non-optional registered attribute. Word counts in basket format are optional registered attributes. The temporary example weights in bagging or example weights in certain configurations of tree induction or rule learning should be left unregistered.

Since meta attributes do not have a great impact, they can be added and removed even after the domain is constructed and examples of that domain already exist. For instance, if data contains the Monk 1 data set, we can add a new continuous attribute named "misses" with the following code.

We shall first provide a few examples, detailed description of the methods follows later.

domain2.py (uses monk1.tab)

>>> misses = orange.FloatVariable("misses") >>> id = orange.newmetaid() >>> data.domain.addmeta(id, misses) >>> print data[55] >>> ['1', '2', '1', '1', '4', '2', '0']

Note that nothing changed in the example. No attributes are added. (This is only natural; the domain descriptor has no idea about which objects refer to it.) As already told, registering meta attributes enables addressing by indexing, either by attribute name or by its descriptor. For instance, to set the attribute to 0 for all examples in the table, you could proceed with

domain2.py (uses monk1.tab)

>>> for example in data: ... example[misses] = 0

An alternative is referring by name.

>>> for example in data: ... example["misses"] = 0

Both alternatives are more elegant than the one to which you would have to resort if the meta attribute was not make known to the domain:

>>> for example in data: ... example.setmeta(id, 0)

Registering the meta attribute also enhances printouts. When example is printed, meta-values for registered attributes are shown as "name:value" pairs, while for unregistered you will get id's instead of names.

As you can learn by reading documentation on ExampleTable, the best way to add a meta attribute to whole example table is by calling

data.addMetaAttribute("misses", 0)

This again works only if "misses" is registered in data.domain (which we did by calling data.domain.addmeta(id, misses)). If it's not, you should use an id instead of string "misses" and change 0 to 0.0 to prevent the value from being interpreted as a discrete value (if there is no descriptor, orange has not idea about what you mean by 0 - a discrete index or a continuous value):

>>> data.addMetaAttribute(id, 0.0)

In a massive testing of different models, you could count the number of times that each example was missclassified by calling classifiers in the following loop.

domain2.py (uses monk1.tab)

>>> for example in data: ... if example.getclass() != classifier(example): ... example[misses] += 1

The other effect of registering meta attributes is that they appear in converted examples. That is, whenever an example is converted to certain domain, the example will have all the meta attributes that are declared in that domain. If the meta attributes occur in the original domain of the example or can be computed from the attributes in the original domain, they will have appropriate values. When not, their values will be DK.

domain = data.domain d2 = orange.Domain(["a", "b", "e", "y"], domain) for attr in ["c", "d", "f"]: d2.addmeta(orange.newmetaid(), domain[attr]) d2.addmeta(orange.newmetaid(), orange.EnumVariable("X")) data2 = orange.ExampleTable(d2, data)

Domain d2 is constructed to have only the attributes a, b, e and the class attribute, while the other three attributes are added as meta attributes, among with a mysterious additional attribute X.

>>> print data[55] ['1', '2', '1', '1', '4', '2', '0'], {"misses":0.000000} >>> print data2[55] ['1', '2', '4', '0'], {"c":'1', "d":'1', "f":'2', "X":'?'}

After conversion, the three attributes are moved to meta attributes and the new attribute appears as unknown.

addmeta(id, descriptor[, optional])
You have seen this function in action already. The id is a negative integer, which you get from orange.newmetaid(). (You can cheat by just giving any negative integer, say -42, when you would only like to quickly try something. You shouldn't do that in final code, since calling orange.newmetaid assures that id's are unique. We did so here to ensure that the code's printout is always the same.) The descriptor should be an attribute descriptor derived from orange.Variable, such as EnumVariable, FloatVariable> or StringVariable. >>> d2.addmeta(-42, orange.StringVariable("name")) >>> data2[55]["name"] = "Joe" >>> print data2[55] ['1', '2', '4', '0'], {"c":'1', "d":'1', "f":'2', "X":'?', "name":'Joe'}

The optional third argument tells whether the meta attribute is optional (when the value of the argument is non-zero) or not (when it is zero). Different non-zero values can be used for different kinds of non-optional attributes, if needed. These values are application dependent and Orange offers no corresponding registration facilities. If omitted, the attribute is not optional.

addmetas(dict[, optional])
This method is similar to addmeta described above, except that it can be used to add multiple meta attributes at once. The dictionary it accepts as an argument is in the same form as the one returned by getmetas(). Therefore, to add all meta attributes from domain to newdomain, use the following statement. newdomain.addmetas(domain.getmetas()) The optional third argument tells whether the attributes need to be added as optional or non-optional. Default is the latter.
removemeta(meta-attribute | list-of-attributes)
Removes meta attributes. You can give a single attribute or a list of them. Attributes can be described by descriptors, names or id's (or a mix of that). Removing meta attributes from domain descriptor has no effect on examples.
hasmeta(name | descriptor | id)
Tells whether the domain has the given meta attribute.
metaid(name | descriptor | id)
With this function, you can retrieve a lost id of a meta attribute. You will use this function to get id's of meta attributes that are loaded from tab-delimited files, where you will know their names but not the id's. >>> d2.metaid("name") -42
getmeta(name | id)
Given a name or an id of an attribute, this function returns its descriptor. For instance, to get the descriptor for the string variable, which we didn't store above, when constructing the meta attribute, we'd call it like this. >>> d2.getmeta("name") StringVariable 'name'
getmetas([optional])
Returns a list of meta attributes as dictionary where id's are keys and descriptors are values. If the argument optional is given, the function will return only the optional attributes with the same value of the argument, or the non-optional meta attributes, if zero.
isOptionalMeta(name | descriptor | id)
The function returns True if the meta attribute is optional and False if it's not.

Domains as lists

To a certain extent, domains behave like lists. The length of domain is the number of its attributes, including the class attribute. Iterating through domain goes through attributes and the class attribute, but not through meta attributes. You can get slices, but cannot set them (since domains are immutable). Domains can be indexed by integer indices, attribute names or descriptors. Domain has a method index(var) that returns the index of an attribute specified by a descriptor, name (or index, if you believe this makes sense).

>>> print d2 [a, b, e, y], {-4:c, -5:d, -6:f, -7:X} >>> d2[1] EnumVariable 'b' >>> d2["e"] EnumVariable 'e' >>> d2["d"] EnumVariable 'd' >>> d2[-4] EnumVariable 'c' >>> for attr in d2: ... print attr.name, ... a b e y