ExampleTable

Examples are usually stored in a table called orange.ExampleTable. In Python you will perceive it a as list and this is what it basically is: an ordered sequence of examples, supporting the usual Python procedures for lists, including the more advanced operations such as slicing and sorting.

This is a bit advanced, but we should write it here so nobody overlooks it: if data is an instance of ExampleTable, data[0] is not a reference to the first element but the first element itself. About the only case in which this is important is when we try to swap two elements, either like data[0], data[1] = data[1], data[0] or with an intermediate variable: it won't work. For the same reason: random.shuffle doesn't work on ExampleTable (as it doesn't on numpy, by the way). Use ExampleTable's own shuffle method instead.

ExampleTable is derived from a more general abstract class ExampleGenerator.

Attributes

domain
All examples in a table belong to the same domain - the one that is given in this field.
ownsExamples
Tells whether the ExampleTable contains copies of examples (true) or only references to examples owned by another table (stored in field lock). Example tables with references to examples are useful for sampling examples from ExampleTable without copying examples.
lock
The true owner of examples, if this table contains only references. (The main purpose of this field is to lock a reference to the owner, so that it doesn't die before the example table that references its examples).
version
An integer that is increased whenever ExampleTable is changed. This is not foolproof, since ExampleTable cannot detect when individual examples are changed. It will, however, catch any additions and removals from the table
randomGenerator
Random generator that is used by method randomexample. If the method is called and randomGenerator is None a new generator is constructed with random seed 0, and stored here for subsequent use. If you would like to have different random examples each time your script is run, use a random number from Python for a random seed.
attributeLoadStatus, metaAttributeLoadStatus
A list and a dictionary describing how the attributes were created. They exist only for example tables which were loaded from files. A detailed description is available on the page about loading the data.

Construction and Saving

ExampleTable can be constructed by reading from file, packing existing examples or creating an empty table. To save the data, see the documentation on file formats.

ExampleTable(filename[, createNewOn])
This constructor reads from files. If filename includes extension, it must be an extension for one of the known file formats. If just a stem is given (such as "monk1", without ".tab", ".names" or whatever), the current directory is searched for any file with the given stem with one of the known extensions (see the page on file formats). If the file is not found in the current directory, Orange will also search the directories specified in the environment variable ORANGE_DATA_PATH.

ExampleTable(domain)
This constructor creates an empty ExampleTable for the given domain. For exercise, we shall construct a domain for the common version of Monk datasets; attribute names will be a, b, c, d, e, and f, and their values will be 1, 2, 3, and 4. Attribute f is four-valued, a, b and d are three-values and c is binary.

part of exampletable1.py

import orange, random classattr = orange.EnumVariable("y", values = ["0", "1"]) card = [3, 3, 2, 3, 4, 2] values = ["1", "2", "3", "4"] attributes = [ orange.EnumVariable(chr(97+i), values = values[:card[i]]) for i in range(6)] domain = orange.Domain(attributes + [classattr]) data = orange.ExampleTable(domain)

Attributes are defined in a list comprehension where i goes from 0 to 5 (for six attributes), attribute name is chr(97+i), which gives letters from a to f, and attribute's values are a slice from list values - exactly as many values as specified in card for each particular attribute. If you don't understand this, don't mind and pretend that all attributes are defined just as simply as the class attribute.

ExampleTable(examples[, references])
This puts the given examples into a new ExampleTable. Examples can be given either with ExampleGenerator, such as ExampleTable, or as an ordinary Python list containing examples (as objects of type Example).

If the optional second argument is true, the new ExampleTable will only store references to examples. In this case, the first argument must be ExampleTable, not a list.

ExampleTable(domain, examples)
This constructor converts examples into the given domain and stores them into the new table. Examples can again be given in an ExampleGenerator, a Python list containing examples as objects of type Example or Python lists, or Numeric array, if your Orange build supports it.

If you have examples stored in a list of lists, for instance

loe = [ ["3", "1", "1", "2", "1", "1", "1"], ["3", "1", "1", "2", "2", "1", "0"], ["3", "3", "1", "2", "2", "1", "1"]]

you can convert it into an ExampleTable by

data = orange.ExampleTable(domain, loe)

Instead of strings (ie, symbolic values) you can use value indices in loe, when you find it more appropriate:

loe = [ [2, 0, 0, 1, 0, 0, 1], [2, 0, 0, 1, 1, 0, 0], [2, 2, 0, 1, 1, 0, 1]]

The other way of putting such examples into an ExampleTable is by method extend.

Finally, here's an example that puts a content of Numeric array into an ExampleTable.

import Numeric d = orange.Domain([orange.FloatVariable('a%i'%x) for x in range(5)]) a = Numeric.array([[1, 2, 3, 4, 5], [5, 4, 3, 2, 1]]) t = orange.ExampleTable(a)

For this example, we first constructed a domain with attributes a1, a2, a3, a4 and a5. We then put together a simple Numeric array with five columns and put it into a table.

ExampleTable(list-of-tables)
"Horizontally" merges multiple tables into a single table. All the tables must be of the same length since new examples are combined from examples from the given tables. Domains are combined so that each (ordinary) attribute appears only once in the resulting table. The class attribute is the last class attribute in the list of tables; for instance, if three tables are merged but the last one is class-less, the class attribute for the new table will come from the second table. Meta attributes for the new domain are merged based on id's: if the same attribute appears under two id's it will be added twice. If, on the opposite, same id is used for two different attributes in two example tables, this is an error. As examples are merged, Orange checks the attributes (ordinary or meta) that appear in different tables have either the same value or undefined values.

Note that this is not the SQL's join operator as it doesn't try to match any keys between the tables.

For a trivial example, we shall merge two tables stored in the following tab-delimited files.

merge1.tab

a1 a2 m1 m2 f f f f meta meta 1 2 3 4 5 6 7 8 9 10 11 12

merge2.tab

a1 a3 m1 m3 f f f f meta meta 1 2.5 3 4.5 5 6.5 7 8.5 9 10.5 11 12.5 The two tables can be loaded, merged and printed out by the following script.

exampletable_merge.py (uses merge1.tab, merge2.tab)

import orange data1 = orange.ExampleTable("merge1") data2 = orange.ExampleTable("merge2", use = data1.domain) merged = orange.ExampleTable([data1, data2]) print print "Domain 1: ", data1.domain print "Domain 2: ", data2.domain print "Merged: ", merged.domain print for i in range(len(data1)): print " %s\n + %s\n-> %s\n" % (data1[i], data2[i], merged[i])

First, note the use = data1.domain which ensures that while loading the second table, the attributes from the first will be reused if they are of same name and type. Without that, the attribute a1 from the first and the attribute a2 from the second table would be two different attributes and the merged table would have two attributes named a1 instead of a single one, which is what we want. The same goes for meta-attribute m1 which will also have the same id in both table. (For this reason, it is important to pass the entire domain, ie data1.domain, not a list of attributes, such as data1.domain.variables or - obviously intentionally doing it wrong -- data1.domain.variables + data1.domain.getmetas().values().)

Merging succeeds since the values of a1 and m1 are the same for all matching examples from data1 and data2, and the printout is as anticipated.

Domain 1: [a1, a2], {-2:m1, -3:m2} Domain 2: [a1, a3], {-2:m1, -4:m3} Merged: [a1, a2, a3], {-2:m1, -3:m2, -4:m3} [1, 2], {"m1":3, "m2":4} + [1, 2.5], {"m1":3, "m3":4.5} -> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5} [5, 6], {"m1":7, "m2":8} + [5, 6.5], {"m1":7, "m3":8.5} -> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5} [9, 10], {"m1":11, "m2":12} + [9, 10.5], {"m1":11, "m3":12.5} -> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}

Standard list-like functions

ExampleTable supports most of standard Python operations on lists. All the basic operations - getting, setting and removing examples and slices are supported.

<items>
When retrieving items (Examples) from the table, you get references to examples not copies. So, when you write ex = data[0] and then modify ex, you will actually change the first example in the data. If the table contains references to examples, it can only contain references to examples in a single table, so when you assign items, eg. by data[10]=example, the example must come from the right table.

When setting items, you can present examples as object of type Example or as ordinary list, for instance, data[0] = ["1", "1", 1, "1", "1", "1", "1"]. This form can, of course, only be used by ExampleTable that own examples.

<slices>
Slices function as expected: data[:10], for instance, gives the first ten examples from data. These examples are not returned in an ExampleTable but in ordinary Python list, containing references to examples in the table. For instance, to do something with the first n examples, you can use a loop like this. for example in data[:n]: do_something(example)

As for ordinary lists, this is somewhat slower than

for i in range(n): do_something(example[n])

But you probably won't notice the difference except in really large tables.

If the table contains references to examples, similar restrictions as for assigning items apply.

<logical tests>
As for ordinary lists, a table is "false", when empty. Thus, an empty table can be rejected by the following if statement. if not data: raise "I would really need some examples before I proceed, sir"
append(example)
Appends a single example to the table. We have already shown how to construct a domain description and an empty example table for Monk 1 dataset. Let us now add a few examples representing the Monk 1 concept (y := (a==b) or (e==1)).

part of exampletable1.py

data = orange.ExampleTable(domain) for i in range(100): ex = [random.randint(0, c-1) for c in card] ex.append(ex[0]==ex[1] or ex[4]==0) data.append(ex)

For each example, we prepare a list of six random values, ranging from 0 to the cardinality of the attribute (randint(0, c-1) returns a random value from between 0 and c-1, inclusive). To this we append a class value, computed according to Monk 1's concept. The constructed list is appended to the table.

Restrictions apply for tables that contain references to examples.

extend(examples)
Appends a list of examples (given as a generator or a Python list) to the table. This function has the same effect as calling append for each example in the list.

Restrictions apply for tables that contain references to examples.

native([nativity])
Converts the ExampleTable into an ordinary Python list. If nativity is 2 (default), the list contains objects of type Example (references to examples in the table). If 1, even examples are replaced by lists containing objects of type Value (therefore, ExampleTable is translated to a list of list of Value). If nativity is 0, even values are represented as native Python objects - strings and numbers.

Selection, Filtering, Translation

ExampleTable offers several methods for selection and translation of examples (some of them are actually inherited from a more general class ExampleGenerator). For easier illustration, we shall prepare an example table with 10 examples, described by a single numerical attribute having values from 0 to 9 (effectively enumerating the examples).

part of exampletable2.py

import orange domain = orange.Domain([orange.FloatVariable()]) data = orange.ExampleTable(domain) for i in range(10): data.append([i])
select(list[, int][, negate=0]) (inherited from ExampleGenerator)
Method select returns a subset of examples. The argument is a list of integers of the same length as the examples table. select picks the examples for which the corresponding list's element is equal to the second (optional) argument. If the latter is omitted, example for which the corresponding element is non-zero are selected. An additional keyword argument negate=1 reverses the selection.

Note: select used to have many other functions, which are now deprecated and only kept for compatibility. We shall not document them, except for one that may cause unexpected behaviour. Say we have a data set which does not contain three examples (can have more of less). Calling select([0, 1, 5]) will return a table containing only the first, second and sixth attribute. In other words, if you use select like described above (and below), but give it a list of a wrong size, the call will be interpreted as if you want to change the domain. Don't purposely call select to change the domain.

The most natural use of this method is for division of examples into folds. For this, we first prepare a list of fold indices using an appropriate descendant of MakeRandomIndices; MakeRandomIndicesCV, for instance, will prepare indices for cross-validation (see documentation on random indices). Then we feed the indices to select, as shown in example below.

part of exampletable2.py

cv_indices = orange.MakeRandomIndicesCV(data, 4) print "Indices: ", cv_indices, "\n" for fold in range(4): train = data.select(cv_indices, fold, negate = 1) test = data.select(cv_indices, fold) print "Fold %d: train " % fold, for ex in train: print ex, print print " : test ", for ex in test: print ex, print

The printout begins with.

Indices: <1, 0, 2, 2, 0, 1, 0, 3, 1, 3> Fold 0: train [0.000000] [2.000000] [3.000000] [5.000000] [7.000000] [8.000000] [9.000000] : test [1.000000] [4.000000] [6.000000]

For the first fold (0), the positions of zero's determine the examples that are selected for testing - these are examples at positions 1, 4 and 6 (don't forget that indices in Python start with zero).

Another form of calling function select is by giving a list of integers that are interpreted as boolean values.

part of exampletable2.py

>>> t = data.select([1, 1, 0, 0, 0, 0, 0, 0, 0, 1]) >>> for ex in t: ... print ex [0.000000] [1.000000] [9.000000]

This form can also be given the negate as keyword argument to reverse the selection.

For compatibility reasons, select method still has some additional functionality which has been moved to methods filter and translate.

selectref(list[, int][, negate=0])
This function is the same as above, except that the new table contains references to examples in the original table instead of copies. This function is especially useful for sampling: the above scripts would be much faster (on large ExampleTables, naturally) if they called selectref instead of select.

selectlist(list[, int][, negate=0])
This form stores references to the selected examples in ordinary Python list. It is thus equivalent to calling selectref and then native.
selectbool(list[, int][, negate=0])
Similar to above select function, except that instead of examples (in whichever form) it returns a list of bools of the same length as the number of examples, denoting the accepted examples.
getitems(indices) (inherited from ExampleGenerator)
Argument indices gives a list of indices of examples to be selected. Selected examples are returned in example table. For instance, calling data.getitems([0, 1, 9]) gives the same result as the above data.select([1, 1, 0, 0, 0, 0, 0, 0, 0, 1]: a (new) ExampleTable with examples data[0], data[1] and data[9]. Calling data.getitems(range(10)) has a similar effect than data[:10], except that the former returns an example table and the latter returns ordinary list.

getitemsref(indices)
Similar to getitems, except that the resulting table contains references to examples instead of copies.

filter(conditions)
Selects examples according to the given condition. These can be given in form of keyword arguments or a dictionary; with the latter, additional keyword argument negate can be given for selection reversal. Result is a new ExampleTable.

For instance, young patients in the lenses dataset can be selected by

young = data.filter(age="young")

More than one value can be allowed and more than one attribute checked. To select all patients with age "young" or "psby" who are astigmatic, use

young = data.filter(age=["young", "presbyopic"], astigm="y")

If you need the reverse selection, you cannot simply add negate=1 as in select method, since this would be interpreted simply as another attribute (negate) whose value needs to be 1 (e.g. values[1], see documentation on Variable). For negation, you should use somewhat less readable way to pass arguments to filter - you should pack them to a dictionary. For instance, to select examples that are not young and astigmatic, use

young = data.filter({"age": "young", "astigmatic": "yes"}, negate=1)

Note that this selects patients that are young, but not astigmatic and those that are astigmatic, but not young. In essence, conjunction of conditions is computed first and the result is negated if negate is 1. If you need more flexible selection (e.g. disjunction instead of conjunction), see documentation on preprocessors.

Continuous attribute values are specified by pairs of values. In dataset "bridges", bridges with lengths between 1000 and 2000 (inclusive) are selected by

mid = data.filter(LENGTH=(1000, 2000))

Bridges that are shorter or longer than that selected by inverting the range.

mid = data.filter(LENGTH=(2000, 1000))
filter(filt)
Filters examples through a given filter filt of type orange.Filter.

filterref, filterlist
Both forms of filter also have variants that return tables and lists of references to examples, analogous to methods selectref and selectlist.
filterbool
Returns a list of bools denoting which examples are accepted and which are not.
translate(domain), translate(attributes[, keepMetas])
Returns a new example table in which examples belong to the specified domain or are described by the given set of attributes. If additional argument keepMetas is 1, the new domain will also include all meta attributes frmo the original domain.

Other methods

checksum()
Computes a CRC32 of the example table. The sum is computed only over discrete and continuous attributes, not over strings and other types of attributes. Meta attributes are also ignored. Besides that, if two tables have the same CRC, you can be pretty sure that they are same.
hasMissingValues()
Returns true if the table contains any undefined values, either in attributes or the class. Meta-attributes are not checked.
hasMissingClasses()
Returns true if any examples' class is undefined. The function throws an exception if the data is class less.
randomexample()
Returns a random example from the table. import orange data = orange.ExampleTable("lenses") print "Random random" for i in range(5): print data.randomexample()

To select random examples, ExampleTable uses a random number generator stored in the field randomGenerator. If it has none, a new one is constructed and initialized with random seed 0. As a consequence, such a script will always select the same examples. If you don't want this, create another random generator and use a random number from Python to initialize it.

import random data.randomGenerator = orange.RandomGenerator(random.randint(0, 10000))

Since Orange calls constructors when an object of incorrect type is assigned to a built-in attribute, this can be written in a shorter form as

import random data.randomGenerator = random.randint(0, 10000)
removeDuplicates([weightID])
Replaces duplicated examples with single copies. If weightID is given, a meta-value is added to each example to contain the sum of weights of all examples merged into a particular example.

sort([attributes])
Sorts examples by attribute values. The order of attributes can be given, beginning with the most important attribute. Note that the ordering is not given by symbolic names (such as "young", "psby"...) but by the order in which values are listed in values table (see documentation on Variable).

Examples in dataset "bridges" can be sorted by lengths and years they were erected by data.sort("LENGTH", "ERECTED").

shuffle()
Randomly shuffles the examples in the table. This function should always be used instead of random.shuffle, which does not work for ExampleTable.
changeDomain(domain)
Changes the table's domain, converting all examples in place. Returns None. This function is not available for tables that contain references to examples.

Meta-values

Adding a meta-value to all examples in a table is a very common operation, deserving specialized functions. There are two, one for adding and the other for removing a meta-value.

addMetaAttribute(id[, value])
Adds a meta-value to all examples in the table. id can be an integer returned by orange.newmetaid(), or a string or an attribute description if meta-attribute is registered in table's domain. If value is given, it must be something convertible to a Value. If a corresponding meta-attribute is registered with domain, value can be symbolical. Otherwise, it must be an index (to values), continuous number or an object of type Value. value is an optional argument; default is 1.0 (to be useful as a neutral weight for examples).
removeMetaAttribute(id)
Removes a meta-attribute. Again, id can be an integer or a string or an attribute description registered with the domain.