Filters

Filters are objects that can be used for selecting examples. They are somewhat related to preprocessors. Filters are more limited, they can accept or reject examples, but cannot modify them. Additional restriction of filters is that they only see individual examples, not entire datasets. This is important at random selection of examples (see Filter_random).

General behavior

All filters have two attributes.

Attributes

negate
Inverts filters decisions
domain
Domain to which examples are converted prior to checking (except for Filter_random, which ignores this field).

Besides the constructor, filters provide the call operator and a method that returns a list denoting which examples match the filter criterion.

Attributes

__call__(example)
Checks whether the example matches the filter's criterion and returns either True or False.
__call__(examples)
When given an entire example table, it returns a list of examples (as an ExampleTable) that matches the criterion.
selectionVector(examples)
Returns a list of bools of the same length as examples, denoting which examples are accepted. Equivalent to [filter(ex) for ex in examples]
.

An alternative way to apply a filter to a table of examples is to call ExampleTable.filter.

Random filter

Filter_random accepts an example with a given probability.

Attributes

prob
Probability for accepting an example.
randomGenerator
A random number generator that used for making selections. If not set before the filter is used for the first time, a new generator is constructed and stored here for the future use.

The inherited attribute domain is ignored.

part of filter.py

>>> randomfilter = orange.Filter_random(prob = 0.7, randomGenerator = 24) >>> for i in range(10): ... print randomfilter(example), 1 1 0 1 0 0 0 1 1 1

For this script, example should be some learning example; you can load any data and set example = data[0]. Script's result will always be the same. Although the probability of selecting an example is set to 0.7, the filter accepted the example six times out of ten. Since filter only sees individual examples, it cannot be accurate; if you need to select exactly 70% of examples in a dataset, use a random indices.

Setting the random generator ensures that the filter will always select the same examples, disregarding of how many times you run the script or what you do (in Orange) before you run it. randomGenerator=24 is a shortcut for randomGenerator = orange.RandomGenerator(24) or randomGenerator = orange.RandomGenerator(initseed=24).

To select a subset of examples instead of calling the filter for each individual example, use the filter like this.

data70 = randomfilter(data)

Filtering examples with(out) unknown values

Filter_isDefined selects examples for which all attribute values are defined (known). By default, the filter checks all attributes; you can modify the list check to select the attributes to be checked. This filter never checks meta-attributes are not checked. (There is an obsolete filter Filter_hasSpecial, which does the opposite, that is, selects examples with at least one unknown value, in any of attributes, including the class attribute. Filter_hasSpecial always checks all attributes except meta-attributes.) Filter_hasClass selects examples with defined class value. You can use negate to invert the selection, as shown in the script below.

Attributes

check
A list of boolean elements specifying which attributes to check. Each element corresponds to an attribute in the domain. By default, check is None, meaning that all attributes are checked. The list is initialized to a list of trues when the filter's domain is set unless the list already exists. You can also set check manually, even without setting the domain. The list can be indexed by ordinary integers (e.g., check[0]); if domain is set, you can also address the list by attribute names or descriptors.

As for all Orange objects, it is not recommended to modify the domain after it has been set once, unless you know exactly what you are doing. In this particular case, changing the domain would disrupt the correspondence between the domain attributes and the check list, causing unpredictable behaviour.

part of filter.py (uses lenses.tab)

data = orange.ExampleTable("lenses") data2 = data[:5] data2[0][0] = "?" data2[1].setclass("?") print "First five examples" for ex in data2: print ex print "\nExamples without unknown values" f = orange.Filter_isDefined(domain = data.domain) for ex in f(data2): print ex print "\nExamples without unknown values, ignoring 'age'" f.check["age"] = 0 for ex in f(data2): print ex print "\nExamples with unknown values (ignoring age)" for ex in f(data2, negate=1): print ex print "\nExamples with defined class" for ex in orange.Filter_hasClassValue(data2): print ex print "\nExamples with undefined class" for ex in orange.Filter_hasClassValue(data2, negate=1): print ex

Filtering examples with(out) a meta value

Filter Filter_hasMeta filters out the attributes that don't have (or that do have, when negated) a meta attribute with the given id.

id
The id of the meta attribute we look for.

This is filter is especially useful with examples from basket format and their optional meta attributes. If they come, for instance, from a text mining domain, we can use it to get the documents that contain a certain word.

part of filterm.py (uses inquisition.basket)

data = orange.ExampleTable("inquisition.basket") haveSurprise = orange.Filter_hasMeta(data, id = data.domain.index("surprise"))

This example, which will print out all instances that contain the word "surprise", gets the id of the meta attribute from the domain by searching for the attribute named "surprise". This meta attribute is optional and does not necessarily appear in all examples. To fully understand how this particular example works, you should be familiar with optional meta attributes and the basket file format.

This filter can of course also be used in other situations involving meta values that appear only in some examples. The corresponding attributes do not need to be registered in the domain.

Filtering by attribute values

Fast filter for single values

Filter_sameValue is a fast filter for selecting examples with particular value of some attribute.

Attributes

position
Position of the attribute in the domain.
value
Attribute's value

If domain is not set, make sure that examples are from the right domain so that position applies to the attribute you want.

part of filter.py (uses lenses.tab)

filteryoung = orange.Filter_sameValue() age = data.domain["age"] filteryoung.value = orange.Value(age, "young") filteryoung.position = data.domain.attributes.index(age) print "\nYoung examples" for ex in filteryoung(data): print ex

This script select examples with age="young" from lenses dataset. Setting position is somewhat tricky: data.domain.attributes behaves as a list and provides method index, which we can use to retrieve the position of attribute age. The attribute age is also needed to construct a Value.

As you can see, this filter is dirty but quick.

Simple filter for continuous attributes

ValueFilter class provides different methods for filtering values of countinuous attributes: ValueFilter.Equal, ValueFilter.Less, ValueFilter.LessEqual, ValueFilter.Greater, ValueFilter.GreaterEqual, ValueFilter.Between, ValueFilter.Outside.

In the following excerpt there are two different filters used: ValueFilter.GreaterEqual which needs only one parameter and ValueFilter.Between which needs to be defined by two parameters.

part of filterv.py (uses iris.tab)

fcont = orange.Filter_values(domain = data.domain) fcont[0] = (orange.ValueFilter.GreaterEqual, 7.6) print "\n\nThe first attribute is greater than or equal to 7.6" for ex in fcont(data): print ex fcont[0] = (orange.ValueFilter.Between, 4.6, 5.0) print "\n\nThe first attribute is between to 4.5 and 5.0" for ex in fcont(data): print ex

Filter for multiple values and attributes

Filter_Values performs a similar function as Filter_sameValue, but can handle conjunctions and disjunctions of more complex conditions.

Attributes

conditions
A list of type ValueFilterList that contains conditions.
conjunction
Decides whether the filter will compute conjunction or disjunction of conditions. If true, example is accepted if no values are rejected. If false, example is accepted if at least one value is accepted.

Elements of list conditions must be objects of type ValueFilter_discrete for discrete and ValueFilter_continuous for continuous attributes; both are derived from ValueFilter.

Both have fields position denoting the position of the checked attribute (just as in Filter_sameValue) and acceptSpecial that determines whether undefined values are accepted (1), rejected (0) or simply ignored (-1, default).

ValueFilter_discrete has field values of type ValueList that contains objects of type Value that represent the acceptable values.

ValueFilter_continous has fields min and max that define an interval, and field outside that tells whether values outside or inside interval are accepted. Default is false (inside).

part of filter.py (uses lenses.tab)

fya = orange.Filter_values() fya.domain = data.domain age, astigm = data.domain["age"], data.domain["astigmatic"] fya.conditions.append(orange.ValueFilter_discrete( position = data.domain.attributes.index(age), values=[orange.Value(age,"young"), orange.Value(age, "presbyopic")]) ) fya.conditions.append(orange.ValueFilter_discrete( position = data.domain.attributes.index(astigm), values=[orange.Value(astigm, "yes")]) ) for ex in fya(data): print ex

This script selects examples whose age is "young" or "presbyopic" and which are astigmatic. Unknown values are ignored (if value for one of the two attributes is missing, only the other is checked; if both are missing, example is accepted).

Script first constructs the filter and assigns a domain. Then it appends both conditions to the filter's conditions field. Both are of type orange.ValueFilter_discrete, since the two attributes are discrete. Position of the attribute is obtained the same way as for Filter_sameValue, described above.

The list of conditions can also be given to filter constructor. The following filter will accept examples whose age is "young" or "presbyopic" or who are astigmatic (conjunction = 0). For contrast from above filter, unknown age is not acceptable (but examples with unknown age can still be accepted if they are astigmatic). Meanwhile, examples with unknown astigmatism are always accepted.

part of filter.py (uses lenses.tab)

fya = orange.Filter_values( domain = data.domain, conditions = [ orange.ValueFilter_discrete( position = data.domain.attributes.index(age), values = [orange.Value(age, "young"), orange.Value(age, "presbyopic")], acceptSpecial = 0 ), orange.ValueFilter_discrete( position = data.domain.attributes.index(astigm), values = [orange.Value(astigm, "yes")], acceptSpecial = 1 ) ], conjunction = 0 )

If you don't find this filter attractive, use Preprocessor_take instead, which is less flexible but more intelligent and friendly.