Prev: Ensemble Techniques, Next: What we did not cover, Up: On Tutorial 'Orange for Beginners'
A substantial part of Orange's functionality is data manipulation: constructing data sets, selecting instances and attributes, different filtering techniques... While all operations on attributes and instances can be done on your data set in your favorite spreadsheet program, this may not be very convenient as you may not want to jump from one environment to another, and can even be prohibitive if data manipulation is a part of your learning or testing scheme. In this section of tutorial we therefore expose some of the very basic data manipulation techniques incorporate in Orange, which in turn may be sufficient for those that want to implement their own feature or instance selection techniques, or even do something like constructive induction.
part of domain1.py (uses imports-85.tab)
The script prints out a following report:
Every example set in Orange has its domain. Say, a variable
data
stores our data set, then the domain of this data
set can be accessed through data.domain
. Inclusion and
exclusion of attributes can be managed through domain: we can use one
domain to construct another one, and then use Orange's
select
function to actually construct a data set from
original instances given the new domain. There is also a more
straightforward way to select attributes through directly using
orange.select
.
Here is an example. We again use imports-85
data set,
and construct different data sets that include first five attributes
(newData1
), attributes stated in list and specified by
their names (newData2
), attributes stated in list and
specified through Orange's object called Variable
(newData3
):
domain2.py (uses imports-85.tab)
The last two examples (construction of newData3
and
newData4
) show a very important distinction in crafting
the domains: in Orange, domains may or may not include class
variable. For classification and regression tasks, you will obviously
need class labels, for something like association rules, you won't. In
the script above, this distinction was made with "0" as a last
attribute for Orange.domain
: this "0" says no, please do
not construct a class variable as by default,
Orange.domain
would consider that the class variable is
the last one from the list of attributes.
The output produced by above script is therefore:
orange.Domain
is a rather powerful constructor of domains, and its complete description is beyond this tutorial. But to illustrate some more, here is another example: run it for yourself to see what happens.
domain3.py (uses glass.tab)
Remember that all this script does is domain construction. You
would still need to use orange.select
in order to obtain
the data example sets.
Instance selection may be based on values of attributes, or we can
simply select some instances according to their index. There are also
a number of filters that may help for instance selection, of which we
here mention only Filter_sameValues
.
First, filtering by index. Again, we will use select
function, this time giving it a vector of integer values based on
which select
will decide to include or not an
instance. By default, select
includes instances with a
corresponding non-zero element in this list, but a specific value for
which corresponding instances may be included may also be
specified. Notice that through this mechanism you may craft your own
selection vector in any way you want, and thus (if needed) implement
some complex instance selection mechanism. Here is though a much
simpler example:
domain7.py (uses glass.tab)
And here is its output:
The above should not be anything new to the careful reader of this tutorial. Namely, we have already used instance selection in the chapter on performance evaluation of classifiers, where we have also learned how to use MakeRandomIndices2
and MakeRandomIndicesCV
to craft the selection vectors.
Next and for something new in this tutorial, Orange's select
allows also to select instances based on their attribute value. This may be best illustrated through some example, so here it goes:
domain4.py (uses adult_sample.tab)
We have used instances from adult data set that for an individual
described through a set of attributes states if her/his yearly
earnings were above $50.000. Notice that we have selected instances
based on their gender (data1
) and gender and education
(data2
), and just to show how different are resulting
data sets reported on number of instances and relative frequency of
cases with higher earnings. Notice that when more than one
attribute-value pair is given to select
, conjunction of
conditions are used. The output of above script is:
Could we request examples for which either of conditions holds? Or
those for which neither of the does? Or... Yes, but not with
select. For this, we need to use a more versatile filter called
Preprocessor_take
. Here's an example of how it's
used.
part of domain5.py (uses adult_sample.tab)
The results are reported as:
Notice that with conjunction=1
the resulting data set
is just like the one constructed with select
. Not just
that - select
and Preprocessor_take
both
actually work by constructing a filter of
Filter_sameValues
object, uses it and discards it
afterwards. What we gained by using Preprocessor_take
instead of select
is the access to field
conjunction
; if set to 0, conditions are treated in
disjunction (OR) instead of in conjunction (AND). And there's also
Preprocessor_take.negate
that reverses the
selection. When constructing the filter, it's essential to set the
domain
before specifying the values.
Selection methods can also dealwith with continuous attributes and values. Constraints with resepct to some attribute values are specified as intervals (lower limit, upper limit). Limits are inclusive: for a limit (30,40) and attribute age, an instance will be selected if the age is higher or equal to 30 and lower or equal to 40. If the limits are reversed, e.g. (40,30), examples with values outside the interval are selected, that is, an instance is selected if age is lower or equal 30 or higher or equal 40.
part of domain6.py (uses adult_sample.tab)
Running this script shows that it pays to be in thirties (good for authors of this text, at the time of writing):
Early in our tutorial we have learned that if data
is
a variable that stores our data set, instances can be accessed simply
by indexing the data, like data[5]
would be the sixth
instance (indices start with 0). Attributes can be accessed through
their index (data[5][3]
; fourth attribute of sixth
instance), name (data[5]["age"]
; attribute age of sixth
instance), or variable (a=data.domain.attributes[5]; print
data[5][a]
; attribute a
of sixth instance).
At this point it should be obvious that attribute values can be
used in any (appropriate) Python's expression, and you may also set
the values of the attributes, like in data[5]["fuel-type"] =
"gas"
. Orange will report an error if assignment is used with a
value out of the variable's scope.
Who needs these? Isn't real data populated with noise and missing values anyway? Well, in machine learning, you sometimes may want to add these to see how robust are your learning algorithms. Particularly, if you deal with artificial data sets that do not include noise and what to make them more "realistic". Like it or not, here is how this may be done.
First, we will add class noise to the data set, and to make thinks
interesting, use this data set with some learner and observe if and
how the accuracy of the learner is affected with noise. To add class
noise, we will use Preprocessor_class_noise
with an
attribute that tells in what percent of instances a class is set to an
arbitrary value. Notice that probabilities of class values used by
Preprocessor_class_noise
are uniform.
domain8.py (uses adult_sample.tab)
Obviously, we expect that with added noise the performance of any classifier will degrade. This is indeed so for our example and naive Bayes learner:
We can also add noise to attributes. Here, we should distinguish between continuous and discrete attributes.
In machine learning and data mining, you may often encounter situations where you wish to add an extra attribute which is constructed from some existing subset of attributes. You may do that "manually" (you know exactly from which attributes you will derive the new one, and you know the function as well), or in some automatic way through, say, constructive induction.
To introduce this subject, we will be here very unambitious and
just show how to deal with the first, manual, case. Here are two
examples. In the first, we add two attributes to the well-known iris
data set; the two may represent the approximation of petal and sepal
area, respectively, and are derived from petal and sepal length and
width. The attributes are declared first: in our case we use
orange.FloatVariable
that returns an object that stores
our variable and its properties. One important property -
getValueFrom
- tells how this variable is computed. All
that we need to do next is to construct new domain that includes new
variables; from this time on every time the new variables are
accessed, Orange will know how to compute them.
domain11.py (uses iris.tab)
As we took care that four data instances from the new data set are nicely printed out, here is the output of the script:
The story is slightly different with nominal attributes, where apart from their name we need to declare its set of values as well. Everything else is quite similar. Below is an example that adds a new attribute to car data set (see more at car data set web page):
domain12.py (uses car.tab)
The output of this code (we intentionally printed out five selected data instances):
In machine learning, we usually alter the data domain to achieve better predictive accuracy, or to introduce attributes that are more understood by domain experts. We tested the first hypothesis on our data set, and constructed classification trees from, respectively, original and new data set. Results of running the following script are not striking (in terms of accuracy boost), but still give an example on how to do exactly the same cross-validation on two data set with the same number of instances.
domain13.py (uses iris.tab)
Here, we construct a simple feature subset selection algorithm that uses a wrapper approach (see Kohavi R, John G: The Wrapper Approach, in Feature Extraction, Construction and Selection : A Data Mining Perspective, edited by Huan Liu and Hiroshi Motoda) and a hill-climbing strategy for selection of features. Wrapper approach estimates the quality of given feature set by running a selected learning algorithm. We start with empty feature set, and incrementally add features from original data set. We do this only if the classification accuracy increases, hence we stop where adding any single feature does not result in gain of performance. For evaluation, we use cross-validation. [What Kohavi and John describe in their wrapper approach is a little more complex, uses best-first search and does some smarter evaluation. From the script presented here to their algorithm is not far, and you are encouraged to build such wrapper if you need one or for an exercise.]
domain9.py (uses voting.tab)
For a wrapper feature subset selection we have defined a function
WrapperFSS
, which takes the data, the learner, and can be
optionally requested to report on the progress of search
(verbose=1
). Cross-validation is by default using ten
folds, but you may change this through a parameter of
WrapperFSS
. Here is a result of a single run of the
script, where we used classification tree as a learner:
Notice that only a single attribute was selected
(party
is a class). You may explore the behavior of the
algorithm in some more detail to see why this happens by calling the
feature subset selection with verbose=2
. You may also
replace tree learner with some other algorithm. We did this and used
naive Bayes (learner = orange.BayesLearner()
), and got
the following:
The selected set of features includes physician-fee-freeze
, but in addition also two other attributes.
[One think with naive Bayes is that it will report a bunch of warnings of the type
This is because at the start of the feature subset selection, a set with no attributes other than class was give to the learner. This warnings are ok and can come in handy elsewhere, if you really do not like them here, add the following to the code:
An issue of course is does this feature subset selection by
wrapping help us in building a better classifier. To test this, we
will construct a WrappedFSSLearner
, that will take some
learning algorithm and a data set, do feature subset selection to
determine the appropriate set of features, and craft classifier from
data that will include this set of features. Like we did before in our
Tutorial, we will construct WrappedFSSLearner
such that
it could be used by other Orange modules.
part of domain10.py (uses voting.tab)
The wrapped learner uses WrapperFSS
, which is exactly
the same function that we have developed in out previous script. The
objects we have introduced in the script above mainly take care of the
attributes, the code that really does something is actually in the
__call__
function of
WrappedFSSLearner_Class
. Running this script for
classification tree, we get the same single-attribute set with
physician-fee-freeze
for all of the ten folds, and a
minimal gain in accuracy. Something similar happens for naive Bayes,
except that attributes included in the data set are
[physician-fee-freeze, synfuels-corporation-cutback,
adoption-of-the-budget-resolution]
, and the statistics reported
are quite higher than for the naive Bayes without feature subset
selection:
This concludes the lesson on basic data set manipulation, which started with description of some really elemental operations and finished with no-so-very-basic algorithm. Still, if you are inspired for feature subset selection, you may use the code for our wrapper approach we have demonstrated at the end of this lesson to extend it in any way you like: what you may find out is that writing Orange scripts like this is easy and can be quite a joy.
Prev: Ensemble Techniques, Next: What we did not cover, Up: On Tutorial 'Orange for Beginners'