orngTest
is Orange module for testing learning
algorithms. It includes functions for data sampling and splitting, and
for testing learners. It implements cross-validation, leave-one out,
random sampling, learning curves. All functions return result in the
same for - an instance of ExperimentResults
, described at the end of the page, or, in case
of learning curves, a list of ExperimentResults
. This
object(s) can be passed to statistical function for model evaluation
(classification accuracy, Brier score, ROC analysis...) available in
module orngStat
.
Your scripts will thus basically conduct experiments using
functions in orngTest
, covered on this page and then
evaluate the results by functions in orngStat
. For those interested in
writing their own statistical measures of the quality of models,
description of TestedExample
and
ExperimentResults
are available at the end of this
page.
An important change over previous versions of Orange: Orange
has been "de-randomized". Running the same script twice will generally
give the same results, unless special care is taken to randomize
it. This is opposed to the previous versions where special care needed
to be taken to make experiments repeatable. See arguments
randseed
and randomGenerator
for the
explanation.
Example scripts in this section suppose that the data is loaded and a list of learning algorithms is prepared.
part of test.py (uses voting.tab)
After testing is done, classification accuracies can be computed
and printed by the following function (function uses list
names
constructed above).
Many function in this module use a set of common arguments, which we define here.
orange.BayesLearner
) or Python classes
or functions written in pure Python (anything that can be called with
the same arguments and results as Orange's classifiers and performs
similar function).ExampleTable
(some functions
need an undivided set of examples while others need examples that are
already split into two sets). If examples are weighted, pass them as a
tuple (examples
, weightID
). Weights are
respected by learning and testing, but not by sampling. When selecting
10% of examples, this means 10% by number, not by weights. There is
also no guarantee that sums of example weights will be (at least
roughly) equal for folds in cross validation.orange.StratifiedIfPossible
which stratifies
selections if the class attribute is discrete and has no unknown
values.randseed
) or random generator
(randomGenerator
) for random selection of examples. If
omitted, random seed of 0 is used and the same test will always select
the same examples from the example set. There are various slightly
different ways to randomize it.
randomGenerator
to
orange.globalRandom
. The function's selection will depend
upon Orange's global random generator that is reset (with random seed
0) when Orange is imported. Script's output will therefore depend upon
what you did after Orange was first imported in the current Python
session.
orange.RandomGenerator
and use in
various places and times. The code below, for instance, will produce
different results in each iteration, but overall the same results each
time it's run.
randseed
) to a random
number provided by Python. Python has a global random generator that
is reset when Python is loaded, using current system time for a
seed. With this, results will be (in general) different each time the
script is run.
(c,
preprocessor)
, where c determines whether the preprocessor will
be applied to learning set ("L"), test set ("T") or to both ("B"). The
latter is applied first, when the example set is still undivided. The
"L" and "T" preprocessors are applied on the separated
subsets. Preprocessing testing examples is allowed only on
experimental procedures that do not report the
TestedExample
's in the same order as examples in the
original set. The second item in the tuple, preprocessor
can be either pure Orange or pure Python preprocessor, that is, any
function or callable class that accept a table of examples and weight,
and returns a preprocessed table and weight.
This example will demonstrate the devastating effect of 100% class noise on learning.
ExperimentResults
' field classifiers
.
The script below makes 100 repetitions of 70:30 test and store the classifiers it induces.
After this, res.classifiers
is a list of 100 items and each item will be a list with three classifiers.
verbose=1
learnProp
of examples in the learning and the rest in the testing set. The test is repeated for a given number of times
(default 10). Division is stratified by default. Function also accepts keyword arguments for randomization and storing classifiers.
100 repetitions of the so-called 70:30 test in which 70% of examples are used for training and 30% for testing is done by
Note that Python allows naming the arguments; instead of "100
" you can use "times = 100
" to increase the clarity (not so with keyword arguments, such as storeClassifiers
, randseed
or verbose
that must always be given with a name, as shown in examples above).
crossValidation
is actually
written as a single call to testWithIndices
.
testWithIndices
takes care the TestedExamples
are in the same order as the corresponding examples in the original set. Preprocessing of testing examples is thus not allowed. The computed results can be saved in files or loaded therefrom if you add a keyword argument cache = 1
. In this case, you also have to specify the rand seed which was used to compute the indices (argument indicesrandseed
; if you don't there will be no caching.
You can request progress reports with a keyword argument verbose = 1
.
ExperimentResults
but a list of them, one for each
proportion.
Function basically prepares a random generator and example selectors (cv
and pick
, see below) and calls the learningCurve
.
Arguments cv
and pick
give the methods
for preparing indices for cross-validation and random selection of
learning examples. If they are not given,
orange.MakeRandomIndicesCV
and
orange.MakeRandomIndices2
are used, both will be
stratified and the cross-validation will be 10-fold. Proportions is a
list of proportions of learning examples.
The function can save time by loading experimental existing data
for any test that were already conducted and saved. Also, the computed
results are stored for later use. You can enable this by adding a
keyword argument cache=1
. Another keyword deals with
progress report. If you add verbose=1
, the function will
print the proportion and the fold number.
testset
. The whole test is repeated for the given
number of times
for each proportion. The result is a list
of ExperimentResults
, one for each proportion.
In the following scripts, examples are pre-divided onto training and testing set. Learning curves are computed in which 20, 40, 60, 80 and 100 percents of the examples in the former set are used for learning and the latter set is used for testing. Random selection of the given proportion of learning set is repeated for five times.
You can pass an already initialized ExperimentResults
(argument results
) and an iteration number (iterationNumber
). Results of the test will be appended with the given iteration number. This is because learnAndTestWithTestData
gets called by other functions, like proportionTest
and learningCurveWithTestData
. If you omit the parameters, a new ExperimentResults
will be created.
As with learnAndTestOnTestData
, you can pass an already initialized ExperimentResults
(argument results
) and an iteration number to the function. In this case, results of the test will be appended with the given iteration number.
ExperimentResults
and iteration number, like in learnAndTestWithTestData
(which actually calls testWithTestData
). If you don't, a new ExperimentResults
will be created.Knowing classes TestedExample
that stores results of testing for a single test example and ExperimentResults
that stores a list of TestedExample
s along with some other data on experimental procedures and classifiers used, is important if you would like to write your own measures of quality of models, compatible the sampling infrastructure provided by Orange. If not, you can skip the remainder of this page.
Attributes
Value
, one for each classifier.TestedExample
was created/tested.Methods
TestExample
.Attributes
TestedExample
, one for each example in the dataset.storeClassifiers = 1
.TestedExample
's attribute iterationNumber should be in range [0, numberOfIterations-1]
.classes
and probabilities
in each TestedExample
should equal numberOfLearners
.false
, weights are still present in TestedExamples
, but they are all 1.0. Clear this flag, if your experimental procedure ran on weighted testing examples but you would like to ignore the weights in statistics.Methods
TestedExamples
will be weighted.lrn
is a list of learners and filename is a template for the filename. The attribute loaded is initialized so that it contains 1's for the learners whose data was loaded and 0's for learners which need to be tested. The function returns 1 if all the files were found and loaded, and 0 otherwise.
The data is saved in a separate file for each classifier. The file is a binary pickle file containing a list of tuples
((x.actualClass, x.iterationNumber), (x.classes[i], x.probabilities[i]))
where x is a TestedExample
and i
is the index of a learner.
The file resides in the directory ./cache
. Its name
consists of a template, given by a caller. The filename should contain
a %s
which is replaced by name
,
shortDescription
, description
,
func_doc
or func_name
(in that order)
attribute of the learner (this gets extracted by
orngMisc.getobjectname
). If a learner has none of these
attributes, its class name is used.
Filename should include enough data to make sure that it indeed
contains the right experimental results. The function
learningCurve
, for example, forms the name of the file
from a string "{learningCurve}
", the proportion of
learning examples, random seeds for cross-validation and learning set
selection, a list of preprocessors' names and a checksum for
examples. Of course you can outsmart this, but it should suffice in
most cases.
i
-th learner.index
-th learner, or uses
it to replace the results of the learner with the index
replace
if replace
is a valid index. Assumes
that results
came from evaluation on the same data set
using the same testing technique (same number of iterations). Salzberg, S. L. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1, pages 317-328.