Example sampling is one of the basic procedures in machine learning. If for nothing else, everybody needs to split dataset into training and testing examples.
It is easy to select a subset of examples in Orange. The key idea is the use of indices: first construct a list of indices, one corresponding to each example. Then you can select examples by indices, say take all examples with index 3. Or with index other than 3. It is obvious that this is useful for many typical setups, such as 70-30 splits or cross-validation.
Orange provides methods for making such selections, such as ExampleTable
's select
method. And, of course, it provides methods for constructing indices for different kinds of splits. For instance, for the most common used sampling method, cross-validation, the Orange's class MakeRandomIndicesCV
prepares a list of indices that assign a fold to each example.
Classes that construct such indices are derived from a basic abstract MakeRandomIndices
. There are three different classes provided. MakeRandomIndices2
constructs a list of 0's and 1's in prescribed proportion; it can be used for, for instance, 70-30 divisions on training and testing examples. A more general MakeRandomIndicesN
construct a list of indices from 0 to N-1 in given proportions. Finally, the most often used MakeRandomIndicesCV
prepares indices for cross-validation.
Important change: random indices are more deterministic than in versions of Orange prior to September 2003. See examples in the section about MakeRandomIndices2
for details.
Attributes
MakeRandomIndices.Stratified
, MakeRandomIndices.NotStratified
and MakeRandomIndices.StratifiedIfPossible
. In the latter case, which is also the default, Orange will try to construct stratified indices, but fall back to non-stratified if anything goes wrong. For stratified indices, it needs to see the example table (see the calling operator below), and the class should be discrete and have no unknown values.MakeRandomIndices
generates random numbers.
randomGenerator
(of type orange.RandomGenerator
) is set, it is used. The same random generator can be shared between different objects; this can be useful when constructing an experiment that depends on a single random seed. If you use this, MakeRandomIndices
will return a different set of indices each time it's called, even if with the same arguments.randomGenerator
is not given, but randseed
is (positive values denote a defined randseed
), the value is used to initiate a new, temporary local random generator. This way, the indices generator will always give same indices for the same data.Variable
, Distribution
and ExampleTable
, that store such generators for future use; the generator constructed by MakeRandomIndices
is disposed after use) and initialized with random seed 0. This thus has the same effect as setting randseed
to 0.MakeRandomIndices2
shows the difference between those options.
MakeRandomIndices
can be called to return a list of indices. The argument can be either the desired length of the list (presumably corresponding to a length of some list of examples) or a set of examples, given as ExampleTable
or plain Python list. It is obvious that in the former case, indices cannot correspond to a stratified division; if stratified
is set to MakeRandomIndices.Stratified
, an exception is raised.
This object prepares a list of 0's and 1's.
Attributes
p0
is less than 1, it's a proportion. For instance, if p0
is 0.2, 20% of indices will be 0's and 80% will be 1's. If p0
is 1 or more, it gives the exact number of 0's. For instance, with p0
of 10, you will get a list with 10 0's and the rest of the list will be 1's.Say that you have loaded the lenses domain into data
. We'll split it into two datasets, the first containing only 6 examples and the other containing the rest.
part of randomindices2.py (uses lenses.tab)
No surprises here. Let's now see what's with those random seeds and generators. First, we shall simply construct and print five lists of random indices.
part of randomindices2.py (uses lenses.tab)
We ran it for five times and got the same result each time (this would not be so in older versions of Orange!). Unless there's something wrong with your port of Orange, you've got the same indices as above.
part of randomindices2.py (uses lenses.tab)
Now we constructed a private random generator for random indices. And got five different lists but if you run the whole script again, you'll get the same five sets, since the generator will be constructed again and start generating number from the beginning. Again, you should have got this same indices on any operating system.
part of randomindices2.py (uses lenses.tab)
Here we set the random seed and removed the random generator (otherwise the seed would have no effect as the generator has the priority). Each time we run the indices generator, it constructs a private random generator and initializes it with the given seed, and consequentially always returns the same indices. As you have probably noticed, this indices are the same as those generated one example earlier, due to the same random seed.
Let's play with p0
. There are 24 examples in the dataset. Setting p0
to 0.25 instead of 6 shouldn't alter the indices. Let's check it.
part of randomindices2.py (uses lenses.tab)
Finally, let's observe the effects of stratified
. By default, indices are stratified if it's possible and, in our case, it is and they are.
part of randomindices2.py (uses lenses.tab)
We explicitly requested stratication and got the same indices as before. That's OK. We also printed out the distribution for the whole dataset and for the selected dataset (as we gave no second parameter, the examples with no-null indices got selected). They are not same, but they are pretty close. MakeRandomIndices2
did what it could. Now let's try without stratification. The script is pretty same except for changing Stratified
to NotStratified
:
part of randomindices2.py (uses lenses.tab)
Different indices and ... just look at the distribution. Could be worse but, well, NotStratified
doesn't mean that Orange will make an effort to get uneven distributions. It just won't mind about them.
For a final test, you can set the class of one of the examples to unknown and rerun the last script with setting stratified
once to Stratified
and once to StratifiedIfPossible
. In the first case you'll get an error and in the second you'll have a non-stratified indices.
MakeRandomIndicesN
is a straight generalization of RandomIndices2
, so there's not much to be told about it.
Attributes
p
has a length of 3, the returned list will have four different indices, the first three will have probabilities as defined in p
while the last will have a probability of (1 - sum of elements of p
).MakeRandomIndicesN
does not support stratification (yet); setting stratified
to Stratified
will yield an error.
Let us construct a list of indices that would assign half of examples to the first set and a quarter to the second and third.
part of randomindicesn.py (uses lenses.tab)
Count them and you'll see there are 12 zero's and 6 one's and two's out of 24.
MakeRandomIndicesCV
computes indices for cross-validation.
Attributes
The object constructs a list of indices between 0 and folds-1
(inclusive), with an equal number of each (if the number of examples is not divisible by folds
, the last folds will have one example less).
For an exercise, we shall first prepare indices for an ordinary ten-fold cross validation.
part of randomindicescv.py (uses lenses.tab)
Since examples don't divide evenly into ten folds, the first four folds have one example more - there are three 0's, 1's, 2's and 3's, but only two 4's, 5's...
For a more even division, Orange will prepare indices for 10 examples for 5-fold cross validation. Instead of giving the examples, as usual, we shall only pass the number of them. This, of course, prevents the stratification.
part of randomindicescv.py (uses lenses.tab)