Classes for Random Sampling

Example sampling is one of the basic procedures in machine learning. If for nothing else, everybody needs to split dataset into training and testing examples.

It is easy to select a subset of examples in Orange. The key idea is the use of indices: first construct a list of indices, one corresponding to each example. Then you can select examples by indices, say take all examples with index 3. Or with index other than 3. It is obvious that this is useful for many typical setups, such as 70-30 splits or cross-validation.

Orange provides methods for making such selections, such as ExampleTable's select method. And, of course, it provides methods for constructing indices for different kinds of splits. For instance, for the most common used sampling method, cross-validation, the Orange's class MakeRandomIndicesCV prepares a list of indices that assign a fold to each example.

Classes that construct such indices are derived from a basic abstract MakeRandomIndices. There are three different classes provided. MakeRandomIndices2 constructs a list of 0's and 1's in prescribed proportion; it can be used for, for instance, 70-30 divisions on training and testing examples. A more general MakeRandomIndicesN construct a list of indices from 0 to N-1 in given proportions. Finally, the most often used MakeRandomIndicesCV prepares indices for cross-validation.

Important change: random indices are more deterministic than in versions of Orange prior to September 2003. See examples in the section about MakeRandomIndices2 for details.


MakeRandomIndices

Attributes

stratified
Defines whether the division should be stratified, that is, whether all subset should have approximatelly equal class distributions. Possible values are MakeRandomIndices.Stratified, MakeRandomIndices.NotStratified and MakeRandomIndices.StratifiedIfPossible. In the latter case, which is also the default, Orange will try to construct stratified indices, but fall back to non-stratified if anything goes wrong. For stratified indices, it needs to see the example table (see the calling operator below), and the class should be discrete and have no unknown values.
randseed, randomGenerator
These two fields deal with the way MakeRandomIndices generates random numbers.
  • If randomGenerator (of type orange.RandomGenerator) is set, it is used. The same random generator can be shared between different objects; this can be useful when constructing an experiment that depends on a single random seed. If you use this, MakeRandomIndices will return a different set of indices each time it's called, even if with the same arguments.
  • If randomGenerator is not given, but randseed is (positive values denote a defined randseed), the value is used to initiate a new, temporary local random generator. This way, the indices generator will always give same indices for the same data.
  • If none of the two is defined, a new random generator is constructed each time the object is called (note that this is unlike some other classes, such as Variable, Distribution and ExampleTable, that store such generators for future use; the generator constructed by MakeRandomIndices is disposed after use) and initialized with random seed 0. This thus has the same effect as setting randseed to 0.
The example for MakeRandomIndices2 shows the difference between those options.

MakeRandomIndices can be called to return a list of indices. The argument can be either the desired length of the list (presumably corresponding to a length of some list of examples) or a set of examples, given as ExampleTable or plain Python list. It is obvious that in the former case, indices cannot correspond to a stratified division; if stratified is set to MakeRandomIndices.Stratified, an exception is raised.

MakeRandomIndices2

This object prepares a list of 0's and 1's.

Attributes

p0
The proportion or a number of 0's. If p0 is less than 1, it's a proportion. For instance, if p0 is 0.2, 20% of indices will be 0's and 80% will be 1's. If p0 is 1 or more, it gives the exact number of 0's. For instance, with p0 of 10, you will get a list with 10 0's and the rest of the list will be 1's.

Say that you have loaded the lenses domain into data. We'll split it into two datasets, the first containing only 6 examples and the other containing the rest.

part of randomindices2.py (uses lenses.tab)

>>> indices2 = orange.MakeRandomIndices2(p0=6) >>> ind = indices2(data) >>> print ind <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1> >>> data0 = data.select(ind, 0) >>> data1 = data.select(ind, 1) >>> print len(data0), len(data1) 6 18

No surprises here. Let's now see what's with those random seeds and generators. First, we shall simply construct and print five lists of random indices.

part of randomindices2.py (uses lenses.tab)

>>> indices2 = orange.MakeRandomIndices2(p0=6) >>> for i in range(5): >>> print indices2(data) <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1> <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1> <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1> <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1> <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1>

We ran it for five times and got the same result each time (this would not be so in older versions of Orange!). Unless there's something wrong with your port of Orange, you've got the same indices as above.

part of randomindices2.py (uses lenses.tab)

>>> indices2.randomGenerator = orange.RandomGenerator(42) >>> for i in range(5): >>> print indices2(data) <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0> <0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0> <0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1> <1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1> <1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0>

Now we constructed a private random generator for random indices. And got five different lists but if you run the whole script again, you'll get the same five sets, since the generator will be constructed again and start generating number from the beginning. Again, you should have got this same indices on any operating system.

part of randomindices2.py (uses lenses.tab)

>>> indices2.randseed = 42 >>> indices2.randomGenerator = None >>> for i in range(5): >>> print indices2(data) <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0> <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0> <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0> <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0> <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>

Here we set the random seed and removed the random generator (otherwise the seed would have no effect as the generator has the priority). Each time we run the indices generator, it constructs a private random generator and initializes it with the given seed, and consequentially always returns the same indices. As you have probably noticed, this indices are the same as those generated one example earlier, due to the same random seed.

Let's play with p0. There are 24 examples in the dataset. Setting p0 to 0.25 instead of 6 shouldn't alter the indices. Let's check it.

part of randomindices2.py (uses lenses.tab)

>>> indices2.p0 = 0.25 >>> print indices2(data) <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0>

Finally, let's observe the effects of stratified. By default, indices are stratified if it's possible and, in our case, it is and they are.

part of randomindices2.py (uses lenses.tab)

>>> indices2.stratified = indices2.Stratified >>> ind = indices2(data) >>> print ind <0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0> >>> data2 = data.select(ind) >>> od = orange.getClassDistribution(data) >>> sd = orange.getClassDistribution(data2) >>> od.normalize() >>> sd.normalize() >>> print od <0.625, 0.208, 0.167> >>> print sd <0.611, 0.222, 0.167>

We explicitly requested stratication and got the same indices as before. That's OK. We also printed out the distribution for the whole dataset and for the selected dataset (as we gave no second parameter, the examples with no-null indices got selected). They are not same, but they are pretty close. MakeRandomIndices2 did what it could. Now let's try without stratification. The script is pretty same except for changing Stratified to NotStratified:

part of randomindices2.py (uses lenses.tab)

>>> indices2.stratified = indices2.NotStratified >>> ind = indices2(data) >>> print ind <0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1> >>> data2 = data.select(ind) >>> sd = orange.getClassDistribution(data2) >>> sd.normalize() >>> print od <0.625, 0.208, 0.167> >>> print sd <0.556, 0.278, 0.167>

Different indices and ... just look at the distribution. Could be worse but, well, NotStratified doesn't mean that Orange will make an effort to get uneven distributions. It just won't mind about them.

For a final test, you can set the class of one of the examples to unknown and rerun the last script with setting stratified once to Stratified and once to StratifiedIfPossible. In the first case you'll get an error and in the second you'll have a non-stratified indices.

MakeRandomIndicesN

MakeRandomIndicesN is a straight generalization of RandomIndices2, so there's not much to be told about it.

Attributes

p
A list of proportions of examples that go to each fold. If p has a length of 3, the returned list will have four different indices, the first three will have probabilities as defined in p while the last will have a probability of (1 - sum of elements of p).

MakeRandomIndicesN does not support stratification (yet); setting stratified to Stratified will yield an error.

Let us construct a list of indices that would assign half of examples to the first set and a quarter to the second and third.

part of randomindicesn.py (uses lenses.tab)

>>> indicesn = orange.MakeRandomIndicesN(p=[0.5, 0.25]) >>> ind = indicesn(data) >>> print ind <1, 2, 1, 0, 2, 2, 1, 0, 2, 0, 2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0>

Count them and you'll see there are 12 zero's and 6 one's and two's out of 24.

MakeRandomIndicesCV

MakeRandomIndicesCV computes indices for cross-validation.

Attributes

folds
Number of folds. Default is 10.

The object constructs a list of indices between 0 and folds-1 (inclusive), with an equal number of each (if the number of examples is not divisible by folds, the last folds will have one example less).

For an exercise, we shall first prepare indices for an ordinary ten-fold cross validation.

part of randomindicescv.py (uses lenses.tab)

>>> print orange.MakeRandomIndicesCV(data) <0, 8, 1, 3, 6, 9, 4, 2, 4, 6, 7, 1, 9, 7, 2, 3, 0, 5, 8, 0, 1, 5, 3, 2>

Since examples don't divide evenly into ten folds, the first four folds have one example more - there are three 0's, 1's, 2's and 3's, but only two 4's, 5's...

For a more even division, Orange will prepare indices for 10 examples for 5-fold cross validation. Instead of giving the examples, as usual, we shall only pass the number of them. This, of course, prevents the stratification.

part of randomindicescv.py (uses lenses.tab)

>>> print orange.MakeRandomIndicesCV(10, folds=5) <2, 1, 3, 3, 0, 2, 0, 4, 1, 4>