orngEnsemble: Orange Bagging and Boosting Module

Module orngEnsemble implements Breiman's bagging and Random Forest, and Freund and Schapire's boosting algorithms.

BaggedLearner

BaggedLearner takes a learner and returns a bagged learner, which is essentially a wrapper around the learner passed as an argument. If examples are passed in arguments, BaggedLearner returns a bagged classifiers. Both learner and classifier then behave just like any other learner and classifier in Orange.

Attributes

learner
A learner to be bagged.
examples
If examples are passed to BaggedLearner, this returns a BaggedClassifier, that is, creates t classifiers using learner and a subset of examples, as appropriate for bagging (default: None).
t
Number of bagged classifiers, that is, classifiers created when examples are passed to bagged learner (default: 10).
name
The name of the learner (default: Bagging).

Bagging, in essence, takes a training data and a learner, and builds t classifiers each time presenting a learner a bootstrap sample from the training data. When given a test example, classifiers vote on class, and a bagged classifier returns a class with a highest number of votes. As implemented in Orange, when class probabilities are requested, these are proportional to the number of votes for a particular class.

Example

See BoostedLearner example.

BoostedLearner

Instead of drawing a series of bootstrap samples from the training set, bootstrap maintains a weight for each instance. When classifier is trained from the training set, the weights for misclassified instances are increased. Just like in bagged learner, the class is decided based on voting of classifiers, but in boosting votes are weighted by accuracy obtained on training set.

BoostedLearner is an implementation of AdaBoost.M1 (Freund and Shapire, 1996). From user's viewpoint, the use of the BoostedLearner is similar to that of BaggedLearner. The learner passed as an argument needs to deal with example weights.

Attributes

learner
A learner to be boosted.
examples
If examples are passed to BoostedLearner, this returns a BoostedClassifier, that is, creates t classifiers using learner and a subset of examples, as appropriate for AdaBoost.M1 (default: None).
t
Number of boosted classifiers created from the example set (default: 10).
name
The name of the learner (default: AdaBoost.M1).

Example

Let us try boosting and bagging on Iris data set and use TreeLearner with post-pruning as a base learner. For testing, we use 10-fold cross validation and observe classification accuracy.

ensemble.py (uses iris.tab)

import orange, orngEnsemble, orngTree import orngTest, orngStat tree = orngTree.TreeLearner(mForPruning=2, name="tree") bs = orngEnsemble.BoostedLearner(tree, name="boosted tree") bg = orngEnsemble.BaggedLearner(tree, name="bagged tree") data = orange.ExampleTable("lymphography.tab") learners = [tree, bs, bg] results = orngTest.crossValidation(learners, data) print "Classification Accuracy:" for i in range(len(learners)): print ("%15s: %5.3f") % (learners[i].name, orngStat.CA(results)[i])

Running this script, we may get something like:

Classification Accuracy: Classification Accuracy: tree: 0.769 boosted tree: 0.782 bagged tree: 0.783

RandomForestLearner

Just like bagging, classifiers in random forests are trained from bootstrap samples of training data. Here, classifiers are trees, but to increase randomness build in the way that at each node the best attribute is chosen from a subset of attributes in the training set. We closely follows the original algorithm (Brieman, 2001) both in implementation and parameter defaults.

Learner is encapsulated in class RandomForestLearner.

Attributes

examples
If these are passed, the call returns RandomForestClassifier, that is, creates the required set of decision trees, which, when presented with an examples, vote for the predicted class.
trees
Number of trees in the forest (default: 100).
learner
Although not required, one can use this argument to pass one's own tree induction algorithm. If none is passed, RandomForestLearner will use Orange's tree induction algorithm such that in induction nodes with less then 5 examples will not be considered for (further) splitting. (default: None)
attributes
Number of attributes used in a randomly drawn subset when searching for best attribute to split the node in tree growing (default: None, and if kept this way, this is turned into square root of attributes in the training set, when this is presented to learner).
rand
Random generator used in bootstrap sampling. If none is passed, then Python's Random from random library is used, with seed initialized to 0..
callback
A function to be called after every iteration of induction of classifier. This is called with parameter (from 0.0 to 1.0) that gives estimates on learning progress.
name
The name of the learner (default: Random Forest).

A note on voting. Random forest classifier uses decision trees induced from bootstrapped training set to vote on class of presented example. Most frequent vote is returned. However, in our implementation, if class probability is requested from a classifier, this will return the averaged probabilities from each of the trees.

Examples

The following script assembles a random forest learner and compares it to a tree learner on a liver disorder (bupa) data set.

ensemble2.py (uses bupa.tab)

import orange, orngTree, orngEnsemble data = orange.ExampleTable('bupa.tab') forest = orngEnsemble.RandomForestLearner(trees=50, name="forest") tree = orngTree.TreeLearner(minExamples=2, mForPrunning=2, \ sameMajorityPruning=True, name='tree') learners = [tree, forest] import orngTest, orngStat results = orngTest.crossValidation(learners, data, folds=10) print "Learner CA Brier AUC" for i in range(len(learners)): print "%-8s %5.3f %5.3f %5.3f" % (learners[i].name, \ orngStat.CA(results)[i], orngStat.BrierScore(results)[i], orngStat.AUC(results)[i])

Notice that our forest contains 50 trees. Learners are compared through 10-fold cross validation, and results reported on classification accuracy, brier score and area under ROC curve:

Learner CA Brier AUC tree 0.664 0.673 0.653 forest 0.710 0.373 0.777

Perhaps the sole purpose of the following example is to show how to access the individual classifiers once they are assembled into the forest, and to show how we can assemble a tree learner to be used in random forests. The tree induction uses an attribute subset split constructor, which we have borrowed from orngEnsamble and from which we have requested the best attribute for decision nodes to be selected from three randomly chosen attributes.

ensemble3.py (uses bupa.tab)

import orange, orngTree, orngEnsemble data = orange.ExampleTable('bupa.tab') tree = orngTree.TreeLearner(storeNodeClassifier = 0, storeContingencies=0, \ storeDistributions=1, minExamples=5, ).instance() gini = orange.MeasureAttribute_gini() tree.split.discreteSplitConstructor.measure = \ tree.split.continuousSplitConstructor.measure = gini tree.maxDepth = 5 tree.split = orngEnsemble.SplitConstructor_AttributeSubset(tree.split, 3) forestLearner = orngEnsemble.RandomForestLearner(learner=tree, trees=50) forest = forestLearner(data) for c in forest.classifiers: print orngTree.countNodes(c), print

Running the above code would report on sizes (number of nodes) of the tree in a constructed random forest.

MeasureAttribute_randomForests

L. Breiman (2001) suggested the possibility of using random forests as a non-myopic measure of attribute importance.

Assessing relevance of attributes with random forests is based on the idea that randomly changing the value of an important attribute greatly affects example's classification while changing the value of an unimportant attribute doen't affect it much. Implemented algorithm accumulates attribute scores over given number of trees. Importances of all atributes for a single tree are computed as: correctly classified OOB examples minus correctly classified OOB examples when an attribute is randomly shuffled. The accumulated attribute scores are divided by the number of used trees and multiplied by 100 before they are returned.

Attributes

trees
Number of trees in the forest (default: 100).
learner
Although not required, one can use this argument to pass one's own tree induction algorithm. If none is passed, MeasureAttribute_randomForests will use Orange's tree induction algorithm such that in induction nodes with less then 5 examples will not be considered for (further) splitting. (default: None)
attributes
Number of attributes used in a randomly drawn subset when searching for best attribute to split the node in tree growing (default: None, and if kept this way, this is turned into square root of attributes in example set).
rand
Random generator used in bootstrap sampling. If none is passed, then Python's Random from random library is used, with seed initialized to 0.

Computation of attribute importance with random forests is rather slow. Also, importances for all attributes need to be considered simultaneous. Since we normally compute attribute importance with random forests for all attributes in the dataset, MeasureAttribute_randomForests caches the results. When it is called to compute a quality of certain attribute, it computes qualities for all attributes in the dataset. When called again, it uses the stored results if the domain is still the same and the example table has not changed (this is done by checking the example tables version and is not foolproof; it won't detect if you change values of existing examples, but will notice adding and removing examples; see the page on ExampleTable for details).

Caching will only have an effect if you use the same instance for all attributes in the domain.

Example

The following script demonstrates measuring attribute importance with random forests.

ensemble4.py (uses iris.tab)

import orange, orngEnsemble, random data = orange.ExampleTable("iris.tab") measure = orngEnsemble.MeasureAttribute_randomForests(trees=100) #call by attribute index imp0 = measure(0, data) #call by orange.Variable imp1 = measure(data.domain.attributes[1], data) print "first: %0.2f, second: %0.2f\n" % (imp0, imp1) print "different random seed" measure = orngEnsemble.MeasureAttribute_randomForests(trees=100, rand=random.Random(10)) imp0 = measure(0, data) imp1 = measure(data.domain.attributes[1], data) print "first: %0.2f, second: %0.2f\n" % (imp0, imp1) print "All importances:" imps = measure.importances(data) for i,imp in enumerate(imps): print "%15s: %6.2f" % (data.domain.attributes[i].name, imp)

Corresponding output:

first: 0.32, second: 0.04 different random seed first: 0.33, second: 0.14 All importances: sepal length: 0.33 sepal width: 0.14 petal length: 15.16 petal width: 48.59

References

L Breiman. Bagging Predictors. Technical report No. 421. University of California, Berkeley, 1994. [PS]

Y Freund, RE Schapire. Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference (ICML'96), 1996. [Citeseer]

JR Quinlan. Boosting, bagging, and C4.5. In Proc. of 13th National Conference on Artificial Intelligence (AAAI'96). pp. 725-730, 1996. [PS]

L Brieman. Random Forests. Machine Learning, 45, 5-32, 2001. [SpringerLink]

M Robnik-Sikonja. Improving Random Forests. In Proc. of European Conference on Machine Learning (ECML 2004), pp. 359-370, 2004. [PDF]