Module orngEnsemble implements Breiman's bagging and Random Forest, and Freund and Schapire's boosting algorithms.
takes a learner and returns a bagged
learner, which is essentially a wrapper around the learner passed as
an argument. If examples are passed in arguments,
BaggedLearner
returns a bagged classifiers. Both learner
and classifier then behave just like any other learner and classifier
in Orange.
Attributes
BaggedLearner
, this
returns a BaggedClassifier
, that is, creates
t
classifiers using learner and a subset of examples,
as appropriate for bagging (default: None).Bagging, in essence, takes a training data and a learner, and
builds t
classifiers each time presenting a learner a
bootstrap sample from the training data. When given a test example,
classifiers vote on class, and a bagged classifier returns a class
with a highest number of votes. As implemented in Orange, when class
probabilities are requested, these are proportional to the number of
votes for a particular class.
Instead of drawing a series of bootstrap samples from the training set, bootstrap maintains a weight for each instance. When classifier is trained from the training set, the weights for misclassified instances are increased. Just like in bagged learner, the class is decided based on voting of classifiers, but in boosting votes are weighted by accuracy obtained on training set.
BoostedLearner
is similar to that of
BaggedLearner
. The learner passed as an argument needs to
deal with example weights.
Attributes
BoostedLearner
, this
returns a BoostedClassifier
, that is, creates
t
classifiers using learner and a subset of examples,
as appropriate for AdaBoost.M1 (default: None).Let us try boosting and bagging on Iris data set and use
TreeLearner
with post-pruning as a base learner. For
testing, we use 10-fold cross validation and observe classification
accuracy.
ensemble.py (uses iris.tab)
Running this script, we may get something like:
Just like bagging, classifiers in random forests are trained from bootstrap samples of training data. Here, classifiers are trees, but to increase randomness build in the way that at each node the best attribute is chosen from a subset of attributes in the training set. We closely follows the original algorithm (Brieman, 2001) both in implementation and parameter defaults.
Learner is encapsulated in class RandomForestLearner
.
Attributes
RandomForestClassifier
, that is, creates the required
set of decision trees, which, when presented with an examples, vote
for the predicted class.RandomForestLearner
will use Orange's tree induction
algorithm such that in induction nodes with less then 5 examples
will not be considered for (further) splitting. (default: None)A note on voting. Random forest classifier uses decision trees induced from bootstrapped training set to vote on class of presented example. Most frequent vote is returned. However, in our implementation, if class probability is requested from a classifier, this will return the averaged probabilities from each of the trees.
The following script assembles a random forest learner and compares it to a tree learner on a liver disorder (bupa) data set.
ensemble2.py (uses bupa.tab)
Notice that our forest contains 50 trees. Learners are compared through 10-fold cross validation, and results reported on classification accuracy, brier score and area under ROC curve:
Perhaps the sole purpose of the following example is to show how to access the individual classifiers once they are assembled into the forest, and to show how we can assemble a tree learner to be used in random forests. The tree induction uses an attribute subset split constructor, which we have borrowed from orngEnsamble and from which we have requested the best attribute for decision nodes to be selected from three randomly chosen attributes.
ensemble3.py (uses bupa.tab)
Running the above code would report on sizes (number of nodes) of the tree in a constructed random forest.
L. Breiman (2001) suggested the possibility of using random forests as a non-myopic measure of attribute importance.
Assessing relevance of attributes with random forests is based on the idea that randomly changing the value of an important attribute greatly affects example's classification while changing the value of an unimportant attribute doen't affect it much. Implemented algorithm accumulates attribute scores over given number of trees. Importances of all atributes for a single tree are computed as: correctly classified OOB examples minus correctly classified OOB examples when an attribute is randomly shuffled. The accumulated attribute scores are divided by the number of used trees and multiplied by 100 before they are returned.
Attributes
MeasureAttribute_randomForests
will use Orange's tree induction
algorithm such that in induction nodes with less then 5 examples
will not be considered for (further) splitting. (default: None)Computation of attribute importance with random forests is rather slow.
Also, importances for all attributes need to be considered simultaneous.
Since we normally compute attribute importance with random forests
for all attributes in the dataset, MeasureAttribute_randomForests
caches the results. When it is called to compute a quality of certain attribute,
it computes qualities for all attributes in the dataset.
When called again, it uses the stored results if the domain is still
the same and the example table has not changed (this is done by
checking the example tables version and is not foolproof;
it won't detect if you change values of existing examples,
but will notice adding and removing examples; see the page on
ExampleTable
for details).
Caching will only have an effect if you use the same instance for all attributes in the domain.
The following script demonstrates measuring attribute importance with random forests.
ensemble4.py (uses iris.tab)
Corresponding output:
L Breiman. Bagging Predictors. Technical report No. 421. University of California, Berkeley, 1994. [PS]
Y Freund, RE Schapire. Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference (ICML'96), 1996. [Citeseer]
JR Quinlan. Boosting, bagging, and C4.5. In Proc. of 13th National Conference on Artificial Intelligence (AAAI'96). pp. 725-730, 1996. [PS]
L Brieman. Random Forests. Machine Learning, 45, 5-32, 2001. [SpringerLink]
M Robnik-Sikonja. Improving Random Forests. In Proc. of European Conference on Machine Learning (ECML 2004), pp. 359-370, 2004. [PDF]