Module orngFSS implements several functions that support or may help design feature subset selection for classification problems. The guiding idea is that some machine learning methods may perform better if they learn only from a selected subset of "best" features. orngFSS mostly implements filter approaches, i.e., approaches were attributes scores are estimated prior to the modelling, that is, without knowing of which machine learning method will be used to construct a predictive model.
orange.MeasureAttribute
and defaults to
orange.MeasureAttribute_relief(k=20, m=50)
.attMeasure
.attMeasure
that have the score above or equal to a
specified threshold. data is used to pass an original data
set. Parameter threshold is optional and defaults to 0.0.range.MeasureAttribute_relief(k=20, m=50)
, and
margin defaults to 0.0 Notice that this filter procedure was
originally designed for measures such as Relief, which are context
dependent, i.e. removal of attributes may change the scores of other
remaining attributes. Hence the need to re-estimate score every time an
attribute is removed. selectAttsAboveThresh
. It allows to create an object
which stores filter's parameters and can be later called with the data
to return the data set that includes only the selected
attributes. measure is a function that returns a list of
couples (attribute name, score), and it defaults to
orange.MeasureAttribute_relief(k=20, m=50)
. The default
threshold is 0.0. Some examples of how to use this class are:
FilterAttsAboveThresh
,
this is a wrapper around the function
selectBestNAtts
. Measure and the number of attributes to
retain are optional (the latter defaults to 5).FilterBestNAtts
, this is a
wrapper around the function
filterRelieff
. measure and margin are
optional attributes, where measure defaults to
orange.MeasureAttribute_relief(k=20, m=50)
and
margin to 0.0.Let us start with a simple script that reads the data, uses orngFSS.attMeasure to derive attribute scores and prints out these for the first three best scored attributes. Same scoring function is then used to report (only) on three best score attributes.
fss1.py (uses voting.tab)
The script should output something like:
The following script reports on gain ratio and relief attribute scores. Notice that for our data set the ranks of the attributes rather match well!
fss2.py (uses voting.tab)
Attribute scoring has at least two potential uses. One is informative (or descriptive): the data analyst can use attribute scoring to find "good" attributes and those that are irrelevant for given classification task. The other use is in improving the performance of machine learning by learning only from the data set that includes the most informative features. This so-called filter approach can boost the performance of learner both in terms of predictive accuracy, speed-up of induction, and simplicity of resulting models.
Following is a script that defines a new classifier that is based on naive Bayes and prior to learning selects five best attributes from the data set. The new classifier is wrapped-up in a special class (see Building your own learner lesson in Orange for Beginners). The script compares this filtered learner naive Bayes that uses a complete set of attributes.
fss3.py (uses voting.tab)
Interestingly, and somehow expected, feature subset selection helps. This is the output that we get:
Although perhaps educational, we can do all of the above by
wrapping the learner using FilteredLearner
, thus creating
an object that is assembled from data filter and a base learner. When
given the data, this learner uses attribute filter to construct a new
data set and base learner to construct a corresponding
classifier. Attribute filters should be of the type like
orngFSS.FilterAttsAboveThresh
or
orngFSS.FilterBestNAtts
that can be initialized with the
arguments and later presented with a data, returning new reduced data
set.
The following code fragment essentially replaces the bulk of code from previous example, and compares naive Bayesian classifier to the same classifier when only a single most important attribute is used:
from fss4.py (uses voting.tab)
Now, let's decide to retain three attributes (change the code in fss4.py accordingly!), but observe how many times
an attribute was used. Remember, 10-fold cross validation constructs
ten instances for each classifier, and each time we run
FilteredLearner a different set of attributes may be
selected. orngEval.CrossValidation
stores classifiers in
results
variable, and FilteredLearner
returns a classifier that can tell which attributes it used (how
convenient!), so the code to do all this is quite short:
from fss4.py (uses voting.tab)
Running fss4.py with three attributes selected each time a learner is run gives the following result:
Experiment yourself to see, if only one attribute is retained for classifier, which attribute was the one most frequently selected over all the ten cross-validation tests!
K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine Learning (ECML-94), pages 171{182. Springer-Verlag, 1994.
R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial Intelligence, 97 (1-2), pages 273-324, 1997