Prev: Categorization, Next: Ensemble Techniques, Up: Other Techniques for Orange Scripting
While the core Orange provides mechanisms to estimate relevance of attributes that describe classified instances, a module called orngFSS provides functions and wrappers that simplify feature subset selection. For instance, the following code loads the data, sets-up a filter that will use Relief measure to estimate the relevance of attributes and remove attribute with relevance lower than 0.01, and in this way construct a new data set.
fss6.py (uses adult_sample.tab)
Notice that we have also defined a function
report_relevance
that takes the data, computes the
relevance of attributes (by calling orngFSS.attMeasure
)
and then reports the computed relevance. Notice that (by chance!) both
orngFSS.attMeasure
and orngFSS.FilterRelief
use the same measure to estimate attributes, so this code would
actually get better if one would first set up an object that would
measure the attributes, and give it to both
orngFSS.FilterRelief
and report_relevance
(we leave this for you as an exercise). The output of the above script
is:
Out of 14 attributes, 5 were considered to be most relevant. We can not check if this would help some classifier to achieve a better performance. We will use 10-fold cross validation for comparison. To do thinks rightfully, we need to do feature subset selection every time we see new learning data, so we need to construct a learner that has feature subset selection up-front, i.e., before it actually learns. For a learner, we will use Naive Bayes with categorization (a particular wrapper from orngDisc). The code is quite short since we will also use a wrapper called FilteredLearner from orngFSS module:
an excerpt from fss7.py (uses adult_sample.tab)
Below is the result. In terms of classification accuracy, feature subset selection did not help. But, the rightmost column shows the number of features used in each classifier (results are averaged across ten trials of cross validation), and it is quite surprising that on average only the use of about two features was sufficient.
another excerpt from fss7.py (uses adult_sample.tab)
Following is the part of the output that shows the attribute usage. Quite interesting, four attributes were used in constructed classifiers, but only one (A9) in all ten classifiers constructed by cross validation.
There are more examples on feature subset selection in the documentation of orngFSS module.
Prev: Categorization, Next: Ensemble Techniques, Up: Other Techniques for Orange Scripting