orngVizRank: Orange VizRank module

Module orngVizRank implements VizRank algorithm (Leban et al, 2004; Leban et al, 2005) which is able to rank possible data projections generated using two different visualization methods - scatterplot and radviz method. For a given class labeled data set, VizRank creates different possible data projections and assigns a score of interestingness to each of the projections. VizRank scores the projections based on how well are different classes separated in the projection. If different classes are well separated the projection gets a high score, otherwise the score it is correspondingly lower. After evaluation it is sensible to focus on top-ranked projections that provide the greatest insight on how to separate between different classes.

In the rest of this document we will talk about two different visualization methods - scatterplot and radviz. While scatterplot is a well known method, not many people know radviz. For those readers who are interested in this method, please see (Hoffman, 1997).


VizRank in Orange

The easiest way to use VizRank in Orange is through Orange widgets. Widgets like Scatterplot, Radviz and Polyviz (which can be found in Visualize tab in Orange Canvas) contain a button "VizRank" which opens VizRank's dialog where you can change all possible settings and find interesting data projections.

A more advanced user, however, will perhaps also want to use VizRank in scripts. These users will use the orngVizRank module.

In the rest of this document we will give information only about using VizRank in scripts. For those of you who will use VizRank in Orange widgets we provided extensive tooltips that should clarify the meaning of different settings.

Creating a VizRank instance

First lets show a very simple example of how we can use VizRank in scripts:

>>> import orange >>> data = orange.ExampleTable("wine.tab") >>> from orngVizRank import * >>> vizrank = VizRank(SCATTERPLOT) # options are: SCATTERPLOT, RADVIZ or LINEAR_PROJECTION >>> vizrank.setData(data) # set the data set >>> vizrank.evaluateProjections() # evaluate possible projections >>> print vizrank.results[0] (86.88861657813024, (86.88861657813024, [87.603105074268271, 82.08174408531525, 93.120556697249413], [59, 71, 48]), 178, ['A7', 'A10'], 5, {})

In this example we created a VizRank instance, evaluated scatterplot projections of the UCI wine data set and printed the information about the best ranked projection. The best projection scored a value of 86.88 (in a range between 0 and 100) and is showing attributes 'A7' and 'A10'. There is also lots of other information for each projections in the result list, but it is not relevant for a casual user.

Below is a list of functions and settings, that can be used in order to modify VizRank's behaviour.

kValue
the number of examples used in predicting the class value. By default it is set to N/c, where N is number of examples in the data set and c is the number of class values
percentDataUsed
when handling large data sets, the kNN method might take a lot of time to evaluate each projection. We can still get a good estimate of projection interestingness if we consider only a subset of examples. You can specify a value between 0 and 100. Default: 100
qualityMeasure
there are different measures of prediction success that one can use to evaluate a classifier. You can use classification accuracy (CLASS_ACCURACY), average probability of correct classification (AVERAGE_CORRECT) or Brier score (BRIER_SCORE). Default: AVERAGE_CORRECT
testingMethod
the way how the accuracy of the classifier is computed. You can use leave one out (LEAVE_ONE_OUT), 10 fold cross validation (TEN_FOLD_CROSS_VALIDATION) or testing on the learning set (TEST_ON_LEARNING_SET). Default: TEN_FOLD_CROSS_VALIDATION
attrCont
which method for evaluating continuous attributes do we want to use. Attributes are ranked and projections with top ranked attributes are evaluated first. Possible options are ReliefF (CONT_MEAS_RELIEFF), Signal to Noise (CONT_MEAS_S2N), a modification of Signal to Noise measure (CONT_MEAS_S2NMIX) or no measure (CONT_MEAS_NONE). Default: CONT_MEAS_RELIEFF
attrDisc
which method for evaluating discrete attributes do we want to use. Attributes are ranked and projections with top ranked attributes are evaluated first. Possible options are ReliefF (DISC_MEAS_RELIEFF), Gain ratio(DISC_MEAS_GAIN), Gini index (DISC_MEAS_GINI) or no measure (DISC_MEAS_NONE). Default: DISC_MEAS_RELIEFF
useGammaDistribution
this parameter determines the order in which the heuristic will select attributes that will be then evaluated using VizRank. If value is set to 0, heuristic will start with selecting top ranked attributes (as ranked by measures specified by attrCont and attrDist variables) and when tested all possible combinations progress to worse ranked attributes. If value set to 1, heuristic will also first rank attributes but will then randomly select attributes according to gamma distribution - this way the better ranked attributes will still be selected more often, but sometimes they will be tested in a combination with attributes that are poorly ranked but can in the end produce high-ranked projection. In domains with a larger set of attributes (>20) it is advisable to use gamma distribution, otherwise we never come to evaluate projections with proorly ranked attributes. Default: 0
useExampleWeighting
if class distribution is very uneven example weighting can be used. Default: 0
evaluationTime
time in minutes that we want to spend in evaluating projections. Since there might be a large number of possible projections we can this way stop evaluation before it evaluates all projetions. Because of the seach heuristic (attrCont and attrDisc) we will most likely find projections with the highest scores at the beginning of the evaluation. Default: 2

Radviz specific settings:

optimizationType
for description see attributeCount below. Possible values are EXACT_NUMBER_OF_ATTRS and MAXIMUM_NUMBER_OF_ATTRS. Default: MAXIMUM_NUMBER_OF_ATTRS
attributeCount
maximum number of attributes in a projection that we will consider. If optimizationType == MAXIMUM_NUMBER_OF_ATTRS then we will consider projections that have between 3 and attributeCount attributes. If optimizationType == EXACT_NUMBER_OF_ATTRS then we will consider only projections that have exactly attributeCount attributes. Default: 4

Methods:

setData(data)
set the example table to evaluate
evaluateProjections()
start projection evaluation. If not all projections are yet evaluated, it will automatically stop after evaluationTime minutes.
save(filename)
save the list of evaluated projections
load(filename)
load a file with evaluated projections

VizRank as a learner

VizRank can also be used as a learning method. You can construct a learner by creating an instance of the VizRankLearner class.

learner = VizRankLearner(SCATTERPLOT)

VizRankLearner can actually accept three parameters. First is the type of the visualization method to use (SCATTERPLOT or RADVIZ). The second parameter is an instance of VizRank class. If it is not given, a new instance is created. The third parameter is a graph instance - orngScaleScatterPlotData or orngScaleRadvizData instance. If it is not specified, a new instance is created.

To change the VizRank's settings we simply access them through the learner.VizRank instance (e.g. learner.VizRank.kValue = 10).

The learner instance can be used as any other learners. If you provide it the examples it returns a classifier of type VizRankClassifier which can be used as any other classifier:

classifier = learner(data)

When classifying VizRank classifier will use the evaluated projections to make class prediction for the new example. Evaluated projection will serve as arguments for each class value. Arguments have different values (weights) and the example is classified to the class which has the highest sum of argument values.

VizRank's settings that are relevant when using VizRank as a classifier:

argumentCount
number of arguments (projections) used when predicting the class value

A simple example:

>>> import orange >>> from orngVizRank import * >>> data = orange.ExampleTable("iris.tab") >>> learner = VizRankLearner(SCATTERPLOT) >>> learner.VizRank.argumentCount = 3 >>> classifier = learner(data) >>> for i in range(5): print classifier(data[i]), data[i].getclass() (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa

References

Leban, G., Bratko, I., Petrovic, U., Curk, T., Zupan, B. VizRank: finding informative data projections in functional genomics by machine learning. Bioinformatics 21, 413-414 (2005).

Leban, G., Mramor, M., Bratko, I., Zupan, B.: Simple and Effective Visual Models for Gene Expression Cancer Diagnostics, KDD-2005 167--177 (Chicago, 2005).

Hoffman, P. E., Grinstein, G. G., Marx, K., Grosse, I. & Stanley, E.: DNA Visual and Analytic Data Mining. IEEE Visualization 1997 1, 437-441 (1997).