orngOutlier: module for detecting outliers

This page describes a class for detecting outliers.

The class first calculates average distances for each example to other examples in given data. Then it calculates Z-scores for all average distances. Z-scores higher than zero denote an example that is more distant to other examples than average.

Detection of outliers can be performed directly on examples or on already calculated distance matrix. Also, the number of nearest neighbours used for averaging distances can be set. Default is 0, which means that all examples are used when calculating average distances.


OutlierDetection

Methods

setExamples(examples, distance)
Sets examples on which the outlier detection will be performed. Distance is a class capable of calculating example distance. If omitted, Manhattan distance is used.
setDistanceMatrix(orange.SymMatrix)
Sets the distance matrix on which the outlier detection will be performed.
setKNN(neighbours)
Set the number of nearest neighbours considered in determinating outliers.
distanceMatrix()
Returns the distance matrix of the dataset.
zValues()
Returns a list of Z values of average distances for each element to others. N-th number in the list is the Z-value of N-th example.

Examples

The following example prints a list of Z-values of examples in bridges dataset.

outlier1.py (uses bridges.tab)

import orange, orngOutlier data = orange.ExampleTable("bridges") outlierDet = orngOutlier.OutlierDetection() outlierDet.setExamples(data) print outlierDet.zValues()

The following example prints 5 examples with highest Z-scores. Euclidian distance is used as a distance measurement and average distance is calculated over 3 nearest neighbours.

outlier2.py (uses bridges.tab)

import orange, orngOutlier data = orange.ExampleTable("bridges") outlierDet = orngOutlier.OutlierDetection() outlierDet.setExamples(data, orange.ExamplesDistanceConstructor_Euclidean(data)) outlierDet.setKNN(3) zValues = outlierDet.zValues() sorted = [] for el in zValues: sorted.append(el) sorted.sort() for i,el in enumerate(zValues): if el > sorted[-6]: print data[i], "Z-score: %5.3f" % el