Basic Statistics for Continuous Attributes

Orange contains two simple classes for computing basic statistics for continuous attributes, such as their minimal and maximal value or average: BasicAttrStat holds the statistics for a single attribute and DomainBasicAttrStat holds the statistics for all attributes in the domain.


BasicAttrStat

Attributes

variable
The descriptor for the attribute to which the data applies.
min, max
Minimal and maximal attribute value that was encountered in the data.
avg, dev
Average value and deviation.
n
Number of examples for which the value was defined (and used in the statistics). If examples were weighted, n is the sum of weights of those examples.
sum, sum2
Weighted sum of values and weighted sum of squared values of this attribute.
holdRecomputation
Holds recomputation of the average and deviation.

Methods

add(value[, weight])
Adds a value to the statistics. Both arguments should be numbers; weight is optional, default is 1.0.
recompute()
Recomputes the average and deviation.

You most probably won't construct the class yourself, but instead call DomainBasicAttrStat to compute statistics for all continuous attributes in the dataset.

Nevertheless, here's how the class works. Values are fed into add; this is usually done by DomainBasicAttrStat, but you can traverse the examples and feed the values in Python, if you want to. For each value it checks and, if necessary, adjusts min and max, adds the value to sum and its square to sum2. The weight is added to n. If holdRecomputation is false, it also computes the average and the deviation. If true, this gets postponed until recompute is called. It makes sense to postpone recomputation when using the class from C++, while when using it from Python, the recomputation will take much much less time than the Python interpreter, so you can leave it on.

You can see that the statistics does not include the median or, more generally, any quantiles. That's because it only collects statistics that can be computed on the fly, without remembering the data. If you need quantiles, you will need to construct a ContDistribution.

DomainBasicAttrStat

DomainBasicAttrStat behaves as a list of BasicAttrStat except for a few details.

Methods

<constructor>
Constructor expects an example generator; if examples are weighted, the second (otherwise optional) arguments should give the id of the meta-attribute with weights.

part of basicattrstat.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") bas = orange.DomainBasicAttrStat(data) print "%20s %5s %5s %5s" % ("attribute", "min", "max", "avg") for a in bas: if a: print "%20s %5.3f %5.3f %5.3f" % ( a.variable.name, a.min, a.max, a.avg)

This will print

attribute min max avg sepal length 4.300 7.900 5.843 sepal width 2.000 4.400 3.054 petal length 1.000 6.900 3.759 petal width 0.100 2.500 1.199
purge()
Noticed the "if a" in the script? It's needed because of discrete attributes for which this statistics cannot be measured and are thus represented by a None. Method purge gets rid of them by removing the None's from the list.

<list-like operations>
DomainBasicAttrStat behaves like a ordinary list, except that its elements can also be indexed by attribute descriptors or attribute names.

>>> print bas["sepal length"].avg 5.84333467484

If you need more statistics, see information on distributions.