Distributions

Objects derived from Distribution are used throughout Orange to store various distributions. These often - but not necessarily - apply to distribution of values of certain attribute on some dataset. You will most often encounter two classes derived from Distribution: DiscDistribution stores discrete and ContDistribution stores continuous distributions. To some extent, they both resemble dictionaries, with attribute values as keys and number of examples with particular value as elements.

General Distributions

Class Distribution contains the common methods for different types of distributions. Even more, its constructor can be used to construct objects of type DiscDistribution and ContDistribution (class Distribution itself is abstract, so no instances of that class can actually exist).

part of distributions.py (uses adult_sample.tab)

>>> import orange >>> data = orange.ExampleTable("adult_sample") >>> disc = orange.Distribution("workclass", data) >>> print disc <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000> >>> print type(disc) <type 'DiscDistribution'>

This simple script prints out distribution of attribute "workclass" on dataset "adult_sample". The resulting distribution is of type DiscDistribution since the attribute is discrete. The printed numbers are counts of examples that have particular attribute value.

part of distributions.py (uses adult_sample.tab)

>>> workclass = data.domain["workclass"] >>> for i in range(len(workclass.values)): ... print "%20s: %5.3f" % (workclass.values[i], disc[i]) Private: 685.000 Self-emp-not-inc: 72.000 Self-emp-inc: 28.000 Federal-gov: 29.000 Local-gov: 59.000 State-gov: 43.000 Without-pay: 2.000 Never-worked: 0.000

Enough introduction. Here are Distribution's attributes and methods.

Attributes

variable
Descriptor of the attribute to which the distribution applies. Can be left empty, when not applicable.
unknowns
Number of examples for which the attribute value was unknown. This field is not changed at normalization (see below).
abs
Sum of all elements in the distribution.
cases
(Weighted) number of examples, on which the distribution is computed, not including the examples on which the observed attribute had unknown value. This equals abs as long as the distribution is not normalized.
normalized
If true, the distribution is normalized, ie the distribution sums to 1.
supportsDiscrete, supportsContinuous
Tells whether distribution supports the protocol for working with discrete/continuous values (this is rather internal thing; still, you can use those flags to check whether the distribution is discrete or continuous).
randomGenerator
A random generator needed for method random().

Methods

orange.Distribution(attribute[, examples[, weightID]])
Constructs either DiscDistribution or ContDistribution, depending on the attribute type. If attribute is the only argument, it must be an attribute descriptor (see Variable). In that case, an empty distribution is constructed. If examples are given as well, the attribute's distribution is computed, as seen in the above example. In that case, attribute can also be given by name or its position in the domain. If examples are weighted, the id of meta-attribute with weights is passed as the third argument (default is 0, no weights).

If attribute is given by descriptor, it doesn't need to exist in the domain, but it must be computable from given examples. This way, it is possible to obtain distributions for attributes constructed by constructive induction or for discretized attributes, without translating the entire dataset. There's an example for this in documentation on attribute descriptors.

<standard dictionary operations>
For getting elements of discrete distributions, indices of type Value, integers and symbolic names (if variable is defined) can be used. For continuous elements, use Value or continuous number (eg cont[3.14]).

To get the number of examples with workclass="private", you can use either of the three forms below:

print "Private: ", disc["Private"] print "Private: ", disc[0] print "Private: ", disc[orange.Value(workclass, "Private")]

Elements cannot be removed from distributions.

Length of distribution equals the number of possible values for discrete distributions (if variable is set), the value with the highest index encountered (if distribution is discrete and variable is not set) or the number of different values encountered (for continuous distributions).

keys(), values(), items()
Return a list of values, a list of example counts and a list of (value, frequency) pairs, respectively. For instance, distribution in the last example of section "General Distributions" could be printed out by for val, num in disc.items(): print "%20s: %5.3f" % (val, num)
native()
Converts the distribution into a list (for discrete distrbutions) or a dictionary (for continuous distributions).
add(value[, weight])
Adds a value to the distribution - as if an example with weight weight (default is 1.0) was added. value can be orange.Value, an index (for discrete distributions), continuous number (for continuous distributions) or symbolic value, if variable is set.
normalize()
Divides all elements of the distribution by their sum (abs), sets normalized to true and abs to 1.0. Fields cases and unknowns are unchanged.
modus()
Returns the most common value of the attribute. If there is more than one such value, one is chosen at random (but always the same for particular distribution). More explanation on that is available on page about randomness in Orange.
random()
Returns a random value, where probabilities of values are as given by the distribution. For continuous distributions, returned value will always be one of the values that occur in the distribution (ie one of values returned by keys()), not any continuous value from the distribution's range.

This method uses distribution's randomGenerator. If none has been constructed and/or assigned yet, one is constructed and stored for further use.

Discrete distributions

Discrete distributions can be constructed directly.

Methods

DiscDistribution(attribute)
Constructor stores the attribute descriptor (which must be of a discrete attribute) to variable and allocates a list of appropriate size for the distribution.
DiscDistribution(list)
This form of constructor initializes a list, but leaves the variable at None. You can use such distribution for random number generation. disc = orange.DiscDistribution([0.5, 0.3, 0.2]) for i in range(20): print disc.random(),

This will print out approximately ten 0's, six 1's and four 2's. To name the values, you can assign an attribute descriptor.

v = orange.EnumVariable(values = ["red", "green", "blue"]) disc.variable = v
DiscDistribution(distribution)
A copy-constructor, which initializes a new distribution as a copy of an existing.
DiscDistribution()
A constructor that creates a distribution and leaves all fields blank, 0 and None.

Besides those constructors, there are no other specific operations for discrete distributions.

Continuous distributions

Continuous distribution (ContDistribution) offers similar constructors as discrete distributions, except that instead of a list, it expects a dictionary, such as one returned by native. There are some specific methods.

Methods

ContDistribution(attribute)
Constructor that stores the attribute descriptor (which must be of a continuous attribute) to variable.
ContDistribution(dictionary)
Initializes the distribution with the values from the dictionary. All keys and values must be numbers.
ContDistribution(distribution)
A copy constructor that initializes the distribution as a copy of the existing distribution.
ContDistribution
Constructor that leaves everything blank, 0 and None.
average()
Returns the average value.
var(), dev(), error()
Return variance, deviation and standard error of the distribution, respectively.
percentile(p)
Returns p-th percentile of distribution, ie such value x that p percents of attribute's values are smaller than x. p must be a value between 0 and 100. For instance, if dage is a continuous distribution, quartiles can be printed by print "Quartiles: %5.3f - %5.3f - %5.3f" % \ (dage.percentile(25), dage.percentile(50), dage.percentile(75))
density(x)
Returns probability density at x. If value is not present, it is interpolated.

Gaussian distribution

Represents Gaussian distribution.

Attributes

mean, sigma
Parameters of the distribution.
abs
This field represents the number of "examples" for discrete and continuous distributions. In case of Gaussian distribution, this is the integral under density function; in effect, the normal Gaussian density function is multiplied by abs.

Methods

GaussianDistribution([mean, sigma])
Constructs Gaussian distribution. Default mean and sigma are 0.0 and 1.0 (normalized distribution), and abs is set to 1.0.
GaussianDistribution(distribution)
Construct Gaussian distribution by approximating another distribution. The given distribution must support continuous protocol (ie, must be able to provide average and deviation). In other words, distribution must be either ContDistribution and its average and deviation will become mean and sigma for the new distribution, or GaussianDistribution, which will be simply copied. abs is set to distribution.abs.
average()
Returns mean.
dev(), error()
Returns sigma
var()
Returns sigma2.
density(x)
Returns density at point x (Gaussian function multiplied by abs).

Computing class distributions

Class distributions can be computed by calling orange.Distribution(data.domain.classVar, weightID) (weightID can be left out if examples are not weighted). Since this is a frequent operation a shortcut is provided.

orange.getClassDistribution(examples[, weightID]) computes distribution of class values for the given data set. Result is of type DiscDistribution or ContDistribution.

Computing distributions for all attributes

Orange can compute distributions for all objects in a single iteration over examples and store them in an object of type DomainDistributions. Its constructor accepts examples and, optionally, an ID of meta attribute with weights. Resulting object is list-like, with the exception that not only integers but also attribute descriptors and names can be used for indexing.

The script below computes distributions for all attributes in the data and prints out distributions for discrete and averages for continuous attributes.

part of distributions.py (uses adult_sample.tab)

dist = orange.DomainDistributions(data) for d in dist: if d.variable.varType == orange.VarTypes.Discrete: print "%30s: %s" % (d.variable.name, d) else: print "%30s: avg. %5.3f" % (d.variable.name, d.average())

To get the distribution for, say, attribute "age", you can either use its index in the domain or its name or descriptor.

dist_age = dist["age"]