Contingency Matrix

Contingency matrix contains conditional distributions. They can work for both, discrete and continuous attributes; although the examples on this page will be mostly limited to discrete attributes, the analogous could be done with continuous values.

part of contingency1.py (uses monk1.tab)

>>> import orange >>> data = orange.ExampleTable("monk1") >>> cont = orange.ContingencyAttrClass("e", data) >>> for val, dist in cont.items(): ... print val, dist 1 <0.000, 108.000> 2 <72.000, 36.000> 3 <72.000, 36.000> 4 <72.000, 36.000>

As this simple example shows, contingency is similar to a dictionary (or a list, it is a bit ambiguous), where attribute values serve as keys and class distributions are the dictionary values. The attribute e is here called the outer attribute, and the class is the inner. That's not the only possible configuration of contingency matrix; class can also be outside or there can be no class at all and the matrix shows distributions of one attribute values given the value of another.

There is a hierarchy of classes with contingencies:

Contingency

ContingencyClass

ContingencyClassAttr

ContingencyAttrClass

ContingencyAttrAttr

The base object is Contingency. Derived from it is ContingencyClass in which one of the attributes is class attribute; ContingencyClass is a base for two classes, ContingencyAttrClass and ContingencyClassAttr, the former having class as the inner and the latter as the outer attribute. Class ContingencyAttrAttr is derived directly from Contingency and represents contingency matrices in which none of the attributes is the class attribute.

The most common used of the above classes is ContingencyAttrClass which resembles conditional probabilities of classes given the attribute value.

General Contingency Matrix

Here's what all contingency matrices share in common.

Attributes

outerVariable
The outer attribute descriptor. In the above case, it is e.
innerVariable
The inner attribute descriptor. In the above case, it is the class attribute.
outerDistribution
The distribution of the outer attribute's values - sums of rows. In the above case, distribution of e is <108.000, 108.000, 108.000, 108.000>
innerDistribution
The distribution of the inner attribute. In the above case, it is the class distribution, which is <216.000, 216.000<.
innerDistributionUnknown
The distribution of the inner attribute for the examples where the outer attribute was unknown. This is the difference between the innerDistribution and the sum of all distributions in the matrix.
varType
The varType for the outer attribute (discrete, continuous...); varType equals outerVariable.varType and outerDistribution.varType.

Methods

<standard list/dictionary operations>
Contingency matrix is a cross between dictionary and a list. It supports standard dictionary methods keys, values and items. >>> print cont.keys() ['1', '2', '3', '4'] >>> print cont.values() [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>] >>> print cont.items() [('1', <0.000, 108.000>), ('2', <72.000, 36.000>), ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)]

Although keys returned by the above functions are strings, you can index the contingency with anything that converts into values of the outer attribute - strings, numbers or instances of Value.

>>> print cont[0] <0.000, 108.000> >>> print cont["1"] <0.000, 108.000> >>> print cont[orange.Value(data.domain["e"], "1")]

Naturally, the length of Contingency equals the number of values of the outer attribute. The only weird thing is that iterating through contingency (by using a for loop, for instance) doesn't return keys, as with dictionaries, but dictionary values.

>>> for i in cont: ... print i <0.000, 108.000> <72.000, 36.000> <72.000, 36.000> <72.000, 36.000> <72.000, 36.000>

If cont behaved like a normal dictionary, the above script would print out strings from '0' to '3'.

add(outer_value, inner_value[, weight])
Adds an element to the contingency matrix.
normalize()
Normalizes all distributions (rows) in the contingency to sum to 1. It doesn't change the innerDistribution or outerDistribution. >>> cont.normalize() >>> for val, dist in cont.items(): ... print val, dist 1 <0.000, 1.000> 2 <0.667, 0.333> 3 <0.667, 0.333> 4 <0.667, 0.333>

Contingency

The base class is, once for a change, not abstract. Its constructor expects two attribute descriptors, the first one for the outer and the second for the inner attribute. It initializes empty distributions and it's up to you to fill them. This is, for instance, how to manually reproduce results of the script at the top of the page.

part of contingency2.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") cont = orange.Contingency(data.domain["e"], data.domain.classVar) for ex in data: cont [ex["e"]] [ex.getclass()] += 1 print "Contingency items:" for val, dist in cont.items(): print val, dist print

The "reproduction" is not perfect. We didn't care about unknown values and haven't computed innerDistribution and outerDistribution. The better way to do it is by using the method add, so that the loop becomes:

for ex in data: cont.add(ex["e"], ex.getclass())

It's not only simpler, but also correctly handles unknown values and updates innerDistribution and outerDistribution.

ContingencyClass

ContingencyClass is an abstract base class for contingency matrices that contain the class attribute, either as the inner or the outer attribute. If offers a function for making filing the contingency clearer.

After reading through the rest of this page you might ask yourself why do we need to separate the classes ContingencyAttrClass, ContingencyClassAttr and ContingencyAttrAttr, given that the underlying matrix is the same. This is to avoid confusion about what is in the inner and the outer variable. Contingency matrices are most often used to compute probabilities of conditional classes or attributes. By separating the classes and giving them specialized methods for computing the probabilities that are most suitable to compute from a particular class, the user (ie, you or the method that gets passed the matrix) is relieved from checking what kind of matrix it got, that is, where is the class and where's the attribute.

Attributes

classVar (read only)
The class attribute descriptor. This is always equal either to innerVariable or outerVariable
variable (read only)
The class attribute descriptor. This is always equal either to innerVariable or outerVariable

Methods

add_attrclass(attribute_value, class_value[, weight])
Adds an element to contingency. The difference between this and Contigency.add is that the attribute value is always the first argument and class value the second, regardless whether the attribute is actually the outer variable or the inner.

ContingencyAttrClass

ContingencyAttrClass is derived from ContingencyClass. Here, attribute is the outer variable (hence variable=outerVariable) and class is the inner (classVar=innerVariable), so this form of contingency matrix is suitable for computing the conditional probabilities of classes given a value of an attribute.

Calling add_attrclass(v, c) is here equivalent to calling add(v, c). In addition to this, the class supports computation of contingency from examples, as you have already seen in the example at the top of this page.

Methods

ContingencyAttrClass(attribute, class_attribute)
The inherited constructor, which does exactly the same as Contingency's constructor.
ContingencyAttrClass(attribute, examples[, weightID])
Constructor that constructs the contingency and computes the data from the given examples. If these are weighted, the meta attribute with example weights can be specified.
p_class(attribute_value)
Returns the distribution of classes given the attribute_value. If the matrix is normalized, this is equivalent to returning self[attribute_value]. Result is returned as a normalized Distribution.
p_class(attribute_value, class_value)
Returns the conditional probability of class_value given the attribute_value. If the matrix is normalized, this is equivalent to returning self[attribute_value][class_value].

Don't confuse the order of arguments: attribute value is the first, class value is the second, just as in add_attrclass. Although in this instance counterintuitive (since the returned value represents the conditional probability P(class_value|attribute_value), this order is uniform for all (applicable) methods of classes derived from ContingencyClass.

You have seen this form of matrix used already at the top of the page. We shall only explore the new stuff we've learned about it.

part of contingency3.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") cont = orange.ContingencyAttrClass("e", data) print "Inner variable: ", cont.innerVariable.name print "Outer variable: ", cont.outerVariable.name print print "Class variable: ", cont.classVar.name print "Attribute: ", cont.variable.name print print "Distributions:" for val in cont.variable: print " p(.|%s) = %s" % (val.native(), cont.p_class(val)) print firstclass = orange.Value(cont.classVar, 1) firstnative = firstclass.native() print "Probabilities of class '%s'" % firstnative for val in cont.variable: print " p(%s|%s) = %5.3f" % (firstnative, val.native(), cont.p_class(val, firstclass))

The inner and the outer variable and their relations to the class and the attribute are as expected.

Inner variable: y Outer variable: e Class variable: y Attribute: e

Distributions are normalized and probabilities are elements from the normalized distributions. Knowing that the target concept is y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1 has a 100% probability, while for the rest, probability is one third, which agrees with a probability that two three-valued independent attributes have the same value.

Distributions: p(.|1) = <0.000, 1.000> p(.|2) = <0.667, 0.333> p(.|3) = <0.667, 0.333> p(.|4) = <0.667, 0.333> Probabilities of class '1' p(1|1) = 1.000 p(1|2) = 0.333 p(1|3) = 0.333 p(1|4) = 0.333

Manual computation using add_attrclass is similar (to be precise: exactly the same) as computation using add.

cont = orange.ContingencyAttrClass(data.domain["e"], data.domain.classVar) for ex in data: cont.add_attrclass(ex["e"], ex.getclass())

ContingencyClassAttr

ContingencyClassAttr is similar to ContingencyAttrClass except that here the class attribute is the outer and the attribute the inner variable. As a consequence, this form of contingency matrix is suitable for computing conditional probabilities of attribute values given class values. Constructor and add_attrclass nevertheless get the arguments in the same order as for ContingencyAttrClass, that is, attribute first, class second.

Methods

ContingencyClassAttr(attribute, class_attribute)
The inherited constructor is exactly the same as Contingency's constructor, except that the argument order is reversed (in Contingency, the outer attribute is given first, while here the first argument, attribute, is the inner attribute).
ContingencyAttrClass(attribute, examples[, weightID])
Constructs the contingency and computes the data from the given examples. If these are weighted, the meta attribute with example weights can be specified.
p_attr(class_value)
Returns the distribution of attribute values given the class_value. If the matrix is normalized, this is equivalent to returning self[class_value]. Result is returned as a normalized Distribution.
p_attr(attribute_value, class_value)
Returns the conditional probability of attribute_value given the class_value. If the matrix is normalized, this is equivalent to returning self[class_value][attribute_value].

As you can see, the class is rather similar to ContingencyAttrClass, except that it has p_attr instead of p_class. If you, for instance, take the above script and replace the class name, the first bunch of prints print out

part of the output from contingency4.py (uses monk1.tab)

Inner variable: e Outer variable: y Class variable: y Attribute: e

This is exactly the reverse of the printout from ContingencyAttrClass. To print out the distributions, the only difference now is that you need to iterate through values of the class attribute and call p_attr. For instance,

part of contingency4.py (uses monk1.tab)

for val in cont.classVar: print " p(.|%s) = %s" % (val.native(), cont.p_attr(val))

will print

p(.|0) = <0.000, 0.333, 0.333, 0.333> p(.|1) = <0.500, 0.167, 0.167, 0.167>

If the class value is '0', than attribute e cannot be '1' (the first value), but can be anything else, with equal probabilities of 0.333. If the class value is '1', e is '1' in exactly half of examples (work-out why this is so); in the remaining cases, e is again distributed uniformly.

ContingencyAttrAttr

ContingencyAttrAttr stores contingency matrices in which none of the attributes is the class attribute. This is rather similar to Contingency, except that it has an additional constructor and method for getting the conditional probabilities.

Methods

ContingencyAttrAttr(outer_variable, inner_variable)
This constructor is exactly the same as that of Contingency.
ContingencyAttrAttr(outer_variable, inner_variable, examples[, weightID])
Computes the contingency from the given examples.
p_attr(outer_value)
Returns the probability distribution of the inner variable given the outer variable.
p_attr(outer_value, inner_value)
Returns the conditional probability of the inner_value given the outer_value.

In the following example, we shall use the ContingencyAttrAttr on dataset "bridges" to determine which material is used for bridges of different lengths.

part of contingency5.py (uses bridges.tab)

import orange data = orange.ExampleTable("bridges") cont = orange.ContingencyAttrAttr("SPAN", "MATERIAL", data) cont.normalize() for val in cont.outerVariable: print "%s:" % val.native() for inval, p in cont[val].items(): if p: print " %s (%i%%)" % (inval, int(100*p+0.5)) print

The output tells us that short bridges are mostly wooden or iron, and the longer (and the most of middle sized) are made from steel.

SHORT: WOOD (56%) IRON (44%) MEDIUM: WOOD (9%) IRON (11%) STEEL (79%) LONG: STEEL (100%)

As all other contingency matrices, this one can also be computed "manually".

part of contingency5.py (uses bridges.tab)

cont = orange.ContingencyAttrAttr(data.domain["SPAN"], data.domain["MATERIAL"]) for ex in data: cont.add(ex["SPAN"], ex["MATERIAL"])

Contingencies with Continuous Values

What happens if one or both attributes are continuous? As first, contingencies can be built for such attributes as well. Just imagine a contingency as a dictionary with attribute values as keys and objects of type Distribution as values.

If the outer attribute is continuous, you can use either its values or ordinary floating point number for indexing. The index must be one of the values that do exist in the contingency matrix.

The following script will query for a distribution in between the first two keys, which triggers an exception.

part of the output from contingency6.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") cont = orange.ContingencyAttrClass(0, data) midkey = (cont.keys()[0] + cont.keys()[1])/2.0 print "cont[%5.3f] =" % (midkey, cont[midkey])

If you still find such contingencies useful, you need to take care about what you pass for indices. Always use the values from keys() directly, instead of manually entering the keys' values you see printed. If, for instance, you print out the first key, see it's 4.500 and then request cont[4.500] this can give an index error due to rounding.

Contingencies with continuous inner attributes are more useful. As first, indexing by discrete values is easier than with continuous. Secondly, class Distribution covers both, discrete and continuous distributions, so even the methods p_class and p_attr will work, except they won't return is not the probability but the density (interpolated, if necessary). See the page about Distribution for more information.

For instance, if you build a ContingencyClassAttr on the iris dataset, you can enquire about the probability of the sepal length 5.5.

part of contingency7.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") cont = orange.ContingencyClassAttr("sepal length", data) for val in cont.classVar: print " p(%s|%s) = %5.3f" % (5.5, val.native(), cont.p_attr(5.5, val))

The script's output is

p(5.5|Iris-setosa) = 2.000 p(5.5|Iris-versicolor) = 5.000 p(5.5|Iris-virginica) = 1.000

These number represent the number of examples having with sepal length of 5.5. If the matrix was normalized, numbers would be divided by the total number of examples in classes setosa, versicolor and virginica, respectively.

Computing Contingencies for All Attributes

Computing contingency matrices requires iteration through examples. We often need to compute ContingencyAttrClass or ContingencyClassAttr for all attributes in the dataset and it is obvious that this will be faster if we do it for all attributes at once. That's taken care of by class DomainContingency.

DomainContingency is basically a list of contingencies, either of type ContingencyAttrClass or ContingencyClassAttr, with two additional fields and a constructor that computes the contingencies.

Attributes

classIsOuter (read only)
Tells whether the class is the outer or the inner attribute. Effectively, this tells whether the elements of the list are ContingencyAttrClass or ContingencyClassAttr.
classes
Contains the distribution of class values on the entire dataset.

Methods

DomainContingency(examples[, weightID][, classIsOuter=0|1])
Constructor needs to be given a list of examples. It constructs a list of contingencies; if classIsOuter is 0 (default), these will be ContingencyAttrClass, if 1, ContingencyClassAttr are used. It then iterates through examples and computes the contingencies.
list-like operations
The only real difference between DomainContingency and an ordinary Python list (except for the additional methods and fields, of course) is that its elements cannot be indexed only by numbers, but also by attribute names and descriptors, as shown in the example below.
normalize
Calls normalize for each contingency.

The following script will print the contingencies for attributes "a", "b" and "e" for the dataset Monk 1.

part of contingency8.py (uses monk1.tab)

import orange data = orange.ExampleTable("monk1") dc = orange.DomainContingency(data) print dc["a"] print dc["b"] print dc["e"]

The contingencies in the DomainContingency dc are of type ContingencyAttrClass and tell us conditional distributions of classes, given the value of the attribute. To compute the distribution of attribute values given the class, one needs to get a list of ContingencyClassAttr.

part of contingency8.py (uses monk1.tab)

dc = orange.DomainContingency(data, classIsOuter=1) print dc["a"] print dc["b"] print dc["e"]

Note that classIsOuter cannot be given as positional argument, but needs to be passed by keyword.