Clustering

Core Orange only provides hierarchical clustering, limited to single, complete and average linkage. Other clustering methods and associated utility functions are provided in orngClustering module. See also Orange extensions by Aleks Jakulin for k-medoid and fuzzy clustering.

Hierarchical Clustering

Classes

The method for hierarchical clustering, encapsulated in class HierarchicalClustering works on a distance matrix stored as SymMatrix. The method works in approximately O(n2) time (with the worst case O(n3)). For orientation, clustering ten thousand of elements should take roughly 15 seconds on a 2 GHz computer. The algorithm can either make a copy of the distances matrix and work on it, or work on the original distance matrix, destroying it in the process. The latter is useful for clustering larger number of objects. Since the distance matrix stores (n+1)(n+2)/2 floats (cca 2 MB for 1000 objects and 200 MB for 10000, assuming the a float takes 4 bytes), by copying it we would quickly run out of physical memory. Using virtual memory is not an option since the matrix is accessed in a random manner.

The distance should contain no negative elements. This limitation is due to implementation details of the algorithm (it is not absolutely necessary and can be lifted in future versions if often requested; it only helps the algorithm run a bit faster). The elements on the diagonal (representing the element's distance from itself) are ignored.

Distance matrix can have the attribute objects describing the objects we are clustering (this is available only in Python). This can be any sequence of the same length as the matrix - an ExampleTable, a list of examples, a list of attributes (if you're clustering attributes), or even a string of the correct length. This attribute is not used in clustering but is only passed to the clusters' attribute mapping (see below), which will hold a reference to it (if you modify the list, the changes will affect the clusters as well).

Attributes of HierarchicalClustering

linkage
Specifies the linkage method, which can be either
  1. HierarchicalClustering.Single (default), where distance between groups is defined as the distance between the closest pair of objects, one from each group,
  2. HierarchicalClustering.Average
  3. , where the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group, or
  4. HierarchicalClustering.Complete, where the distance between groups is defined as the distance between the most distant pair of objects, one from each group. Complete linkage is also called farthest neighbor.
  5. HierarchicalClustering.Ward uses Ward's distance.
overwriteMatrix
If true (default is false), the algorithm will work on the original distance matrix, destroying it in the process. The benefit is that it will need much less memory (not much more than what is needed to store the tree of clusters).
progressCallback
A callback function (None by default). It can be any function or callable class in Python, which accepts a single float as an argument. The function only gets called if the number of objects being clustered is at least 1000. It will be called for 101 times, and the argument will give the proportion of the work been done. The time intervals between the function calls won't be equal (sorry about that...) since the clustering proceeds faster as the number of clusters decreases.

The HierarchicalClustering is called with a distance matrix as an argument. It returns an instance of HierarchicalCluster representing the root of the hierarchy.

Attributes of HierarchicalCluster

branches
A list of subclusters; None if the node is a leaf (a single element). The list can contain more than two subclusters. (HierarchicalClustering never produces such clusters; this is here for any future extensions.)
left, right
Left and right subclusters. defined only if there are up to two branches - that is, always, if HierarchicalClustering was used for constructing the cluster..
height
The distance between the subclusters.
mapping, first, last
mapping is a list of indices to the distance matrix. It is the same for all clusters in the hierarchy - it simply represents the indices ordered according to the clustering. first and last are indices into the elements of mapping that belong to that cluster. (Seems weird, but is trivial - wait for the examples. On the other hand, you probably won't need to understand this anyway.)

If the distance matrix had an attribute objects defined, it is copied to mapping.

Methods

__len__
Asking for the length of the cluster gives the number of the objects belonging to it. This equals last-first.
<indexing>
By indexing the cluster we address its elements; these are either indices or objects (you'll understand this after seeing the examples). For instance cluster[2] gives the third element of the cluster, and list(cluster) will return the cluster elements as a list. The cluster elements are read-only. To actually modify them, you'll have to go through mapping, as described below. This is intentionally complicated to discourage a naive user from doing what he does not understand.
swap()
Swaps the left and the right subcluster; obviously this will report an error when the cluster has more than two subclusters. This function changes the mapping and first and last of all clusters below this one and thus needs O(len(cluster)) time.
permute(permutation)
Permutes the subclusters. Permutation gives the order in which the subclusters will be arranged. As for swap, this function changes the mapping and first and last of all clusters below this one.

Example 1: Toy matrix

Let us construct a simple distance matrix run clustering on it.

part of hclust_art.py

import orange m = [[], [ 3], [ 2, 4], [17, 5, 4], [ 2, 8, 3, 8], [ 7, 5, 10, 11, 2], [ 8, 4, 1, 5, 11, 13], [ 4, 7, 12, 8, 10, 1, 5], [13, 9, 14, 15, 7, 8, 4, 6], [12, 10, 11, 15, 2, 5, 7, 3, 1]] matrix = orange.SymMatrix(m) root = orange.HierarchicalClustering(matrix, linkage=orange.HierarchicalClustering.Average)

Root is a root of the cluster hierarchy. We can print using a simple recursive function.

part of hclust_art.py

def printClustering(cluster): if cluster.branches: return "(%s%s)" % (printClustering(cluster.left), printClustering(cluster.right)) else: return `cluster[0]`

The output is not exactly nice, but it will have to do. Our clustering, printed by calling printClustering(root) looks like this: (((04)((57)(89)))((1(26))3)). The elements are separated into two groups, the first containing elements 0, 4, 5, 7, 8, 9, and the second 1, 2, 6, 3. The difference between them equals root.height, 9.0 in our case. The first cluster is further divided onto 0 and 4 in one, and 5, 7, 8, 9 in the other subcluster...

It is easy to print out the cluster's objects. Here's what's in the left subcluster of root.

>>> for el in root.left: ... print el, 0 4 5 7 8 9

Everything that can be iterated over, can as well be cast into a list or tuple. Let us demonstrate this by writing a better function for printing out the clustering (which will also come handy for something else in a while). The one above supposed that each leaf contains a single object. This is not necessarily so; instead of printing out the first (and supposedly the only) element of cluster, cluster[0], we shall print it out as a tuple.

part of hclust_art.py

def printClustering2(cluster): if cluster.branches: return "(%s%s)" % (printClustering2(cluster.left), printClustering2(cluster.right)) else: return str(tuple(cluster))

The distance matrix could have been given a list of objects. We could, for instance, put

matrix.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"]

above calling the HierarchicalClustering. (This code will actually trigger a warning; to avoid it, use matrix.setattr("objects", ["Ann", "Bob".... Why this is needed is explained in the page on Orange peculiarities.) If we've forgotten to store the objects into matrix prior to clustering, nothing is lost. We can add it into clustering later, by

root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"])

So, what do these "objects" do? Call printClustering(root) again and you'll see. Or, let us print out the elements of the first left cluster, as we did before.

>>> for el in root.left: ... print el, Ann Eve Fred Hue Ivy Jon

If objects are given, the cluster's elements, as got by indexing (eg root.left[2]) or by iteration, as in the above case, won't be indices but the elements we clustered. If we put an ExampleTable into objects, root.left[-1] will be the last example of the first left cluster.

Now for swapping and permutations.

>>> printClustering(root) ((('Ann''Eve')(('Fred''Hue')('Ivy''Jon')))(('Bob'('Curt''Greg'))'Danny')) >>> root.left.swap() >>> printClustering(root) (((('Fred''Hue')('Ivy''Jon'))('Ann''Eve'))(('Bob'('Curt''Greg'))'Danny')) >>> root.permute([1, 0]) >>> printClustering(root) ((('Bob'('Curt''Greg'))'Danny')((('Fred''Hue')('Ivy''Jon'))('Ann''Eve')))

Calling root.left.swap reversed the order of subclusters of root.left and root.permute([1, 0]) (which is equivalent to root.swap - there aren't many possible permutations of two elements) reverses the order of root.left and root.right.

Let us write function for cluster pruning.

part of hclust_art.py

def prune(cluster, togo): if cluster.branches: if togo<0: cluster.branches = None else: for branch in cluster.branches: prune(branch, togo-cluster.height)

We shall use printClustering2 here, since we can have multiple elements in a leaf of the clustering hierarchy.

>>> prune(root, 9) >>> print printClustering2(root) ((('Bob', 'Curt', 'Greg')('Danny',))(('Fred', 'Hue', 'Ivy', 'Jon')('Ann', 'Eve')))

We've ended up with four clusters. Need a list of clusters? Here's the function.

part of hclust_art.py

def listOfClusters0(cluster, alist): if not cluster.branches: alist.append(list(cluster)) else: for branch in cluster.branches: listOfClusters0(branch, alist) def listOfClusters(root): l = [] listOfClusters0(root, l) return l

The function returns a list of lists, in our case [['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']]. If there were no objects the list would contains indices instead of names.

Example 2: Clustering of examples

The most common things to cluster are certainly examples. To show how to this is done, we shall now load the Iris data set, initialize a distance matrix with the distances measure by ExamplesDistance_Euclidean and cluster it with average linkage. Since we don't need the matrix, we shall let the clustering overwrite it (not that it's needed for such a small data set as Iris).

part of hclust.py (uses iris.tab)

import orange data = orange.ExampleTable("iris") matrix = orange.SymMatrix(len(data)) matrix.setattr("objects", data) distance = orange.ExamplesDistanceConstructor_Euclidean(data) for i1, ex1 in enumerate(data): for i2 in range(i1+1, len(data)): matrix[i1, i2] = distance(ex1, data[i2]) clustering = orange.HierarchicalClustering() clustering.linkage = clustering.Average clustering.overwriteMatrix = 1 root = clustering(matrix)

Note that we haven't forgotten to set the matrix.objects. We did it through matrix.setattr to avoid the warning. Let us now prune the clustering using the function we've written above, and print out the clusters.

part of hclust.py (uses iris.tab)

prune(root, 1.4) for n, cluster in enumerate(listOfClusters(root)): print "\n\n*** Cluster %i ***\n" % n for ex in cluster: print ex

Since the printout is pretty long, it might be more informative to just print out the class distributions for each cluster.

part of hclust.py (uses iris.tab)

for cluster in listOfClusters(root): dist = orange.getClassDistribution(cluster) for e, d in enumerate(dist): print "%s: %3.0f " % (data.domain.classVar.values[e], d), print

Here's what it shows.

Iris-setosa: 49 Iris-versicolor: 0 Iris-virginica: 0 Iris-setosa: 1 Iris-versicolor: 0 Iris-virginica: 0 Iris-setosa: 0 Iris-versicolor: 50 Iris-virginica: 17 Iris-setosa: 0 Iris-versicolor: 0 Iris-virginica: 33

Note something else: listOfClusters does not return a list of ExampleTables, but a list of lists of examples. Therefore, in the above script, cluster is a list of examples, not an ExampleTable, but it gets converted to it automatically when the function is called. Most Orange functions will do this for you automatically. You can, for instance, call a learning algorithms, passing a cluster as an argument. It won't mind. If you, however, want to have a list of table, you can easily convert the list by

tables = [orange.ExampleTable(cluster) for cluster in listOfClusters(root)]

Finally, if you are dealing with examples, you may want to take the function listOfClusters and replace alist.append(list(cluster)) by alist.append(orange.ExampleTable(cluster)). This function is less general, it will fail if objects are not of type Example. However, instead of list of lists, it will return a list of example tables.

How the data in HierarchyCluster is really stored?

To demonstrate how the data in clusters is stored, we shall continue with the clustering we got in the first example.

>>> del root.mapping.objects >>> print printClustering(root) (((1(26))3)(((57)(89))(04))) >>> print root.mapping <1, 2, 6, 3, 5, 7, 8, 9, 0, 4> >>> print root.left.first 0 >>> print root.left.last 4 >>> print root.left.mapping[root.left.first:root.left.last] <1, 2, 6, 3> >>> print root.left.left.first 0 >>> print root.left.left.last 3

We removed objects to just to see more clearly what is going on. mapping is an ordered list of indices to the rows/columns of distance matrix (and, at the same time, indices into objects, if they exist). Each cluster's fields first and last are indices into mapping, so the clusters elements are actually cluster.mapping[cluster.first:cluster.last]. cluster[i] therefore returns cluster.mapping[cluster.first+i] or, if objects are specified, cluster.objects[cluster.mapping[cluster.first+i]]. Space consumption is minimal since all clusters share the same objects mapping and objects.

Subclusters are ordered so that cluster.left.last always equals cluster.right.first or, in general, cluster.branches[i].last equals cluster.branches[i+1].first.

Swapping and permutation do three things: change the order of elements in branches, permute the corresponding regions in mapping and adjust the first and last for all the clusters below. For the latter, when subclusters of cluster are permuted, the entire subtree starting at cluster.branches[i] is moved by the same offset.

The hierarchy of objects that represent a clustering is open, everything is accessible from Python. You can write your own clustering algorithms that build this same structure, or you can use Orange's clustering and then do we the structure anything you want. For instance prune it, as we have shown earlier. However, it is easy to do things wrong: shuffle the mapping, for instance, and forget to adjust the first and last pointers. Orange does some checking for the internal consistency, but you are surely smarter and can find a way to crash it. For instance, just create a cycle in the structure, call swap for some cluster above the cycle and you're there. But don't blame it on me then.