Core Orange only provides
The method for hierarchical clustering, encapsulated in class
works on a distance matrix stored as SymMatrix
. The method works in approximately O(n2) time (with the worst case O(n3)). For orientation, clustering ten thousand of elements should take roughly 15 seconds on a 2 GHz computer. The algorithm can either make a copy of the distances matrix and work on it, or work on the original distance matrix, destroying it in the process. The latter is useful for clustering larger number of objects. Since the distance matrix stores (n+1)(n+2)/2 floats (cca 2 MB for 1000 objects and 200 MB for 10000, assuming the a float takes 4 bytes), by copying it we would quickly run out of physical memory. Using virtual memory is not an option since the matrix is accessed in a random manner.
The distance should contain no negative elements. This limitation is due to implementation details of the algorithm (it is not absolutely necessary and can be lifted in future versions if often requested; it only helps the algorithm run a bit faster). The elements on the diagonal (representing the element's distance from itself) are ignored.
Distance matrix can have the attribute objects
describing the objects we are clustering (this is available only in
Python). This can be any sequence of the same length as the matrix -
an ExampleTable
, a list of examples, a list of attributes
(if you're clustering attributes), or even a string of the correct
length. This attribute is not used in clustering but is only passed to
the clusters' attribute mapping
(see below), which will
hold a reference to it (if you modify the list, the changes will
affect the clusters as well).
Attributes of HierarchicalClustering
HierarchicalClustering.Single
(default), where
distance between groups is defined as the distance between the closest
pair of objects, one from each group,HierarchicalClustering.Average
HierarchicalClustering.Complete
, where the distance
between groups is defined as the distance between the most distant
pair of objects, one from each group. Complete linkage is also called
farthest neighbor.HierarchicalClustering.Ward
uses Ward's distance.None
by default). It can be any function or callable class in Python, which accepts a single float as an argument. The function only gets called if the number of objects being clustered is at least 1000. It will be called for 101 times, and the argument will give the proportion of the work been done. The time intervals between the function calls won't be equal (sorry about that...) since the clustering proceeds faster as the number of clusters decreases.The HierarchicalClustering
is called with a distance matrix as an argument. It returns an instance of
representing the root of the hierarchy.
Attributes of HierarchicalCluster
None
if the node is a leaf (a single element). The list can contain more than two subclusters. (HierarchicalClustering
never produces such clusters; this is here for any future extensions.)HierarchicalClustering
was used for constructing the cluster..mapping
is a list of indices to the distance matrix. It is the same for all clusters in the hierarchy - it simply represents the indices ordered according to the clustering. first
and last
are indices into the elements of mapping
that belong to that cluster. (Seems weird, but is trivial - wait for the examples. On the other hand, you probably won't need to understand this anyway.)
If the distance matrix had an attribute objects
defined, it is copied to mapping
.
Methods
last-first
.cluster[2]
gives the third element of the cluster, and list(cluster)
will return the cluster elements as a list. The cluster elements are read-only. To actually modify them, you'll have to go through mapping
, as described below. This is intentionally complicated to discourage a naive user from doing what he does not understand.mapping
and first
and last
of all clusters below this one and thus needs O(len(cluster)) time.swap
, this function changes the mapping
and first
and last
of all clusters below this one.Let us construct a simple distance matrix run clustering on it.
part of hclust_art.py
Root is a root of the cluster hierarchy. We can print using a simple recursive function.
part of hclust_art.py
The output is not exactly nice, but it will have to do. Our clustering, printed by calling printClustering(root)
looks like this: (((04)((57)(89)))((1(26))3))
. The elements are separated into two groups, the first containing elements 0, 4, 5, 7, 8, 9, and the second 1, 2, 6, 3. The difference between them equals root.height
, 9.0 in our case. The first cluster is further divided onto 0 and 4 in one, and 5, 7, 8, 9 in the other subcluster...
It is easy to print out the cluster's objects. Here's what's in the left subcluster of root
.
Everything that can be iterated over, can as well be cast into a list or tuple. Let us demonstrate this by writing a better function for printing out the clustering (which will also come handy for something else in a while). The one above supposed that each leaf contains a single object. This is not necessarily so; instead of printing out the first (and supposedly the only) element of cluster, cluster[0]
, we shall print it out as a tuple.
part of hclust_art.py
The distance matrix could have been given a list of objects. We could, for instance, put
above calling the HierarchicalClustering
. (This code will actually trigger a warning; to avoid it, use matrix.setattr("objects", ["Ann", "Bob"...
. Why this is needed is explained in the page on Orange peculiarities.) If we've forgotten to store the objects
into matrix
prior to clustering, nothing is lost. We can add it into clustering later, by
So, what do these "objects" do? Call printClustering(root)
again and you'll see. Or, let us print out the elements of the first left cluster, as we did before.
If objects are given, the cluster's elements, as got by indexing (eg root.left[2]
) or by iteration, as in the above case, won't be indices but the elements we clustered. If we put an ExampleTable
into objects
, root.left[-1]
will be the last example of the first left cluster.
Now for swapping and permutations.
Calling root.left.swap
reversed the order of subclusters of root.left
and root.permute([1, 0])
(which is equivalent to root.swap
- there aren't many possible permutations of two elements) reverses the order of root.left
and root.right
.
Let us write function for cluster pruning.
part of hclust_art.py
We shall use printClustering2
here, since we can have multiple elements in a leaf of the clustering hierarchy.
We've ended up with four clusters. Need a list of clusters? Here's the function.
part of hclust_art.py
The function returns a list of lists, in our case [['Bob', 'Curt', 'Greg'], ['Danny'], ['Fred', 'Hue', 'Ivy', 'Jon'], ['Ann', 'Eve']]
. If there were no objects
the list would contains indices instead of names.
The most common things to cluster are certainly examples. To show how to this is done, we shall now load the Iris data set, initialize a distance matrix with the distances measure by ExamplesDistance_Euclidean
and cluster it with average linkage. Since we don't need the matrix, we shall let the clustering overwrite it (not that it's needed for such a small data set as Iris).
part of hclust.py (uses iris.tab)
Note that we haven't forgotten to set the matrix.objects
. We did it through matrix.setattr
to avoid the warning. Let us now prune the clustering using the function we've written above, and print out the clusters.
part of hclust.py (uses iris.tab)
Since the printout is pretty long, it might be more informative to just print out the class distributions for each cluster.
part of hclust.py (uses iris.tab)
Here's what it shows.
Note something else: listOfClusters
does not return a list of ExampleTable
s, but a list of lists of examples. Therefore, in the above script, cluster
is a list of examples, not an ExampleTable
, but it gets converted to it automatically when the function is called. Most Orange functions will do this for you automatically. You can, for instance, call a learning algorithms, passing a cluster as an argument. It won't mind. If you, however, want to have a list of table, you can easily convert the list by
Finally, if you are dealing with examples, you may want to take the function listOfClusters
and replace alist.append(list(cluster))
by alist.append(orange.ExampleTable(cluster))
. This function is less general, it will fail if objects
are not of type Example
. However, instead of list of lists, it will return a list of example tables.
To demonstrate how the data in clusters is stored, we shall continue with the clustering we got in the first example.
We removed objects
to just to see more clearly what is going on. mapping
is an ordered list of indices to the rows/columns of distance matrix (and, at the same time, indices into objects
, if they exist). Each cluster's fields first
and last
are indices into mapping, so the clusters elements are actually cluster.mapping[cluster.first:cluster.last]
. cluster[i]
therefore returns cluster.mapping[cluster.first+i]
or, if objects
are specified, cluster.objects[cluster.mapping[cluster.first+i]]
. Space consumption is minimal since all clusters share the same objects mapping
and objects
.
Subclusters are ordered so that cluster.left.last
always equals cluster.right.first
or, in general, cluster.branches[i].last
equals cluster.branches[i+1].first
.
Swapping and permutation do three things: change the order of elements in branches
, permute the corresponding regions in mapping
and adjust the first
and last
for all the clusters below. For the latter, when subclusters of cluster
are permuted, the entire subtree starting at cluster.branches[i]
is moved by the same offset.
The hierarchy of objects that represent a clustering is open, everything is accessible from Python. You can write your own clustering algorithms that build this same structure, or you can use Orange's clustering and then do we the structure anything you want. For instance prune it, as we have shown earlier. However, it is easy to do things wrong: shuffle the mapping
, for instance, and forget to adjust the first
and last
pointers. Orange does some checking for the internal consistency, but you are surely smarter and can find a way to crash it. For instance, just create a cycle in the structure, call swap
for some cluster above the cycle and you're there. But don't blame it on me then.