orngClustering: Partitional and Agglomerative Clustering

The module implements a k-means partitional clustering, and provides a wrapper around Orange's implementation of agglomerative hierarchical clustering. The module also implements a number of useful functions associated with these two clustering methods, like leaf-ordering of the dendrogram and dendrogram plot.

KMeans

Class KMeans provides for an implementation of standard k-means clustering algorithm:

  1. Choose the number of clusters, k.
  2. Choose a set of k initial centroids.
  3. Assign each instances in the data set to the closest centroid.
  4. For each cluster, compute a new centroid as a center of clustered data instances.
  5. Repeat the previous two steps, until some convergence criterion is met (e.g., the cluster assignment has not changed).

The main advantage of the algorithm is simplicity and low memory space requirements. The principal disadvantage is the dependence of results on the selection of initial set of centroids.

Methods

__init__(data=None, centroids=3, maxiters=None, minscorechange=None, stopchanges=0, nstart=1, initialization=kmeans_init_random, distance=orange.ExamplesDistanceConstructor_Euclidean, scoring=score_distance_to_centroids, inner_callback = None, outer_callback = None, initialize_only = False)
data is an Orange's ExampleTable object that stores the data instances to be clustered. If data is not None, clustering will immediately executed after the initialization of clustering parameters unless initialize_only is set to True. centroids either specify a number of clusters or provide a list of examples that will serve as clustering centroids. The clustering will stop if one of the following conditions is met: the number of clustering iterations exceeds maxiters, the number of instances changing the cluster is equal to stopchanges, or the score associated with current clustering improved for less than minscorechange of the score from previous iteration. If minscorechange is not set, the score will not be computed between iterations. User can also provide a example distance constructor, which, given the data set, will provide a function that measures the distance between two example instances (see Distances between Examples). A function to select centroids given the table of data instances, k and a example distance function is provided by initialization, the module includes implementations of several different approaches. scoring is a function that takes clustering object as an argument and returns a score for the clustering. It could be used, for instance, in procedure that repeats the clustering nstart times, returning the clustering with the lowest score. The two callbacks are invoked either after every clustering iteration (inner_callback) or after every clustering restart (in case when nstart is greater than 1, outer_callback).
runone()
Runs one clustering iteration, starting with re-computation of centroids, followed by computation of data membership (associating data instances to their nearest centroid).
run()
Runs clustering until the convergence conditions are met. If nstart is greater than one, nstart runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.

Attributes

k
Number of clusters.
centroids
Current set of centroids.
scoring
Current clustering score.
iteration
Current clustering iteration.
clusters
A list of cluster indexes. An i-th element provides an index to a centroid associated with i-th data instance from the input data set.

The following code runs k-means clustering on Iris data set and prints out the cluster indexes for the last 10 data instances:

part of kmeans-run.py (uses iris.tab)

data = orange.ExampleTable("iris") km = orngClustering.KMeans(data, 3) print km.clusters[-10:]

The output of this code is:

[1, 1, 2, 1, 1, 1, 2, 1, 1, 2]

Invoking a call-back function may be useful when tracing the progress of the clustering. Below is a code that uses an inner_callback to report on the number of instances that have changed the cluster and to report on the clustering score. For the score to be computed at each iteration we have to set minscorechange, but we can leave it at 0 (or even set it to a negative value, which allows the score to deteriorate by some amount).

part of kmeans-run-callback.py (uses iris.tab)

def callback(km): print "Iteration: %d, changes: %d, score: %.4f" % (km.iteration, km.nchanges, km.score) data = orange.ExampleTable("iris") km = orngClustering.KMeans(data, 3, minscorechange=0, inner_callback=callback)

The convergence on Iris data set is fast:

Iteration: 1, changes: 150, score: 10.9555 Iteration: 2, changes: 12, score: 10.3867 Iteration: 3, changes: 2, score: 10.2034 Iteration: 4, changes: 2, score: 10.0699 Iteration: 5, changes: 2, score: 9.9542 Iteration: 6, changes: 1, score: 9.9168 Iteration: 7, changes: 2, score: 9.8624 Iteration: 8, changes: 0, score: 9.8624

Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script:

part of kmeans-trace.py (uses iris.tab)

def plot_scatter(data, km, attx, atty, filename="kmeans-scatter", title=None): """plot a data scatter plot with the position of centeroids""" pylab.rcParams.update({'font.size': 8, 'figure.figsize': [4,3]}) x = [float(d[attx]) for d in data] y = [float(d[atty]) for d in data] colors = ["c", "w", "b"] cs = "".join([colors[c] for c in km.clusters]) pylab.scatter(x, y, c=cs, s=10) xc = [float(d[attx]) for d in km.centroids] yc = [float(d[atty]) for d in km.centroids] pylab.scatter(xc, yc, marker="x", c="k", s=200) pylab.xlabel(attx) pylab.ylabel(atty) if title: pylab.title(title) pylab.savefig("%s-%03d.png" % (filename, km.iteration)) pylab.close() def in_callback(km): print "Iteration: %d, changes: %d, score: %8.6f" % (km.iteration, km.nchanges, km.score) plot_scatter(data, km, "petal width", "petal length", title="Iteration %d" % km.iteration) data = orange.ExampleTable("iris") random.seed(42) km = orngClustering.KMeans(data, 3, minscorechange=0, maxiters=10, inner_callback=in_callback)

Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.

k-Means Utility Functions

kmeans_init_random(data, k, _)
A function that can be used for initialization of k-means clustering returns k data instances from the data. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).
kmeans_init_diversity(data, k, distfun)
Another function that can be used for intializationof k-means clustering. Given the data set, number of clusters and adistance function returns a set of centroids where the first one is a data point being the farthest away from the center of the data, and consequent centroids data points of which the minimal distance to the previous set of centroids is maximal. This type of initialization is almost identical to initialization proposed by Katsavounidis et al. (1994), with the only difference being in the selection of the first centroid (where they use a data instance with the biggest norm).
KMeans_init_hierarchicalClustering(n=100)
Is actually a class that returns an clustering initialization function that, given the data, k, and a distance function samples n data instances, performs hierarhical clustering, uses it to infer k clusters, and computes a list of cluster-based data centers.
data_center(data)
Returns a center of the instances in the data set (average across data instances for continuous attributes, most frequent value for discrete attributes).
score_distance_to_centroids(kmeans)
Returns an average distance of data instances to their associated centroids. kmeans is a k-means clustering object.
score_silhouette(kmeans, index=None)
Returns an average silhouette score of data instances. If index is specified it instead returns just the silhouette score of that particular data instance. kmeans is a k-means clustering object.
score_fastsilhouette(kmeans, index=None)
Same as score_silhouette, but computes an approximation and is faster.
plot_silhouette(kmeans, filename='tmp.png', fast=False)
Saves a silhuette plot to filename, showing the distributions of silhouette scores in clusters. kmeans is a k-means clustering object. If fast is True use score_fastsilhouette to compute scores instead of score_silhouette.

Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge:

part of kmeans-cmp-init.py (uses iris.tab, housing.tab, vehicle.tab)

import orange import orngClustering import random data_names = ["iris", "housing", "vehicle"] data_sets = [orange.ExampleTable(name) for name in data_names] print "%10s %3s %3s %3s" % ("", "Rnd", "Div", "HC") for data, name in zip(data_sets, data_names): random.seed(42) km_random = orngClustering.KMeans(data, centroids = 3) km_diversity = orngClustering.KMeans(data, centroids = 3, \ initialization=orngClustering.kmeans_init_diversity) km_hc = orngClustering.KMeans(data, centroids = 3, \ initialization=orngClustering.KMeans_init_hierarchicalClustering(n=100)) print "%10s %3d %3d %3d" % (name, km_random.iteration, km_diversity.iteration, km_hc.iteration)

The results show that diversity and clustering-based initialization make k-means converge faster that random selection of seeds (as expected):

Rnd Div HC iris 12 3 4 housing 14 6 4 vehicle 11 4 3

The following code computes the silhouette score for three different clusterings (k=2..7), and at the end plots a silhuette plot for k=3.

kmeans-silhouette.py (uses iris.tab

import orange import orngClustering data = orange.ExampleTable("iris") for k in range(2,8): km = orngClustering.KMeans(data, k, initialization=orngClustering.kmeans_init_diversity) score = orngClustering.score_silhouette(km) print k, score km = orngClustering.KMeans(data, 3, initialization=orngClustering.kmeans_init_diversity) orngClustering.plot_silhouette(km, "kmeans-silhouette.png")

The analysis sugests that clustering with k=2 is preferred as it yields the maximal silhouette coefficien:

2 0.629467553352 3 0.504318855054 4 0.407259377854 5 0.358628975081 6 0.353228492088 7 0.366357876944

Silhouette plot for k=3 is given below:

Hierarchical Clustering

hierarchicalClustering(data, distanceConstructor=orange.ExamplesDistanceConstructor_Euclidean, linkage=orange.HierarchicalClustering.Average, order=False, progressCallback=None)
Returns an object with information of a hierarchical clustering of a given data set. This is a wrapper function around HierarchicalClustering class implemented in Orange. hierarchicalClustering uses distanceConstructor function (see Distances between example) to construct a distance matrix, which is then passed to Orange's hierarchical clustering algorithm, along with a particular linkage method. Ordering of leaves can be requested (order=True) and if so, the leaves will be ordered using orderLeaves function (see below).
orderLeaves(root, distanceMatrix)
Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.
orderLeaves(root, distanceMatrix)
Given the object with hierarchical clustering (a root node of the tree) and a distance matrix, function uses a fast optimal leaf ordering by Bar-Joseph et al. to impose the order of the branches in the dendrogram so that the distance between the neighboring leaves is minimized.
hierarchicalClustering_topClusters(root, k)
Returns k topmost clusters (top k nodes of the clustering tree) from hierarchical clustering.
hierarhicalClustering_topClustersMembership(root, k)
Returns a list with indexes which indicate the membership of data instances that are included in top k clusters.

Using hierarchicalClustering, scripts need a single line of code to invoke the clustering and get the object with a result. This is demonstrated in the following script, that considers the Iris data set, performs hierarchical clustering, and then plots the data in two-attribute projection, coloring the points representing data instances according to cluster membership.

part of hclust-iris.py (uses iris.tab)

def plot_scatter(data, cls, attx, atty, filename="hclust-scatter", title=None): """plot a data scatter plot with the position of centeroids""" pylab.rcParams.update({'font.size': 8, 'figure.figsize': [4,3]}) x = [float(d[attx]) for d in data] y = [float(d[atty]) for d in data] colors = ["c", "w", "b"] cs = "".join([colors[c] for c in cls]) pylab.scatter(x, y, c=cs, s=10) pylab.xlabel(attx) pylab.ylabel(atty) if title: pylab.title(title) pylab.savefig("%s.png" % filename) pylab.close() data = orange.ExampleTable("iris") root = orngClustering.hierarchicalClustering(data) n = 3 cls = orngClustering.hierarhicalClustering_topClustersMembership(root, n) plot_scatter(data, cls, "sepal width", "sepal length", title="Hiearchical clustering (%d clusters)" % n)

The output of the script is a following plot:

DendrogramPlot

Class DendrogramPlot implements visualization of the clustering tree (called dendrogram) and corresponding visualization of attribute heatmap.

Methods

__init__(tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, text_width=None, heatmap_width=None, spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None, renderer=EPSRenderer)
tree is an Orange's hierarhical clustering tree object (root node), attr_tree an optional attribute clustering root node. If data (Orange's ExampleTable) is given than the dendrogram will include a heat map with color-based presentation of attribute's values. The length of the data set should match the number of leaves in the hierarchical clustering tree. Following are arguments that define the height and width of the plot areas. Branches are plotted in black, but may be colored to visually expose various clusters. The coloring is specified with cluster_colors, a dictionary with cluster instances as keys and (r, g, b) tuples as items. color_palette specifies the palette to use for the heatmap and minv and maxv maximum and minimum data value to plot.
set_matrix_color_schema(color_palette, minv, maxv)
Set the heatmap color schema. color_palette can be an instance of ColorPalette class or an list of (r, g, b) tuples. minv and maxv specify the cutoff values for the heatmap (values below and above the interval will be painted with color_palette.underflow and color_palette.overflow respectively)
plot(filename="graph.eps")
Plots the dendrogram and save is to the output file.

Additionaly a module level convenience function dendrogram_draw is provided to streamline the drawing process.

dendrogram_draw(filename, tree, attr_tree=None, labels=None, data=None, width=None, height=None, tree_height=None, text_width=None, heatmap_width=None, spacing=2, cluster_colors={}, color_palette=ColorPalette([(255, 0, 0), (0, 255, 0)]), maxv=None, minv=None)
Draw the dendrogram to filename (supported formats: PNG, EPS, SVG)

To illustrate the use of the dendrogram plotting class, the following scripts uses it on a subset of 20 instances from the Iris data set. Values of the class variables is used for labeling the leaves (and, of course, it is not used for the clustering - only the non-class attributes are used to compute instance distance matrix).

part of hclust-dendrogram.py (uses iris.tab)

data = orange.ExampleTable("iris") sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0) root = orngClustering.hierarchicalClustering(sample) orngClustering.dendrogram_draw("hclust-dendrogram.png", root, data=sample, labels=[str(d.getclass()) for d in sample])

The resulting dendrogram is shown below.

Following is a similar script to above one, but this time we have 1) distinctively colored the three topmost dendrogram branches, 2) used a custom color schema for representation of attribute values (spanning red - black - green with custom gamma minv and maxv set ), and 3) included only two attributes in the heat map presentation (note: clustering is still done on all of the data set's attributes).

part of hclust-colored-dendrogram.py (uses iris.tab)

data = orange.ExampleTable("iris") sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0) root = orngClustering.hierarchicalClustering(sample) reduced = orange.ExampleTable(orange.Domain(sample.domain[:2], False), sample) my_colors = [(255,0,0), (0,255,0), (0,0,255)] cls = orngClustering.hierarchicalClustering_topClusters(root, 3) colors = dict([(cl, col) for cl, col in zip(cls, my_colors)]) orngClustering.dendrogram_draw("hclust-colored-dendrogram.png", root, data = reduced, labels=[str(d.getclass()) for d in sample], cluster_colors=colors, color_palette=[(0, 255, 0), (0, 0, 0), (255, 0, 0)], gamma=0.5, minv=2.0, maxv=7.0)

Our "colored" dendrogram is now saved as shown in the figure below:

References

Forgy E (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21(3): 768-769.

He J, Lan M, Tan C-L , Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary.

Katsavounidis I, Jay C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10): 144-146.

Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) Fast optimal leaf ordering for herarchical clustering. Bioinformatics 17(Suppl. 1): S22-S29.