The module implements a k-means partitional clustering, and provides a wrapper around Orange's implementation of agglomerative hierarchical clustering. The module also implements a number of useful functions associated with these two clustering methods, like leaf-ordering of the dendrogram and dendrogram plot.
Class KMeans
provides for an implementation of standard k-means clustering algorithm:
The main advantage of the algorithm is simplicity and low memory space requirements. The principal disadvantage is the dependence of results on the selection of initial set of centroids.
Methods
data
is an Orange's ExampleTable object that stores the data instances to be clustered. If data
is not None
, clustering will immediately executed after the initialization of clustering parameters unless initialize_only
is set to True
. centroids
either specify a number of clusters or provide a list of examples that will serve as clustering centroids. The clustering will stop if one of the following conditions is met: the number of clustering iterations exceeds maxiters
, the number of instances changing the cluster is equal to stopchanges
, or the score associated with current clustering improved for less than minscorechange
of the score from previous iteration. If minscorechange
is not set, the score will not be computed between iterations. User can also provide a example distance constructor, which, given the data set, will provide a function that measures the distance between two example instances (see Distances between Examples). A function to select centroids given the table of data instances, k and a example distance function is provided by initialization
, the module includes implementations of several different approaches. scoring
is a function that takes clustering object as an argument and returns a score for the clustering. It could be used, for instance, in procedure that repeats the clustering nstart
times, returning the clustering with the lowest score. The two callbacks are invoked either after every clustering iteration (inner_callback
) or after every clustering restart (in case when nstart
is greater than 1, outer_callback
).nstart
is greater than one, nstart
runs of the clustering algorithm will be executed, returning the clustering with the best (lowest) score.Attributes
The following code runs k-means clustering on Iris data set and prints out the cluster indexes for the last 10 data instances:
part of kmeans-run.py (uses iris.tab)
The output of this code is:
Invoking a call-back function may be useful when tracing the progress of the clustering. Below is a code that uses an inner_callback
to report on the number of instances that have changed the cluster and to report on the clustering score. For the score to be computed at each iteration we have to set minscorechange
, but we can leave it at 0 (or even set it to a negative value, which allows the score to deteriorate by some amount).
part of kmeans-run-callback.py (uses iris.tab)
The convergence on Iris data set is fast:
Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script:
part of kmeans-trace.py (uses iris.tab)
Only the first four scatterplots are shown below. Colors of the data instances indicate the cluster membership. Notice that since the Iris data set includes four attributes, the closest centroid in a particular 2-dimensional projection is not necessary also the centroid of the cluster that the data point belongs to.
![]() |
![]() |
![]() |
![]() |
k
data instances from the data
. This type of initialization is also known as Fory's initialization (Forgy, 1965; He et al., 2004).n
data instances, performs hierarhical clustering, uses it to infer k
clusters, and computes a list of cluster-based data centers.kmeans
is a k-means clustering object.index
is specified it instead returns just the silhouette score of that particular data instance. kmeans
is a k-means clustering object.filename
, showing the distributions of silhouette scores in clusters. kmeans
is a k-means clustering object. If fast
is True use score_fastsilhouette
to compute scores instead of score_silhouette
.Typically, the choice of seeds has a large impact on the k-means clustering, with better initialization methods yielding a clustering that converges faster and finds more optimal centroids. The following code compares three different initialization methods (random, diversity-based and hierarchical clustering-based) in terms of how fast they converge:
part of kmeans-cmp-init.py (uses iris.tab, housing.tab, vehicle.tab)
The results show that diversity and clustering-based initialization make k-means converge faster that random selection of seeds (as expected):
The following code computes the silhouette score for three different clusterings (k=2..7), and at the end plots a silhuette plot for k=3.
kmeans-silhouette.py (uses iris.tab
The analysis sugests that clustering with k=2 is preferred as it yields the maximal silhouette coefficien:
Silhouette plot for k=3 is given below:
hierarchicalClustering
uses distanceConstructor
function (see Distances between example) to construct a distance matrix, which is then passed to Orange's hierarchical clustering algorithm, along with a particular linkage method. Ordering of leaves can be requested (order=True
) and if so, the leaves will be ordered using orderLeaves
function (see below).Using hierarchicalClustering
, scripts need a single line of code to invoke the clustering and get the object with a result. This is demonstrated in the following script, that considers the Iris data set, performs hierarchical clustering, and then plots the data in two-attribute projection, coloring the points representing data instances according to cluster membership.
part of hclust-iris.py (uses iris.tab)
The output of the script is a following plot:
Class DendrogramPlot
implements visualization of the clustering tree (called dendrogram) and corresponding visualization of attribute heatmap.
Methods
tree
is an Orange's hierarhical clustering tree object (root node), attr_tree
an optional attribute clustering root node. If data
(Orange's ExampleTable) is given than the dendrogram will include a heat map with color-based presentation of attribute's values. The length of the data set should match the number of leaves in the hierarchical clustering tree. Following are arguments that define the height and width of the plot areas. Branches are plotted in black, but may be colored to visually expose various clusters. The coloring is specified with cluster_colors
, a dictionary with cluster instances as keys and (r, g, b) tuples as items.
color_palette specifies the palette to use for the heatmap and minv
and maxv
maximum and minimum data value to plot.
color_palette can be an instance of ColorPalette class or an list of (r, g, b) tuples. minv
and maxv
specify the cutoff values for the heatmap (values below and above the interval will be painted with color_palette.underflow
and color_palette.overflow
respectively)
- plot(filename="graph.eps")
- Plots the dendrogram and save is to the output file.
Additionaly a module level convenience function dendrogram_draw
is provided to streamline the drawing process.
To illustrate the use of the dendrogram plotting class, the following scripts uses it on a subset of 20 instances from the Iris data set. Values of the class variables is used for labeling the leaves (and, of course, it is not used for the clustering - only the non-class attributes are used to compute instance distance matrix).
part of hclust-dendrogram.py (uses iris.tab)
The resulting dendrogram is shown below.
Following is a similar script to above one, but this time we have 1) distinctively colored the three topmost dendrogram branches, 2) used a custom color schema for representation of attribute values (spanning red - black - green with custom gamma
minv
and maxv
set ), and 3) included only two attributes in the heat map presentation (note: clustering is still done on all of the data set's attributes).
part of hclust-colored-dendrogram.py (uses iris.tab)
Our "colored" dendrogram is now saved as shown in the figure below:
Forgy E (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21(3): 768-769.
He J, Lan M, Tan C-L , Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 297-302, Budapest, Hungary.
Katsavounidis I, Jay C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10): 144-146.
Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) Fast optimal leaf ordering for herarchical clustering. Bioinformatics 17(Suppl. 1): S22-S29.