Home > Hadrian > Basic clustering models

Basic clustering models

Before you begin…

Download and install Titus. This article was tested with Titus 0.7.1; newer versions should work with no modification. Python >= 2.6 and < 3.0 is required.

Launch a Python prompt and import the following:

Python 2.7.6
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> import math
>>> import titus.prettypfa
>>> from titus.genpy import PFAEngine

The basic form

To use the functions in model.cluster.*, declare your clusters as records with a “center” field. (The centers don’t have to be arrays of numbers: they can be arrays of any type X as long as you have a metric function that can find distances on X.)

pfaDocument = titus.prettypfa.jsonNode('''
  Cluster = record(Cluster,
                   center: array(double),
                   population: int)
input: array(double)
output: Cluster
  clusters(array(Cluster)) = []
  model.cluster.closest(input, clusters, metric.simpleEuclidean)

This scoring engine returns the whole cluster object for each match. You could consider adding an “id” (string) field and return the id of the closest cluster.

Producing a model

Since titus.producer.kmeans already has an implementation of k-means clustering, let’s build [hierarchical clusters])(https://en.wikipedia.org/wiki/Hierarchical_clustering) (complete linkage).

def randomPoint():
    return [random.gauss(0, 1), random.gauss(0, 1), random.gauss(0, 1)]

def pointDistance(xpoint, ypoint):
    return math.sqrt(sum((xi - yi)**2 for xi, yi in zip(xpoint, ypoint)))

def setDistance(xset, yset):
    maxDistance = None
    for xpoint in xset:
        for ypoint in yset:
            d = pointDistance(xpoint, ypoint)
            if maxDistance is None or d > maxDistance:
                maxDistance = d
    return maxDistance

dataset = [randomPoint() for x in xrange(100)]

clusters = [[x] for x in dataset]

while len(clusters) > 1:
    distance, besti, bestj = None, None, None
    for i in xrange(len(clusters)):
        for j in xrange(i + 1, len(clusters)):
            d = setDistance(clusters[i], clusters[j])
            if distance is None or d < distance:
                distance = d
                besti = i
                bestj = j
    if distance > DISTANCE_CUTOFF:
    clusters.append(clusters[besti] + clusters[bestj])
    del clusters[bestj]
    del clusters[besti]

Insert the model into PFA

The clusters as I’ve defined them aren’t in the right format for my Cluster type above, so I’ll have to convert them before inserting them.

def makeCluster(list):
    center = [0.0] * len(list[0])
    for vector in list:
        for i in xrange(len(center)):
            center[i] += vector[i] / len(list)
    return {"center": center, "population": len(list)}

pfaDocument["cells"]["clusters"]["init"] = map(makeCluster, clusters)

Test it!

engine, = PFAEngine.fromJson(pfaDocument)

for point in dataset:
    print "{0:70s} -> {1}".format(point, engine.action(point))

Streaming clusters

PFA is ordinarily used for making predictions from a fixed model because it describes a single-pass procedure. (The logic for multiple iterations on a dataset would be contained in the program that runs the PFA, not the PFA itself.) Thus, the usual mode is to perform k-means or hierarchical clustering on a training dataset (which requires iterations) and then apply the resulting model to classify new data, as shown in the sections above.

However, it is also possible to update clusters on the fly on a sliding window. The model.cluster.* library contains methods for seeding and updating clusters. The approach is to apply one k-means iteration to a window of data every time the window increments. Most of the data in that window is unchanged from one step to the next, so this procedure is similar to iteration, apart from the old data points that fall off one end of the window and new data points that are added to the other.

The YAML file below describes a PFA engine based on this technique. (The subset of YAML used is equivalent to JSON, but easier to read, and it allows comments.)

# Input: get a 2D vector.
input: {type: array, items: double}

# Output: emit the current state of the clusters.
  type: array
    type: record
    name: Cluster
      - {name: center, type: {type: array, items: double}}

# Keep track of the dataset (window of size 100) and the clusters (initially
# empty).
    type: {type: array, items: {type: array, items: double}}
    init: []
    type: {type: array, items: Cluster}
    init: []

# For each input vector, apply this action...
method: emit
  # First, put the datum into the circular buffer (maintained by the a.cycle
  # function).
  - cell: window
      params: [{x: {type: array, items: {type: array, items: double}}}]
      ret: {type: array, items: {type: array, items: double}}
      do: {a.cycle: [x, input, 100]}

  # Next, update the clusters if the window has more than 100 entries.
  - if: {">=": [{a.len: {cell: window}}, 100]}
        cell: clusters
          params: [{oldClusters: {type: array, items: Cluster}}]
          ret: {type: array, items: Cluster}
            if: {"==": [{a.len: oldClusters}, 0]}

            # If there are no clusters yet, randomly select three data points
            # to use as seeds.
                - {cell: window}
                - 3
                - params: [{i: int}, {vec: {type: array, items: double}}]
                  ret: Cluster
                    new: {center: vec}
                    type: Cluster

            # If clusters already exist, apply one k-means iteration every time
            # the window slides one place. Also, include the old centers in the
            # mean with as much weight as the new window (100.0).
                - {cell: window}
                - oldClusters
                - {fcnref: metric.simpleEuclidean}
                - params:
                    - {data: {type: array, items: {type: array, items: double}}}
                    - {cluster: Cluster}
                  ret: Cluster
                  do: {model.cluster.updateMean: [data, cluster, 100.0]}

The plot below shows what this looks like. Black points are data in the windows, which clearly belong to three clusters, and after a burn-in period (100 events), cluster centers (red cross-hairs) are randomly seeded and then updated. The true centers of the clusters are intentionally changed during the run, to show how the scoring engine reacts.

If the animation has stopped, reload this page or open the image in a new tab.