Michael Data

Typically in unsupervised learning and exploratory data analysis, you want to produce a simplified representation of a data set for visualization or other interpretation purposes.

The number of clusters is frequently a parameter that must be specified ahead of time. This is a big issue.

Data Format

Data matrix

  • * N rows
  • * d columns

Distance matrix

  • * N rows
  • * N columns

Hierarchical Clustering

Produces a tree of nested clusters. Works from a distance matrix.

The primary advantage is that this allows for visualization of high-dimensional data.

Disadvantage: Computationally difficult. Typically $O(n^2)$ to d distance comparisons within a new set of clusters.

Visualized via Dendrogram.


Bottom-up approach.

  • for i=1 to n let c_i = \{x(i)\} i.e. start with singletons.
  • Use a distance measure to choose 2 nearest points, and combine them


Top-down approach. Rarely used.

Spectral Clustering

a.k.a. Graph-based clustering. a.k.a. min-cut methods.
Given a graph of data points, find the min-cut of the graph to produce two sub-graphs.

Distance Measures

Care is required to properly interprete variables as Euclidean space. Should multiple variables be normalized (whitened) or should some have more weight than the others? Domain expertise may be needed to answer these questions.

Similar to similarity measures. You can usually take the inverse of a similarity measure to get a distance measure, or vice-versa.

The following measures are used for hierarchical clustering methods.

a.k.a. nearest neighbor measure.

$D(C_i,C_j) = \text{min}\{d(x,y) | x \in C_i , y \in C_j \} $

Tends to produce long chains.

a.k.a. furthest neighbor measure.

$D(C_i,C_j) = \text{max}\{d(x,y) | x \in C_i , y \in C_j \} $

Tends to produce round shapes.

A common compromise between single-link and complete-link.

$D(C_i,C_j) = \text{avg}\{d(x,y) | x \in C_i , y \in C_j \} $

Have to have a good spatial measure to be able to compute centroids of clusters.

$D(C_i,C_j) = d(c_i,c_j) $ where $c_i$ and $c_j$ are the centroids of their respective clusters.