# Michael's Wiki

Typically in unsupervised learning and exploratory data analysis, you want to produce a simplified representation of a data set for visualization or other interpretation purposes.

The number of clusters is frequently a parameter that must be specified ahead of time. This is a big issue.

Data matrix

• * N rows
• * d columns

Distance matrix

• * N rows
• * N columns

#### Hierarchical Clustering

Produces a tree of nested clusters. Works from a distance matrix.

The primary advantage is that this allows for visualization of high-dimensional data.

Disadvantage: Computationally difficult. Typically $O(n^2)$ to d distance comparisons within a new set of clusters.

Visualized via Dendrogram.

### Agglomerative

Bottom-up approach.

• for i=1 to n let c_i = \{x(i)\} i.e. start with singletons.
• Use a distance measure to choose 2 nearest points, and combine them

### Divisive

Top-down approach. Rarely used.

#### Spectral Clustering

a.k.a. Graph-based clustering. a.k.a. min-cut methods. Given a graph of data points, find the min-cut of the graph to produce two sub-graphs.

##### Distance Measures

Care is required to properly interprete variables as Euclidean space. Should multiple variables be normalized (whitened) or should some have more weight than the others? Domain expertise may be needed to answer these questions.

Similar to similarity measures. You can usually take the inverse of a similarity measure to get a distance measure, or vice-versa.

The following measures are used for hierarchical clustering methods.

a.k.a. nearest neighbor measure.

$D(C_i,C_j) = \text{min}\{d(x,y) | x \in C_i , y \in C_j \}$

Tends to produce long chains.

a.k.a. furthest neighbor measure.

$D(C_i,C_j) = \text{max}\{d(x,y) | x \in C_i , y \in C_j \}$

Tends to produce round shapes.

$D(C_i,C_j) = \text{avg}\{d(x,y) | x \in C_i , y \in C_j \}$
$D(C_i,C_j) = d(c_i,c_j)$ where $c_i$ and $c_j$ are the centroids of their respective clusters. 