**Hierarchical Clustering**

- Produce a nested sequence of clusters.
- One approach : recursive application of a partitional clustering algorithm.

**Types of hierarchical clustering**

Agglomerative (bottom up) clustering : builds the dendrogram (tree) from the bottom level, and

- merges the most similar (or nearest) pair of clusters
- stops when all the data points are merged into a single cluster (i.e., the root cluster).

- Splits the root into a set of child clusters. Each child cluster is recursively divided further
- stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point

**Dendrogram : Hierarchical Clustering**

Dendrogram

- Given an input set S
- nodes represent subsets of S
- Features of the tree :
- The root is the whole input set S.
- The leaves are the individual elements of S.
- The internal nodes are defined as the union of their children.

**Dendrogram : Hierarchical Clustering**

Dendrogram

→ Each level of the tree represents a partition of the input data into several (nested) clusters or groups.

→ May be cut at any level : Each connected component forms a cluster.

**Hierrarchical Agglomerative clustering**

- Initially each data point forms a cluster.
- Compute the distance matrix between the clusters.
- Repeat -

- Update the distance matrix

- Until only a single cluster remains.

**Different definitions of the distance leads to different algorithms.**

**Initialization**

- Each individual point is taken as a cluster
- Construct distance/proximity matrix

**Intermediate State**

- After some merging steps , we have some clusters

**After Merging**

- Update the distance matrix

**Closest Pair**

- A few ways to measure distances of two clusters.
- Single-link → Similarity of the most similar (single-link)
- Complete-link → Similarity of the least similar points
- Centroid → Clusters whose centroids (centers of gravity) are the most similar
- Average-link → Average cosine between pairs of elements

**Distance between two clusters**

- Single-link distance between clusters Ci and Cj is the minimum distance between any object in Ci and any object in Cj

**Single-link clustering : example**

- Determined by one pair of points, i.e., by one link in the proximity graph.

**Complete link method**

- The distance between two clusters is the distance of two furthest data points in the two clusters.

- Makes "tighter" spherical clusters that are typically preferable.
- It is sensitive to outliers because they are far away

**Computational Complexity**

- In the first iteration, all HAC methods need to compute similarity of all pairs of N initial instances, which is O(N2).
- In each of the subsequent N-2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters.
- In order to maintain an overall O(N2) performance , computing similarity to each other cluster must be done in constant time.
- Often O(N3) if done naively or O(N2 log N) if done more cleverly

## No comments:

## Post a comment