Agglomerative Hierarchical Clustering (AHC)

Labels / Tags

  • Clustering
  • Hierarchical

Principle

Agglomerative Hierarchical Clustering is an algorithm working on any metric space as long as a metric associated to input data type is given. It consist to gather closest cluster at each iteration to form a hierarchy. At first iteration each data point is considered as a cluster, then they are iteratively fusionned depending on the given proximity definition between two clusters until it remains a single cluster containing all data observations.

Scalability

Computational complexity is in $O(n^3)$ for computation time and $O(n^2)$ for memory space.

Where :

  • n is the number of data points

Input

  • A collection of data observations of common type which can be any representation R type.

Parameters

  • A metric on R.
  • Linkage nature. Many have been proposed, most knonw are average ref_to_and for others to, single, complete, and ward (HAVE TO BE IMPLEMENTED) linkages. (In future it could be defined manually instea of using defined ones)

Ouput

TreeModel ref_to is a tree data structure where each node represent a cluster and each leaf is a data observation. The TreeModel is not a HardClustering itself, it only represent the computed hierarchy between data observations. But it contains many views, each view containing every leaf is a HardClustering then a TreeModel can be seen as a set of various HardClustering of the input data.

Extract a HardClustering from a TreeModel is called a cut and it exists different cutting strategies ref_to_etat_art_cut.

Acutally HephIA propose only one cut but other will be added.

Predictor

Associated visualization

  • Dendogram (hierarchy).
  • LOOK for MORE

Practical strategies

It can be recommanded to start AHC using the ward linkage but depending how you desire to define distance between clusters other options are available.

AHC is often applied on a collection of prototypes describing a clustering. This list is often small and will allow to run AHC very quickly giving then a hierarchy between prototypes and thus between clusters they represent.

Business case

Usage

Due to its complexity, agglomerative hierarchical clustering cannot handle dataset of high cardinality.

A standard application of agglomerative hierarchical clustering is combined with reduced dataset resulting from algorithms returning prototypes. Then a hierarchy can be made for a large dataset by applying a regular clustering algorithm and then apply agglomerative clustering algorithm on previously obtained prototypes.