Agglomerative Hierarchical Clustering (AHC)
Labels / Tags
- Clustering
- Hierarchical
Principle
Agglomerative Hierarchical Clustering is an algorithm working on any metric space as long as a metric associated to input data type is given. It consist to gather closest cluster at each iteration to form a hierarchy. At first iteration each data point is considered as a cluster, then they are iteratively fusionned depending on the given proximity definition between two clusters until it remains a single cluster containing all data observations.
Scalability
Computational complexity is in $O(n^3)$ for computation time and $O(n^2)$ for memory space.
Where :
- n is the number of data points
Input
- A collection of data observations of common type which can be any representation
Rtype.
Parameters
- A metric on
R. - Linkage nature. Many have been proposed, most knonw are
averageref_to_and for others to,single,complete, andward(HAVE TO BE IMPLEMENTED) linkages. (In future it could be defined manually instea of using defined ones)
Ouput
TreeModel ref_to is a tree data structure where each node represent a cluster and each leaf is a data observation. The TreeModel is not a HardClustering itself, it only represent the computed hierarchy between data observations. But it contains many views, each view containing every leaf is a HardClustering then a TreeModel can be seen as a set of various HardClustering of the input data.
Extract a HardClustering from a TreeModel is called a cut and it exists different cutting strategies ref_to_etat_art_cut.
Acutally HephIA propose only one cut but other will be added.
Predictor
Associated visualization
- Dendogram (hierarchy).
- LOOK for MORE
Practical strategies
It can be recommanded to start AHC using the ward linkage but depending how you desire to define distance between clusters other options are available.
Recommended association
AHC is often applied on a collection of prototypes describing a clustering. This list is often small and will allow to run AHC very quickly giving then a hierarchy between prototypes and thus between clusters they represent.
Business case
Usage
Due to its complexity, agglomerative hierarchical clustering cannot handle dataset of high cardinality.
Recommended association
A standard application of agglomerative hierarchical clustering is combined with reduced dataset resulting from algorithms returning prototypes. Then a hierarchy can be made for a large dataset by applying a regular clustering algorithm and then apply agglomerative clustering algorithm on previously obtained prototypes.