Canopy

Labels / Tags

  • Clustering

Principle

The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000 [1]( McCallum, A.; Nigam, K.; and Ungar L.H. (2000) “Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching”, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 doi:10.1145/347090.347123). It generate a hard clustering with associate prototypes of each cluster.

Its principle consist to iterate over randomnly taken prototypes drawn from non-clusterized data and apply two succesive filtering using distance between prototypes and non-clusterized data and hyperparameters $t_1$ and $t_2$. The number of clusters emerges from the processing (instead of being set-up as an hyper-parameter).

Scalability

Computational complexity is in $O(n)$.

Where :

  • $n$ is the number of data observation.

Input

A collection of any data type who has an associated dissimilarity measure on it. Usually it will be numerical, binary, mixed, or multivariate time series data but it is possible to build customizable metric for any given data type.

Parameters

  • $t_1$: the loose distance
  • $t_2$ : the tight distance
  • Dissimilarity measure : metric

Requirements

  • $t_1$ > $t_2$

Output

Result is a CanopyClusters which is a particular form of Clusters where each Cluster also has an associate prototype.

Predictor

Canopy algorithm output beeing a Clusters it already contains affectation for each observation because they are gather into their respective cluster.

Let recall than a Clusters and a HardClustering are siblings, when the first is a collection of Cluster with their data points, the second is a collection of all data point and their respective ClusterId.

Some predictor can directly or through a conversion be used with CanopyClusters model such has ClosestPrototypePredictor or KNNHardClusteringPredictor.

Associated visualization

  • Clusters like.
  • HardClustering like.
  • Prototypes like.

Canopy can be used as a preprocessing step for $K$-Centers like algorithm by giving its output prototypes for initialization.

Business case

Usage

Initiate K-Means like.