DCDPM


labels/tags

Principle

The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000 [1]( McCallum, A.; Nigam, K.; and Ungar L.H. (2000) “Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching”, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 doi:10.1145/347090.347123). It generate a hard clustering with associate prototypes of each cluster.

Its principle consist to iterate over randomnly taken prototypes drawn from non-clusterized data and apply two succesive filtering using distance between prototypes and non-clusterized data and hyperparameters $t_1$ and $t_2$. The number of clusters emerges from the processing (instead of being set-up as an hyper-parameter).

scalability

Complexity is in O(n).

Input

A collection of Any data type who has an associated distance metrics.

parameters

  • t_1: the loose distance
  • t_2 : the tight distance
  • distance : metric

Requirements

  • t_1 > t_2

Ouput format

HardClustering and associate Prototypes which are points from datsets.

Associated visualization

HardClustering like.

Prototypes like.

Business case

Usage

Initiate K-Means like.