DCDPM
labels/tags
- Tagged according to tag list
- clustering
Principle
The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000 [1]( McCallum, A.; Nigam, K.; and Ungar L.H. (2000) “Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching”, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 doi:10.1145/347090.347123). It generate a hard clustering with associate prototypes of each cluster.
Its principle consist to iterate over randomnly taken prototypes drawn from non-clusterized data and apply two succesive filtering using distance between prototypes and non-clusterized data and hyperparameters $t_1$ and $t_2$. The number of clusters emerges from the processing (instead of being set-up as an hyper-parameter).
scalability
Complexity is in O(n).
Input
A collection of Any data type who has an associated distance metrics.
parameters
- t_1: the loose distance
- t_2 : the tight distance
- distance : metric
Requirements
- t_1 > t_2
Ouput format
HardClustering and associate Prototypes which are points from datsets.
Associated visualization
HardClustering like.
Prototypes like.
Business case
Usage
Initiate K-Means like.