K-Medoids

labels/tags

Principle

K-Means clustering algorithm is an unsupervised algorithm technique working on continuous data (R^d). It generate a list of $K$ prototypes by iterating over input dataset, each time a new partitioning is generated associating points to their closest prototype. Then each prototype is updated by computing the mean of each cluster. Iterative process can be stopped in various way such as setting a maximal number of iteration, wait until prototypes shift is under a threshold, or by looking for WSSE variation.

K-Means algorithm aim is to minimize following cost function :

C = argmin_S{\sum_{k = 1}^K{\sum_{x \in S_k}{D(x, p_k)}}}

Where :

  • S = S_1, S_2, …, S_K is the set composed with K disjoint subset of input dataset.
  • D(., .) is the continuous distance over data representations, usually the euclidean distance.
  • p_k is the prototype of each cluster, here it is the mean.

Initial prototypes can be given or automatically set, for example they can be chosed randomly within input dataset or through others technique la K-Means++ ref.

Scalability

Complexity is in O(n.K.iter.d). Where :

  • n is the number of data points
  • K is the number of prototypes
  • iter is the number of iterations
  • d is the dimensionality of the data

Input

  • A collection of continuous vector (R^d).

Parameters

  • A continous metric, by default Euclidean.
  • Stopping criteria. Many strategies have been developped. Actually A, B, C are available.
  • K prototypes intialization.
  • (Optionally) list of K prototypes.
  • Maximum number of iteration.

Ouput format

A model containing the K prototypes.

Associated visualization

Prototypes like.

Business case

Usage