K-Means

Tags

Principle

K-Means clustering algorithm is an unsupervised algorithm technique working on continuous data (R^d). It generate a list of $K$ prototypes by iterating over input dataset, each time a new partitioning is generated associating points to their closest prototype. Then each prototype is updated by computing the mean of each cluster. Iterative process can be stopped in various way such as setting a maximal number of iteration, wait until prototypes shift is under a threshold, or by looking for WSSE variation.

K-Means algorithm aim is to minimize following cost function :

C = argmin_S{\sum_{k = 1}^K{\sum_{x \in S_k}{D(x, p_k)}}}

Where :

  • S = S_1, S_2, …, S_K is the set composed with K disjoint subset of input dataset.
  • D(., .) is the continuous distance over data representations, usually the euclidean distance.
  • p_k is the prototype of each cluster, here it is the mean.

Initial prototypes can be given or automatically set, for example they can be chosed randomly within input dataset or through others technique la K-Means++ (ref_to_ADD).

Scalability

Computational complexity is in O(n.K.iter.d).

Where :

  • n is the number of data points.
  • K is the number of prototypes.
  • iter is the number of iterations.
  • d is the dimensionality of the data.

Input

A collection of numerical vector (R^d) where $d$ is the dimensionality of the data.

Parameters

  • metric : A dissimilarity measure on numerical vectors, by default the Euclidean distance.
  • stop_criter : Stopping criteria. Many strategies have been developped. Actually A, B, C are available.
  • k : The value of initialized $K$ prototypes.
  • proto_set : The set of $K$ prototypes to start with (Optional).
  • max_iter : Maximum number of iteration.

Ouput

The KMeansModel containing the $K$ learned prototypes (final clusters mean) as numerical vector.

K-Means default predictor.

K-Means algorithm returns a KMeansModel describes above. In order to obtain an HardClustering from new input numerical vectors values, they often be the input data on which the fit was launched but in also can be any other values’collection of same type.

It is build from a KMeansModel and it allows to predict cluster’s mebership for queries of type numerical vectors.

Queries’results will be the ClusterIds of the closest $K$-MeansModel prototype. If a single query is submit, the output will be the ClusterId associated to the query, let recal that a ClusterId is an integer value affiliated to the cluter membership. Otherwise if a collection of queries is predicted ClusterIds will be predicted for each query gathering in an HardClustering model.

K-MeansPredictor inherit from ClosestPrototypePredictor and follows the same logic in a more specific way. Here both queries and prototypes have share the type numerical vector.

Scalability

When dissimilarity measure complexity is considered negligeable the computational complexity of a single $K$-MeansPredictor query is in $O(K)$.

Where :

  • K is the number of prototype of the $K$-MeansModel.

If there are $n$ queries, the complexity will be $O(K.n)$. Considering than in most of case $K$ is neglieable compared to $n$, a collection of query will be executed in linear time.

Be sure to know the computational complexity of used dissimilarity measure.

Input

Single query

A numerical vector value.

Multiple queries, i.e collection of data observations

A collection of numerical vector value.

Parameters

1 : K-MeansModel

K-MeansModel contains the list of prototypes returned by K-Means algorithm.

Output

Single query

Returns the ClusterId of the closest K-MeansModel prototype.

Multiple queries, i.e. a collection of queries of numerical vectors

Returns the HardClustering associates to input data.

Associated visualization

We can resume KMeansModel as a collection, usually of moderate size, of virtual observation on R^d which are the means of clusters obtained at last iterations. Then any visualization working on $R^d$ data points can be apply.

Available visualization :

  • Radar Plot
  • If 2/3D, Scatter Plot.

Practical strategies

Many strategies are born around $K$-Means algorithm and how to get the better of it.

Find the best K value

TODO : Describe evolution of an index with $K$ parameter.

Choose the right stopping criteria

How to stop the algorithm at the right moment is an important factor which can imply to balance between diminish execution time at acceptable level and get accurate enough results

K-Means algorithm is often use in order to initialize more sophisticated algorithms such as the Gaussian Mixtures clustering algorithm.

Business case

K-Means is one of the oldest clustering algorithm and is wildely use for many case requiring to deal with numerical data when the user has an idea of the $K$, i.e. the number of clusters.

Usage

# python code

# Example of how run K-Means on toy dataset

data = [0, 1, 2, 3]

def kmeans(input_data):
    return [v + 1 for v in input_data]

model = kmeans(data)

model