Gaussian Mixture Models

Labels / Tags

Clustering
Soft
Numerical Vector

Principle

A Gaussian mixture model is a probabilistic model which makes the assumption than every data point are generated from a finite mixture of Gaussian distributions with unknown parameters.

It can be seen as a generalization of the $K$-Means algorithm and as this last, it has the $K$ parameter defining how many clusters the user look for.

The aim of Gaussian mixture model is to estimate its unknown parameters, they are usually symbolized by $\theta$. A Gaussian distribution has two parameters, the mean $\mu$ and the covariance $\Sigma$. This algorithm will estimates parameters for the asked number of cluster $K$, knowing that each of them are defines as a Gaussian distribution with its own pair of parameters ${\mu, \Sigma}$.

Scalability

Computational complexity is in $O(n.k.iter.d^3)$. Where :

$n$ is the number of data values.
$K$ is the number of prototypes, i.e. the list of $\mu$.
iter is the number of iterations.
$d$ is the dimensionality of the data values.

The memory complexity is often negligeable when numerical vectors dimensionality has resonable size, if your data space has huge dimensionality it can be problematic because during covariances matrices must be computed within the Gaussian density function and theses last are square matrices of order $d$, thus they carry $n^2$ values. If $d \gg n$ the memory complexity will be $O(d^2)$.

Pro

Cons

Input

A collection of numerical vectors ($R^d$).

Parameters

1 : $K$, the number of cluster to look for.

$K$ is an upper bound for the number of clusters. The algorithm will start to look for $K$ clusters but on some case the final number of cluster can be smaller.

2 : $maxIter$, the maximum number of allowed iterations.

It is one of the stopping criteria. Its objective is to prevent to many iterations to run.

3 : metric, a numerical vector dissimalirity measure.

By default Euclidean distance is used.

4 : stoppingCriteria

Stopping criteria. Many strategies have been developped. Actually A, B, C are available.

5 (Optional) : gmModel, another Gaussian Mixture Model to start with.

To iterate this algorithm must have a list of defined clusters defining by their Gaussian Mixture distribution. By default it is generated using the $K$-Means++ initialization to get the $K$ prototypes $\mu$ and then associate $\Sigma$ values are computed to build the list of initial Gaussian distributions.

Ouput

The returned GaussianMixtureModel contains the list of estimated paramters $\theta$, i.e. the list of clusters which are defined by their gaussian distribution within is stored the mean $\mu$ and the covariance $\Sigma$.

Predictor

GaussianMixtureModels algoritm output model have two default associate predictors which follow the same principle. The difference lies at the end of the process, for a single query, one will returns the probability to belong in each cluster whereas the second will returns the ClusterId of the cluster maxmizing probability belonging.

Hard predictor

GaussianMixtureHardPredictor takes only the model as parameter.

Single query

It returns the ClusterId of the cluster maximizing the probability to be in.

Multiple queries

It returns the HardClustering associates to input collections of numerical vectors.

Soft predictor

GaussianMixtureSoftPredictor takes only the model as parameter.

Single query

It returns the list of probabilities to fall in each cluster.

Multiple queries

It returns the CategoricalClustering associates to input collections of numerical vectors.

Associated visualization

Gaussian Mixture Models.
List of 2/3D prototypes.

Practical strategies

As mentioned in Parameters section, a already computed GaussianMixtureModel can be given rather than use the default algorithm initialization. It can be a model from another GaussianMixtureModel algorithm run on similar data, but it also can be a $K$-Means model, even if this last contains only the $K$ means $\mu$, covariances $\Sigma$ will be computed in initialization step of the algorithm before starting the iterative process.