Silhouette

Labels / Tags

  • Internal Quality index
  • NumericalEvaluation

Principle

Silhouette is a numerical index used to calculate the goodness of a HardClustering, it consist to study the distance between each cluster combined with average distance of cluster’s points from their prototype. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a ranges of [-1,1].

  • 1: Means clusters are well apart from each other and clearly distinguished.
  • 0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.
  • -1: Means clusters are assigned in the wrong way.

$a(i) = \frac{1}{card(C_I) - 1}\sum_{j \in C_I, i \neq j}D(i, j)$

$b(i) = \min_{J \neq I}{(\frac{1}{card(C_J)}}\sum_{j \in C_J}D(i, j))$

$s(i) = \frac{b_i - a_i}{\max(a_i, b_i)}$, if $card(C_i) > 1$

$S = \frac{1}{K}\sum_k^K{\frac{1}{card(I_k)}\sum_{i \in I_k}{s_i}}$

Where :

  • $C_I$
  • $D(., .)$

Scalability

Computational complexity is in $O(n^2)$.

Input

A collection of values of same type from usal one to any other one as long as associated dissimilarity measure exists.

Currently supported types are :

  • Numerical vector
  • Binary vector
  • Mixed Vector
  • Monovariate time series
  • Multivariate time series

Parameters

1 : metric : a dissimilarity measure the given input type.

Let be R the type of the data values. Then given metric parameter must by of type :

  • $D : $(R, R) $=>$ Numerical value

Ouput format

A score as a real value.

Associated visualization

Every visualization dealing with numerical values collection :

  • Bar chart / Histogram

Practical strategy

Silhouette index as many other indices is often use to fine tuned hyperparameters of a pipeline returning a HardClustering or a Clusters. A well known example is the estimation of best $K$ hyperparameter value for $K$-Means, $K$-Modes, …

Business case

Usage