Silhouette
Labels / Tags
- Internal Quality index
- NumericalEvaluation
Principle
Silhouette is a numerical index used to calculate the goodness of a HardClustering, it consist to study
the distance between each cluster combined with average distance of cluster’s points from their prototype. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a ranges of [-1,1].
- 1: Means clusters are well apart from each other and clearly distinguished.
- 0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.
- -1: Means clusters are assigned in the wrong way.
$a(i) = \frac{1}{card(C_I) - 1}\sum_{j \in C_I, i \neq j}D(i, j)$
$b(i) = \min_{J \neq I}{(\frac{1}{card(C_J)}}\sum_{j \in C_J}D(i, j))$
$s(i) = \frac{b_i - a_i}{\max(a_i, b_i)}$, if $card(C_i) > 1$
$S = \frac{1}{K}\sum_k^K{\frac{1}{card(I_k)}\sum_{i \in I_k}{s_i}}$
Where :
- $C_I$
- $D(., .)$
Scalability
Computational complexity is in $O(n^2)$.
Input
A collection of values of same type from usal one to any other one as long as associated dissimilarity measure exists.
Currently supported types are :
- Numerical vector
- Binary vector
- Mixed Vector
- Monovariate time series
- Multivariate time series
Parameters
1 : metric : a dissimilarity measure the given input type.
Let be R the type of the data values. Then given metric parameter must by of type :
- $D : $(
R,R) $=>$ Numerical value
Ouput format
A score as a real value.
Associated visualization
Every visualization dealing with numerical values collection :
- Bar chart / Histogram
Practical strategy
Silhouette index as many other indices is often use to fine tuned hyperparameters of a pipeline returning a HardClustering or a Clusters. A well known example is the estimation of best $K$ hyperparameter value for $K$-Means, $K$-Modes, …
Business case
Usage