KNN HardClustering Predictor

Principle

This predictor is quiet similar to KNNCategoricalClusteringPredictor ref_to on its process, i.e for a single query it will compute the $K$NN for each value from the input HardClustering given a dissimilarity measure on values’type, it will be symbolized by R in this document.

The result for a single query will be its associate ClusterId, it will be compute from the list of $K$NN. Informations of their cluster belonging will be sum up then a majority vote will be applied choosing the most represented ClusterId.

As each clustering predictor $K$NNHardClusteringPredictor is able to predict either a Distribution or a plain Clustering if the query is a single value or a collection of them.

Reminders

The problem of the K Nearest Neighbors often abbreviated K-NN consists for a given set $S$ of data to retrieve the K neirest neighbors for a particular query.

Scalability

When dissimilarity measure complexity is considered negligeable the computational complexity of a $K$NN query is in O(K.n) where :

K is the number of neirest neighbors to look for.
n is the number of data observations.

The majority vote step is considered as negligeable.

Have in mind that this is for a single query, if you apply this predictor with a search HardClustering of size n with the number of nearest neighbors hyperparameter $K$ on a collection (NB : should Instance be prefered) of queries of size $m$ then computational complexity becomes $O(K.n.m)$ which if $n=m$ is equivalent to $O(n^2)$.

Be sure to know the computational complexity of used dissimilarity measure.

Input

Single query

The input’s type must be the same as the values contained in the HardClustering.

A value of any type R.

Multiple queries, i.e collection of data observations

A collection of values of type R.

Parameters

1 : A HardClustering with observation representation of type `R`

It is the collection of HardClusteringEntity including data observation representation of type R on which $K$NN will be applied.

2 : A dissimilarity measure on data representation `R`

It is the dissimilarity measure over queries and values the HardClustering which share a same type R.

D : (R, R) => Numerical value / Double

3 : $K$, the number of neirest neighbors

It is the integer value used in the $K$NN algorithm.

Output

Single query

Returns the ClusterId fof the cluster which have more of its points into the $K$NN bucket.

Multiple queries, i.e. a collection of queries

Returns a HardClustering.

Associated visualizations

HardClustering