KNN HardClustering Predictor
Labels / Tags
- Predictor
- AnyType
- Numerical vector / R^d
- Binary vector / {0, 1}^d
- Mixed vector / …
- Monovariate time series
- Multivariate time series
- …
Principle
This predictor is quiet similar to KNNCategoricalClusteringPredictor ref_to on its process, i.e for a single query it will compute the $K$NN for each value from the input HardClustering given a dissimilarity measure on values’type, it will be symbolized by R in this document.
The result for a single query will be its associate ClusterId, it will be compute from the list of $K$NN. Informations of their cluster belonging will be sum up then a majority vote will be applied choosing the most represented ClusterId.
As each clustering predictor $K$NNHardClusteringPredictor is able to predict either a Distribution or a plain Clustering if the query is a single value or a collection of them.
Reminders
The problem of the K Nearest Neighbors often abbreviated K-NN consists for a given set $S$ of data to retrieve the K neirest neighbors for a particular query.
Scalability
When dissimilarity measure complexity is considered negligeable the computational complexity of a $K$NN query is in O(K.n) where :
- K is the number of neirest neighbors to look for.
- n is the number of data observations.
The majority vote step is considered as negligeable.
Have in mind that this is for a single query, if you apply this predictor with a search HardClustering of size n with the number of nearest neighbors hyperparameter $K$ on a collection (NB : should Instance be prefered) of queries of size $m$ then computational complexity becomes $O(K.n.m)$ which if $n=m$ is equivalent to $O(n^2)$.
Be sure to know the computational complexity of used dissimilarity measure.
Input
Single query
The input’s type must be the same as the values contained in the HardClustering.
A value of any type R.
Multiple queries, i.e collection of data observations
A collection of values of type R.
Parameters
1 : A HardClustering with observation representation of type R
It is the collection of HardClusteringEntity including data observation representation of type R on which $K$NN will be applied.
2 : A dissimilarity measure on data representation R
It is the dissimilarity measure over queries and values the HardClustering which share a same type R.
- D : (
R,R) => Numerical value / Double
3 : $K$, the number of neirest neighbors
It is the integer value used in the $K$NN algorithm.
Output
Single query
Returns the ClusterId fof the cluster which have more of its points into the $K$NN bucket.
Multiple queries, i.e. a collection of queries
Returns a HardClustering.
Associated visualizations
- HardClustering