DSBCAN

Labels / Tags

Clustering

Principle

Epsilon Proximity is a clustering algorithm working on any space as long as a metric associated to input data type is given.

Density-based clustering has the ability to discover arbitrary-shape clusters and to handle noise. In density-based clustering methods, clusters are formed based on the dense areas that are separated by sparse areas. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the most well-known density-based clustering algorithms. The density of each observation is defined based on the number of observations close to that particular point called the point’s neighborhood. The dense neighborhood is defined based on two user-specified parameters: the radius ($\epsilon$) of the neighborhood ($\epsilon$-neighborhood), and the number of the objects in the neighborhood ($MinPts$).

DBSCAN starts by randomly selecting a point and checking whether the $\epsilon$-neighborhood of the point contains at least $MinPts$ observations. If not, it is considered as a noise point, otherwise it is considered as a core point and a new cluster is created. DBSCAN iteratively adds the data points, which do not belong to any cluster and are directly density reachable from the core points of a new cluster. If the new cluster can no longer be expanded, the new cluster is completed. In order to find the next cluster, DBSCAN randomly selects the unvisited data points and the clustering process continues until all the points are visited and no new observation is added to any cluster.

It can be considered as a close cousin of to $\epsilon$ Proximity except for that here we look for density rather than proximity.

Scalability

Computational complexity is in $O(n^2)$.

Where :

n is the number of data points

Memory complexity of DBScan can be either in $O(n^2)$ or in $O(n)$ depending the implementation. Indeed DBScan algorithm has to compute distances between each pair of data observation within what we called a DistanceMatrix ref_to_DistanceMatrix_type, it has the advantage to gives fast access to a distance for a given pair of points at the cost of an important memory space $O(n^2)$. We can then distinguish two scenarios with their pro and cons :

The distance matrix ref_to_DistanceMatrix between every data observation is used which imply a quadratic memory cost $O(n^2)$. It can be either given as input argument, and inevitably put in memory, or if it is not the case, it can be computed a single time at the beginning of the algorithm.
Distances are computed on the fly needed few memory for every distance computed. The cost of this solution is that it require to compute distances between same pair of points multiple times which slow down the overall computation time.

DBScan can be parallelized for better performance despite the fact that it remains quadratic.

Input

A collection of data representation (R) of any nature.

Parameters

A metric on R.
$\epsilon$ : the radius of the hyperball on which look for neighbors.
$MinPts$ : the minimal number of points neccessary to be present into the hyperball to consider points to be into same cluster.

Output

As with $\epsilon$-Proximity ref_todo_ algorithm, DBScan return a Clusters ref_to_Clusters_type model i.e. a collection of Cluster ref_to_Cluster_type. For whose unfamaliar with HephIA concepts, a Clusters model has a sibling which is the HardClustering model. Both have cluster’s affection for each data observation and differ in their data structure.

Predictor

DBScan algorithm output beeing a Clusters it already contains affectation for each observation because they are gather into their respective cluster.

Some predictor can directly or through a conversion be used with Clusters model such has ClosestPrototypePredictor or KNNHardClusteringPredictor.