Patchwork

Labels / Tags

  • Clustering
  • Grid-based
  • Density

Principle

PatchWork is a clustering algorithm mixing both grid-based and density clustering algorithm. One strengh of this algorithm is to have a linear complexity (and near linear horizontal scalability, find precise meaning).

It consist to affect Cell Key value to each data observation and compute density of each cell. Once this step is achieved, output PatchWorkClusters are generate by aggregating neighbouring cells regarding their densities. A cluster is then represented by a set of cells.

Scalability

Computational complexity is in O(n).

Where :

  • n is the number of data points.

Input

We suppose a numerical vector of dimensionality d.

A collection of numerical vector.

Parameters

1 : A continous metric, by default Euclidean.

D : (R^d, R^d) => numerical value, where d is the dimensionality of numerical vectors.

2 : epsilon, the size of the cell on each of its $d$ dimensions.

3 : minPts, the threshold to specify the minimum number of data points needed in a cell to consider it as part of a cluster.

4 : ratio, the parameter to control the density threshold used to expand existing clusters.

5 : minCell : The threshold to filter out clusters with too few cells. This parameters allows users to keep only spatially large clusters rather that dense ones only.

Ouput

The PatchworkModel contains :

  • The cardinality of each cell.
  • The list of generated PatchWorkClusters
    • A PatchworkCluster is a list containing PatchWorkCell keys.

Predictor

PatchWorkPredictor is the default predictor that can be applied with the PatchWorkModel.

  • Single data : ClusterId (Int).
  • Collection of data : HardClustering.

Associated visualization

  • Cluster cardinality visualization.
  • Explore cells density. (We don’t know the arrangement of these last)

Practical strategies

PatchWork was essentially tested on $R^2$ dataset which allows best performances, indeed the more numerical dimension defined the data, the more cells could exist which require a careful setting to return decent results.

Business case

Due to its original approach PatchWork has a better scalability than other clustering algorithm such as K-Means. It could be a fair solution if you have to deal with tremendous amount of data with which other algorithm could be too long.

Usage