Overview

HephIA product ambition is to allow to any user to easily build imbricated unsupervize data pipeline through our python API whilst beeing able to set execution environment to run existing and new pipelines. All generatd results and inputs will be stored and accessible through our python API (UI later). They could be reused in other processing pipelines or visualized if there it exist a type matching with our visualizations.

Model

The concept of Model is at the core of HephIA Scala Engine, mostly everything can be considered as a model such as :

  • Input dataset, they are converted to regular Scala Engine implementations such as :

    • Instance
    • Extension
    • HardClustering
    • CategoricalClustering
  • Algorithm always returns Models as with result of :

  • Clustering algorithm.

  • Dimension reduction algorithm.

  • Outlier detection algorithm.

  • Preprocessing algorithm.

  • Evaluation algorithm.

Briefly a Model is a wrapper of what we defined as a Memory which can be anything from a numerical value (ex: DoubleEvaluation) to a collection of data points (ex: Instance) passing from any primitive type or complex class.

// it may exist a better place for below text, reformulate…

An Algorithm will always yield a Model, it is just data which can take various form. HephIA split models in two main categories that can be combined. The GlobalModel like which have information (feature) shared by the whole model such as PatchWorkModel which has a global field gatherering cardinality of each cell ref_todo. On the other hand the concept of TraversableModel represent structure of data points on which we can have access on various way. HephIA handle actually tree and collection structure. If we took again our PatchWorkModel it also have a TraversableModel part consisting of the collection of computed clusters which contains the list of cells defining their space area. We consider in this context the word cluster in PatchWorkModel sense, not as a collection of data points sharing values over features. PatchWorkModel is then an example of combined models. As said above other Model can be only TraversableModel descendant, $K$-Means can be taken as example because it only contains the collection of KMeansClusters which are composed with the prototype (mean) of the cluster, its ClusterId (Int) which describe to which cluster it is, and finally the cardinality of the cluster computed at the last iteration.

// and graph later

Traversable models

As brievely explain above, TraversableModel describe model having a memory which is a data structure that can be travers. Data observation composing it are accessible specificaly depending the model, then the time complexity for accessing to one or many elements isn’t the same depending the structure.

Many basic HephIA types inherit from TraversableModel such as :

  • Instance
  • TreeModel
  • Extension
  • Clusters
  • Clustering
  • HardClustering And also algorithm results :
  • KMeansModel
  • KModesModel
  • DBScanModel

We will dive progressively into these essential models in the rest of this document.

As data points container, TraversableModels know the type of the data they carry through a higher kind. You will often see these kind of bracket ([]) through the library.

TraversableModel[E]

The E is a generic type, here it is unbounded which means that it can take any value.

Instance models

Instance is a TraversableModel specific to data collection. It is itself divided into many descendant :

  • VectorInstance
  • ParVectorInstance
  • ListInstance
  • ArrayInstance
  • ParArrayInstance
  • RDDInstance

All these models have as every model their memory which is the collection itself. We easily deduct than an ArrayInstance as the data structure Array as memory and than RDDInstance contains a RDD.

As a generic description the Instance model is designed to handle many collection types including those expose above and also new ones if needed.

It exists an axe to separete two important sibling concepts, mutability and immutability. Then it exists two direct descendant of Instance which are :

  • ImmutableInstance
  • MutableInstance Every Instance implement one of these two trait.

Extension

Tree models (coherence lib/doc, lib -> TreeModel, doc -> TreeModel or Tree model ??

TreeModel posses a memory which is the tree datastructure. Contrary to Instance, TreeModels add a notion of hierarchy between data they carry.

Clustering models

Within HephIA product the concept of Clustering is at every corner, then it is important to understand it as much as we have define it. Clustering represent the common ancestor for every other clustering types. Its a collection of data observations wrapped in a ClusteringEntity and sharing at least three features :

  • Id : The identifier of the representation / data observation.
  • rep : The representation of the data, for example a numerical vector, a time series, … // Specify somewhere what is a representation R, i.e. can be any type.
  • dist : The generative distribution of the ClusterId (Int).

As seen previously basic Clustering models are just descendant of TraversableModel with a specific entity E which is a ClusteringEntity. Current provided Clustering types would mostly differ on the distribution they define.

CategoricalClustering models

Categorical distribution explain the different outcomes for a random variable, results values fall in one of the $K$ possible categories, each of them have its one probability and they sum up to 1.

A CategoricalClustering is then a specific Clustering where the data observation’s distribution is a categorical one.

HardClustering models

A HardClustering is a clustering where data points are affected to a specific cluster through the ClusterId it carry within its dist field, its still a distribution which is caled constant. There is one and only one cluster belonging per point.

Dimension reduction models

Outlier detection models

Preprocessing models

Due to variety of preprocessing models.

// idea1 : Nothing can be say at this level

// idea2 : Look for systematic patterns and describe them

// idea3 We can still explain some of them. // Is it necessary, the risk is to dig down too much here

Gradient Ascent ???

Evaluation models

Evaluatation models is a direct descendant of Model and the goal of an evaluation is to be compared to other ones of same type in order get meaningfull insights. Then an EvaluationModel is a Model with an Ordering on its memory (evaluation descriptor).

data = [0, 1, 2, 3]

def kmeans(input_data):
    return [v + 1 for v in input_data]

model = kmeans(data)

kmeans(1)

model