Workflow Manager to Scala Engine

Workflow Manager to Scala Engine payload are currently is sent as a single json and contains four keys :

data : The payload send by user which follow the client to workflow manager templates.
path_jar : A string which explicit path to the jar run by the Scala Engine.
db_context : The DBContext JSON which for the moment can only be a MongoCnntext.
processing_id : The processing unique identifier generated by the workflow manager to identify this processing.

Here is a JSON example.

{
  "data": {
    "processingKeyword": "distinct_processing_name_to_leverage_exec",
    "customer": "username",
    "name": "user defined name of the representation",
    "creationTS": 17267393322,
    "latestUpdateTS": 17267393344,
    "status": 1,
    "processingContext": {
      "processingName": "user define name, ex SOM",
      "editionContext": "user",
      "callingContext": "hephIA-solution",
      "view": {
        "id": "637ce534dd85c10875c4fe26",
        "name": "view_11-22-2022_15:05:24"
      },
      "dataset": {
        "name": "my_dataset",
        "collection": "datasets"
      },
      "project": {
        "id": 2,
        "name": "SOM"
      }
    }
  },
  "path_jar": "/path/to/have/access/to/engine.jar",
  "db_context": {
    "database": "dev",
    "username": "cloudOps",
    "password": "O;mrspaq1i8",
    "host": "mongo.dev.hephia-product.com",
    "port": 27
  },
  "processing_id": "62bda1197750ec088f984461"
}

Workflow Manager to Scala Engine payload mandatory keys

Keys processingKeyword, customer, name, startTS, latestUpdateTS, status, processingContext, processingId , processingName, callingContext, editionContext, processingName, dataset, name, project, id, and name are unchanged and follow the classic of mandatory payload, new keys can be added for specific needs and are described as above :

dataLocations : Dictionary where keys are named accordingly with data nature and values are the dataLocationIds.
hyperParameters : It is a dictionary process specific dictionary.

(workflow_manager_to_scala_engine_section_home_payload_list)=

Payload list

This section gathers every payload between the workflow manager and the Scala Engine :

Load CSV : Load a CSV file and store it properly.
- It takes given numerical feature names and keep them as processing data
- It takes given domain data feature names and keep them as domain data
- It adds a column named by default ObservationId indexed from $0$ to $N - 1$.
- It saves above columns and save them in parquet file on GCS.
SOM and HardClustering : Run SOM on numerical features and predict corresponding HardClustering on learning data.
- Learn a SOM model.
- Predict HardClustering with output model and learning data.
- Save SOM model and HardClustering in Mongo (later HardClustering must be saved in parquet).