Workflow Manager to Scala Engine
Workflow Manager to Scala Engine payload are currently is sent as a single json and contains four keys :
data: The payload send by user which follow the client to workflow manager templates.path_jar: A string which explicit path to the jar run by the Scala Engine.db_context: The DBContext JSON which for the moment can only be a MongoCnntext.processing_id: The processing unique identifier generated by the workflow manager to identify this processing.
Here is a JSON example.
{
"data": {
"processingKeyword": "distinct_processing_name_to_leverage_exec",
"customer": "username",
"name": "user defined name of the representation",
"creationTS": 17267393322,
"latestUpdateTS": 17267393344,
"status": 1,
"processingContext": {
"processingName": "user define name, ex SOM",
"editionContext": "user",
"callingContext": "hephIA-solution",
"view": {
"id": "637ce534dd85c10875c4fe26",
"name": "view_11-22-2022_15:05:24"
},
"dataset": {
"name": "my_dataset",
"collection": "datasets"
},
"project": {
"id": 2,
"name": "SOM"
}
}
},
"path_jar": "/path/to/have/access/to/engine.jar",
"db_context": {
"database": "dev",
"username": "cloudOps",
"password": "O;mrspaq1i8",
"host": "mongo.dev.hephia-product.com",
"port": 27
},
"processing_id": "62bda1197750ec088f984461"
}
Workflow Manager to Scala Engine payload mandatory keys
Keys processingKeyword, customer, name, startTS, latestUpdateTS, status, processingContext, processingId
, processingName, callingContext, editionContext, processingName, dataset, name, project, id,
and name are unchanged and follow the classic of mandatory payload, new keys can be added for
specific needs and are described as above :
dataLocations: Dictionary where keys are named accordingly with data nature and values are the dataLocationIds.hyperParameters: It is a dictionary process specific dictionary.
(workflow_manager_to_scala_engine_section_home_payload_list)=
Payload list
This section gathers every payload between the workflow manager and the Scala Engine :
- Load CSV : Load a CSV file and store it
properly.
- It takes given numerical feature names and keep them as processing data
- It takes given domain data feature names and keep them as domain data
- It adds a column named by default
ObservationIdindexed from $0$ to $N - 1$. - It saves above columns and save them in parquet file on GCS.
- SOM and HardClustering : Run SOM on numerical
features
and predict corresponding HardClustering on learning data.
- Learn a SOM model.
- Predict HardClustering with output model and learning data.
- Save SOM model and HardClustering in Mongo (later HardClustering must be saved in parquet).