Normalization

Representation

Keys _id, name, stepId, domainInformation, dataset, project, processingInfo, creationTS , latestUpdateTS are unchanged and follow the classic of representation only dataSpecification keys are changed and are described as above :

  • dataSpecification :
    • keyword : Its value is “numericalData” . Cf to mandatory keys.
    • valueType : It is the type of processing data values and then depends on them. Cf to mandatory keys.
    • meaning: It’s value is “Vector of normalized numerical features” . Cf to mandatory keys.
    • view: Cf to mandatory keys.
    • dataLocationId: Cf to mandatory keys.
    • normalizationStrategies: It is a dictionary containing 3 sub keys.
      • columnName: The name of the column.
      • preprocessingMethod: The name of the method like(fstand, flog, fcos, fyeoJonson, fdivide, fminMax, fpercentile).
      • parameters: It is a dictionary containing 2 sub keys.
        • name: Name of the parameter.
        • value: The value of the parameter.

Currently, only numerical vector can be saved during raw data loading which gives us un single template at this moment.

JSON template for numerical vector processing data representation.

{
  "dataSpecification": {
    "keyword": "numericalData",
    "valueType": {
      "dataType": "numerical",
      "structureType": "vector"
    },
    "meaning": "Vector of normalized numerical features",
    "view": {
      "id": "637ce534dd85c10875c4fe26",
      "name": "view_11-22-2022_15:05:24"
    },
    "dataLocationId": "637ced8192c8c3638d51588e",
    "normalizationStrategies": [
      {
        "columnName": "cp_a_1_bar",
        "preprocessingMethod": "fstand",
        "parameters": [
          {
            "name": "A",
            "value": 0
          }
        ]
      },
      {
        "columnName": "cp_a_2_bar",
        "preprocessingMethod": "fstand",
        "parameters": [
          {
            "name": "Vide",
            "value": 0
          }
        ]
      }
    ]
  }
}

Observation

  • Save in Parquet.