datalocation template

DataLocation

DataLocation class describe all necessary information on which system data is store (currently Mongo/Parquet/GCS) and how to reach it with various strategy which may include path, database, collection.

It becomes storage specific with two version for parquet, mongo, and CSV data.

Common mandatory keys description :

  • _id : MongoId.
  • name : User defined name for this DataLocation, if not specify it will be named with the key name of its associated representation followed by “_dataLocation” (ex: “myRepresentation_dataLocation”).
  • databaseEngine : Used database engine.
  • environment : Enumerator describing used environment, it takes one of following values : dev, preprod, prod.
  • representationId : MongoId of associated representation.
  • value : It describe how access to related data and depends on databaseEngine value.

(datalocation_parquet)=

Parquet data storage

value key description :

  • value :
    • location : parquet folder.
    • restrictedToColumns : Array of String with column name which are extracted. If every column are taken the array is empty.

DataLocation JSON template :

{
  "_id": "62bc108b8c51f362811989c8",
  "name": "processingData_TS",
  "databaseEngine": "parquet",
  "environment": "dev",
  "representationId": "62bc108b8c51f362811989c6",
  "value": {
    "location": "gs://hephia-database-dev/parquet/industry/processingData",
    "restrictedToColumns": [
      "col1",
      "col2"
    ]
  }
}

(datalocation_csv_gcs)=

CSV in GCS data storage

value key description :

  • value :
    • location : parquet folder.
    • restrictedToColumns : Array of String with column name which are extracted. If every column are taken the array is empty.

Update CSV file JSON template :

{
  "_id": "62bc108b8c51f362811989c8",
  "name": "numericalData_TS",
  "databaseEngine": "gcs",
  "environment": "dev",
  "representationId": "62bc108b8c51f362811989c6",
  "value": {
    "location": "gs://hephia-database-dev/parquet/clientName/customFile.csv",
    "restrictedToColumns": [
      "col1",
      "col2"
    ]
  }
}

(datalocation_mongo)=

Mongo data storage

value key description :

  • It is a dictionary containing two keys :
    • database : Mongo database where is store collection to search in. It will often be the same as environment key value.
    • collection : name of the collection in mongoDB.
{
  "_id": "62bc108b8c51f362811989c8",
  "databaseEngine": "mongo",
  "environment": "dev",
  "representationId": "62bc108b8c51f362811989c6",
  "value": {
    "database": "dev",
    "collection": "representations_som_demo_7"
  }
}