Skip to content

02 Concepts

Benny Austin edited this page Jan 17, 2025 · 1 revision

Azure Modern Data Platform

A typical Azure Modern Data Platform implementation as shown below which will be used to explain the basic concepts of the ELT Framework. Azure Modern Data Platform

Definition and Instance concepts

Definition: In the ELT framework, a definition refers to a one-time metadata configuration, such as IngestDefinition for ingestion, L1TransformDefinition for Level 1 transformation, and L2TransformDefinition for Level 2 transformation.

Instance: Each execution of a definition generates an instance (e.g., IngestInstance, L1TransformInstance, L2TransformInstance) used for tracking, auditing, data lineage, and re-runs.

Ingest Definition

IngestDefinition configures data loading from various source systems into the Raw/Landing zone of a Datalake. Sources can be cloud-based, on-premises, or manually uploaded (e.g., third-party data) and include databases, flat files, REST APIs, XML/FetchXML, JSON, and other batch data sources. Data first lands in the Raw zone, such as Azure Data Lake Gen2 Storage (ADLS), maintaining its original granularity and format. This zone serves as cost-effective storage, aiding processing pipelines without directly accessing the source system, which is useful for re-runs, decommissioned sources, and separating transformations from ingestion. Folders in the Raw zone are typically partitioned by Source/Entity/Year/Month/Day. File formats from APIs and external files are preserved (e.g., JSON, XML), while data from databases is stored in parquet format when possible. Raw data can then be transformed for downstream use, including Machine Learning (ML) workloads.

Ingest Definition

Source System

Within the context of ELT Framework, Source System refers to any input data source. E.g. ERP, Fleet Management, Historian, Enrolment System etc.

Stream

Within the context of ELT Framework, Stream refers to an entity within the source system. For e.g. a table/view, REST API end point or a flat file within the source system

Ingest Instance

An execution of Ingestion Pipeline using the Ingest Definition will create a Ingest Instance record. The Ingest instance record will have the following data points

  • Date/Number range of Ingestion
  • Reload flag
  • Ingestion status
  • Destination details of Raw file (container, folder and file)
  • Duration for which the pipeline ran
  • Audit data points
  • Lineage data points

L1 Transform Definition

Once ingested, data is available in the Raw zone of the data lake. From there, it can be transformed using compute resources like Spark notebooks, with results stored in the Trusted zone. The Level 1 TransformDefinition specifies the input path from the Raw Zone, the destination path in the Trusted Zone, the destination in DW (if applicable), and the transformation notebook used. The Trusted/Structured zone enriches data from the Raw zone, maintaining the same granularity and storing it in parquet format. This layer can also be Delta Lake. Examples of data enrichment include:

  • De-duplication
  • Removing leading/trailing spaces from strings
  • Merging/upserting existing data with newer versions
  • Converting UTC dates to local time zones
  • Standardizing timestamp formats (e.g., Julian dates)
  • Flattening JSON and XML files
  • Adding headers to files without headers
  • Translating column names to English (e.g., SAP columns in German)
  • Removing system columns from source data
  • Implementing SCD patterns

L1 Transform Definition

L1 Transform Instance

An execution of L1Transformation Pipeline using the L1TransformDefinition will create a L1 Transform Instance record. The Ingest instance record will have the following data points.

  • Reload flag
  • Transformation status
  • Details of Raw and Transformed file (container, folder and file)
  • DW Table or Delta Lake Table where the transformed output is available
  • Duration for which the pipeline ran
  • Audit data points
  • Lineage data points

L2 Transform Definition

Level 2 transformation is where granularity of data changes through application of specific business rules. L2 Transformations can be defined to use the source as either Raw Zone or Trusted Zone or a table in DW. Typical transformations in this layer are:

  • Aggregation
  • Pivot/Un-Pivot
  • Redaction
  • Consolidation
  • Data mash-up from different source systems
  • Snapshots
  • Post processing
  • Fact and Dim

L2 Transform Definition

L2 Transform Instance

An execution of L2Transformation Pipeline using the L1TransformDefinition will create a L2 Transform Instance record. The Ingest instance record will have the following data points

  • Reload flag
  • Transformation status
  • Details of Raw and Transformed file (container, folder and file)
  • DW Table or Delta Lake Table where the transformed output is available
  • Duration for which the pipeline ran
  • Audit datapoints
  • Lineage data points
Clone this wiki locally