-
Notifications
You must be signed in to change notification settings - Fork 31
02 Concepts
A typical Azure Modern Data Platform implementation as shown below which will be used to explain the basic concepts of the ELT Framework.
Definition: In the ELT framework, a definition refers to a one-time metadata configuration, such as IngestDefinition for ingestion, L1TransformDefinition for Level 1 transformation, and L2TransformDefinition for Level 2 transformation.
Instance: Each execution of a definition generates an instance (e.g., IngestInstance, L1TransformInstance, L2TransformInstance) used for tracking, auditing, data lineage, and re-runs.
IngestDefinition configures data loading from various source systems into the Raw/Landing zone of a Datalake. Sources can be cloud-based, on-premises, or manually uploaded (e.g., third-party data) and include databases, flat files, REST APIs, XML/FetchXML, JSON, and other batch data sources. Data first lands in the Raw zone, such as Azure Data Lake Gen2 Storage (ADLS), maintaining its original granularity and format. This zone serves as cost-effective storage, aiding processing pipelines without directly accessing the source system, which is useful for re-runs, decommissioned sources, and separating transformations from ingestion. Folders in the Raw zone are typically partitioned by Source/Entity/Year/Month/Day. File formats from APIs and external files are preserved (e.g., JSON, XML), while data from databases is stored in parquet format when possible. Raw data can then be transformed for downstream use, including Machine Learning (ML) workloads.
Within the context of ELT Framework, Source System refers to any input data source. E.g. ERP, Fleet Management, Historian, Enrolment System etc.
Within the context of ELT Framework, Stream refers to an entity within the source system. For e.g. a table/view, REST API end point or a flat file within the source system
An execution of Ingestion Pipeline using the Ingest Definition will create a Ingest Instance record. The Ingest instance record will have the following data points
- Date/Number range of Ingestion
- Reload flag
- Ingestion status
- Destination details of Raw file (container, folder and file)
- Duration for which the pipeline ran
- Audit data points
- Lineage data points
Once ingested, data is available in the Raw zone of the data lake. From there, it can be transformed using compute resources like Spark notebooks, with results stored in the Trusted zone. The Level 1 TransformDefinition specifies the input path from the Raw Zone, the destination path in the Trusted Zone, the destination in DW (if applicable), and the transformation notebook used. The Trusted/Structured zone enriches data from the Raw zone, maintaining the same granularity and storing it in parquet format. This layer can also be Delta Lake. Examples of data enrichment include:
- De-duplication
- Removing leading/trailing spaces from strings
- Merging/upserting existing data with newer versions
- Converting UTC dates to local time zones
- Standardizing timestamp formats (e.g., Julian dates)
- Flattening JSON and XML files
- Adding headers to files without headers
- Translating column names to English (e.g., SAP columns in German)
- Removing system columns from source data
- Implementing SCD patterns
An execution of L1Transformation Pipeline using the L1TransformDefinition will create a L1 Transform Instance record. The Ingest instance record will have the following data points.
- Reload flag
- Transformation status
- Details of Raw and Transformed file (container, folder and file)
- DW Table or Delta Lake Table where the transformed output is available
- Duration for which the pipeline ran
- Audit data points
- Lineage data points
Level 2 transformation is where granularity of data changes through application of specific business rules. L2 Transformations can be defined to use the source as either Raw Zone or Trusted Zone or a table in DW. Typical transformations in this layer are:
- Aggregation
- Pivot/Un-Pivot
- Redaction
- Consolidation
- Data mash-up from different source systems
- Snapshots
- Post processing
- Fact and Dim
An execution of L2Transformation Pipeline using the L1TransformDefinition will create a L2 Transform Instance record. The Ingest instance record will have the following data points
- Reload flag
- Transformation status
- Details of Raw and Transformed file (container, folder and file)
- DW Table or Delta Lake Table where the transformed output is available
- Duration for which the pipeline ran
- Audit datapoints
- Lineage data points