This project showcases an end-to-end data ingestion framework using Azure Data Factory (ADF) to extract data from multiple structured sources, transform it into JSON format, and store it securely in Azure Data Lake Storage Gen2 (ADLS Gen2).
The sources include:
- 💾 Microsoft SQL Server
- 🧮 Azure SQL Database
- 📁 Azure Blob Storage (CSV files)
A control table drives the selection of source systems, enabling flexible and scalable orchestration through a master pipeline. Audit logging is handled via stored procedures in SQL Server, and lifecycle management rules are applied to archive files after 30 days of modification, ensuring efficient storage governance.
Designed to eliminate the need for hardcoded or duplicated pipelines, this project showcases best practices in modern data engineering — with reusability, automation, and observability at its core.
Assume an organization receives data from multiple upstream systems, including:
- Relational databases: MS SQL Server and Azure SQL Database
- Flat files: Structured CSV files delivered to Azure Blob Storage
Each relational database source system contains multiple databases, schemas, and tables. Each CSV file is placed in the designated folder. The downstream analytics team requested that all data be delivered to Azure Data Lake Storage Gen2 (ADLS Gen2) in a flexible and unified format — JSON, instead of its native format (CSV or relational structure).
The goals are to:
- Enable a consistent and schema-flexible data model
- Avoid pipeline duplication and manual hardcoding for every new source or table
- Ensure auditability, traceability, and compliance by logging pipeline runs and metadata
The primary reason for converting to JSON was to support downstream systems that prefer or require data in a semi-structured format, such as data lakes, reporting tools, or NoSQL platforms. Additionally, JSON makes the data portable and suitable for long-term storage or archiving, especially for regulatory or compliance requirements.
- Azure Data Factory (ADF)
- Azure Data Lake Storage Gen2 (ADLS Gen2)
- Azure Blob Storage
- Azure SQL Database
- Microsoft SQL Server
- T-SQL Stored Procedures
- Self-Hosted Integration Runtime
- JSON File Format
- Lifecycle Management Rules
- Azure Linked Service
- Azure Key Vault
-
Parameterized Ingestion Pipelines: Generic pipelines for each source type (SQL Server, Azure SQL DB, Blob Storage) that dynamically pull data using metadata.
-
Master Pipeline: Triggers the appropriate source pipeline in a control table based on metadata.
-
Control Table: This table stores source system metadata such as table name, schema, folder paths, and active status.
-
Audit Logging: A stored procedure logs metadata like record counts, execution duration, pipeline name, status, etc. into an audit table.
-
Lifecycle Management: Automatically archives files older than 30 days in ADLS using a lifecycle policy.
-
Self-hosted IR: Used to access on-premises SQL Server database securely from ADF.
- Master Pipeline queries the
src_control_tableand filters by active sources. - Based on the source system, triggers corresponding source pipeline:
- SQL Server
- Azure SQL DB
- Blob Storage
- Each pipeline:
- Fetches data
- Converts to JSON
- Loads to
downstreamcontainer andArchivefolder in ADLS Gen2
- Data is available for 30 days, post which lifecycle rules move it to the
archivetier. - Execution metadata is logged into the audit tables.
Stores metadata about the source system:
- Source type, database, schema, table, folder paths, and active status.
Stores logs of execution, including:
- Run ID, records read/written, execution status, timestamps, errors, etc.
| 📁 Folder | 📄 Files |
|---|---|
Pipelines |
pl_master_extract, pl_onprem_sqlserver_extract, pl_azure_sql_extract, pl_azure_adlsgen2_extract |
Pipeline_Runs |
Adf_Pipeline_Activity_Runs.xlsx |
DataSets |
ds_adlsgen2_json_param, ds_adlsgen2_json_param2, ds_adlsgen2_upstream, ds_azure_sql_param, ds_onprem_sqlserver_param |
Integration_Runtimes |
SHIR-01.json |
sql |
src_control_table.sql, sp_Audit_Table.sql, sp_Azure_Audit_Table.sql, Audit_Table.sql, Azure_Audit_Table.sql |
Control_Tables |
Source_Control_Table.csv |
Audit_Tables |
Blob_Audit_Table.csv, Database_Audit_Table.csv |
Images |
Framework_Architecture_Diagram.png, Pipeline_Flow_Diagram.png |
docs |
README.md |
✅ Dynamic: Ingestion logic is metadata-driven, making it easy to onboard new sources.
✅ Maintainable: No duplication of pipelines for each schema in a database.
✅ Auditable: Track pipeline runs, record counts, errors.
✅ Scalable: Easy to extend with new source types or transformations.
✅ Enterprise-Ready: Built-in archival, monitoring, and logging mechanisms.
Feel free to connect with me on LinkedIn or send me an Email if you’d like to know more or discuss how I built this project.

