Skip to content

databrickslabs/dlt-meta

Repository files navigation

DLT-META

Documentation | Release Notes | Examples


Documentation PyPI Build Coverage Style PyPI Downloads


Project Overview

DLT-META is a metadata-driven framework designed to work with Lakeflow Declarative Pipelines. This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in an onboarding JSON file. This file, known as the Dataflowspec, serves as the data flow specification, detailing the source and target metadata required for the pipelines.

In practice, a single generic pipeline reads the Dataflowspec and uses it to orchestrate and run the necessary data processing workloads. This approach streamlines the development and management of data pipelines, allowing for a more efficient and scalable data processing workflow

Components:

Metadata Interface

Generic Lakeflow Declarative Pipeline

  • Apply appropriate readers based on input metadata
  • Apply data quality rules with Lakeflow Declarative Pipeline expectations
  • Apply CDC apply changes if specified in metadata
  • Builds Lakeflow Declarative Pipeline graph based on input/output metadata
  • Launch Lakeflow Declarative pipeline

High-Level Process Flow:

DLT-META High-Level Process Flow

Steps

DLT-META Stages

DLT-META Lakeflow Declarative Pipeline Features support

Features DLT-META Support
Input data sources Autoloader, Delta, Eventhub, Kafka, snapshot
Medallion architecture layers Bronze, Silver
Custom transformations Bronze, Silver layer accepts custom functions
Data Quality Expecations Support Bronze, Silver layer
Quarantine table support Bronze layer
create_auto_cdc_flow API support Bronze, Silver layer
create_auto_cdc_from_snapshot_flow API support Bronze layer
append_flow API support Bronze layer
Liquid cluster support Bronze, Bronze Quarantine, Silver tables
DLT-META CLI databricks labs dlt-meta onboard, databricks labs dlt-meta deploy
Bronze and Silver pipeline chaining Deploy dlt-meta pipeline with layer=bronze_silver option using default publishing mode
create_sink API support Supported formats:external delta table , kafka Bronze, Silver layers
Databricks Asset Bundles Supported
DLT-META UI Uses Databricks Lakehouse DLT-META App

Getting Started

Refer to the Getting Started

Databricks Labs DLT-META CLI lets you run onboard and deploy in interactive python terminal

pre-requisites:

  • Python 3.8.0 +

  • Databricks CLI v0.213 or later. See instructions

  • Install Databricks CLI on macOS:

  • macos_install_databricks

  • Install Databricks CLI on Windows:

  • windows_install_databricks.png

Once you install Databricks CLI, authenticate your current machine to a Databricks Workspace:

databricks auth login --host WORKSPACE_HOST
To enable debug logs, simply add `--debug` flag to any command.

Installing dlt-meta:

  • Install dlt-meta via Databricks CLI:
    databricks labs install dlt-meta

Onboard using dlt-meta CLI:

If you want to run existing demo files please follow these steps before running onboard command:

  1. Clone dlt-meta:

    git clone https://github.com/databrickslabs/dlt-meta.git
    
  2. Navigate to project directory:

    cd dlt-meta
    
  3. Create Python virtual environment:

    python -m venv .venv
    
  4. Activate virtual environment:

    source .venv/bin/activate
    
  5. Install required packages:

    # Core requirements
    pip install "PyYAML>=6.0" setuptools databricks-sdk
    
    # Development requirements
    pip install delta-spark==3.0.0 pyspark==3.5.5 pytest>=7.0.0 coverage>=7.0.0
    
    # Integration test requirements
    pip install "typer[all]==0.6.1"
    
  6. Set environment variables:

    dlt_meta_home=$(pwd)
    export PYTHONPATH=$dlt_meta_home
    

onboardingDLTMeta.gif

  1. Run onboarding command:
    databricks labs dlt-meta onboard
    

The command will prompt you to provide onboarding details. If you have cloned the dlt-meta repository, you can accept the default values which will use the configuration from the demo folder. onboardingDLTMeta_2.gif

Above onboard cli command will:

  1. Push code and data to your Databricks workspace
  2. Create an onboarding job
  3. Display a success message: Job created successfully. job_id={job_id}, url=https://{databricks workspace url}/jobs/{job_id}
  4. Job URL will automatically open in your default browser.

depoly using dlt-meta CLI:

  • Once onboarding jobs is finished deploy Lakeflow Declarative Pipeline using below command
  •    databricks labs dlt-meta deploy
    

The command will prompt you to provide pipeline configuration details.

deployingDLTMeta_bronze_silver.gif

Above deploy cli command will:

  1. Deploy Lakeflow Declarative Pipeline with dlt-meta configuration like layer, group, dataflowSpec table details etc to your databricks workspace
  2. Display message: dlt-meta pipeline={pipeline_id} created and launched with update_id={pipeline_update_id}, url=https://{databricks workspace url}/#joblist/pipelines/{pipeline_id}
  3. Pipline URL will automatically open in your defaul browser.

More questions

Refer to the FAQ and DLT-META documentation

Project Support

Please note that all projects released under Databricks Labs are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as issues on the Github Repo.
They will be reviewed as time permits, but there are no formal SLAs for support.