Skip to content

This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.

License

Notifications You must be signed in to change notification settings

CamilaJaviera91/gcp-new

Repository files navigation

🌀 Airflow + DBT + PostgreSQL Data Pipeline

This repository implements a modern, modular data pipeline using:

  • Apache Airflow for orchestration

  • DBT for SQL-based transformations

  • PostgreSQL as both the source/target database and metadata store

💡 Ideal for learning, development, and lightweight data integration projects.


📂 Project Structure

.
├── 1_init.sh
├── 2_reset_docker.sh
├── 3_fix_permissions.sh
├── credentials #gitignore
│   └── auth.json
├── dags
│   └── dag.py
├── dbt_project
│   ├── dbt_project.yml
│   ├── models
│   │   ├── marts
│   │   │   ├── final_report.sql
│   │   │   └── sales_by_product.sql
│   │   ├── schema.sql
│   │   └── staging
│   │       ├── clients.sql
│   │       ├── orders.sql
│   │       └── products.sql
│   └── profiles.yml
├── docker-compose.yml
├── Dockerfile.airflow
├── files
│   ├── clients.csv
│   ├── final_report.csv
│   ├── orders.csv
│   ├── products.csv
│   └── sales_by_product.csv
├── LICENSE
├── README.md
├── requirements.txt
└── scripts
    ├── extract
    │   └── extract.py
    ├── load
    │   └── load_data.py
    └── utils
        └── utils.py

🚀 Getting Started

Before running the pipeline, make sure to create the following folders in the root directory of the project:

.
├── dags/               # Airflow DAG definitions
├── dbt_project/        # DBT transformations and config
│   └── models/
│       ├── staging/    # Raw → Staging transformations
│       └── marts/      # Staging → Marts (analytics-ready)
├── files/              # CSVs, exports, mock datasets
├── scripts/            # Python utilities for extract/load/validation
│   ├── extract/
│   ├── load/
│   └── utils/
  • dags/: Contains Airflow DAGs to orchestrate the pipeline.
  • dbt_project/: Contains the DBT project with all SQL transformation models.
    • models/
      • staging/: Contains staging models for cleaning and preparing raw data.
      • marts/: Contains data marts for final models ready for analysis and reporting.
  • files/: Stores input/output files such as CSVs.
  • scripts/: Includes helper scripts for data extraction, validation, and loading.

🛠️ Prerequisites

Make sure you have the following installed:

  • Python 3.10+
  • Docker & Docker Compose
  • DBT
  • Apache Airflow (v2+)
  • PostgreSQL

📦 Installation

# Clone the repository
git clone git@github.com:CamilaJaviera91/gcp-new.git
cd gcp-new

# Create required folders
mkdir -p dags dbt_project/models/{staging,marts} files scripts/{extract,load,utils}

🐳 Docker Setup with Airflow and PostgreSQL

This project uses Docker Compose to orchestrate the following services:

Service Description
PostgreSQL Stores raw/transformed data & Airflow metadata
Airflow Webserver UI to manage DAGs
Airflow Scheduler Triggers DAG tasks based on time or sensors
Airflow Init Initializes metadata DB, creates user

Make sure the previous structure exists before launching the containers:


⚙️ .env Configuration

Create a .env file with the following (sample):

# Airflow
AIRFLOW__CORE__EXECUTOR=...
AIRFLOW__CORE__LOAD_EXAMPLES=...
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=...
AIRFLOW__WEBSERVER__SECRET_KEY=...

# PostgreSQL
POSTGRES_SCHEMA=...
POSTGRES_HOST=...
POSTGRES_PORT=...
POSTGRES_DB=...
POSTGRES_USER=...
POSTGRES_PASSWORD=...

# Bigquery
GOOGLE_CREDENTIALS_PATH=...
BQ_PROJECT_ID=....
BQ_DATASET=...


📦 Python Dependencies

This project uses a requirements.txt file to manage all Python dependencies needed for the data pipeline, including Airflow, DBT, PostgreSQL, testing, and development tools.

🔧 What's Included

Category Package(s) Purpose
DBT dbt-core, dbt-postgres, dbt-bigquery DBT functionality for PostgreSQL and BigQuery
Airflow apache-airflow==2.9.1, apache-airflow-providers-openlineage Workflow orchestration
Database psycopg2-binary==2.9.9 PostgreSQL connector used by Airflow and DBT
Compatibility protobuf<5, sqlparse<0.5 Ensures compatibility with DBT and Airflow
Environment Variables python-dotenv==1.1.0 Loads .env files for secure and flexible config
Synthetic Data faker==24.9.0 Generate fake data for testing or mock pipelines
Testing pytest, pytest-mock Unit testing and mocking for pipeline components
Code Quality black, flake8, isort Code formatting, linting, and import sorting
Data Analysis numpy, pandas, matplotlib Analyze, transform, and visualize data in Python
GoogleSheets Integration gspread, gspread-dataframe, oauth2client Interact with GoogleSheets via API

🛠️ Docker Compose Setup

Sample docker-compose.yml setup is included in the repo and features:

  • PostgreSQL with persistent volume

  • Airflow Webserver, Scheduler, Init

  • Custom Dockerfile for Airflow + DBT + Python deps

✅ Make sure volumes: in each service are properly mapped to ./dags, ./scripts, etc.

🌀 Dockerfile.airflow

This file sets up the Airflow environment with Python dependencies and your DBT project.

FROM apache/airflow:2.10.0-python3.11

USER root
RUN apt-get update && apt-get install -y build-essential git

USER airflow

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

⚙️ Helper Scripts

To simplify setup and maintenance, the project includes the following Bash scripts:

Script Description
1_init.sh Initialize Airflow DB, create admin user
2_reset_docker.sh Reset all containers, volumes, and rebuild
3_fix_permissions.sh Fix volume permissions (Linux only)

🌐 Access the Airflow Web UI

Once the containers are up and the initialization step has been completed, you can access the Apache Airflow web interface to monitor, manage, and trigger your DAGs.

🔗 Open in your browser

Localhost:8080

This URL points to the Airflow webserver running inside the Docker container and exposed on your local machine's port 8080.

🔐 Default login credentials

If you used the initialization script (./1_init.sh), the following admin user was created automatically:

Username: admin  
Password: admin

💡 You can customize these credentials by modifying the airflow users create command inside the airflow-init service or the 1_init.sh script.

🖥️ What you’ll see

After logging in, you’ll be able to:

  • View all DAGs in the dags/ folder
  • Trigger DAGs manually or wait for scheduled runs
  • Monitor task statuses and inspect logs
  • Manage Airflow Connections, Variables, and Pools
  • Access admin configurations and user management

🛠️ Troubleshooting DAGs

If DAGs don't appear:

  • Check that dags/*.py files define a DAG object

  • Use: docker compose logs -f airflow-webserver for debug


📈 What’s Next?

This pipeline is ready for:

  • 💡 Building DAGs with Python and Airflow
  • 📤 Exporting data to CSV or Google Sheets
  • 🔗 Connecting to BigQuery
  • 📊 Creating visualization
  • 🧠 Modeling datasets with DBT and version control

📬 Feedback or Questions?

Feel free to open an issue or submit a PR!

About

This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published