This repository implements a modern, modular data pipeline using:
-
Apache Airflow for orchestration
-
DBT for SQL-based transformations
-
PostgreSQL as both the source/target database and metadata store
💡 Ideal for learning, development, and lightweight data integration projects.
.
├── 1_init.sh
├── 2_reset_docker.sh
├── 3_fix_permissions.sh
├── credentials #gitignore
│ └── auth.json
├── dags
│ └── dag.py
├── dbt_project
│ ├── dbt_project.yml
│ ├── models
│ │ ├── marts
│ │ │ ├── final_report.sql
│ │ │ └── sales_by_product.sql
│ │ ├── schema.sql
│ │ └── staging
│ │ ├── clients.sql
│ │ ├── orders.sql
│ │ └── products.sql
│ └── profiles.yml
├── docker-compose.yml
├── Dockerfile.airflow
├── files
│ ├── clients.csv
│ ├── final_report.csv
│ ├── orders.csv
│ ├── products.csv
│ └── sales_by_product.csv
├── LICENSE
├── README.md
├── requirements.txt
└── scripts
├── extract
│ └── extract.py
├── load
│ └── load_data.py
└── utils
└── utils.py
Before running the pipeline, make sure to create the following folders in the root directory of the project:
.
├── dags/ # Airflow DAG definitions
├── dbt_project/ # DBT transformations and config
│ └── models/
│ ├── staging/ # Raw → Staging transformations
│ └── marts/ # Staging → Marts (analytics-ready)
├── files/ # CSVs, exports, mock datasets
├── scripts/ # Python utilities for extract/load/validation
│ ├── extract/
│ ├── load/
│ └── utils/
dags/: Contains Airflow DAGs to orchestrate the pipeline.dbt_project/: Contains the DBT project with all SQL transformation models.models/staging/: Contains staging models for cleaning and preparing raw data.marts/: Contains data marts for final models ready for analysis and reporting.
files/: Stores input/output files such as CSVs.scripts/: Includes helper scripts for data extraction, validation, and loading.
Make sure you have the following installed:
- Python 3.10+
- Docker & Docker Compose
- DBT
- Apache Airflow (v2+)
- PostgreSQL
# Clone the repository
git clone git@github.com:CamilaJaviera91/gcp-new.git
cd gcp-new
# Create required folders
mkdir -p dags dbt_project/models/{staging,marts} files scripts/{extract,load,utils}This project uses Docker Compose to orchestrate the following services:
| Service | Description |
|---|---|
| PostgreSQL | Stores raw/transformed data & Airflow metadata |
| Airflow Webserver | UI to manage DAGs |
| Airflow Scheduler | Triggers DAG tasks based on time or sensors |
| Airflow Init | Initializes metadata DB, creates user |
Make sure the previous structure exists before launching the containers:
Create a .env file with the following (sample):
# Airflow
AIRFLOW__CORE__EXECUTOR=...
AIRFLOW__CORE__LOAD_EXAMPLES=...
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=...
AIRFLOW__WEBSERVER__SECRET_KEY=...
# PostgreSQL
POSTGRES_SCHEMA=...
POSTGRES_HOST=...
POSTGRES_PORT=...
POSTGRES_DB=...
POSTGRES_USER=...
POSTGRES_PASSWORD=...
# Bigquery
GOOGLE_CREDENTIALS_PATH=...
BQ_PROJECT_ID=....
BQ_DATASET=...
This project uses a requirements.txt file to manage all Python dependencies needed for the data pipeline, including Airflow, DBT, PostgreSQL, testing, and development tools.
| Category | Package(s) | Purpose |
|---|---|---|
| DBT | dbt-core, dbt-postgres, dbt-bigquery |
DBT functionality for PostgreSQL and BigQuery |
| Airflow | apache-airflow==2.9.1, apache-airflow-providers-openlineage |
Workflow orchestration |
| Database | psycopg2-binary==2.9.9 |
PostgreSQL connector used by Airflow and DBT |
| Compatibility | protobuf<5, sqlparse<0.5 |
Ensures compatibility with DBT and Airflow |
| Environment Variables | python-dotenv==1.1.0 |
Loads .env files for secure and flexible config |
| Synthetic Data | faker==24.9.0 |
Generate fake data for testing or mock pipelines |
| Testing | pytest, pytest-mock Unit testing and mocking for pipeline components |
|
| Code Quality | black, flake8, isort |
Code formatting, linting, and import sorting |
| Data Analysis | numpy, pandas, matplotlib |
Analyze, transform, and visualize data in Python |
| GoogleSheets Integration | gspread, gspread-dataframe, oauth2client |
Interact with GoogleSheets via API |
Sample docker-compose.yml setup is included in the repo and features:
-
PostgreSQL with persistent volume
-
Airflow Webserver, Scheduler, Init
-
Custom Dockerfile for Airflow + DBT + Python deps
✅ Make sure volumes: in each service are properly mapped to ./dags, ./scripts, etc.
This file sets up the Airflow environment with Python dependencies and your DBT project.
FROM apache/airflow:2.10.0-python3.11
USER root
RUN apt-get update && apt-get install -y build-essential git
USER airflow
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
To simplify setup and maintenance, the project includes the following Bash scripts:
| Script | Description |
|---|---|
1_init.sh |
Initialize Airflow DB, create admin user |
2_reset_docker.sh |
Reset all containers, volumes, and rebuild |
3_fix_permissions.sh |
Fix volume permissions (Linux only) |
Once the containers are up and the initialization step has been completed, you can access the Apache Airflow web interface to monitor, manage, and trigger your DAGs.
This URL points to the Airflow webserver running inside the Docker container and exposed on your local machine's port 8080.
If you used the initialization script (./1_init.sh), the following admin user was created automatically:
Username: admin
Password: admin
💡 You can customize these credentials by modifying the
airflow users createcommand inside theairflow-initservice or the1_init.shscript.
After logging in, you’ll be able to:
- View all DAGs in the
dags/folder - Trigger DAGs manually or wait for scheduled runs
- Monitor task statuses and inspect logs
- Manage Airflow Connections, Variables, and Pools
- Access admin configurations and user management
If DAGs don't appear:
-
Check that dags/*.py files define a DAG object
-
Use: docker compose logs -f airflow-webserver for debug
This pipeline is ready for:
- 💡 Building DAGs with Python and Airflow
- 📤 Exporting data to CSV or Google Sheets
- 🔗 Connecting to BigQuery
- 📊 Creating visualization
- 🧠 Modeling datasets with DBT and version control
Feel free to open an issue or submit a PR!