Analytics engineering and dataops
This project demonstrates an end-to-end data pipeline leveraging:
- Python scripts to ingest data from Google Sheets into PostgreSQL.
- Incremental loading from PostgreSQL to BigQuery (Silver dataset).
- dbt Cloud for transformations, testing, and loading into the Gold dataset.
- Terraform for infrastructure provisioning.
- GitHub Actions for CI/CD
The pipeline optimizes data transfer to BigQuery by only loading new records since the last successful load.
How it works:
- Get last load timestamp
- The script queries the target BigQuery table for the latest timestamp in a specified column.
- Extract only new data
- PostgreSQL is queried for rows with a timestamp greater than the last load.
- Append to BigQuery
- New rows are appended to the existing table without overwriting previous data.
- Fallback to full load
- If the target table doesn't exist or is empty, a full loadwh is performed.
Why?
- Reduce processing time.
- Minimizes API/data transfer.
- Ensure no duplication of previously loaded data.
-
Google Sheets → PostgreSQL
- Extract data from a ticker list sheet and historical ticker sheets.
- Store in PostgreSQL for quick local analysis.
-
PostgreSQL → BigQuery (Silver)
- Incrementally load tables based on their timestamp column.
-
dbt Cloud → BigQuery (Gold)
- Transform and test data before making it analytics-ready. (Section below is a placeholder)