Skip to content

code-workbench/sample-spark-to-sql-using-bcp

Repository files navigation

Spark to Azure Blob to SQL Server BCP Pipeline

This project demonstrates a complete data pipeline that:

  1. Uses Apache Spark to process sample data
  2. Exports processed data to Azure Blob Storage in BCP-compatible format
  3. Uses BCP utility to bulk import data into SQL Server

Project Structure

bcp-investigation/
├── README.md
├── requirements.txt
├── config/
│   ├── spark_config.py
│   └── azure_config.json
├── data/
│   └── sample_data.csv
├── spark/
│   ├── data_processor.py
│   └── spark_to_blob.py
├── sql/
│   ├── create_table.sql
│   └── format_file.fmt
├── bcp/
│   ├── bcp_import.sh
│   └── bcp_import.ps1
├── infra/
│   ├── main.bicep
│   ├── main.parameters.json
│   ├── deploy.sh
│   └── README.md
└── scripts/
    ├── run_pipeline.sh
    └── setup_environment.sh

Prerequisites

  1. Azure Resources:

    • Azure Storage Account
    • Azure SQL Database or SQL Server instance
    • Service Principal (for authentication)
  2. Software Requirements:

    • Python 3.8+
    • Apache Spark 3.x
    • SQL Server BCP utility
    • Azure CLI
  3. Create Configuration / Parameters files: ''' cp ./infra/main.parameters.json.copy ./infra/main.parameters.json cp ./config/azure_config.json.copy ./config/azure_config.json '''

  4. Install Azure CLI:

    curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
  5. Install BCP:

    chmod +x ./scripts/install_bcp.sh
    bash ./scripts/install_bcp.sh
    

Setup Instructions

**For Quick Start Information, and information on local configuration.

  1. Deploy Azure Infrastructure: For more information see infra README.md.

    chmod +x infra/deploy.sh
    ./infra/deploy.sh
  2. Setup Local Environment:

    chmod +x scripts/setup_environment.sh
    ./scripts/setup_environment.sh

NOTE: Manual Updates in the Azure Portal The following are the manual updates:

  • Update the SQL Server to enable Entra ID, adding your current user as the admin for SQL.
  • Make sure to open the Database, and run the query found at "./sql/create_table.sql" to create the table for data to land in.
  • The user account being used to run the scripts will need azure permissions to the blob storage account. Specifically "Storage Blob Data Contributor"
  1. Run the Pipeline:
    chmod +x scripts/run_pipeline.sh
    ./scripts/run_pipeline.sh

NOTE: To clean up all log files generated in execution, you can run the script '''bash ./scripts/clear_log_files.sh'''.

Pipeline Flow

  1. Data Generation: Creates sample sales data
  2. Spark Processing: Transforms and aggregates data
  3. Blob Export: Writes data to Azure Blob Storage in pipe-delimited format
  4. BCP Import: Bulk imports data into SQL Server using BCP utility

Configuration

Edit config/azure_config.json to match your Azure environment:

  • Storage account details
  • SQL Server connection information
  • Authentication credentials

Monitoring and Logging

All operations include comprehensive logging:

  • Spark job logs
  • BCP operation logs
  • Error handling and retry logic

Performance Considerations

  • File sizes optimized for BCP performance
  • Partitioning strategy for large datasets
  • Connection pooling and timeout settings

About

A repo that shows how to use Spark to export to Blob Storage, and then import into SQL using bcp

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published