This project demonstrates a complete data pipeline that:
- Uses Apache Spark to process sample data
- Exports processed data to Azure Blob Storage in BCP-compatible format
- Uses BCP utility to bulk import data into SQL Server
bcp-investigation/
├── README.md
├── requirements.txt
├── config/
│ ├── spark_config.py
│ └── azure_config.json
├── data/
│ └── sample_data.csv
├── spark/
│ ├── data_processor.py
│ └── spark_to_blob.py
├── sql/
│ ├── create_table.sql
│ └── format_file.fmt
├── bcp/
│ ├── bcp_import.sh
│ └── bcp_import.ps1
├── infra/
│ ├── main.bicep
│ ├── main.parameters.json
│ ├── deploy.sh
│ └── README.md
└── scripts/
├── run_pipeline.sh
└── setup_environment.sh
-
Azure Resources:
- Azure Storage Account
- Azure SQL Database or SQL Server instance
- Service Principal (for authentication)
-
Software Requirements:
- Python 3.8+
- Apache Spark 3.x
- SQL Server BCP utility
- Azure CLI
-
Create Configuration / Parameters files: ''' cp ./infra/main.parameters.json.copy ./infra/main.parameters.json cp ./config/azure_config.json.copy ./config/azure_config.json '''
-
Install Azure CLI:
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
-
Install BCP:
chmod +x ./scripts/install_bcp.sh bash ./scripts/install_bcp.sh
**For Quick Start Information, and information on local configuration.
-
Deploy Azure Infrastructure: For more information see infra README.md.
chmod +x infra/deploy.sh ./infra/deploy.sh
-
Setup Local Environment:
chmod +x scripts/setup_environment.sh ./scripts/setup_environment.sh
NOTE: Manual Updates in the Azure Portal The following are the manual updates:
- Update the SQL Server to enable Entra ID, adding your current user as the admin for SQL.
- Make sure to open the Database, and run the query found at "./sql/create_table.sql" to create the table for data to land in.
- The user account being used to run the scripts will need azure permissions to the blob storage account. Specifically "Storage Blob Data Contributor"
- Run the Pipeline:
chmod +x scripts/run_pipeline.sh ./scripts/run_pipeline.sh
NOTE: To clean up all log files generated in execution, you can run the script '''bash ./scripts/clear_log_files.sh'''.
- Data Generation: Creates sample sales data
- Spark Processing: Transforms and aggregates data
- Blob Export: Writes data to Azure Blob Storage in pipe-delimited format
- BCP Import: Bulk imports data into SQL Server using BCP utility
Edit config/azure_config.json
to match your Azure environment:
- Storage account details
- SQL Server connection information
- Authentication credentials
All operations include comprehensive logging:
- Spark job logs
- BCP operation logs
- Error handling and retry logic
- File sizes optimized for BCP performance
- Partitioning strategy for large datasets
- Connection pooling and timeout settings