Skip to content

Real-time streaming pipeline using AWS CDK with Amazon MSK Serverless, Amazon Managed Flink, Amazon Kinesis Data Firehose, Amazon S3, and Amazon SageMaker Feature Store for stock market data processing.

License

Notifications You must be signed in to change notification settings

aws-samples/sample-streaming-pipeline-kafka-flink-firehose-s3-cdk

Streaming Pipeline with Amazon MSK, Amazon Managed Flink, Amazon Kinesis Data Firehose & Amazon S3 - CDK

A comprehensive real-time streaming data pipeline built with AWS CDK that demonstrates stock price processing using Apache Kafka (Amazon MSK Serverless), Amazon Managed Service for Apache Flink, Amazon Kinesis Data Firehose, Amazon S3 storage, and Amazon SageMaker Feature Store integration.

Architecture Overview

Architecture Diagram

This project demonstrates streaming data pipeline that processes stock market data using a modern serverless architecture. The system follows an event-driven design where Amazon EventBridge triggers a Producer AWS Lambda function every minute to generate realistic stock price data (ticker symbols like AMZN, AAPL, MSFT, GOOGL with random price fluctuations) and publishes it to an Amazon MSK Serverless cluster. Amazon Managed Service for Apache Flink consumes from the input_topic, for demo we just log the messages, and writes the processed results to an output_topic.

The architecture implements a dual consumption pattern from the output topic to serve both batch analytics and real-time machine learning use cases. Amazon Kinesis Data Firehose provides reliable, buffered delivery of processed data to Amazon S3 for data lake storage and historical analytics, while a dedicated Feature Store AWS Lambda function ingests the same streaming data into Amazon SageMaker Feature Store for real-time ML model serving. The entire system is deployed within a secure Amazon VPC with proper AWS IAM roles, security groups, and encryption, demonstrating AWS best practices for building scalable, fault-tolerant streaming data platforms.

Quick Start

1. Environment Setup

# Clone the repository
git clone <repository-url>
cd streaming-pipeline-kafka-flink-firehose-s3-cdk

# Create and activate Python virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Install AWS CDK (if not already installed)
npm install -g aws-cdk

2. Configuration

Important: Update the S3 bucket name in project_config.json:

{
  "flink": {
    "artifact_bucket_name": "your-unique-bucket-name-here"
  }
}

Replace your-unique-bucket-name-here with a globally unique S3 bucket name for storing Flink artifacts. All other configuration values can remain as defaults for testing.

3. Deploy Infrastructure

# Bootstrap CDK (first time only)
cdk bootstrap

# Deploy all stacks
cdk deploy --all --require-approval never

# Or deploy stacks individually
cdk deploy MSKServerless
cdk deploy FlinkStack
cdk deploy FirehoseToS3Stack
cdk deploy SMFeatureStore

4. Deploy Flink Application

# Build and deploy Flink JAR to S3
chmod +x build-and-deploy-flink-jar.sh
./build-and-deploy-flink-jar.sh

Stack Details

MSKServerlessStack

  • Amazon MSK Serverless Cluster: Fully managed Kafka service with input_topic and output_topic
  • Amazon VPC & Networking: Isolated network with public/private subnets and flow logs
  • Amazon EC2 Bastion Host: EC2 instance for secure Kafka administration and testing
  • Producer AWS Lambda: Amazon EventBridge-triggered function generating stock data every minute
  • Topic Creation AWS Lambda: Custom resource for automated Kafka topic creation
  • Amazon S3 Buckets: Artifact storage, output storage, and access logs buckets
  • Security Groups: Network access control for all components

FlinkStack

  • Amazon Managed Service for Apache Flink Application: Java application deployed to Amazon Managed Service for Apache Flink
  • Stream Processing: Reads from input_topic, processes stock data, writes to output_topic
  • AWS IAM Roles: Granular permissions for Amazon MSK, Amazon S3, and Amazon CloudWatch access
  • Security Groups: Network isolation for Flink application
  • Amazon CloudWatch Logs: Centralized logging for Flink application monitoring

FirehoseToS3Stack

  • Amazon Kinesis Data Firehose: Consumes from output_topic and delivers to Amazon S3
  • Amazon S3 Storage: Reliable data lake storage with configurable buffering
  • AWS IAM Role: Specific permissions for Amazon MSK consumption and Amazon S3 delivery
  • Amazon CloudWatch Logs: Delivery stream monitoring and error logging

SMFeatureStoreStack

  • Feature Group: Amazon SageMaker Feature Store for stock price features
  • Feature Store AWS Lambda: Amazon MSK consumer that ingests processed data into Feature Store
  • Online/Offline Storage: Real-time serving and batch analytics capabilities
  • AWS KMS Encryption: Feature Store data encryption at rest
  • Amazon S3 Offline Store: Durable storage for feature engineering and model training

Data Flow Details

Stock Data Structure

{
  "datetime": "2025-06-25T18:17:38Z",
  "ticker": "AMZN", 
  "price": 210.25
}

Processing Pipeline

  1. Amazon EventBridge triggers Producer AWS Lambda every minute
  2. Producer AWS Lambda generates 2 random stock records per execution
  3. Amazon Managed Service for Apache Flink Application processes stock data (Java application in flink-java-app/)
  4. Dual Output:
    • Amazon Kinesis Data Firehose → Amazon S3 bucket (JSON format, 60s/1MB buffering)
    • Feature Store AWS Lambda → Amazon SageMaker Feature Store (real-time ML features)

Troubleshooting

Common Issues

  1. Amazon Managed Service for Apache Flink Application Not Starting

    • Verify Amazon S3 artifact bucket contains kafka-connectors-1.0.jar
    • Check Flink role permissions for Amazon MSK access
    • Review Amazon CloudWatch logs: /aws/kinesis-analytics/demo-flink-app-cdk
  2. Producer AWS Lambda Not Generating Data

    • Check Amazon EventBridge rule is enabled (triggers every minute)
    • Verify Amazon MSK cluster is in ACTIVE state
    • Monitor AWS Lambda logs for authentication issues
  3. Feature Store Ingestion Issues

    • Verify Feature Group is created and ACTIVE
    • Check AWS Lambda VPC configuration and security groups
    • Monitor Amazon SageMaker Feature Store metrics
  4. Amazon Kinesis Data Firehose Delivery Failures

    • Check Amazon S3 bucket permissions
    • Verify Amazon MSK output_topic has data
    • Review Amazon Kinesis Data Firehose CloudWatch logs

Cleanup

# Destroy all stacks
cdk destroy --all

Note: If Elastic Network Interfaces (ENIs) remain after stack destruction, you may need to delete them manually from the AWS Console.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Built with ❤️ using AWS CDK, Apache Kafka, and Apache Flink

About

Real-time streaming pipeline using AWS CDK with Amazon MSK Serverless, Amazon Managed Flink, Amazon Kinesis Data Firehose, Amazon S3, and Amazon SageMaker Feature Store for stock market data processing.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published