Skip to content

Optimize Docker Build Performance with Layer Caching and Multi-stage Builds #188

@nishadeborahphilips

Description

@nishadeborahphilips

Table of Contents

  1. Overview
  2. Architecture
  3. Step-by-Step Implementation
  4. Testing Guide
  5. Troubleshooting

Overview

This guide provides step-by-step instructions to optimize Docker build performance across all Dockerfiles in the MCP Gateway Registry project.

What You're Optimizing

Three Docker build configurations that currently have suboptimal layer caching and build strategies:

  1. Root Dockerfile - Main application container
  2. docker/Dockerfile.registry - Registry service with frontend
  3. metrics-service/Dockerfile - Metrics service

Expected Improvements

  • Build time: Reduce from ~10-15 minutes to ~2-3 minutes (after first build)
  • Rebuild time: When only app code changes, rebuild in <1 minute
  • Image size: Reduce by 30-50% with multi-stage builds
  • CI/CD: Faster deployments, more frequent releases

Key References

Architecture

Current Problems

Current Flow (Inefficient):
1. Copy ALL application code
2. Install dependencies
3. Build application

Problem: Any code change invalidates ALL layers, forcing full rebuild

Optimized Flow

Optimized Flow:
1. Copy ONLY dependency files (package.json, pyproject.toml)
2. Install dependencies (CACHED if dependencies unchanged)
3. Copy application code
4. Build application (CACHED if build unchanged)

Benefit: Code changes only rebuild final layers, reusing cached dependencies

Multi-stage Build Pattern

Stage 1: Builder (with build tools)
  → Install build dependencies
  → Build application
  → Large image size

Stage 2: Runtime (minimal)
  → Copy built artifacts from Stage 1
  → NO build tools in final image
  → Small, secure image

Step-by-Step Implementation

File Overview

Files that will be modified:

  • /Dockerfile (root)
  • /docker/Dockerfile.registry
  • /metrics-service/Dockerfile
  • No nginx changes needed (.dockerignore already optimal)

Step 1: Optimize Root Dockerfile

File: /Dockerfile
Lines to modify: 23-34

Current Code (BEFORE)

Lines 23-34:

# Set the working directory in the container
WORKDIR /app

# Copy the application code
COPY . /app/

# Copy nginx configurations (both HTTP-only and HTTP+HTTPS versions)
COPY docker/nginx_rev_proxy_http_only.conf /app/docker/nginx_rev_proxy_http_only.conf
COPY docker/nginx_rev_proxy_http_and_https.conf /app/docker/nginx_rev_proxy_http_and_https.conf

# Make the entrypoint script executable
COPY docker/entrypoint.sh /app/docker/entrypoint.sh
RUN chmod +x /app/docker/entrypoint.sh

Problem: Line 26 copies entire application before dependencies are installed, breaking layer cache.

New Code (AFTER)

Replace lines 23-34 with:

# Set the working directory in the container
WORKDIR /app

# Copy dependency files first for layer caching
COPY pyproject.toml ./

# Copy nginx configurations and scripts (rarely change, cache-friendly)
COPY docker/nginx_rev_proxy_http_only.conf /app/docker/nginx_rev_proxy_http_only.conf
COPY docker/nginx_rev_proxy_http_and_https.conf /app/docker/nginx_rev_proxy_http_and_https.conf
COPY docker/entrypoint.sh /app/docker/entrypoint.sh
RUN chmod +x /app/docker/entrypoint.sh

# Install Python dependencies (cached unless pyproject.toml changes)
RUN pip install uv && \
    uv pip install --system -e .

# Copy the application code AFTER dependencies are installed
COPY . /app/

Why This Works

  1. Line 26: Copy only pyproject.toml first
  2. Lines 29-32: Copy nginx configs (rarely change)
  3. Lines 35-36: Install dependencies (cached if pyproject.toml unchanged)
  4. Line 39: Copy full app code LAST

Impact: Code changes only rebuild from line 39 onwards, saving 5-10 minutes per build.


Step 2: Optimize Registry Dockerfile (Option A - Quick Fix)

File: /docker/Dockerfile.registry
Lines to modify: 23-63

Current Code (BEFORE)

Lines 23-37 (Frontend build):

WORKDIR /app

# Copy the application code
COPY . /app/

# Copy nginx configurations (both HTTP-only and HTTP+HTTPS versions)
COPY docker/nginx_rev_proxy_http_only.conf /app/docker/nginx_rev_proxy_http_only.conf
COPY docker/nginx_rev_proxy_http_and_https.conf /app/docker/nginx_rev_proxy_http_and_https.conf

# Build React frontend
WORKDIR /app/frontend
COPY frontend/package.json ./
RUN npm install --legacy-peer-deps
COPY frontend/ ./
RUN npm run build

Lines 39-63 (Python setup):

# Return to app directory
WORKDIR /app

# Install uv and setup Python environment
RUN pip install uv && \
    uv venv .venv --python 3.12 && \
    . .venv/bin/activate && \
    uv pip install \
    "fastapi>=0.115.12" \
    # ... (more dependencies)
    "hf_xet>=0.1.0" && \
    uv pip install -e .

Problems:

  • Line 26: Copies entire app before dependencies
  • Line 35: Uses npm install instead of npm ci (slower)
  • Line 43-63: Python dependencies installed after app copy

New Code (AFTER)

Replace lines 23-63 with:

WORKDIR /app

# Copy nginx configurations first (rarely change)
COPY docker/nginx_rev_proxy_http_only.conf /app/docker/nginx_rev_proxy_http_only.conf
COPY docker/nginx_rev_proxy_http_and_https.conf /app/docker/nginx_rev_proxy_http_and_https.conf

# Build React frontend - optimize layer caching
WORKDIR /app/frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci --legacy-peer-deps
COPY frontend/ ./
RUN npm run build

# Return to app directory
WORKDIR /app

# Copy Python dependency files first
COPY pyproject.toml ./

# Install uv and Python dependencies (cached unless pyproject.toml changes)
RUN pip install uv && \
    uv venv .venv --python 3.12 && \
    . .venv/bin/activate && \
    uv pip install \
    "fastapi>=0.115.12" \
    "itsdangerous>=2.2.0" \
    "jinja2>=3.1.6" \
    "mcp>=1.6.0" \
    "pydantic>=2.11.3" \
    "httpx>=0.27.0" \
    "python-dotenv>=1.1.0" \
    "python-multipart>=0.0.20" \
    "uvicorn[standard]>=0.34.2" \
    "faiss-cpu>=1.7.4" \
    "sentence-transformers>=2.2.2" \
    "websockets>=15.0.1" \
    "scikit-learn>=1.3.0" \
    "torch>=1.6.0" \
    "huggingface-hub[cli,hf_xet]>=0.31.1" \
    "hf_xet>=0.1.0" && \
    uv pip install -e .

# Copy the rest of the application code LAST
COPY . /app/

Key Changes

  1. Line 31: Copy both package.json AND package-lock.json
  2. Line 32: Changed npm installnpm ci (faster, deterministic)
  3. Line 41: Copy pyproject.toml before pip install
  4. Line 59: Copy full app code LAST

Impact

  • First build: Similar time (~10-15 min)
  • Code-only changes: ~1-2 minutes (90% time savings)
  • Dependency changes: ~5-8 minutes (50% time savings)

Step 3: Optimize Registry Dockerfile (Option B - Multi-stage Build)

File: /docker/Dockerfile.registry
Replace entire file

New Code (FULL FILE REPLACEMENT)

Replace ALL 79 lines with:

# =============================================================================
# Stage 1: Frontend Builder
# =============================================================================
FROM node:20-slim AS frontend-builder

WORKDIR /build

# Copy package files and install dependencies
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci --legacy-peer-deps

# Copy frontend source and build
COPY frontend/ ./
RUN npm run build

# =============================================================================
# Stage 2: Python Builder
# =============================================================================
FROM python:3.12-slim AS python-builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    git \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY pyproject.toml ./
RUN pip install uv && \
    uv venv .venv --python 3.12 && \
    . .venv/bin/activate && \
    uv pip install \
    "fastapi>=0.115.12" \
    "itsdangerous>=2.2.0" \
    "jinja2>=3.1.6" \
    "mcp>=1.6.0" \
    "pydantic>=2.11.3" \
    "httpx>=0.27.0" \
    "python-dotenv>=1.1.0" \
    "python-multipart>=0.0.20" \
    "uvicorn[standard]>=0.34.2" \
    "faiss-cpu>=1.7.4" \
    "sentence-transformers>=2.2.2" \
    "websockets>=15.0.1" \
    "scikit-learn>=1.3.0" \
    "torch>=1.6.0" \
    "huggingface-hub[cli,hf_xet]>=0.31.1" \
    "hf_xet>=0.1.0"

# Copy app code and install as editable
COPY . /app/
RUN . .venv/bin/activate && uv pip install -e .

# =============================================================================
# Stage 3: Final Runtime
# =============================================================================
FROM python:3.12-slim

ENV PYTHONUNBUFFERED=1 \
    DEBIAN_FRONTEND=noninteractive

# Install only runtime dependencies (no build tools)
RUN apt-get update && apt-get install -y --no-install-recommends \
    nginx \
    nginx-extras \
    lua-cjson \
    curl \
    procps \
    openssl \
    ca-certificates \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy Python virtual environment from builder
COPY --from=python-builder /app/.venv /app/.venv

# Copy frontend build from frontend-builder
COPY --from=frontend-builder /build/build /app/frontend/build

# Copy application code
COPY . /app/

# Copy nginx configurations
COPY docker/nginx_rev_proxy_http_only.conf /app/docker/nginx_rev_proxy_http_only.conf
COPY docker/nginx_rev_proxy_http_and_https.conf /app/docker/nginx_rev_proxy_http_and_https.conf

# Create logs directory
RUN mkdir -p /app/logs

# Expose ports
EXPOSE 80 443 7860

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD curl -f http://localhost:7860/health || exit 1

# Entrypoint
COPY docker/registry-entrypoint.sh /app/registry-entrypoint.sh
RUN chmod +x /app/registry-entrypoint.sh

ENTRYPOINT ["/app/registry-entrypoint.sh"]

Why Multi-stage Is Better

Benefits:

  • Image size: Reduce from ~2GB to ~800MB-1.2GB
  • Security: No build tools in production image
  • Layer caching: Each stage caches independently
  • Parallel builds: Docker can build stages concurrently

Stage 1 (lines 1-14): Builds frontend in isolated Node.js environment
Stage 2 (lines 16-54): Builds Python dependencies in isolated environment
Stage 3 (lines 56-110): Copies artifacts from both stages into minimal runtime image


Step 4: Optimize Metrics Service Dockerfile

File: /metrics-service/Dockerfile
Lines to modify: 10-12

Current Code (BEFORE)

Lines 10-12:

# Install dependencies
COPY metrics-service/pyproject.toml .
RUN pip install uv && uv pip install --system -e .

Problem: Line 12 uses -e . (editable install) but app code not copied yet, causing potential issues.

New Code (AFTER) - Quick Fix

Replace lines 10-12 with:

# Install dependencies - copy pyproject.toml first
COPY metrics-service/pyproject.toml ./
RUN pip install uv && uv pip install --system .

Change: Line 12 now uses . instead of -e . (no longer editable install).

Impact

  • Minimal change, fixes logical issue
  • Already well-structured for layer caching

Step 5: Optimize Metrics Service Dockerfile (Multi-stage)

File: /metrics-service/Dockerfile
Replace entire file (OPTIONAL but recommended)

New Code (FULL FILE REPLACEMENT)

Replace ALL 28 lines with:

# =============================================================================
# Stage 1: Builder
# =============================================================================
FROM python:3.12-slim AS builder

WORKDIR /app

# Install system dependencies needed for building
RUN apt-get update && apt-get install -y \
    curl \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY metrics-service/pyproject.toml ./
RUN pip install uv && uv pip install --system .

# =============================================================================
# Stage 2: Runtime
# =============================================================================
FROM python:3.12-slim

WORKDIR /app

# Install only runtime dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy installed packages from builder
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin

# Copy application code
COPY metrics-service/app/ app/
COPY metrics-service/create_api_key.py ./

# Create data directory
RUN mkdir -p /var/lib/sqlite

# Expose port
EXPOSE 8890

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8890/health || exit 1

CMD ["python", "-m", "app.main"]

Impact

  • Quick fix: Minimal improvement (already well-structured)
  • Multi-stage: Reduce image size by ~30-40%

Implementation Priority

High Priority - Quick Wins

  1. [ ] Fix /Dockerfile (root) - Replace lines 23-34

    • Effort: 5 minutes
    • Impact: High - used for main application builds
  2. [ ] Fix /docker/Dockerfile.registry - Option A (lines 23-63)

    • Effort: 10 minutes
    • Impact: Very High - most complex, largest time savings
  3. [ ] Change npm install to npm ci - Line 35 in registry Dockerfile

    • Effort: 1 minute (included in Option A)
    • Impact: Medium - 20-30% faster npm installs

Medium Priority

  1. [ ] Implement multi-stage build for /docker/Dockerfile.registry - Option B

    • Effort: 15 minutes
    • Impact: High - significant image size reduction
  2. [ ] Fix /metrics-service/Dockerfile - Lines 10-12

    • Effort: 2 minutes
    • Impact: Low - already well-structured

Low Priority

  1. [ ] Implement multi-stage build for /metrics-service/Dockerfile
    • Effort: 10 minutes
    • Impact: Medium - image size reduction only

Testing Guide

Validation Steps

After implementing changes:

# 1. Build and time the first build
time docker build -f Dockerfile -t test-root-build .

# 2. Make a small code change (e.g., add comment to a .py file)
echo "# test comment" >> registry/main.py

# 3. Rebuild and time it (should be MUCH faster)
time docker build -f Dockerfile -t test-root-build .

# 4. Check image size
docker images test-root-build

Expected Results

Root Dockerfile:

  • First build: ~8-10 minutes
  • Second build (code only): ~30-45 seconds ✅
  • Image size: Similar (no multi-stage yet)

Registry Dockerfile (Option A):

  • First build: ~12-15 minutes
  • Second build (code only): ~60-90 seconds ✅
  • Image size: Similar (~2GB)

Registry Dockerfile (Option B - Multi-stage):

  • First build: ~12-15 minutes
  • Second build (code only): ~60-90 seconds ✅
  • Image size: ~800MB-1.2GB ✅ (40-60% reduction)

Test Each Dockerfile

# Test root Dockerfile
time docker build -f Dockerfile -t mcp-gateway:optimized .

# Test registry Dockerfile
time docker build -f docker/Dockerfile.registry -t mcp-registry:optimized .

# Test metrics service Dockerfile
time docker build -f metrics-service/Dockerfile -t mcp-metrics:optimized .

# Check all image sizes
docker images | grep mcp-

Integration Testing

# 1. Build all containers
docker-compose build

# 2. Start services
docker-compose up -d

# 3. Verify services are healthy
docker-compose ps

# 4. Check logs for errors
docker-compose logs | grep -i error

# 5. Test API endpoints
curl http://localhost/health
curl http://localhost:8890/health

Troubleshooting

Issue 1: Build fails with "pyproject.toml not found"

Symptoms:

COPY pyproject.toml ./
ERROR: failed to compute cache key: "/pyproject.toml" not found

Cause: File path incorrect or .dockerignore is blocking it

Solution:

# Check if pyproject.toml exists
ls -la pyproject.toml

# Check .dockerignore doesn't exclude it
cat .dockerignore | grep pyproject

# Verify build context includes it
docker build -f Dockerfile . --no-cache

Issue 2: Frontend build fails with "package-lock.json not found"

Symptoms:

COPY frontend/package.json frontend/package-lock.json ./
ERROR: failed to compute cache key: "/frontend/package-lock.json" not found

Cause: No package-lock.json in frontend directory (using yarn or pnpm instead of npm)

Solution:

# Check what lock file exists
ls -la frontend/

# If using package-lock.json doesn't exist, generate it:
cd frontend
npm install
cd ..

# OR modify Dockerfile to only copy package.json:
COPY frontend/package.json ./
RUN npm install --legacy-peer-deps  # Keep npm install if no lock file

Issue 3: Multi-stage build larger than expected

Symptoms:
Image size after multi-stage build is not significantly smaller

Cause: Copying unnecessary files or directories in final stage

Solution:

# Check what's being copied in final stage
docker history mcp-registry:optimized

# Use dive tool to inspect layers
docker run --rm -it \
  -v /var/run/docker.sock:/var/run/docker.sock \
  wagoodman/dive mcp-registry:optimized

# Only copy necessary files in final COPY statement:
COPY registry/ /app/registry/  # Instead of COPY . /app/

Issue 4: "npm ci" fails but "npm install" works

Symptoms:

npm ERR! cipm can only install packages when your package.json and package-lock.json
npm ERR! are in sync.

Cause: package-lock.json out of sync with package.json

Solution:

# Regenerate package-lock.json
cd frontend
rm package-lock.json
npm install
git add package-lock.json
git commit -m "Update package-lock.json"

# Then rebuild
docker build -f docker/Dockerfile.registry -t mcp-registry:optimized .

Issue 5: Layer cache not working as expected

Symptoms:
Docker rebuilds dependencies every time even though no changes

Cause: COPY command includes files that change frequently

Solution:

# Use --cache-from to debug
docker build --cache-from mcp-registry:latest -f docker/Dockerfile.registry . --progress=plain

# Check what files are being copied
# Ensure .dockerignore excludes volatile files:
cat .dockerignore

# Should include:
__pycache__/
*.pyc
.git/
logs/
node_modules/  # If copying from host

Issue 6: Virtual environment not activating in multi-stage

Symptoms:

ModuleNotFoundError: No module named 'fastapi'

Cause: Virtual environment path incorrect or not activated

Solution:

Option A: Copy site-packages directly (recommended):

# In builder stage
RUN pip install uv && uv pip install --system <packages>

# In final stage
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages

Option B: Activate venv in entrypoint:

# In final stage
COPY --from=builder /app/.venv /app/.venv
ENV PATH="/app/.venv/bin:$PATH"

Issue 7: Image builds but fails at runtime

Symptoms:
Container starts but immediately exits with error

Cause: Missing runtime dependencies in final stage

Solution:

# Check container logs
docker logs <container-id>

# Common missing dependencies:
# - curl (for health checks)
# - nginx (for registry)
# - shared libraries

# Add to final stage:
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    nginx \
    && rm -rf /var/lib/apt/lists/*

Validation Checklist

Use this to verify optimizations are working:

Build Performance

  • First build completes successfully
  • Second build (code change only) takes <2 minutes
  • Dependency change rebuilds take ~50% less time
  • No cache warnings in build output

Image Quality

  • Multi-stage images are 30-50% smaller
  • Containers start without errors
  • Health checks pass
  • All services respond to requests

Functionality

  • Registry API responds: curl http://localhost/v0/servers
  • Metrics service responds: curl http://localhost:8890/health
  • Frontend loads correctly
  • Authentication works

Layer Caching

# Verify cache is being used:
docker build -f Dockerfile . 2>&1 | grep "CACHED"

# Should see output like:
# => CACHED [2/8] WORKDIR /app
# => CACHED [3/8] COPY pyproject.toml ./
# => CACHED [4/8] RUN pip install uv && ...

Summary

You've now optimized:

Root Dockerfile - Proper dependency layer caching
Registry Dockerfile - npm ci + layer caching + multi-stage option
Metrics Dockerfile - Fixed editable install + multi-stage option
Build times - Reduced from 10-15min to 1-3min for code changes
Image sizes - Reduced by 30-50% with multi-stage builds

Quick Reference

Dockerfile Priority Effort Time Savings Size Reduction
/Dockerfile High 5 min 80-90% 0%
/docker/Dockerfile.registry (Option A) Very High 10 min 85-90% 0%
/docker/Dockerfile.registry (Option B) Medium 15 min 85-90% 40-60%
/metrics-service/Dockerfile (Quick) Medium 2 min 10% 0%
/metrics-service/Dockerfile (Multi-stage) Low 10 min 10% 30-40%

Next Steps

  1. Start with High Priority quick wins (Steps 1-2)
  2. Test thoroughly after each change
  3. Commit changes incrementally
  4. Implement multi-stage builds when time allows
  5. Update CI/CD pipelines to leverage cache

Good luck with your optimization! 🚀

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions