Skip to content

Docker in Docker with llama‐swap guide

Al Hadebe edited this page Jul 21, 2025 · 7 revisions

Important

🎉 Contributed by @ramblingcoder

This guide is an explanation of a setup deployed that...

  • Allows llama-swap to spin up and down LLM inference containers
  • Uses a shared network config for models and config.
  • Restarts all llama-swap instances watching the config file.

This setup is a variation of DIND called "docker out of docker" for a llama-swap environment. "Docker out of Docker" is similar to "Docker in Docker" (aka DIND) but uses the host's docker process.

More information regarding DIND can be found here: DIND

Minimum Requirements

Use Case

I want to be able to manage multiple servers with the following capabilities...

  • Restart llama-swap when config is changed
  • Have a singular copy of models
  • Automatically grab latest version of docker images when starting a model to reduce manual effort on maintenance

Architecture

This is the overall architecture this page will recreate.

homelab drawio

NAS

Network Shares

There are two network shares that need to be setup.

  1. models - A collection of GGUF files
  2. config - Contains the llama-swap config file and the watch-config.sh script.

These will be configured as network share volumes in docker compose.

Models

In the models folder, there is a collection of loose GGUF files. This is an example of how my model's folder content would look like.

models/
├── unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
├── unsloth_Qwen2.5-Coder-7B-Instruct-Q4_K_M_20250517.gguf
└── nomic-ai_nomic-embed-text-v1.5.Q8_0_20250517.gguf

Config

The config directory should contain two files:

  1. config.yaml - Llama-Swap configuration
  2. watch-config.sh - Script to monitor config changes
config/
├── config.yaml
└── watch-config.sh

config.yaml example

healthCheckTimeout: 1500 # wait 25 minutes before unhealthy
startPort: 65001 # Ports for containers start at 65001

groups:
  "coding":
    swap: false
    exclusive: true
    members:
      - "qwen-coder-32b-v2.5"
      - "qwen-coder-7b-v2.5"
      - "nomic-embed-text-v1.5"
      
models:
  "nomic-embed-text-v1.5":
    cmd: |
      docker run --pull always --name nomicembed
      --init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
      --model /models/nomicai_nomic-embed-text-v1.5.Q8_0.gguf
      --ctx-size 8192 
      --gpu-layers 999
      --no-context-shift
      --embeddings 
      --batch-size 8192 
      --rope-scaling yarn 
      --rope-freq-scale 0.75
      --parallel 1
      --flash-attn
    cmdStop: docker stop nomicembed

  "qwen-coder-32b-v2.5":
    cmd: |
      docker run --gpus=all --pull always --name qwencoder-32bv25
      --init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
      --model /models/unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
      --ctx-size 32768 
      --gpu-layers 999
      --no-context-shift
      --jinja 
      --parallel 1
      --flash-attn
      --cache-type-k q8_0
      --cache-type-v q8_0
    cmdStop: docker stop qwencoder-32bv25

  "qwen-coder-7b-v2.5": 
    cmd: |
      docker run --gpus=all --pull always --name qwencoder-7bv25
      --init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
      --model /models/unsloth_Qwen2.5-Coder-7B-Instruct-Q4_K_M_20250517.gguf
      --ctx-size 32768 
      --gpu-layers 999
      --no-context-shift
      --jinja 
      --parallel 1
      --flash-attn
      --cache-type-k q8_0
      --cache-type-v q8_0
    cmdStop: docker stop qwencoder-7bv25

The volume llamaswap_models is created when the docker compose services are created. The docker compose file exists within a folder called "llamaswap". Similarly, there is a volume created called llamaswap_config but it isn't directly referenced by the above docker commands.

watch-config.sh

#!/bin/sh

# File to watch
CONFIG_FILE="/config/config.yaml"

# Initialize the last modified time
LAST_MODIFIED=$(stat -c %Y "$CONFIG_FILE")

while true; do
  # Get the current modified time
  CURRENT_MODIFIED=$(stat -c %Y "$CONFIG_FILE")

  # Check if the file has been modified
  if [ "$CURRENT_MODIFIED" -ne "$LAST_MODIFIED" ]; then
    echo "File has been modified."

    echo "Restarting llamaswap container..."
    docker restart llamaswap
    echo "Restarted llamaswap container..."
    
    # Update the last modified time
    LAST_MODIFIED=$CURRENT_MODIFIED
  fi

  # Wait for a while before checking again
  sleep 5
done

This script is able to restart the llamaswap container by using the host machine's docker socket. Accessing the docker socket of the host machine allows the script to affect any running container on the host. This is a potential security concern but one I've accepted for my use case.

Docker Server

With docker, and optionally nvidia-smi for CUDA support, installed we can continue with the docker server setup.

Docker Compose

The docker compose file would be stored in a folder called llamaswap

llamaswap/
└── docker-compose.yml

The docker compose file would look like this.

docker-compose.yml

services:
  llama-swap:
    container_name: llamaswap
    image: ghcr.io/mostlygeek/llama-swap:cuda
    network_mode: host
    volumes:
      - models:/models
      - config:/config
      - /var/run/docker.sock:/var/run/docker.sock
      - /usr/bin/docker:/usr/bin/docker
    pull_policy: always
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    entrypoint: /app/llama-swap -config /config/config.yaml --listen :65000

  watch-config:
    image: docker:latest
    volumes:
      - config:/config
      - /var/run/docker.sock:/var/run/docker.sock
    entrypoint: ["sh", "-c", "/config/watch-config.sh"]

volumes:
  models:
    driver: local
    driver_opts:
      type: cifs
      device: "//server_hostname/models"
      o: "username=USERNAME,password=PASSWORD,uid=1000,gid=1000,noserverino"
  config:
    driver: local
    driver_opts:
      type: cifs
      device: "//service_hostname/config"
      o: "username=USERNAME,password=PASSWORD,uid=1000,gid=1000,noserverino"

Docker Compose Explanation

  1. llama-swap Service:

    • Container Name: llamaswap
    • Image: Uses the Docker image ghcr.io/mostlygeek/llama-swap:cuda from the GitHub Container Registry.
    • Network Mode: host mode, which means the service will share the host's network stack.
      • Host is used as containers launched from llama-swap will be launched on the host and llama-swap won't have a simple route to the LLM container.
    • Volumes: Mounts several volumes:
      • models volume to /models inside the container.
      • config volume to /config inside the container.
      • /var/run/docker.sock to the same path inside the container, allowing the container to interact with the Docker daemon.
      • /usr/bin/docker to the same path inside the container. Prevents us from having to install docker within llama-swap's image.
    • Pull Policy: Always pull the latest image from the repository.
      • Optional. Done to prevent having to manually update the service and let it auto-update itself.
    • Deploy Resources: Allows llama-swap to access the GPUs available on the server.
    • Entrypoint: The container will run the command /app/llama-swap -config /config/config.yaml --listen :65000 upon startup.
      • Starting with port 65000 to avoid conflicting with other containers running on the server and to avoid conflicting with itself.
  2. watch-config Service:

    • Image: Uses the latest Docker image.
      • Just used because it is the official docker image that has docker preinstalled and publicly available.
    • Volumes: Mounts the config volume to /config inside the container and /var/run/docker.sock to the same path inside the container.
      • This allows access to the docker service running on the host, allowing it to be able to restart the llama-swap container if the config changes.
    • Entrypoint: The container will run a shell script located at /config/watch-config.sh upon startup.
  3. Volumes:

    • models: A local volume with CIFS (Common Internet File System) options configured to mount a shared folder from //server_hostname/models.
    • config: Another local volume with CIFS options configured to mount a shared folder from //service_hostname/config.

Launching

Launch a terminal from the llamaswap folder and execute the following once the docker-compose.yml file is ready.

docker compose up -d

This will launch the watch-config container and llamaswap container.

When the llamaswap container is launched it creates an endpoint at http://{server}:65000 where {server} is the docker server's IP or hostname.

The configured models would list the two Qwen models and the nomic embed model.

When you click on qwen-coder-32b-v2.5, the llama-swap service will executing the docker command associated with that model.

The following will happen during the setup of the docker process.

  1. Docker will download the ghcr.io/ggml-org/llama.cpp:server-cuda image
  2. Docker will launch the container on a port greater than 65000
    • The created llamacpp server container is created on the host and not within the llamaswap container
    • The llamacpp container is created on the host by making use of the docker socket
  3. The model will be read from the /models path which is pointing to a network share
  4. Once healthy, the llama-swap service will route traffic from 65000 to the container's port on the host machine

If the llamaswap container is restarted due to a config change, or by simply restarting the container, then all containers launched by the llamaswap container are cleaned up.

Additional Uses for this Configuration

Comparing LLM servers

I've since started incorporating additional LLM servers into my config other than llamacpp. This turned out to be a great use case for llama-swap with Docker in Docker.

This configuration example is intended to be used in the type of environment configured above.

The Qwen2.5-Coder-32B-Instruct-AWQ folder was created by using git clone https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ with GIT LFS.

The folder structure for the models directory for this config file is the following.

models/
├── unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
└── Qwen2.5-Coder-32B-Instruct-AWQ  #(GIT FOLDER)

config.yaml example

healthCheckTimeout: 1500 # wait 25 minutes before unhealthy
startPort: 65001 # Ports for containers start at 65001

models:
  "qwen-coder-32b-v2.5-llamacpp":
    cmd: |
      docker run --gpus=all --pull always --name qwencoder-32bv25
      --init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
      --model /models/unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
      --ctx-size 32768 
      --gpu-layers 999
      --no-context-shift
      --jinja 
      --parallel 1
      --flash-attn
      --cache-type-k q8_0
      --cache-type-v q8_0
    cmdStop: docker stop qwencoder-32bv25

  "qwen-coder-32b-v2.5-aphrodite":
    cmd: |
      docker run --runtime nvidia --gpus=all --pull always --name qwencoder-32bv25-aphrodite
      --init --rm -p ${PORT}:2242 --ipc=host -v llamaswap_models:/models alpindale/aphrodite-openai:latest
      --model /models/Qwen2.5-Coder-32B-Instruct-AWQ
      --tensor-parallel-size 8
    cmdStop: docker stop qwencoder-32bv25-aphrodite

  "qwen-coder-32b-v2.5-vllm":
    cmd: |
      docker run --runtime nvidia --gpus=all --pull always --name qwencoder-32bv25-vllm
      --init --rm -p ${PORT}:8000 --ipc=host -v llamaswap_models:/models -e VLLM_USE_FLASHINFER_SAMPLER=0 vllm/vllm-openai:latest
      --model /models/Qwen2.5-Coder-32B-Instruct-AWQ
      --tensor-parallel-size 8 --max-model-len 32768 --quantization awq_marlin --gpu-memory-utilization .75
    cmdStop: docker stop qwencoder-32bv25-vllm

I can then run experiments with the three different LLM servers to see which one I prefer by using the server's 65000 endpoint.

Clone this wiki locally