-
Notifications
You must be signed in to change notification settings - Fork 66
Docker in Docker with llama‐swap guide
Important
🎉 Contributed by @ramblingcoder
This guide is an explanation of a setup deployed that...
- Allows llama-swap to spin up and down LLM inference containers
- Uses a shared network config for models and config.
- Restarts all llama-swap instances watching the config file.
This setup is a variation of DIND called "docker out of docker" for a llama-swap environment. "Docker out of Docker" is similar to "Docker in Docker" (aka DIND) but uses the host's docker process.
More information regarding DIND can be found here: DIND
- Docker
- NVidia-SMI (Optional but used in this example)
I want to be able to manage multiple servers with the following capabilities...
- Restart llama-swap when config is changed
- Have a singular copy of models
- Automatically grab latest version of docker images when starting a model to reduce manual effort on maintenance
This is the overall architecture this page will recreate.
There are two network shares that need to be setup.
- models - A collection of GGUF files
- config - Contains the llama-swap config file and the watch-config.sh script.
These will be configured as network share volumes in docker compose.
In the models folder, there is a collection of loose GGUF files. This is an example of how my model's folder content would look like.
models/
├── unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
├── unsloth_Qwen2.5-Coder-7B-Instruct-Q4_K_M_20250517.gguf
└── nomic-ai_nomic-embed-text-v1.5.Q8_0_20250517.gguf
The config directory should contain two files:
- config.yaml - Llama-Swap configuration
- watch-config.sh - Script to monitor config changes
config/
├── config.yaml
└── watch-config.sh
config.yaml example
healthCheckTimeout: 1500 # wait 25 minutes before unhealthy
startPort: 65001 # Ports for containers start at 65001
groups:
"coding":
swap: false
exclusive: true
members:
- "qwen-coder-32b-v2.5"
- "qwen-coder-7b-v2.5"
- "nomic-embed-text-v1.5"
models:
"nomic-embed-text-v1.5":
cmd: |
docker run --pull always --name nomicembed
--init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
--model /models/nomicai_nomic-embed-text-v1.5.Q8_0.gguf
--ctx-size 8192
--gpu-layers 999
--no-context-shift
--embeddings
--batch-size 8192
--rope-scaling yarn
--rope-freq-scale 0.75
--parallel 1
--flash-attn
cmdStop: docker stop nomicembed
"qwen-coder-32b-v2.5":
cmd: |
docker run --gpus=all --pull always --name qwencoder-32bv25
--init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
--model /models/unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
--ctx-size 32768
--gpu-layers 999
--no-context-shift
--jinja
--parallel 1
--flash-attn
--cache-type-k q8_0
--cache-type-v q8_0
cmdStop: docker stop qwencoder-32bv25
"qwen-coder-7b-v2.5":
cmd: |
docker run --gpus=all --pull always --name qwencoder-7bv25
--init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
--model /models/unsloth_Qwen2.5-Coder-7B-Instruct-Q4_K_M_20250517.gguf
--ctx-size 32768
--gpu-layers 999
--no-context-shift
--jinja
--parallel 1
--flash-attn
--cache-type-k q8_0
--cache-type-v q8_0
cmdStop: docker stop qwencoder-7bv25
The volume llamaswap_models is created when the docker compose services are created. The docker compose file exists within a folder called "llamaswap". Similarly, there is a volume created called llamaswap_config but it isn't directly referenced by the above docker commands.
watch-config.sh
#!/bin/sh
# File to watch
CONFIG_FILE="/config/config.yaml"
# Initialize the last modified time
LAST_MODIFIED=$(stat -c %Y "$CONFIG_FILE")
while true; do
# Get the current modified time
CURRENT_MODIFIED=$(stat -c %Y "$CONFIG_FILE")
# Check if the file has been modified
if [ "$CURRENT_MODIFIED" -ne "$LAST_MODIFIED" ]; then
echo "File has been modified."
echo "Restarting llamaswap container..."
docker restart llamaswap
echo "Restarted llamaswap container..."
# Update the last modified time
LAST_MODIFIED=$CURRENT_MODIFIED
fi
# Wait for a while before checking again
sleep 5
done
This script is able to restart the llamaswap container by using the host machine's docker socket. Accessing the docker socket of the host machine allows the script to affect any running container on the host. This is a potential security concern but one I've accepted for my use case.
With docker, and optionally nvidia-smi for CUDA support, installed we can continue with the docker server setup.
The docker compose file would be stored in a folder called llamaswap
llamaswap/
└── docker-compose.yml
The docker compose file would look like this.
docker-compose.yml
services:
llama-swap:
container_name: llamaswap
image: ghcr.io/mostlygeek/llama-swap:cuda
network_mode: host
volumes:
- models:/models
- config:/config
- /var/run/docker.sock:/var/run/docker.sock
- /usr/bin/docker:/usr/bin/docker
pull_policy: always
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
entrypoint: /app/llama-swap -config /config/config.yaml --listen :65000
watch-config:
image: docker:latest
volumes:
- config:/config
- /var/run/docker.sock:/var/run/docker.sock
entrypoint: ["sh", "-c", "/config/watch-config.sh"]
volumes:
models:
driver: local
driver_opts:
type: cifs
device: "//server_hostname/models"
o: "username=USERNAME,password=PASSWORD,uid=1000,gid=1000,noserverino"
config:
driver: local
driver_opts:
type: cifs
device: "//service_hostname/config"
o: "username=USERNAME,password=PASSWORD,uid=1000,gid=1000,noserverino"
-
llama-swap Service:
-
Container Name:
llamaswap
-
Image: Uses the Docker image
ghcr.io/mostlygeek/llama-swap:cuda
from the GitHub Container Registry. -
Network Mode:
host
mode, which means the service will share the host's network stack.- Host is used as containers launched from llama-swap will be launched on the host and llama-swap won't have a simple route to the LLM container.
-
Volumes: Mounts several volumes:
-
models
volume to/models
inside the container. -
config
volume to/config
inside the container. -
/var/run/docker.sock
to the same path inside the container, allowing the container to interact with the Docker daemon. -
/usr/bin/docker
to the same path inside the container. Prevents us from having to install docker within llama-swap's image.
-
-
Pull Policy: Always pull the latest image from the repository.
- Optional. Done to prevent having to manually update the service and let it auto-update itself.
- Deploy Resources: Allows llama-swap to access the GPUs available on the server.
-
Entrypoint: The container will run the command
/app/llama-swap -config /config/config.yaml --listen :65000
upon startup.- Starting with port 65000 to avoid conflicting with other containers running on the server and to avoid conflicting with itself.
-
Container Name:
-
watch-config Service:
-
Image: Uses the latest Docker image.
- Just used because it is the official docker image that has docker preinstalled and publicly available.
-
Volumes: Mounts the
config
volume to/config
inside the container and/var/run/docker.sock
to the same path inside the container.- This allows access to the docker service running on the host, allowing it to be able to restart the llama-swap container if the config changes.
-
Entrypoint: The container will run a shell script located at
/config/watch-config.sh
upon startup.
-
Image: Uses the latest Docker image.
-
Volumes:
-
models
: A local volume with CIFS (Common Internet File System) options configured to mount a shared folder from//server_hostname/models
. -
config
: Another local volume with CIFS options configured to mount a shared folder from//service_hostname/config
.
-
Launch a terminal from the llamaswap folder and execute the following once the docker-compose.yml file is ready.
docker compose up -d
This will launch the watch-config
container and llamaswap
container.
When the llamaswap
container is launched it creates an endpoint at http://{server}:65000
where {server}
is the docker server's IP or hostname.
The configured models would list the two Qwen models and the nomic embed model.
When you click on qwen-coder-32b-v2.5
, the llama-swap service will executing the docker command associated with that model.
The following will happen during the setup of the docker process.
- Docker will download the
ghcr.io/ggml-org/llama.cpp:server-cuda
image - Docker will launch the container on a port greater than 65000
- The created llamacpp server container is created on the host and not within the
llamaswap
container - The llamacpp container is created on the host by making use of the docker socket
- The created llamacpp server container is created on the host and not within the
- The model will be read from the
/models
path which is pointing to a network share - Once healthy, the llama-swap service will route traffic from 65000 to the container's port on the host machine
If the llamaswap
container is restarted due to a config change, or by simply restarting the container, then all containers launched by the llamaswap
container are cleaned up.
I've since started incorporating additional LLM servers into my config other than llamacpp. This turned out to be a great use case for llama-swap with Docker in Docker.
This configuration example is intended to be used in the type of environment configured above.
The Qwen2.5-Coder-32B-Instruct-AWQ
folder was created by using git clone https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
with GIT LFS.
The folder structure for the models directory for this config file is the following.
models/
├── unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
└── Qwen2.5-Coder-32B-Instruct-AWQ #(GIT FOLDER)
config.yaml example
healthCheckTimeout: 1500 # wait 25 minutes before unhealthy
startPort: 65001 # Ports for containers start at 65001
models:
"qwen-coder-32b-v2.5-llamacpp":
cmd: |
docker run --gpus=all --pull always --name qwencoder-32bv25
--init --rm -p ${PORT}:8080 -v llamaswap_models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
--model /models/unsloth_Qwen2.5-Coder-32B-Instruct-Q4_K_M_20250517.gguf
--ctx-size 32768
--gpu-layers 999
--no-context-shift
--jinja
--parallel 1
--flash-attn
--cache-type-k q8_0
--cache-type-v q8_0
cmdStop: docker stop qwencoder-32bv25
"qwen-coder-32b-v2.5-aphrodite":
cmd: |
docker run --runtime nvidia --gpus=all --pull always --name qwencoder-32bv25-aphrodite
--init --rm -p ${PORT}:2242 --ipc=host -v llamaswap_models:/models alpindale/aphrodite-openai:latest
--model /models/Qwen2.5-Coder-32B-Instruct-AWQ
--tensor-parallel-size 8
cmdStop: docker stop qwencoder-32bv25-aphrodite
"qwen-coder-32b-v2.5-vllm":
cmd: |
docker run --runtime nvidia --gpus=all --pull always --name qwencoder-32bv25-vllm
--init --rm -p ${PORT}:8000 --ipc=host -v llamaswap_models:/models -e VLLM_USE_FLASHINFER_SAMPLER=0 vllm/vllm-openai:latest
--model /models/Qwen2.5-Coder-32B-Instruct-AWQ
--tensor-parallel-size 8 --max-model-len 32768 --quantization awq_marlin --gpu-memory-utilization .75
cmdStop: docker stop qwencoder-32bv25-vllm
I can then run experiments with the three different LLM servers to see which one I prefer by using the server's 65000 endpoint.