|
| 1 | +# 1p1d |
| 2 | + |
| 3 | +## Overview |
| 4 | +This example demonstrates how to run unified-cache-management with disaggregated prefill using NFS connector on a single node with a 1 prefiller + 1 decoder setup. |
| 5 | + |
| 6 | +## Prerequisites |
| 7 | +- UCM: Installed with reference to the Installation documentation. |
| 8 | +- Hardware: At least 2 GPUs |
| 9 | + |
| 10 | +## Start disaggregated service |
| 11 | +For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct. |
| 12 | + |
| 13 | +### Run prefill server |
| 14 | +Prefiller Launch Command: |
| 15 | +```bash |
| 16 | +export PYTHONHASHSEED=123456 |
| 17 | +CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \ |
| 18 | +--max-model-len 20000 \ |
| 19 | +--tensor-parallel-size 1 \ |
| 20 | +--gpu_memory_utilization 0.87 \ |
| 21 | +--trust-remote-code \ |
| 22 | +--enforce-eager \ |
| 23 | +--no-enable-prefix-caching \ |
| 24 | +--port 7800 \ |
| 25 | +--block-size 128 \ |
| 26 | +--kv-transfer-config \ |
| 27 | +'{ |
| 28 | + "kv_connector": "UnifiedCacheConnectorV1", |
| 29 | + "kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector", |
| 30 | + "kv_role": "kv_producer", |
| 31 | + "kv_connector_extra_config": { |
| 32 | + "ucm_connector_name": "UcmNfsStore", |
| 33 | + "ucm_connector_config": { |
| 34 | + "storage_backends": "/mnt/test1", |
| 35 | + "kv_block_size": 33554432 |
| 36 | + } |
| 37 | + } |
| 38 | +}' |
| 39 | +``` |
| 40 | + |
| 41 | +### Run decode server |
| 42 | +Decoder Launch Command: |
| 43 | +```bash |
| 44 | +export PYTHONHASHSEED=123456 |
| 45 | +CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \ |
| 46 | +--max-model-len 20000 \ |
| 47 | +--tensor-parallel-size 1 \ |
| 48 | +--gpu_memory_utilization 0.87 \ |
| 49 | +--trust-remote-code \ |
| 50 | +--enforce-eager \ |
| 51 | +--no-enable-prefix-caching \ |
| 52 | +--port 7801 \ |
| 53 | +--block-size 128 \ |
| 54 | +--kv-transfer-config \ |
| 55 | +'{ |
| 56 | + "kv_connector": "UnifiedCacheConnectorV1", |
| 57 | + "kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector", |
| 58 | + "kv_role": "kv_consumer", |
| 59 | + "kv_connector_extra_config": { |
| 60 | + "ucm_connector_name": "UcmNfsStore", |
| 61 | + "ucm_connector_config": { |
| 62 | + "storage_backends": "/mnt/test1", |
| 63 | + "kv_block_size": 33554432 |
| 64 | + } |
| 65 | + } |
| 66 | +}' |
| 67 | +``` |
| 68 | +### Run proxy server |
| 69 | +Make sure prefill nodes and decode nodes can connect to each other. |
| 70 | +```bash |
| 71 | +cd vllm-workspace/unified-cache-management/test/ |
| 72 | +python3 toy_proxy_server.py --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <prefill-node-ip> --decoder-port 7801 |
| 73 | +``` |
| 74 | + |
| 75 | +## Testing and Benchmarking |
| 76 | +### Basic Test |
| 77 | +After running all servers , you can test with a simple curl command: |
| 78 | +```bash |
| 79 | +curl http://localhost:7802/v1/completions \ |
| 80 | + -H "Content-Type: application/json" \ |
| 81 | + -d '{ |
| 82 | + "model": "/home/models/Qwen2.5-7B-Instruct", |
| 83 | + "prompt": "content": "What date is today?", |
| 84 | + "max_tokens": 20, |
| 85 | + "temperature": 0 |
| 86 | + }' |
| 87 | +``` |
| 88 | +### Benchmark Test |
| 89 | +Use the benchmark scripts provided by vLLM. |
| 90 | +```bash |
| 91 | +cd /vllm-workspace/vllm/benchmarks |
| 92 | +python3 benchmark_serving.py \ |
| 93 | + --backend vllm \ |
| 94 | + --dataset-name random \ |
| 95 | + --random-input-len 4096 \ |
| 96 | + --random-output-len 100 \ |
| 97 | + --num-prompts 10 \ |
| 98 | + --ignore-eos \ |
| 99 | + --model /home/models/Qwen2.5-7B-Instruct \ |
| 100 | + --tokenizer /home/models/Qwen2.5-7B-Instruct \ |
| 101 | + --host localhost \ |
| 102 | + --port 7802 \ |
| 103 | + --endpoint /v1/completions \ |
| 104 | + --request-rate 1 |
| 105 | +``` |
0 commit comments