|
| 1 | +# Mooncake Connector |
| 2 | + |
| 3 | +This document provides a usage example and configuration guide for the **Mooncake Connector**. This connector enables offloading of KV cache from GPU HBM to CPU Mooncake, helping reduce memory pressure and support larger models or batch sizes. |
| 4 | + |
| 5 | +## Performance |
| 6 | + |
| 7 | +| tokens | mooncake-first | mooncake-second | default | |
| 8 | +| ------ | ------------------ | ------------------ | ------------------ | |
| 9 | +| 2k | 1.9231491860002279 | 0.8265988459810615 | 0.5419427898712457 | |
| 10 | +| 4k | 3.9460434830747544 | 1.5273493870627135 | 0.991630249004811 | |
| 11 | +| 8k | 7.577957597002387 | 2.7632693520281464 | 2.0716467570047827 | |
| 12 | +| 16k | 16.823639799049126 | 5.515289016952738 | 4.742832682048902 | |
| 13 | +| 32k | 81.98759594326839 | 14.217441103421152 | 12.310140203218907 | |
| 14 | + |
| 15 | +Use mooncake fig && default: |
| 16 | +<p align="center"> |
| 17 | + <img alt="UCM" src="../../images/mooncake_performance.png" width="40%"> |
| 18 | +</p> |
| 19 | + |
| 20 | +## Features |
| 21 | + |
| 22 | +The Monncake connector supports the following functionalities: |
| 23 | + |
| 24 | +- `dump`: Offload KV cache blocks from HBM to Mooncake. |
| 25 | +- `load`: Load KV cache blocks from Mooncake back to HBM. |
| 26 | +- `lookup`: Look up KV blocks stored in Mooncake by block hash. |
| 27 | +- `wait`: Ensure that all copy streams between CPU and GPU have completed. |
| 28 | + |
| 29 | +## Configuration |
| 30 | + |
| 31 | +### Start Mooncake Services |
| 32 | + |
| 33 | +1. Follow the [Mooncake official guide](https://github.com/kvcache-ai/Mooncake/blob/v0.3.4/doc/en/build.md) to build Mooncake. |
| 34 | + |
| 35 | +> **[Warning]**: Currently, this connector only supports Mooncake v0.3.4, and the updated version is being adapted. |
| 36 | +
|
| 37 | +2. Start Mooncake Store Service |
| 38 | + |
| 39 | + Please change the IP addresses and ports in the following guide according to your env. |
| 40 | + |
| 41 | +```bash |
| 42 | +# Unset HTTP proxies |
| 43 | +unset http_proxy https_proxy no_proxy HTTP_PROXY HTTPS_PROXY NO_PROXY |
| 44 | +# Navigate to the metadata server directory, http server for example. |
| 45 | +cd $MOONCAKE_ROOT_DIR/mooncake-transfer-engine/example/http-metadata-server |
| 46 | +# Start Metadata Service |
| 47 | +go run . --addr=0.0.0.0:23790 |
| 48 | +# Start Master Service |
| 49 | +mooncake_master --port 50001 |
| 50 | +``` |
| 51 | +- Replace `$MOONCAKE_ROOT_DIR` with your Mooncake source root path. |
| 52 | +- Make sure to unset any HTTP proxies to prevent networking issues. |
| 53 | +- Use appropriate port based on your environment. |
| 54 | + |
| 55 | + |
| 56 | + |
| 57 | +### Required Parameters |
| 58 | + |
| 59 | +To use the Mooncake connector, you need to configure the `connector_config` dictionary in your model's launch configuration. |
| 60 | + |
| 61 | +- `local_hostname`: |
| 62 | + The IP address of the current node used to communicate with the metadata server. |
| 63 | +- `metadata_server`: |
| 64 | + The metadata server of the mooncake transfer engine. |
| 65 | +- `master_server_address`: |
| 66 | + The IP address and the port of the master daemon process of MooncakeStore. |
| 67 | +- `protocol` *(optional)*: |
| 68 | + If not provided, it defaults to **tcp**. |
| 69 | +- `device_name` *(optional)*: |
| 70 | + The device to be used for data transmission, it is required when “protocol” is set to “rdma”. If multiple NIC devices are used, they can be separated by commas such as “erdma_0,erdma_1”. Please note that there are no spaces between them. |
| 71 | +- `global_segment_size`*(optional)*: |
| 72 | + The size of each global segment in bytes. `DEFAULT_GLOBAL_SEGMENT_SIZE = 3355443200` **3.125 GiB** |
| 73 | +- `local_buffer_size`*(optional)*: |
| 74 | + The size of the local buffer in bytes. `DEFAULT_LOCAL_BUFFER_SIZE = 1073741824` **1.0 GiB** |
| 75 | + |
| 76 | + |
| 77 | +### Example: |
| 78 | + |
| 79 | +```python |
| 80 | +kv_connector_extra_config={ |
| 81 | + "ucm_connector_name": "UcmMooncakeStore", |
| 82 | + "ucm_connector_config":{ |
| 83 | + "local_hostname": "127.0.0.1", |
| 84 | + "metadata_server": "http://127.0.0.1:23790/metadata", |
| 85 | + "protocol": "tcp", |
| 86 | + "device_name": "", |
| 87 | + "master_server_address": "127.0.0.1:50001" |
| 88 | + } |
| 89 | + } |
| 90 | +``` |
| 91 | + |
| 92 | +## Launching Inference |
| 93 | + |
| 94 | +### Offline Inference |
| 95 | + |
| 96 | +To start **offline inference** with the Mooncake connector,modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for Mooncake connector usage: |
| 97 | + |
| 98 | +```python |
| 99 | +# In examples/offline_inference.py |
| 100 | +ktc = KVTransferConfig( |
| 101 | + ... |
| 102 | + kv_connector_extra_config={ |
| 103 | + "ucm_connector_name": "UcmMooncakeStore", |
| 104 | + "ucm_connector_config":{ |
| 105 | + "local_hostname": "127.0.0.1", |
| 106 | + "metadata_server": "http://127.0.0.1:23790/metadata", |
| 107 | + "protocol": "tcp", |
| 108 | + "device_name": "", |
| 109 | + "master_server_address": "127.0.0.1:50001" |
| 110 | + } |
| 111 | + } |
| 112 | +) |
| 113 | +``` |
| 114 | + |
| 115 | +Then run the script as follows: |
| 116 | + |
| 117 | +```bash |
| 118 | +cd examples/ |
| 119 | +python offline_inference.py |
| 120 | +``` |
| 121 | + |
| 122 | +### Online Inference |
| 123 | + |
| 124 | +For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. |
| 125 | + |
| 126 | +First, specify the python hash seed by: |
| 127 | +```bash |
| 128 | +export PYTHONHASHSEED=123456 |
| 129 | +``` |
| 130 | + |
| 131 | +Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model: |
| 132 | + |
| 133 | +```bash |
| 134 | +vllm serve /home/models/Qwen2.5-14B-Instruct \ |
| 135 | +--max-model-len 20000 \ |
| 136 | +--tensor-parallel-size 2 \ |
| 137 | +--gpu_memory_utilization 0.87 \ |
| 138 | +--trust-remote-code \ |
| 139 | +--port 7800 \ |
| 140 | +--kv-transfer-config \ |
| 141 | +'{ |
| 142 | + "kv_connector": "UnifiedCacheConnectorV1", |
| 143 | + "kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector", |
| 144 | + "kv_role": "kv_both", |
| 145 | + "kv_connector_extra_config": { |
| 146 | + "ucm_connector_name": "UcmMooncakeStore", |
| 147 | + "ucm_connector_config":{ |
| 148 | + "local_hostname": "127.0.0.1", |
| 149 | + "metadata_server": "http://127.0.0.1:23790/metadata", |
| 150 | + "protocol": "tcp", |
| 151 | + "device_name": "", |
| 152 | + "master_server_address": "127.0.0.1:50001" |
| 153 | + } |
| 154 | + } |
| 155 | + } |
| 156 | +}' |
| 157 | +``` |
| 158 | + |
| 159 | +If you see log as below: |
| 160 | + |
| 161 | +```bash |
| 162 | +INFO: Started server process [321290] |
| 163 | +INFO: Waiting for application startup. |
| 164 | +INFO: Application startup complete. |
| 165 | +``` |
| 166 | + |
| 167 | +Congratulations, you have successfully started the vLLM server with Mooncake Connector! |
| 168 | + |
| 169 | +After successfully started the vLLM server,You can interact with the API as following: |
| 170 | + |
| 171 | +```bash |
| 172 | +curl http://localhost:7800/v1/completions \ |
| 173 | + -H "Content-Type: application/json" \ |
| 174 | + -d '{ |
| 175 | + "model": "/home/models/Qwen2.5-14B-Instruct", |
| 176 | + "prompt": "Shanghai is a", |
| 177 | + "max_tokens": 7, |
| 178 | + "temperature": 0 |
| 179 | + }' |
| 180 | +``` |
0 commit comments