Skip to content

Commit ccbee78

Browse files
hufumansz00452769propanone1006propanone1006
authored
[Feature] Add mooncake store (#117)
* 暂存 * [Feature] Monncake connector support both config and file * [Doc] Add docs for Ucm Mooncake Connector * [Feature] Add mooncake to ucm factory * [Doc][Fix] Modify the description of configuration to match usage. * [Feature] [Fix] Load Mooncake config from dict, when lack params, load from env config file. * [Doc] update the performance and modify description. * [Test] Example config file for Mooncake test `test_mooncake_env.py`. * [Test] [Del] Removed unnecessary tests that do not match the current functionality * [Feat!] [Del] Adjust the mooncake configuration method, remove the configuration file method, and only retain the parameter transmission method * [Doc] [Fix] modifiy the performance figure of Mooncake Store. * [Feat] add __del__() to shutdown all the mooncake components --------- Co-authored-by: z00452769 <zhangyichen@huawei.com> Co-authored-by: propanone1006 <1035097916@qq.com> Co-authored-by: propanone1006 <1035067916@qq.com>
1 parent bb85854 commit ccbee78

File tree

6 files changed

+672
-0
lines changed

6 files changed

+672
-0
lines changed

docs/source/getting-started/example/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
:maxdepth: 2
55
nfs_conn.md
66
dram_conn.md
7+
mooncake_conn.md
78
disaggregated_prefill/index.md
89
:::
910

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Mooncake Connector
2+
3+
This document provides a usage example and configuration guide for the **Mooncake Connector**. This connector enables offloading of KV cache from GPU HBM to CPU Mooncake, helping reduce memory pressure and support larger models or batch sizes.
4+
5+
## Performance
6+
7+
| tokens | mooncake-first | mooncake-second | default |
8+
| ------ | ------------------ | ------------------ | ------------------ |
9+
| 2k | 1.9231491860002279 | 0.8265988459810615 | 0.5419427898712457 |
10+
| 4k | 3.9460434830747544 | 1.5273493870627135 | 0.991630249004811 |
11+
| 8k | 7.577957597002387 | 2.7632693520281464 | 2.0716467570047827 |
12+
| 16k | 16.823639799049126 | 5.515289016952738 | 4.742832682048902 |
13+
| 32k | 81.98759594326839 | 14.217441103421152 | 12.310140203218907 |
14+
15+
Use mooncake fig && default:
16+
<p align="center">
17+
<img alt="UCM" src="../../images/mooncake_performance.png" width="40%">
18+
</p>
19+
20+
## Features
21+
22+
The Monncake connector supports the following functionalities:
23+
24+
- `dump`: Offload KV cache blocks from HBM to Mooncake.
25+
- `load`: Load KV cache blocks from Mooncake back to HBM.
26+
- `lookup`: Look up KV blocks stored in Mooncake by block hash.
27+
- `wait`: Ensure that all copy streams between CPU and GPU have completed.
28+
29+
## Configuration
30+
31+
### Start Mooncake Services
32+
33+
1. Follow the [Mooncake official guide](https://github.com/kvcache-ai/Mooncake/blob/v0.3.4/doc/en/build.md) to build Mooncake.
34+
35+
> **[Warning]**: Currently, this connector only supports Mooncake v0.3.4, and the updated version is being adapted.
36+
37+
2. Start Mooncake Store Service
38+
39+
Please change the IP addresses and ports in the following guide according to your env.
40+
41+
```bash
42+
# Unset HTTP proxies
43+
unset http_proxy https_proxy no_proxy HTTP_PROXY HTTPS_PROXY NO_PROXY
44+
# Navigate to the metadata server directory, http server for example.
45+
cd $MOONCAKE_ROOT_DIR/mooncake-transfer-engine/example/http-metadata-server
46+
# Start Metadata Service
47+
go run . --addr=0.0.0.0:23790
48+
# Start Master Service
49+
mooncake_master --port 50001
50+
```
51+
- Replace `$MOONCAKE_ROOT_DIR` with your Mooncake source root path.
52+
- Make sure to unset any HTTP proxies to prevent networking issues.
53+
- Use appropriate port based on your environment.
54+
55+
56+
57+
### Required Parameters
58+
59+
To use the Mooncake connector, you need to configure the `connector_config` dictionary in your model's launch configuration.
60+
61+
- `local_hostname`:
62+
The IP address of the current node used to communicate with the metadata server.
63+
- `metadata_server`:
64+
The metadata server of the mooncake transfer engine.
65+
- `master_server_address`:
66+
The IP address and the port of the master daemon process of MooncakeStore.
67+
- `protocol` *(optional)*:
68+
If not provided, it defaults to **tcp**.
69+
- `device_name` *(optional)*:
70+
The device to be used for data transmission, it is required when “protocol” is set to “rdma”. If multiple NIC devices are used, they can be separated by commas such as “erdma_0,erdma_1”. Please note that there are no spaces between them.
71+
- `global_segment_size`*(optional)*:
72+
The size of each global segment in bytes. `DEFAULT_GLOBAL_SEGMENT_SIZE = 3355443200` **3.125 GiB**
73+
- `local_buffer_size`*(optional)*:
74+
The size of the local buffer in bytes. `DEFAULT_LOCAL_BUFFER_SIZE = 1073741824` **1.0 GiB**
75+
76+
77+
### Example:
78+
79+
```python
80+
kv_connector_extra_config={
81+
"ucm_connector_name": "UcmMooncakeStore",
82+
"ucm_connector_config":{
83+
"local_hostname": "127.0.0.1",
84+
"metadata_server": "http://127.0.0.1:23790/metadata",
85+
"protocol": "tcp",
86+
"device_name": "",
87+
"master_server_address": "127.0.0.1:50001"
88+
}
89+
}
90+
```
91+
92+
## Launching Inference
93+
94+
### Offline Inference
95+
96+
To start **offline inference** with the Mooncake connector,modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for Mooncake connector usage:
97+
98+
```python
99+
# In examples/offline_inference.py
100+
ktc = KVTransferConfig(
101+
...
102+
kv_connector_extra_config={
103+
"ucm_connector_name": "UcmMooncakeStore",
104+
"ucm_connector_config":{
105+
"local_hostname": "127.0.0.1",
106+
"metadata_server": "http://127.0.0.1:23790/metadata",
107+
"protocol": "tcp",
108+
"device_name": "",
109+
"master_server_address": "127.0.0.1:50001"
110+
}
111+
}
112+
)
113+
```
114+
115+
Then run the script as follows:
116+
117+
```bash
118+
cd examples/
119+
python offline_inference.py
120+
```
121+
122+
### Online Inference
123+
124+
For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.
125+
126+
First, specify the python hash seed by:
127+
```bash
128+
export PYTHONHASHSEED=123456
129+
```
130+
131+
Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
132+
133+
```bash
134+
vllm serve /home/models/Qwen2.5-14B-Instruct \
135+
--max-model-len 20000 \
136+
--tensor-parallel-size 2 \
137+
--gpu_memory_utilization 0.87 \
138+
--trust-remote-code \
139+
--port 7800 \
140+
--kv-transfer-config \
141+
'{
142+
"kv_connector": "UnifiedCacheConnectorV1",
143+
"kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector",
144+
"kv_role": "kv_both",
145+
"kv_connector_extra_config": {
146+
"ucm_connector_name": "UcmMooncakeStore",
147+
"ucm_connector_config":{
148+
"local_hostname": "127.0.0.1",
149+
"metadata_server": "http://127.0.0.1:23790/metadata",
150+
"protocol": "tcp",
151+
"device_name": "",
152+
"master_server_address": "127.0.0.1:50001"
153+
}
154+
}
155+
}
156+
}'
157+
```
158+
159+
If you see log as below:
160+
161+
```bash
162+
INFO: Started server process [321290]
163+
INFO: Waiting for application startup.
164+
INFO: Application startup complete.
165+
```
166+
167+
Congratulations, you have successfully started the vLLM server with Mooncake Connector!
168+
169+
After successfully started the vLLM server,You can interact with the API as following:
170+
171+
```bash
172+
curl http://localhost:7800/v1/completions \
173+
-H "Content-Type: application/json" \
174+
-d '{
175+
"model": "/home/models/Qwen2.5-14B-Instruct",
176+
"prompt": "Shanghai is a",
177+
"max_tokens": 7,
178+
"temperature": 0
179+
}'
180+
```
42.6 KB
Loading

test/test_mooncake.py

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
import hashlib
2+
import uuid
3+
4+
import torch
5+
6+
from unifiedcache.logger import init_logger
7+
from unifiedcache.ucm_connector.base import Task
8+
from unifiedcache.ucm_connector.ucm_mooncake import UcmMooncakeStore
9+
10+
logger = init_logger(__name__)
11+
12+
mooncake_dict_config = {
13+
"local_hostname": "127.0.0.1",
14+
"metadata_server": "http://127.0.0.1:23790/metadata",
15+
"protocol": "tcp",
16+
"device_name": "",
17+
"master_server_address": "127.0.0.1:50001",
18+
}
19+
20+
21+
def tensor_hash(tensor: torch.Tensor) -> str:
22+
"""Calculate the hash value of the tensor."""
23+
tensor_bytes = tensor.clone().detach().cpu().numpy().tobytes()
24+
hash_object = hashlib.blake2b(tensor_bytes)
25+
hash_hex = hash_object.hexdigest()
26+
return str(int(hash_hex[:16], 16))
27+
28+
29+
def test_lookup_not_found():
30+
"""Test that lookup returns False for non-existent block IDs."""
31+
store = UcmMooncakeStore(mooncake_dict_config)
32+
block_ids = [uuid.uuid4().hex for _ in range(10)]
33+
masks = store.lookup(block_ids)
34+
assert all(mask is False for mask in masks)
35+
36+
37+
def test_lookup_found():
38+
"""Test that lookup returns True for existing block IDs after dumping data."""
39+
src_block_data = [
40+
torch.randint(0, 1000, (1, 100), dtype=torch.int) for _ in range(5)
41+
]
42+
block_ids = [tensor_hash(data) for data in src_block_data]
43+
offset = [0] * len(block_ids)
44+
45+
store = UcmMooncakeStore(mooncake_dict_config)
46+
task: Task = store.dump(
47+
block_ids=block_ids, offset=offset, src_tensor=src_block_data
48+
)
49+
ret = store.wait(task)
50+
assert ret == 0
51+
masks = store.lookup(block_ids)
52+
assert all(mask is True for mask in masks)
53+
54+
55+
def test_dump_once():
56+
"""Test dumping data once and verifying it exists in the store."""
57+
src_block_data = [
58+
torch.randint(0, 1000, (1, 100), dtype=torch.int) for _ in range(5)
59+
]
60+
block_ids = [tensor_hash(data) for data in src_block_data]
61+
offset = [0] * len(block_ids)
62+
63+
store = UcmMooncakeStore(mooncake_dict_config)
64+
task: Task = store.dump(
65+
block_ids=block_ids, offset=offset, src_tensor=src_block_data
66+
)
67+
ret = store.wait(task)
68+
assert ret == 0
69+
masks = store.lookup(block_ids)
70+
assert all(mask is True for mask in masks)
71+
72+
73+
def test_dump_repeated():
74+
"""Test that repeated dumping of the same data doesn't cause errors."""
75+
src_block_data = [
76+
torch.randint(0, 1000, (1, 100), dtype=torch.int) for _ in range(5)
77+
]
78+
block_ids = [tensor_hash(data) for data in src_block_data]
79+
offset = [0] * len(block_ids)
80+
81+
store = UcmMooncakeStore(mooncake_dict_config)
82+
task: Task = store.dump(
83+
block_ids=block_ids, offset=offset, src_tensor=src_block_data
84+
)
85+
ret = store.wait(task)
86+
assert ret == 0
87+
masks = store.lookup(block_ids)
88+
assert all(mask is True for mask in masks)
89+
90+
task: Task = store.dump(
91+
block_ids=block_ids, offset=offset, src_tensor=src_block_data
92+
)
93+
ret = store.wait(task)
94+
assert ret == 0
95+
96+
97+
def test_load_existing_data():
98+
"""Test loading data that was previously dumped into the store."""
99+
src_block_data = [
100+
torch.randint(0, 1000, (1, 100), dtype=torch.int) for _ in range(5)
101+
]
102+
dst_block_data = [
103+
torch.empty(data.shape, dtype=data.dtype) for data in src_block_data
104+
]
105+
block_ids = [tensor_hash(data) for data in src_block_data]
106+
offset = [0] * len(block_ids)
107+
108+
store = UcmMooncakeStore(mooncake_dict_config)
109+
task: Task = store.dump(
110+
block_ids=block_ids, offset=offset, src_tensor=src_block_data
111+
)
112+
ret = store.wait(task)
113+
assert ret == 0
114+
115+
masks = store.lookup(block_ids)
116+
assert all(mask is True for mask in masks)
117+
118+
task: Task = store.load(
119+
block_ids=block_ids, offset=offset, dst_tensor=dst_block_data
120+
)
121+
ret = store.wait(task)
122+
assert ret == 0
123+
assert all(
124+
[
125+
torch.equal(src_block_data[i], dst_block_data[i]) is True
126+
for i in range(len(src_block_data))
127+
]
128+
)
129+
130+
131+
def test_load_non_existent_data():
132+
"""Test loading data that doesn't exist in the store verifies the destination remains unchanged."""
133+
src_block_data = [
134+
torch.randint(0, 1000, (1, 100), dtype=torch.int) for _ in range(5)
135+
]
136+
dst_block_data = [
137+
torch.empty(data.shape, dtype=data.dtype) for data in src_block_data
138+
]
139+
block_ids = [tensor_hash(data) for data in src_block_data]
140+
offset = [0] * len(block_ids)
141+
store = UcmMooncakeStore(mooncake_dict_config)
142+
masks = store.lookup(block_ids)
143+
assert all(mask is False for mask in masks)
144+
145+
task: Task = store.load(
146+
block_ids=block_ids, offset=offset, dst_tensor=dst_block_data
147+
)
148+
ret = store.wait(task)
149+
assert ret != 0
150+
assert all(
151+
[
152+
torch.equal(src_block_data[i], dst_block_data[i]) is False
153+
for i in range(len(src_block_data))
154+
]
155+
)

unifiedcache/ucm_connector/factory.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,6 @@ def create_connector(cls, connector_name: str, config: dict) -> UcmKVStoreBase:
6363
UcmConnectorFactory.register_connector(
6464
"UcmNfsStore", "unifiedcache.ucm_connector.ucm_nfs_store", "UcmNfsStore"
6565
)
66+
UcmConnectorFactory.register_connector(
67+
"UcmMooncakeStore", "unifiedcache.ucm_connector.ucm_mooncake", "UcmMooncakeStore"
68+
)

0 commit comments

Comments
 (0)