Commit f531051
Milestone 1 of Internal Process-level Fault Tolerance (#61)
* feat(fault-tolerance): add class skeletons for fault tolerance
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* config: add configuration options for fault tolerance
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* 增加generate_identity和generate_identitys函数 Generate a unique identity for ZMQ ROUTER node
* add service startup configuradtion fault report addr
* add init WorkerGuard
* add engine_core_cmd_addr、fault_report_addr、client_cmd_addr、engine_core_identitys in EngineZmqAddresses
init engine_core_cmd_addr、fault_report_addr、client_cmd_addr in launch_core_engines func
add _report_engine_dead func in CoreEngineProcManager
* init ClientGuard
init EngineZmqAddresses engine_core_identitys
* init EngineCoreGuard
* change generate_identitys to generate_identity_group
* code typesetting is optimized
* code typesetting is optimized
* changed code format ensure every line < 88 chars
* changed code format ensure every line < 88 chars
fix error Value of type "dict[Any, Any] | None" is not indexable [index]
* fix bug
Error: vllm/v1/engine/utils.py:122:89: E501 Line too long (117 > 88)
Error: vllm/v1/engine/utils.py:1059:9: F402 Import `uuid` from line 6 shadowed by loop variable
* fix
Error: vllm/v1/engine/utils.py:1045: error: Need type annotation for "uuids" (hint: "uuids: set[<type>] = ...") [var-annotated]
* fix
error: Value of type "dict[Any, Any] | None" is not indexable [index]
* fix
error: Value of type "dict[Any, Any] | None" is not indexable [index]
Signed-off-by: a798347923 <2645302020@qq.com>
* add _send_msg in EngineCoreGuard
Signed-off-by: a798347923 <2645302020@qq.com>
* add import torch.cuda
* add _recv_cmd function docstring that clearly explains the meaning of the return value.
* changed recv_fault_msg to recv_msg
add ClientGuard __init__ func parameter types
* add engine monitor
Signed-off-by: TianZhuo <2770730562@qq.com>
* Delete requirements/test.txt~
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Delete vllm/v1/engine/core_client.py~
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* simply _send_msg and _recv_cmd in EngineCoreGuard
* simply recv_msg in ClientGuard
* engine: add fault tolerance features for EngineCore.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* engine: add timeout mechanism in retry.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* add engine monitor
* Delete vllm/v1/engine/exceptions.py~
Signed-off-by: 205150940 <112750056+205150940@users.noreply.github.com>
* updata actor_index
* updata enginedead flag
* handle fault and report exception
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix engine_actor
* fix engine_actor fault_info
* handle fault and report exception
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* delete num_identity
* changed try expect
* fix debug error
* fix one bug.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* add fault_report_addr in FaultToleranceConfig
* add handle fault&get_fault_info api
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* remove fault_report_address in CoreEngineActorManager __init__
Signed-off-by: a798347923 <2645302020@qq.com>
* ruff format
Signed-off-by: a798347923 <2645302020@qq.com>
* add handle fault&get_fault_info api
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix one bug.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* add fault_report_port in FaultToleranceConfig
Signed-off-by: a798347923 <2645302020@qq.com>
* add zmq_addr concatenate with fault_report_addr and fault_report_port
Signed-off-by: a798347923 <2645302020@qq.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix some bug
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fault reporter bug fix
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* remove fault_report_addr in FaultToleranceConfig
Signed-off-by: a798347923 <2645302020@qq.com>
* refactor: relocate method serialization functions to serial_util.py
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* fix actor bug
* fix actor bug
* add engine_core_cmd_addr in FaultToleranceConfig
Signed-off-by: a798347923 <2645302020@qq.com>
* add and use _stop_worker_execution in EngineCoreGuard
Signed-off-by: a798347923 <2645302020@qq.com>
* add and use run in WorkerGuard
Signed-off-by: a798347923 <2645302020@qq.com>
* fix actor bug
* fix bug
* fix sentinel
* fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard"
Signed-off-by: a798347923 <2645302020@qq.com>
* fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int"
Signed-off-by: a798347923 <2645302020@qq.com>
* fix bug in fault tolerance mode
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix bug in fault tolerance mode
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* change fault_report_port to internal_fault_report_port
add external_fault_notify_port
Signed-off-by: a798347923 <2645302020@qq.com>
* change fault_report_port to internal_fault_report_port
add external_fault_notify_port
Signed-off-by: a798347923 <2645302020@qq.com>
* add _recv_cmd func
use deserialize_method_call and run_method in run func
Signed-off-by: a798347923 <2645302020@qq.com>
* Update core.py
fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...")
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* add self.ctx.term() in shutdown()
Signed-off-by: a798347923 <2645302020@qq.com>
* changed import deserialize_method_call,serialize_method_call
Signed-off-by: a798347923 <2645302020@qq.com>
* changed init worker_guard in init_device
Signed-off-by: a798347923 <2645302020@qq.com>
* Update core.py
add import serialize_method_call
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Update gpu_worker.py
changed init WorkerGuard in init_device
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Update gpu_worker.py
FIX BUG self.worker_guard: WorkerGuard|None = None
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Update gpu_worker.py
fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str" [arg-type]
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Update gpu_worker.py
ruff format
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Update core.py
ruff-format
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* actively send exception information
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* actively send exception information
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* actively send exception information
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses
Signed-off-by: a798347923 <2645302020@qq.com>
* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses
Signed-off-by: a798347923 <2645302020@qq.com>
* Update utils.py
delete engine_core_cmd_addr in EngineZmqAddresses
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
* Remove redundant configuration: fault-pub-port
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Send pause instructions after receiving fault info in ClientGuard
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* change engine_core_guard_identities from dict[int, bytes] to list[bytes]
Signed-off-by: a798347923 <2645302020@qq.com>
* fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard
Signed-off-by: a798347923 <2645302020@qq.com>
* change local_rank to rank_in_group in WorkerGuard
Signed-off-by: a798347923 <2645302020@qq.com>
* changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)]
Signed-off-by: a798347923 <2645302020@qq.com>
* add gloo communication timeout
* fix some bug
* add stateless_process_group gloo_comm_timeout
* reconstruct fault receiver&fault handler
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix some bug
* reconstruct fault receiver&fault handler
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* reconstruct fault receiver&fault handler
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix return format
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix return format
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix return format
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* add abort request
* fix some bug
* fix some bug
* fix some bug
* add dt for client guard
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* add dt for client guard
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* add dt for client guard
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Refine certain log forms and fix a minor bug in pause function.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Refactor and abstract the recv_msg logic in CG,ECG,WG.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Add and check method uuid when sending commands and receiving results.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Abstract the logic of sending instructions and waiting responses from FaultHandler
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Add options in EngineCoreGuard to recv execution results from WorkerGuard
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* resolve conflicts
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* resolve conflicts
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* resolve conflicts
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* resolve conflicts
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* resolve conflicts
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* resolve conflicts
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* add engine core ut
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* add engine core ut
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* rename& format logger
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* rename& format logger
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* reinit dp_group
* fix bug
* fix bug
* fix bug
* fix bug (#54)
* Move requests to waiting queue instead of abandoing them directly.
Signed-off-by: fangyuchu <fangyuchu@qq.com>
* add annotation
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
* fix typos
Signed-off-by: fangyuchu <fangyuchu@qq.com>
---------
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: a798347923 <2645302020@qq.com>
Signed-off-by: TianZhuo <2770730562@qq.com>
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
Signed-off-by: 205150940 <112750056+205150940@users.noreply.github.com>
Signed-off-by: w00689259 <wangzhuo66@huawei.com>
Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com>
Co-authored-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com>
Co-authored-by: a798347923 <2645302020@qq.com>
Co-authored-by: TianZhuo <2770730562@qq.com>
Co-authored-by: 205150940 <112750056+205150940@users.noreply.github.com>
Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com>
Co-authored-by: w00689259 <wangzhuo66@huawei.com>1 parent 4b1ff13 commit f531051
File tree
28 files changed
+2600
-79
lines changed- tests/v1/engine
- vllm
- config
- distributed
- device_communicators
- engine
- entrypoints
- cli
- openai
- utils
- v1
- core/sched
- engine
- worker
28 files changed
+2600
-79
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
83 | 84 | | |
84 | 85 | | |
85 | 86 | | |
| 87 | + | |
| 88 | + | |
86 | 89 | | |
87 | 90 | | |
88 | 91 | | |
| |||
0 commit comments