-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Description
Summary
Phase 2 enables automatic recovery from persistent rank failures by integrating Phase 0-1's health monitoring with vLLM's existing scale_elastic_ep() API. When failures persist beyond a threshold, the system automatically triggers scale-down (remove failed ranks) followed by scale-up (restore capacity) to achieve self-healing without manual intervention.
Motivation
Problem: Phase 0-1 Leaves System in Degraded State
Phase 0-1 (#27774) successfully provides graceful degradation:
- Detects failures via per-expert latency monitoring within some number of forward passes
- Applies weight penalty to unhealthy experts during EPLB rebalancing
- Continues serving with reduced capacity
However, critical limitations remain in production:
- Incomplete traffic blocking. The weight penalty is probabilistic rather than absolute, so failed experts continue to receive traffic.
- No capacity is restored. Phases 0–1 never recover capacity, so failed ranks remain in the cluster indefinitely. Redundant experts are compensating for these failures.
- Manual intervention is required to restore full capacity.
Proposed Change.
Solution: Automatic Recovery via Elastic Scaling
Phase 2 adds three critical capabilities:
- Hard-blocking: Immediately stop ALL traffic to failed ranks (not just penalty)
- Automatic replacement: Trigger elastic scale-down + scale-up to restore capacity
- Self-healing: System recovers without manual intervention
Architecture
┌─────────────────────────────────────────────────────────────┐
│ External Orchestrator │
│ (Dynamo, K8s, Ray, Custom, etc.) │
│ │
│ - Monitors health API │
│ - Receives recovery webhooks │
│ - Decides scaling strategy │
│ - Manages pod/worker lifecycle │
└────────────┬────────────────────────────────┬───────────────┘
│ │
┌──────▼──────────┐ ┌──────▼──────────┐
│ Health/Events │ │ Scaling Control │
│ APIs │ │ APIs │
└──────┬──────────┘ └──────┬──────────┘
│ │
┌────────────┴─────────────────────────────────┴───────────────┐
│ vLLM System │
│ │
│ Phase 0-1: Detection & Soft Blocking │
│ - Health monitoring │
│ - Weight penalties │
│ - Failure tracking │
│ │
│ Phase 2: Recovery Coordination │
│ - Expose health status │
│ - Emit recovery events │
│ - Accept scaling commands │
│ - Coordinate drain & transitions │
└───────────────────────────────────────────────────────────────┘
Track the failed state of each rank
@dataclass
class EPLBConfig:
# ... existing EPLB fields ...
# Phase 0-1 (from #27774)
health_check_enabled: bool = True
health_latency_threshold: float = 3.0 # 3x median = unhealthy
health_penalty_factor: float = 10.0 # Weight penalty for unhealthy experts
# Phase 2 (NEW) - Automatic Recovery
recovery_enabled: bool = True
# Trigger recovery if failure persists for this many forward passes
failure_persistence_threshold: int = 1000
# Per-rank threshold: trigger recovery if this fraction of experts failed
rank_failure_ratio_threshold: float = 0.5
class EplbState:
def __init__(self, ...):
# ... existing state ...
# Phase 2: Recovery state
self.rank_failure_duration: torch.Tensor = torch.zeros(...)
# Worker sets this when recovery needed, coordinator reads it
self.pending_recovery_request: Optional[RecoveryRequest] = None
def _update_failure_duration(self) -> None:
"""
Track consecutive unhealthy passes per rank.
Increment for unhealthy ranks, reset for recovered ranks.
A rank is unhealthy if ALL its experts are unhealthy.
"""
Trigger recovery for serious degradation
# EblbState
def _should_trigger_recovery(self) -> bool:
"""
Check if any rank meets EITHER conditions for recovery:
1. Failure ratio exceeds threshold (e.g., >50% experts failed)
2. Failure persisted for threshold number of passes
Returns:
True if recovery should be triggered for at least one rank
"""
def step(self, model: MixtureOfExperts, ...) -> None:
"""EPLB step with recovery trigger."""
# Phase 0-1: Health monitoring
if self.eplb_config.health_check_enabled:
self._update_health_mask(model)
self._update_failure_duration()
# Phase 2: Recovery trigger (NEW)
if self.eplb_config.recovery_enabled and self._should_trigger_recovery():
failed_ranks = self._hard_block_and_prepare_recovery()
return # Skip normal rebalancing
Recovery
class AsyncLLM:
def __init__(self, ...):
# ... existing init ...
self.orchestrator_webhook_url = orchestrator_webhook_url
self.last_recovery_check = time.time()
async def _check_and_notify_recovery(self) -> None:
"""
Poll workers for recovery triggers and notify orchestrator.
Instead of executing recovery internally, emit event to
external orchestrator (Dynamo, etc.)
"""
# Poll all workers via collective_rpc
recovery_triggers = await self.collective_rpc(
method="get_recovery_trigger", # ← Calls GPUWorker.get_recovery_trigger()
timeout=0.5
)
# Execute recovery if any worker flagged
all_failed_ranks = set()
for worker_idx, failed_ranks in enumerate(recovery_triggers):
if failed_ranks is not None:
all_failed_ranks.update(failed_ranks)
if len(all_failed_ranks) > 0:
await self._emit_recovery_event(failed_ranks_list)
async def _emit_recovery_event(self, failed_ranks: list[int]) -> None:
"""
Notify external orchestrator that recovery is needed.
Orchestrator will then:
1. Call POST /v1/admin/drain
2. Wait for drain completion
3. Scale down failed pods/workers
4. Call POST /v1/admin/prepare_scale
5. Launch new pods/workers
6. Call POST /v1/admin/apply_scale
"""
event = {
"event_type": "recovery_required",
"timestamp": datetime.utcnow().isoformat(),
"failed_ranks": failed_ranks,
"current_world_size": self.world_size,
# ... additional details ...
}
if self.orchestrator_webhook_url:
await self._send_webhook(self.orchestrator_webhook_url, event)
logger.warning(
f"Recovery required for ranks {failed_ranks}. "
f"Waiting for orchestrator to handle scaling."
)
Exposed APIs for Orchestrators
- Health Monitoring
GET /v1/health/experts- Detailed per-rank health status
- Recovery Events
- Webhook callbacks to configured orchestrator endpoint
- Event types:
recovery_required,rank_failed,rank_recovered
- Modify POST /scale_elastic_ep to accept:
ranks_to_remove parameter (for explicit rank selection)
rank_endpoints parameter (for non-Ray backends)
Keep new_data_parallel_size for backward compatibility
Enforced Clean-Up Policy
Phase 2 enforces strict constraints when unhealthy ranks exist to maintain clean state transitions:
Constraint 1: Scale-Down Must Remove ALL Masked Ranks
When unhealthy ranks are present, scale-down is restricted to exact removal:
# With rank 0 masked out of 4 ranks:
✅ Allowed: 4 → 3 (removes the masked rank)
❌ Blocked: 4 → 2 (must be exactly 3)
❌ Blocked: 4 → 1 (must be exactly 3)
Constraint 2: Scale-Up Blocked Until Cleanup
Scale-up is completely blocked when masked ranks exist:
# With rank 0 masked:
❌ Blocked: 4 → 5, 4 → 6, 4 → 8 (any scale-up)
✅ Required workflow:
1. Scale-down to remove masked ranks: 4 → 3
2. System clears masked_ranks after removal
3. Then scale-up allowed: 3 → 5, 3 → 6, etc.
Rationale
- Clear State Semantics
One operation = one purpose: Cleanup (remove masked) OR scaling (capacity adjustment)
Prevents ambiguity: "Should we remove masked rank AND a healthy rank? Which one?" - Avoid State Accumulation
Without constraints, masked ranks persist indefinitely through multiple scale operations
Forces system back to clean state before further operations - Simplified Implementation
No complex tracking of masked ranks through multiple scale events
Clear tensor shape transitions: masking keeps size, scaling changes size (never both) - Predictable Recovery
Orchestrator workflow is explicit: Detect → Cleanup → Restore
Each step is atomic and verifiable
Q&A
Q: Why Downtime is Unavoidable for now?
A: Downtime is required due to PyTorch/NCCL architectural constraints: vLLM must destroy and recreate communication groups during scaling because PyTorch provides no API to dynamically update group membership. During cleanup_dist_env_and_memory(), all NCCL/NVSHMEM groups are destroyed, making inter-rank communication impossible. Also, DeepEP's NVSHMEM barriers (e.g., nvshmemx_barrier_all_block()) require ALL ranks to participate, so a dead rank in the group would cause hangs. Therefore, traffic must be drained before scaling begins via wait_for_requests_to_drain().
Q: Why don't we need hard-blocking during recovery?
A: Since drain is mandatory, explicit hard-blocking is unnecessary: Phase 0-1's weight penalty already reduces traffic to failed ranks during the degraded period. When recovery triggers, wait_for_requests_to_drain() naturally finishes any remaining in-flight requests on all ranks (including minimal traffic still reaching failed ranks). By the time group destruction begins, no requests are active. Thus, Phase 2 only needs to: (1) detect persistent failures, (2) trigger scale_elastic_ep() twice, (3) let the existing drain mechanism handle the rest. Hard-blocking would add complexity without benefit since the drain already guarantees zero active requests during the scaling operation.
Total downtime: drain + 2× scaling + CUDA graph recapture
Q: Why expose APIs instead of doing internal recovery?
A: This design enables deployment flexibility and cost control:
- Multi-orchestrator support: Works with Dynamo, K8s, Ray, custom frameworks
- Policy control: Orchestrator decides when/how to scale based on:
- Cost considerations: Replacement nodes may be more expensive
- SLA requirements: Can delay recovery during low-traffic periods
- Capacity planning: Coordinate with broader cluster autoscaling policies
- Infrastructure-aware decisions: Orchestrator has topology visibility that vLLM lacks:
- Network topology: Replacement workers may be on different racks/zones with higher latency
- GPU availability: New nodes might have different GPU types (A100 → H100)
- Placement constraints: Affinity rules, availability zones, rack diversity
- Performance impact: Customer can evaluate if degraded mode is preferable to suboptimal topology
- Stay-in-degraded mode option: Customer can explicitly choose NOT to recover if:
- Replacement topology would hurt performance (cross-rack communication)
- Recovery downtime isn't acceptable for current workload
- Waiting for better placement options (preferred GPU type, same rack)
- Cost of replacement exceeds value of restored capacity
- Separation of concerns:
- vLLM focuses on detection and graceful degradation
- Orchestrator handles lifecycle, placement, and cost optimization
Feedback Period.
No response
CC List.
@ruisearch42 @pavanimajety @GuanLuo @benchislett @xinli-sw
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.