Skip to content

[RFC]: Fault-Tolerant Expert parallelism phase 2 (recovery) #27908

@tzulingk

Description

@tzulingk

Summary

Phase 2 enables automatic recovery from persistent rank failures by integrating Phase 0-1's health monitoring with vLLM's existing scale_elastic_ep() API. When failures persist beyond a threshold, the system automatically triggers scale-down (remove failed ranks) followed by scale-up (restore capacity) to achieve self-healing without manual intervention.

Motivation

Problem: Phase 0-1 Leaves System in Degraded State
Phase 0-1 (#27774) successfully provides graceful degradation:

  • Detects failures via per-expert latency monitoring within some number of forward passes
  • Applies weight penalty to unhealthy experts during EPLB rebalancing
  • Continues serving with reduced capacity

However, critical limitations remain in production:

  1. Incomplete traffic blocking. The weight penalty is probabilistic rather than absolute, so failed experts continue to receive traffic.
  2. No capacity is restored. Phases 0–1 never recover capacity, so failed ranks remain in the cluster indefinitely. Redundant experts are compensating for these failures.
  3. Manual intervention is required to restore full capacity.

Proposed Change.

Solution: Automatic Recovery via Elastic Scaling

Phase 2 adds three critical capabilities:

  • Hard-blocking: Immediately stop ALL traffic to failed ranks (not just penalty)
  • Automatic replacement: Trigger elastic scale-down + scale-up to restore capacity
  • Self-healing: System recovers without manual intervention

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    External Orchestrator                     │
│              (Dynamo, K8s, Ray, Custom, etc.)                │
│                                                              │
│  - Monitors health API                                       │
│  - Receives recovery webhooks                                │
│  - Decides scaling strategy                                  │
│  - Manages pod/worker lifecycle                              │
└────────────┬────────────────────────────────┬───────────────┘
             │                                 │
      ┌──────▼──────────┐              ┌──────▼──────────┐
      │  Health/Events  │              │ Scaling Control │
      │      APIs       │              │      APIs       │
      └──────┬──────────┘              └──────┬──────────┘
             │                                 │
┌────────────┴─────────────────────────────────┴───────────────┐
│                        vLLM System                            │
│                                                               │
│  Phase 0-1: Detection & Soft Blocking                         │
│  - Health monitoring                                          │
│  - Weight penalties                                           │
│  - Failure tracking                                           │
│                                                               │
│  Phase 2: Recovery Coordination                               │
│  - Expose health status                                       │
│  - Emit recovery events                                       │
│  - Accept scaling commands                                    │
│  - Coordinate drain & transitions                             │
└───────────────────────────────────────────────────────────────┘

Track the failed state of each rank

@dataclass
class EPLBConfig:
    # ... existing EPLB fields ...
    
    # Phase 0-1 (from #27774)
    health_check_enabled: bool = True
    health_latency_threshold: float = 3.0  # 3x median = unhealthy
    health_penalty_factor: float = 10.0    # Weight penalty for unhealthy experts
    
    # Phase 2 (NEW) - Automatic Recovery
    recovery_enabled: bool = True
    
    # Trigger recovery if failure persists for this many forward passes
    failure_persistence_threshold: int = 1000
    
    # Per-rank threshold: trigger recovery if this fraction of experts failed
    rank_failure_ratio_threshold: float = 0.5
class EplbState:
    def __init__(self, ...):
        # ... existing state ...
        
        # Phase 2: Recovery state
        self.rank_failure_duration: torch.Tensor = torch.zeros(...)

        # Worker sets this when recovery needed, coordinator reads it
        self.pending_recovery_request: Optional[RecoveryRequest] = None

    def _update_failure_duration(self) -> None:
        """
        Track consecutive unhealthy passes per rank.
        
        Increment for unhealthy ranks, reset for recovered ranks.
        A rank is unhealthy if ALL its experts are unhealthy.
        """

Trigger recovery for serious degradation

# EblbState
def _should_trigger_recovery(self) -> bool:
    """
    Check if any rank meets EITHER conditions for recovery:
    1. Failure ratio exceeds threshold (e.g., >50% experts failed)
    2. Failure persisted for threshold number of passes
    
    Returns:
        True if recovery should be triggered for at least one rank
    """

def step(self, model: MixtureOfExperts, ...) -> None:
    """EPLB step with recovery trigger."""
    
    # Phase 0-1: Health monitoring
    if self.eplb_config.health_check_enabled:
        self._update_health_mask(model)
        self._update_failure_duration()
    
    # Phase 2: Recovery trigger (NEW)
    if self.eplb_config.recovery_enabled and self._should_trigger_recovery():
        failed_ranks = self._hard_block_and_prepare_recovery()
        return  # Skip normal rebalancing

Recovery

class AsyncLLM:
    def __init__(self, ...):
        # ... existing init ...
        self.orchestrator_webhook_url = orchestrator_webhook_url
        self.last_recovery_check = time.time()
    
    async def _check_and_notify_recovery(self) -> None:
        """
        Poll workers for recovery triggers and notify orchestrator.
        
        Instead of executing recovery internally, emit event to
        external orchestrator (Dynamo, etc.)
        """
        # Poll all workers via collective_rpc
        recovery_triggers = await self.collective_rpc(
            method="get_recovery_trigger",  # ← Calls GPUWorker.get_recovery_trigger()
            timeout=0.5
        )
                
        # Execute recovery if any worker flagged        
        all_failed_ranks = set()
        for worker_idx, failed_ranks in enumerate(recovery_triggers):
            if failed_ranks is not None:
                all_failed_ranks.update(failed_ranks)
        if len(all_failed_ranks) > 0:
            await self._emit_recovery_event(failed_ranks_list)
    
    async def _emit_recovery_event(self, failed_ranks: list[int]) -> None:
        """
        Notify external orchestrator that recovery is needed.
        
        Orchestrator will then:
        1. Call POST /v1/admin/drain
        2. Wait for drain completion
        3. Scale down failed pods/workers
        4. Call POST /v1/admin/prepare_scale
        5. Launch new pods/workers
        6. Call POST /v1/admin/apply_scale
        """
        event = {
            "event_type": "recovery_required",
            "timestamp": datetime.utcnow().isoformat(),
            "failed_ranks": failed_ranks,
            "current_world_size": self.world_size,
            # ... additional details ...
        }
        
        if self.orchestrator_webhook_url:
            await self._send_webhook(self.orchestrator_webhook_url, event)
        
        logger.warning(
            f"Recovery required for ranks {failed_ranks}. "
            f"Waiting for orchestrator to handle scaling."
        )

Exposed APIs for Orchestrators

  1. Health Monitoring
    • GET /v1/health/experts - Detailed per-rank health status
  2. Recovery Events
    • Webhook callbacks to configured orchestrator endpoint
    • Event types: recovery_required, rank_failed, rank_recovered
  3. Modify POST /scale_elastic_ep to accept:
    ranks_to_remove parameter (for explicit rank selection)
    rank_endpoints parameter (for non-Ray backends)
    Keep new_data_parallel_size for backward compatibility

Enforced Clean-Up Policy

Phase 2 enforces strict constraints when unhealthy ranks exist to maintain clean state transitions:
Constraint 1: Scale-Down Must Remove ALL Masked Ranks
When unhealthy ranks are present, scale-down is restricted to exact removal:

# With rank 0 masked out of 4 ranks:
✅ Allowed: 4 → 3  (removes the masked rank)
❌ Blocked: 4 → 2  (must be exactly 3)
❌ Blocked: 4 → 1  (must be exactly 3)

Constraint 2: Scale-Up Blocked Until Cleanup
Scale-up is completely blocked when masked ranks exist:

# With rank 0 masked:
❌ Blocked: 4 → 5, 4 → 6, 4 → 8  (any scale-up)

✅ Required workflow:
1. Scale-down to remove masked ranks: 4 → 3
2. System clears masked_ranks after removal
3. Then scale-up allowed: 3 → 5, 3 → 6, etc.

Rationale

  1. Clear State Semantics
    One operation = one purpose: Cleanup (remove masked) OR scaling (capacity adjustment)
    Prevents ambiguity: "Should we remove masked rank AND a healthy rank? Which one?"
  2. Avoid State Accumulation
    Without constraints, masked ranks persist indefinitely through multiple scale operations
    Forces system back to clean state before further operations
  3. Simplified Implementation
    No complex tracking of masked ranks through multiple scale events
    Clear tensor shape transitions: masking keeps size, scaling changes size (never both)
  4. Predictable Recovery
    Orchestrator workflow is explicit: Detect → Cleanup → Restore
    Each step is atomic and verifiable

Q&A

Q: Why Downtime is Unavoidable for now?
A: Downtime is required due to PyTorch/NCCL architectural constraints: vLLM must destroy and recreate communication groups during scaling because PyTorch provides no API to dynamically update group membership. During cleanup_dist_env_and_memory(), all NCCL/NVSHMEM groups are destroyed, making inter-rank communication impossible. Also, DeepEP's NVSHMEM barriers (e.g., nvshmemx_barrier_all_block()) require ALL ranks to participate, so a dead rank in the group would cause hangs. Therefore, traffic must be drained before scaling begins via wait_for_requests_to_drain().

Q: Why don't we need hard-blocking during recovery?
A: Since drain is mandatory, explicit hard-blocking is unnecessary: Phase 0-1's weight penalty already reduces traffic to failed ranks during the degraded period. When recovery triggers, wait_for_requests_to_drain() naturally finishes any remaining in-flight requests on all ranks (including minimal traffic still reaching failed ranks). By the time group destruction begins, no requests are active. Thus, Phase 2 only needs to: (1) detect persistent failures, (2) trigger scale_elastic_ep() twice, (3) let the existing drain mechanism handle the rest. Hard-blocking would add complexity without benefit since the drain already guarantees zero active requests during the scaling operation.
Total downtime: drain + 2× scaling + CUDA graph recapture

Q: Why expose APIs instead of doing internal recovery?
A: This design enables deployment flexibility and cost control:

  1. Multi-orchestrator support: Works with Dynamo, K8s, Ray, custom frameworks
  2. Policy control: Orchestrator decides when/how to scale based on:
    • Cost considerations: Replacement nodes may be more expensive
    • SLA requirements: Can delay recovery during low-traffic periods
    • Capacity planning: Coordinate with broader cluster autoscaling policies
  3. Infrastructure-aware decisions: Orchestrator has topology visibility that vLLM lacks:
    • Network topology: Replacement workers may be on different racks/zones with higher latency
    • GPU availability: New nodes might have different GPU types (A100 → H100)
    • Placement constraints: Affinity rules, availability zones, rack diversity
    • Performance impact: Customer can evaluate if degraded mode is preferable to suboptimal topology
  4. Stay-in-degraded mode option: Customer can explicitly choose NOT to recover if:
    • Replacement topology would hurt performance (cross-rack communication)
    • Recovery downtime isn't acceptable for current workload
    • Waiting for better placement options (preferred GPU type, same rack)
    • Cost of replacement exceeds value of restored capacity
  5. Separation of concerns:
    • vLLM focuses on detection and graceful degradation
    • Orchestrator handles lifecycle, placement, and cost optimization

Feedback Period.

No response

CC List.

@ruisearch42 @pavanimajety @GuanLuo @benchislett @xinli-sw

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions