Skip to content

[Enhancement] Add TimerWheel snapshot mechanism for reliable recovery #9735

@guyinyou

Description

@guyinyou

Before Creating the Enhancement Request

  • I have confirmed that this should be classified as an enhancement rather than a bug/feature.

Summary

Add a snapshot backup and recovery mechanism for TimerWheel to ensure reliable recovery of timer message state after broker restarts. Currently, the discrete TimerWheel state cannot be fully recovered from TimerLog alone, leading to potential data inconsistency and performance issues during recovery.

Motivation

The current implementation has a critical limitation where TimerWheel state is not persisted, making it impossible to fully recover the timer message scheduling state after a broker restart. This creates several problems:
Data Recovery Issues: TimerWheel contains crucial scheduling information that cannot be reconstructed from TimerLog alone
Performance Impact: During recovery, the entire TimerWheel needs to be rebuilt from scratch, which is time-consuming and resource-intensive
Data Consistency: Without proper state recovery, some timer messages may be lost or incorrectly scheduled
Operational Reliability: Production environments require reliable recovery mechanisms to ensure business continuity
This enhancement is particularly important for:
High-availability deployments where broker restarts are common
Large-scale timer message processing scenarios
Production environments requiring consistent timer message delivery

Describe the Solution You'd Like

Implement a comprehensive snapshot mechanism for TimerWheel with the following components:

  1. Configuration Options
    timerWheelSnapshotFlush: Enable/disable snapshot functionality (default: false for backward compatibility)
    timerWheelDefaultFlush: Control default flush behavior (default: true)
    timerWheelSnapshotIntervalMs: Configure snapshot creation interval (default: 10 seconds)
  2. Snapshot Management
    Snapshot Creation: Periodically backup TimerWheel state to snapshot files
    Snapshot Recovery: Support recovery from the latest snapshot file during initialization
    Snapshot Cleanup: Automatically clean up expired snapshot files to manage disk space
    Atomic Operations: Ensure snapshot operations are atomic to prevent corruption
  3. Implementation Details
    Add backup() method to TimerWheel for creating snapshots
    Modify TimerWheel constructor to support recovery from snapshots
    Implement snapshot file selection logic based on flush positions
    Add synchronization locks to ensure flush operations are atomic
    Integrate snapshot functionality into TimerFlushService
  4. Backward Compatibility
    Default configuration maintains existing behavior
    Existing deployments continue to work without changes
    Gradual migration path for enabling snapshot functionality

Describe Alternatives You've Considered

Alternative 1: Enhanced TimerLog Recovery
Approach: Improve TimerLog to include all necessary TimerWheel state information
Why Rejected: This would require significant changes to the existing TimerLog format and could break backward compatibility. It would also increase the complexity of the logging mechanism.
Alternative 2: In-Memory State Persistence
Approach: Persist TimerWheel state to a separate state file during normal operations
Why Rejected: This approach would require frequent disk I/O operations during normal message processing, which could significantly impact performance.
Alternative 3: Database-Based State Storage
Approach: Store TimerWheel state in an external database
Why Rejected: This would introduce external dependencies and complexity, making the system harder to deploy and maintain. It would also add network latency and potential failure points.
Alternative 4: Checkpoint-Based Recovery
Approach: Use existing checkpoint mechanism to store TimerWheel state
Why Rejected: The current checkpoint mechanism is not designed for complex data structures like TimerWheel, and modifying it would affect other parts of the system.
The proposed snapshot mechanism provides the best balance of:
Performance: Minimal impact on normal operations
Reliability: Atomic operations ensure data consistency
Compatibility: Backward compatible with existing deployments
Simplicity: Clean separation of concerns without affecting existing code paths

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions