-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
Before Creating the Enhancement Request
- I have confirmed that this should be classified as an enhancement rather than a bug/feature.
Summary
Add a snapshot backup and recovery mechanism for TimerWheel to ensure reliable recovery of timer message state after broker restarts. Currently, the discrete TimerWheel state cannot be fully recovered from TimerLog alone, leading to potential data inconsistency and performance issues during recovery.
Motivation
The current implementation has a critical limitation where TimerWheel state is not persisted, making it impossible to fully recover the timer message scheduling state after a broker restart. This creates several problems:
Data Recovery Issues: TimerWheel contains crucial scheduling information that cannot be reconstructed from TimerLog alone
Performance Impact: During recovery, the entire TimerWheel needs to be rebuilt from scratch, which is time-consuming and resource-intensive
Data Consistency: Without proper state recovery, some timer messages may be lost or incorrectly scheduled
Operational Reliability: Production environments require reliable recovery mechanisms to ensure business continuity
This enhancement is particularly important for:
High-availability deployments where broker restarts are common
Large-scale timer message processing scenarios
Production environments requiring consistent timer message delivery
Describe the Solution You'd Like
Implement a comprehensive snapshot mechanism for TimerWheel with the following components:
- Configuration Options
timerWheelSnapshotFlush: Enable/disable snapshot functionality (default: false for backward compatibility)
timerWheelDefaultFlush: Control default flush behavior (default: true)
timerWheelSnapshotIntervalMs: Configure snapshot creation interval (default: 10 seconds) - Snapshot Management
Snapshot Creation: Periodically backup TimerWheel state to snapshot files
Snapshot Recovery: Support recovery from the latest snapshot file during initialization
Snapshot Cleanup: Automatically clean up expired snapshot files to manage disk space
Atomic Operations: Ensure snapshot operations are atomic to prevent corruption - Implementation Details
Add backup() method to TimerWheel for creating snapshots
Modify TimerWheel constructor to support recovery from snapshots
Implement snapshot file selection logic based on flush positions
Add synchronization locks to ensure flush operations are atomic
Integrate snapshot functionality into TimerFlushService - Backward Compatibility
Default configuration maintains existing behavior
Existing deployments continue to work without changes
Gradual migration path for enabling snapshot functionality
Describe Alternatives You've Considered
Alternative 1: Enhanced TimerLog Recovery
Approach: Improve TimerLog to include all necessary TimerWheel state information
Why Rejected: This would require significant changes to the existing TimerLog format and could break backward compatibility. It would also increase the complexity of the logging mechanism.
Alternative 2: In-Memory State Persistence
Approach: Persist TimerWheel state to a separate state file during normal operations
Why Rejected: This approach would require frequent disk I/O operations during normal message processing, which could significantly impact performance.
Alternative 3: Database-Based State Storage
Approach: Store TimerWheel state in an external database
Why Rejected: This would introduce external dependencies and complexity, making the system harder to deploy and maintain. It would also add network latency and potential failure points.
Alternative 4: Checkpoint-Based Recovery
Approach: Use existing checkpoint mechanism to store TimerWheel state
Why Rejected: The current checkpoint mechanism is not designed for complex data structures like TimerWheel, and modifying it would affect other parts of the system.
The proposed snapshot mechanism provides the best balance of:
Performance: Minimal impact on normal operations
Reliability: Atomic operations ensure data consistency
Compatibility: Backward compatible with existing deployments
Simplicity: Clean separation of concerns without affecting existing code paths
Additional Context
No response