Skip to content

Conversation

@yx9o
Copy link
Contributor

@yx9o yx9o commented Dec 5, 2025

The failover protection period parameter can be dynamically configured. This time can be adjusted by setting environment variables without requiring code modification and recompilation.

Copilot AI review requested due to automatic review settings December 5, 2025 15:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request replaces a hardcoded failover protection period with a configurable environment variable, enabling dynamic adjustment of the timeout without code modification. The main change allows operators to control when a node should be eligible for failover after entering a new epoch.

  • Replaced TimeUnit.MINUTES.toMillis(1) with Systems.getEnvLong() for dynamic configuration
  • Removed unused TimeUnit import and added Systems utility import
  • Default value remains 60000ms (1 minute) when environment variable is not set

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to 45
private static final long DONT_FAILOVER_AFTER_NEW_EPOCH_MS = Systems.getEnvLong("AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS", 60L * 1000);
private final int id;
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Systems.getEnvLong method can throw NumberFormatException if the environment variable AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS is set to an invalid value (e.g., non-numeric string). This would cause the class to fail during static initialization, potentially crashing the application at startup. Consider adding validation or catching the exception with a fallback to the default value, or document the expected format clearly.

Suggested change
private static final long DONT_FAILOVER_AFTER_NEW_EPOCH_MS = Systems.getEnvLong("AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS", 60L * 1000);
private final int id;
private static final long DONT_FAILOVER_AFTER_NEW_EPOCH_MS = getDontFailoverAfterNewEpochMs();
private final int id;
private static long getDontFailoverAfterNewEpochMs() {
final long defaultValue = 60L * 1000;
try {
return Systems.getEnvLong("AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS", defaultValue);
} catch (NumberFormatException e) {
// Optionally log the error here if a logger is available
return defaultValue;
}
}

Copilot uses AI. Check for mistakes.
*
* @see ClusterControlManager#getNextNodeId()
*/
private static final int MAX_CONTROLLER_ID = 1000 - 1;
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new environment variable AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS should be documented with a comment explaining its purpose, acceptable values (non-negative long in milliseconds), and the default value (60000ms = 1 minute). This helps operators understand how to configure this parameter.

Suggested change
private static final int MAX_CONTROLLER_ID = 1000 - 1;
private static final int MAX_CONTROLLER_ID = 1000 - 1;
/**
* The minimum time (in milliseconds) after a node starts a new epoch before failover is allowed.
* Controlled by the environment variable {@code AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS}.
* <p>
* Purpose: Prevents failover immediately after a node restarts, allowing time for recovery.
* Acceptable values: non-negative long (milliseconds).
* Default value: 60000ms (1 minute).
*/

Copilot uses AI. Check for mistakes.
*/
private static final int MAX_CONTROLLER_ID = 1000 - 1;
private static final long DONT_FAILOVER_AFTER_NEW_EPOCH_MS = TimeUnit.MINUTES.toMillis(1);
private static final long DONT_FAILOVER_AFTER_NEW_EPOCH_MS = Systems.getEnvLong("AUTOMQ_CONTROLLER_DONT_FAILOVER_AFTER_NEW_EPOCH_MS", 60L * 1000);
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of DONT_FAILOVER_AFTER_NEW_EPOCH_MS now depends on an environment variable, but there are no tests verifying that shouldFailover() works correctly with different values of this parameter. Consider adding tests that validate the failover logic with various timeout values to ensure the environment variable configuration works as expected.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant