-
Notifications
You must be signed in to change notification settings - Fork 2.6k
refactor: add healthcheck manager to decouple upstream #12426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
refactor: add healthcheck manager to decouple upstream #12426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a healthcheck manager to decouple upstream objects from health checkers. Previously, health checkers were tightly coupled to upstream objects and stored within them, creating complexity with lifecycle management and DNS/service discovery updates. The new approach separates the lifecycle of health checkers from upstream objects by managing them through resource paths and versions in a dedicated healthcheck_manager module that runs asynchronously.
Key changes include:
- Creation of a new
healthcheck_manager.lua
module that manages health checkers independently from upstream objects - Replacement of the
parent
field relationship withresource_key
andresource_version
index references - Asynchronous health checker creation using timers instead of direct creation during requests
Reviewed Changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
apisix/healthcheck_manager.lua | New module implementing asynchronous health checker management with working and waiting pools |
apisix/upstream.lua | Refactored to use healthcheck_manager instead of direct checker management, replaced parent references |
apisix/plugin.lua | Removed shallow copy exceptions for upstream.parent references |
apisix/init.lua | Updated DNS resolution logic to increment _nodes_ver and removed shallow copy exceptions |
apisix/balancer.lua | Updated to use healthcheck_manager for fetching node status |
apisix/control/v1.lua | Modified to work with new healthcheck_manager API |
test files | Updated with additional sleep delays and extra requests to accommodate asynchronous checker creation |
-- if a checker exists then delete it before creating a new one | ||
local existing_checker = _M.working_pool[resource_path] | ||
if existing_checker then | ||
existing_checker.checker:delayed_clear(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.
existing_checker.checker:delayed_clear(10) | |
existing_checker.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apisix/healthcheck_manager.lua
Outdated
--- remove from working pool if resource doesn't exist | ||
local res_conf = fetch_latest_conf(resource_path) | ||
if not res_conf then | ||
item.checker:delayed_clear(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.
item.checker:delayed_clear(10) | |
item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT) |
Copilot uses AI. Check for mistakes.
apisix/healthcheck_manager.lua
Outdated
core.log.info("checking working pool for resource: ", resource_path, | ||
" current version: ", current_ver, " item version: ", item.version) | ||
if item.version ~= current_ver then | ||
item.checker:delayed_clear(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.
item.checker:delayed_clear(10) | |
item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT) |
Copilot uses AI. Check for mistakes.
apisix/balancer.lua
Outdated
@@ -27,7 +27,7 @@ local get_last_failure = balancer.get_last_failure | |||
local set_timeouts = balancer.set_timeouts | |||
local ngx_now = ngx.now | |||
local str_byte = string.byte | |||
|
|||
local healthcheck_manager = require("apisix.healthcheck_manager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -75,7 +75,8 @@ local function fetch_health_nodes(upstream, checker) | |||
local port = upstream.checks and upstream.checks.active and upstream.checks.active.port | |||
local up_nodes = core.table.new(0, #nodes) | |||
for _, node in ipairs(nodes) do | |||
local ok, err = checker:get_target_status(node.host, port or node.port, host) | |||
local ok, err = healthcheck_manager.fetch_node_status(checker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can cache healthcheck_manager.fetch_node_status
, a short local function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean using the lrucache with checker
as key?
apisix/healthcheck_manager.lua
Outdated
local events = require("apisix.events") | ||
local tab_clone = core.table.clone | ||
local timer_every = ngx.timer.every | ||
local _M = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not safe to export working_pool
and waiting_pool
.
We will never allow the "working_pool" or "waiting_pool" to be destroyed.
they only can be use in this lua module
it looks good to me now, but we have to wait for more time, for more comments, it is a big change BTW, the PR title should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json? |
A lot of existing tests here already use yaml config_provider. See https://github.com/apache/apisix/pull/12426/files#diff-d89c41bb4b1cc7c97936090edd39c05e1fcbc35c94039a16dbb392515d4f38d3R34 |
Description
The current implementation of health checker has few problems.
parent
field adds complexity across the source code and needs to be taken care of. This field is only used for healthchecks.Solution
NOTE: A lot of tests have been modified to either add sleep or add an extra request before the actual request for following reasons:
Benchmark results confirming no anomalies
Below are the results of manual testing done to verify that there are no variations in CPU and memory usage.
Relevant ENV variables:
worker_process: 1
ADMIN_KEY="HaVUyrUtJtpuDuRzHipGziJxdbxOkeqm"
APISIX_ADMIN_URL="http://127.0.0.1:9180"
APISIX_DATA_PLANE="http://127.0.0.1:9080"
BACKEND_PORT=8080 # A go server is run locally for upstream nodes
UPSTREAM_NAME="stress_upstream"
ROUTE_NAME="stress_route"
N_NODES=200
1.1 Create resources with 50% failing node
1.2 Run wrk and pidstat to calculate CPU usage - Baseline case
Wrk results:
wrk
RESULT:
The average CPU% is 30.88% and MEM% is 0.48%
Test with enabled healthcheck + Updating routes in the background for 24 hrs
Wrk results:
RESULT:
After 1st hour the CPU usage drops to 16.57% but Memory usage remains 0.27
Towards the end, the CPU usage drops to 5.51% but Memory usage increased to just 0.31
The average memory usage was 0.27 and increase in memory usage in 24hrs was 0.04
Checklist