Skip to content

refactor: add healthcheck manager to decouple upstream #12426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

Revolyssup
Copy link
Contributor

@Revolyssup Revolyssup commented Jul 14, 2025

Description

The current implementation of health checker has few problems.

  1. After creating the checker, it is currently stored in the structure of the object table to which it belongs, adding the checker attribute. However, the associated object is often affected by actions such as DNS resolution, Service Discovery, and configuration file merging, which frequently lead to passive changes in the associated object. At this point, correctly updating the associated object and properly storing the checker object are particularly error-prone and relatively complex.
  2. parent field adds complexity across the source code and needs to be taken care of. This field is only used for healthchecks.
  3. The lifecycle of the checker and the upstream object is not strongly consistent

Solution

  1. upstream and checker relationship has been changed from strong binding to index reference, with the index fields using resource_path and resource_version fields respectively. This will be managed by a separate healthcheck_manager module.
  2. A timer will asynchronously create checkers from waiting pool. The requests no longer directly create health checkers therefore the health checker lifecycle is decoupled from requests.

NOTE: A lot of tests have been modified to either add sleep or add an extra request before the actual request for following reasons:

  1. Since checker is now created asynchronously. Minimum of 1 second wait required for confirming it was created.
  2. Since the first request will just put the checker on waiting pool. Even in active health check, the first request will always not run the health check and just put it in waiting pool. The timer will pick it up and create the healthcheck later.

Benchmark results confirming no anomalies

Below are the results of manual testing done to verify that there are no variations in CPU and memory usage.
Relevant ENV variables:
worker_process: 1
ADMIN_KEY="HaVUyrUtJtpuDuRzHipGziJxdbxOkeqm"
APISIX_ADMIN_URL="http://127.0.0.1:9180"
APISIX_DATA_PLANE="http://127.0.0.1:9080"
BACKEND_PORT=8080 # A go server is run locally for upstream nodes
UPSTREAM_NAME="stress_upstream"
ROUTE_NAME="stress_route"
N_NODES=200
1.1 Create resources with 50% failing node

    NODES="["
    for i in $(seq 1 $((N_NODES/2))); do
        NODES+="{\"host\":\"127.0.0.1\",\"port\":$BACKEND_PORT,\"weight\":1},"
    done
    for i in $(seq 1 $((N_NODES/2))); do
        NODES+="{\"host\":\"127.0.0.1\",\"port\":1,\"weight\":1},"
    done
    NODES="${NODES%,}]"

    # Create upstream
    curl -s -X PUT "$APISIX_ADMIN_URL/apisix/admin/upstreams/1" \
        -H "X-API-KEY: $ADMIN_KEY" \
        -d "{
            \"name\": \"$UPSTREAM_NAME\",
            \"type\": \"roundrobin\",
            \"nodes\": $NODES,
            \"retries\": 2
        }"

    # Create route
    curl -s -X PUT "$APISIX_ADMIN_URL/apisix/admin/routes/1" \
        -H "X-API-KEY: $ADMIN_KEY" \
        -d "{
            \"name\": \"$ROUTE_NAME\",
            \"uri\": \"/*\",
            \"upstream_id\": \"1\"
        }"
    
    echo "Created upstream with $N_NODES nodes (50% failing)"

1.2 Run wrk and pidstat to calculate CPU usage - Baseline case
Wrk results:

wrk

 wrk -c 100 -t 5 -d 60s -R 900 http://localhost:9080
Running 1m test @ http://localhost:9080
  5 threads and 100 connections
  Thread calibration: mean lat.: 2.114ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.135ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.140ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.122ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.147ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.10ms  735.04us  13.99ms   70.86%
    Req/Sec   189.76     70.81   444.00     69.94%
  54005 requests in 1.00m, 22.71MB read
  Non-2xx or 3xx responses: 54005
Requests/sec:    900.04
Transfer/sec:    387.62KB

CPU and Memory usage results
 pidstat -ur -p $pid 2 30 #sample 30 times - every 2 seconds
 
 #Output after sampling for 60 seconds
 
Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000   1089374   17.15   13.73    0.00    0.10   30.88     -  openresty

Average:      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
Average:     1000   1089374      5.00      0.00  498812   77144   0.48  openresty

RESULT:
The average CPU% is 30.88% and MEM% is 0.48%

   local health_config='{
            "active": {
                "type": "http",
                "http_path": "/",
                "healthy": {"interval": 1, "successes": 1},
                "unhealthy": {"interval": 1, "http_failures": 1}
            }
        }'
        curl -s -X PATCH "$APISIX_ADMIN_URL/apisix/admin/upstreams/1" \
            -H "X-API-KEY: $ADMIN_KEY" \
            -d "{\"checks\": $health_config}"
        echo "Enabled health checks"

Test with enabled healthcheck + Updating routes in the background for 24 hrs

Wrk results:

 wrk -c 100 -t 5 -d 86400s -R 900 http://localhost:9080

Running 1440m test @ http://localhost:9080
  5 threads and 100 connections
  Thread calibration: mean lat.: 1.322ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.314ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.302ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.307ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.294ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.01ms  562.23ms   1.00m    99.97%
    Req/Sec   188.85     68.87     2.33k    73.45%
  77336041 requests in 1440.00m, 13.33GB read
  Socket errors: connect 0, read 0, write 0, timeout 23542
  Non-2xx or 3xx responses: 15186209
Requests/sec:    895.09
Transfer/sec:    161.72KB
CPU and Memory usage results
 pidstat -ur -p $pid 3600 24
Linux 6.14.4-arch1-2 (ashish-82yl)      24/07/25        _x86_64_        (16 CPU)

05:16:54 PM IST   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
06:16:54 PM IST  1000    142261   10.77    5.80    0.00    0.10   16.57     1  openresty

05:16:54 PM IST   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
06:16:54 PM IST  1000    142261      9.35      0.00  409232   42680   0.27  openresty

...

04:16:54 PM IST   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
05:16:54 PM IST  1000    142261    3.71    1.81    0.00    0.02    5.51    14  openresty

04:16:54 PM IST   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
05:16:54 PM IST  1000    142261      0.26      0.00  415664   50452   0.31  openresty


...

Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000    142261    7.10    3.86    0.00    0.11   10.96     -  openresty

Average:      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
Average:     1000    142261      7.71      0.00  411107   44214   0.27  openresty

RESULT:
After 1st hour the CPU usage drops to 16.57% but Memory usage remains 0.27
Towards the end, the CPU usage drops to 5.51% but Memory usage increased to just 0.31
The average memory usage was 0.27 and increase in memory usage in 24hrs was 0.04

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@Revolyssup Revolyssup marked this pull request as draft July 14, 2025 15:18
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 14, 2025
@Revolyssup Revolyssup marked this pull request as ready for review July 16, 2025 12:15
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 16, 2025
@Revolyssup Revolyssup marked this pull request as draft July 16, 2025 20:27
@moonming moonming requested a review from Copilot July 21, 2025 08:31
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a healthcheck manager to decouple upstream objects from health checkers. Previously, health checkers were tightly coupled to upstream objects and stored within them, creating complexity with lifecycle management and DNS/service discovery updates. The new approach separates the lifecycle of health checkers from upstream objects by managing them through resource paths and versions in a dedicated healthcheck_manager module that runs asynchronously.

Key changes include:

  • Creation of a new healthcheck_manager.lua module that manages health checkers independently from upstream objects
  • Replacement of the parent field relationship with resource_key and resource_version index references
  • Asynchronous health checker creation using timers instead of direct creation during requests

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
apisix/healthcheck_manager.lua New module implementing asynchronous health checker management with working and waiting pools
apisix/upstream.lua Refactored to use healthcheck_manager instead of direct checker management, replaced parent references
apisix/plugin.lua Removed shallow copy exceptions for upstream.parent references
apisix/init.lua Updated DNS resolution logic to increment _nodes_ver and removed shallow copy exceptions
apisix/balancer.lua Updated to use healthcheck_manager for fetching node status
apisix/control/v1.lua Modified to work with new healthcheck_manager API
test files Updated with additional sleep delays and extra requests to accommodate asynchronous checker creation

-- if a checker exists then delete it before creating a new one
local existing_checker = _M.working_pool[resource_path]
if existing_checker then
existing_checker.checker:delayed_clear(10)
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.

Suggested change
existing_checker.checker:delayed_clear(10)
existing_checker.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--- remove from working pool if resource doesn't exist
local res_conf = fetch_latest_conf(resource_path)
if not res_conf then
item.checker:delayed_clear(10)
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.

Suggested change
item.checker:delayed_clear(10)
item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

Copilot uses AI. Check for mistakes.

core.log.info("checking working pool for resource: ", resource_path,
" current version: ", current_ver, " item version: ", item.version)
if item.version ~= current_ver then
item.checker:delayed_clear(10)
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.

Suggested change
item.checker:delayed_clear(10)
item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

Copilot uses AI. Check for mistakes.

nic-6443
nic-6443 previously approved these changes Jul 22, 2025
@nic-6443 nic-6443 requested a review from membphis July 23, 2025 01:31
@membphis membphis self-requested a review July 23, 2025 01:31
@nic-6443 nic-6443 requested a review from bzp2010 July 23, 2025 01:31
@@ -27,7 +27,7 @@ local get_last_failure = balancer.get_last_failure
local set_timeouts = balancer.set_timeouts
local ngx_now = ngx.now
local str_byte = string.byte

local healthcheck_manager = require("apisix.healthcheck_manager")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@@ -75,7 +75,8 @@ local function fetch_health_nodes(upstream, checker)
local port = upstream.checks and upstream.checks.active and upstream.checks.active.port
local up_nodes = core.table.new(0, #nodes)
for _, node in ipairs(nodes) do
local ok, err = checker:get_target_status(node.host, port or node.port, host)
local ok, err = healthcheck_manager.fetch_node_status(checker,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can cache healthcheck_manager.fetch_node_status, a short local function

Copy link
Contributor Author

@Revolyssup Revolyssup Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean using the lrucache with checker as key?

local events = require("apisix.events")
local tab_clone = core.table.clone
local timer_every = ngx.timer.every
local _M = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not safe to export working_pool and waiting_pool.

We will never allow the "working_pool" or "waiting_pool" to be destroyed.

they only can be use in this lua module

@Revolyssup Revolyssup requested a review from nic-chen July 23, 2025 07:30
@membphis
Copy link
Member

it looks good to me now, but we have to wait for more time, for more comments, it is a big change

BTW, the PR title should be refactor ^_^

Copy link
Member

@membphis membphis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Revolyssup Revolyssup changed the title feat: add healthcheck manager to decouple upstream refactor: add healthcheck manager to decouple upstream Jul 24, 2025
@nic-chen
Copy link
Member

Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json?

@Revolyssup
Copy link
Contributor Author

Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json?

A lot of existing tests here already use yaml config_provider. See https://github.com/apache/apisix/pull/12426/files#diff-d89c41bb4b1cc7c97936090edd39c05e1fcbc35c94039a16dbb392515d4f38d3R34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants