refactor: add healthcheck manager to decouple upstream #12426

Revolyssup · 2025-07-14T15:18:00Z

Description

The current implementation of health checker has few problems.

After creating the checker, it is currently stored in the structure of the object table to which it belongs, adding the checker attribute. However, the associated object is often affected by actions such as DNS resolution, Service Discovery, and configuration file merging, which frequently lead to passive changes in the associated object. At this point, correctly updating the associated object and properly storing the checker object are particularly error-prone and relatively complex.
parent field adds complexity across the source code and needs to be taken care of. This field is only used for healthchecks.
The lifecycle of the checker and the upstream object is not strongly consistent

Solution

upstream and checker relationship has been changed from strong binding to index reference, with the index fields using resource_path and resource_version fields respectively. This will be managed by a separate healthcheck_manager module.
A timer will asynchronously create checkers from waiting pool. The requests no longer directly create health checkers therefore the health checker lifecycle is decoupled from requests.

NOTE: A lot of tests have been modified to either add sleep or add an extra request before the actual request for following reasons:

Since checker is now created asynchronously. Minimum of 1 second wait required for confirming it was created.
Since the first request will just put the checker on waiting pool. Even in active health check, the first request will always not run the health check and just put it in waiting pool. The timer will pick it up and create the healthcheck later.

Benchmark results confirming no anomalies

Below are the results of manual testing done to verify that there are no variations in CPU and memory usage.
Relevant ENV variables:
worker_process: 1
ADMIN_KEY="HaVUyrUtJtpuDuRzHipGziJxdbxOkeqm"
APISIX_ADMIN_URL="http://127.0.0.1:9180"
APISIX_DATA_PLANE="http://127.0.0.1:9080"
BACKEND_PORT=8080 # A go server is run locally for upstream nodes
UPSTREAM_NAME="stress_upstream"
ROUTE_NAME="stress_route"
N_NODES=200
1.1 Create resources with 50% failing node

    NODES="["
    for i in $(seq 1 $((N_NODES/2))); do
        NODES+="{\"host\":\"127.0.0.1\",\"port\":$BACKEND_PORT,\"weight\":1},"
    done
    for i in $(seq 1 $((N_NODES/2))); do
        NODES+="{\"host\":\"127.0.0.1\",\"port\":1,\"weight\":1},"
    done
    NODES="${NODES%,}]"

    # Create upstream
    curl -s -X PUT "$APISIX_ADMIN_URL/apisix/admin/upstreams/1" \
        -H "X-API-KEY: $ADMIN_KEY" \
        -d "{
            \"name\": \"$UPSTREAM_NAME\",
            \"type\": \"roundrobin\",
            \"nodes\": $NODES,
            \"retries\": 2
        }"

    # Create route
    curl -s -X PUT "$APISIX_ADMIN_URL/apisix/admin/routes/1" \
        -H "X-API-KEY: $ADMIN_KEY" \
        -d "{
            \"name\": \"$ROUTE_NAME\",
            \"uri\": \"/*\",
            \"upstream_id\": \"1\"
        }"
    
    echo "Created upstream with $N_NODES nodes (50% failing)"

1.2 Run wrk and pidstat to calculate CPU usage - Baseline case
Wrk results:

wrk

 wrk -c 100 -t 5 -d 60s -R 900 http://localhost:9080
Running 1m test @ http://localhost:9080
  5 threads and 100 connections
  Thread calibration: mean lat.: 2.114ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.135ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.140ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.122ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.147ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.10ms  735.04us  13.99ms   70.86%
    Req/Sec   189.76     70.81   444.00     69.94%
  54005 requests in 1.00m, 22.71MB read
  Non-2xx or 3xx responses: 54005
Requests/sec:    900.04
Transfer/sec:    387.62KB

CPU and Memory usage results
 pidstat -ur -p $pid 2 30 #sample 30 times - every 2 seconds
 
 #Output after sampling for 60 seconds
 
Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000   1089374   17.15   13.73    0.00    0.10   30.88     -  openresty

Average:      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
Average:     1000   1089374      5.00      0.00  498812   77144   0.48  openresty

RESULT:
The average CPU% is 30.88% and MEM% is 0.48%

   local health_config='{
            "active": {
                "type": "http",
                "http_path": "/",
                "healthy": {"interval": 1, "successes": 1},
                "unhealthy": {"interval": 1, "http_failures": 1}
            }
        }'
        curl -s -X PATCH "$APISIX_ADMIN_URL/apisix/admin/upstreams/1" \
            -H "X-API-KEY: $ADMIN_KEY" \
            -d "{\"checks\": $health_config}"
        echo "Enabled health checks"

Test with enabled healthcheck + Updating routes in the background for 24 hrs

Wrk results:

 wrk -c 100 -t 5 -d 86400s -R 900 http://localhost:9080

Running 1440m test @ http://localhost:9080
  5 threads and 100 connections
  Thread calibration: mean lat.: 1.322ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.314ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.302ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.307ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.294ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.01ms  562.23ms   1.00m    99.97%
    Req/Sec   188.85     68.87     2.33k    73.45%
  77336041 requests in 1440.00m, 13.33GB read
  Socket errors: connect 0, read 0, write 0, timeout 23542
  Non-2xx or 3xx responses: 15186209
Requests/sec:    895.09
Transfer/sec:    161.72KB
CPU and Memory usage results
 pidstat -ur -p $pid 3600 24
Linux 6.14.4-arch1-2 (ashish-82yl)      24/07/25        _x86_64_        (16 CPU)

05:16:54 PM IST   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
06:16:54 PM IST  1000    142261   10.77    5.80    0.00    0.10   16.57     1  openresty

05:16:54 PM IST   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
06:16:54 PM IST  1000    142261      9.35      0.00  409232   42680   0.27  openresty

...

04:16:54 PM IST   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
05:16:54 PM IST  1000    142261    3.71    1.81    0.00    0.02    5.51    14  openresty

04:16:54 PM IST   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
05:16:54 PM IST  1000    142261      0.26      0.00  415664   50452   0.31  openresty


...

Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000    142261    7.10    3.86    0.00    0.11   10.96     -  openresty

Average:      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
Average:     1000    142261      7.71      0.00  411107   44214   0.27  openresty

RESULT:
After 1st hour the CPU usage drops to 16.57% but Memory usage remains 0.27
Towards the end, the CPU usage drops to 5.51% but Memory usage increased to just 0.31
The average memory usage was 0.27 and increase in memory usage in 24hrs was 0.04

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

apisix/healthcheck_manager.lua

apisix/upstream.lua

apisix/healthcheck_manager.lua

Copilot

Pull Request Overview

This PR introduces a healthcheck manager to decouple upstream objects from health checkers. Previously, health checkers were tightly coupled to upstream objects and stored within them, creating complexity with lifecycle management and DNS/service discovery updates. The new approach separates the lifecycle of health checkers from upstream objects by managing them through resource paths and versions in a dedicated healthcheck_manager module that runs asynchronously.

Key changes include:

Creation of a new healthcheck_manager.lua module that manages health checkers independently from upstream objects
Replacement of the parent field relationship with resource_key and resource_version index references
Asynchronous health checker creation using timers instead of direct creation during requests

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
apisix/healthcheck_manager.lua	New module implementing asynchronous health checker management with working and waiting pools
apisix/upstream.lua	Refactored to use healthcheck_manager instead of direct checker management, replaced parent references
apisix/plugin.lua	Removed shallow copy exceptions for upstream.parent references
apisix/init.lua	Updated DNS resolution logic to increment _nodes_ver and removed shallow copy exceptions
apisix/balancer.lua	Updated to use healthcheck_manager for fetching node status
apisix/control/v1.lua	Modified to work with new healthcheck_manager API
test files	Updated with additional sleep delays and extra requests to accommodate asynchronous checker creation

Copilot · 2025-07-21T08:33:00Z

apisix/healthcheck_manager.lua

+            -- if a checker exists then delete it before creating a new one
+            local existing_checker = _M.working_pool[resource_path]
+            if existing_checker then
+                existing_checker.checker:delayed_clear(10)


The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.

Suggested change

existing_checker.checker:delayed_clear(10)

existing_checker.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

@Revolyssup

Copilot · 2025-07-21T08:33:00Z

apisix/healthcheck_manager.lua

+        --- remove from working pool if resource doesn't exist
+        local res_conf = fetch_latest_conf(resource_path)
+        if not res_conf then
+            item.checker:delayed_clear(10)


The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.

Suggested change

item.checker:delayed_clear(10)

item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

Copilot · 2025-07-21T08:33:01Z

apisix/healthcheck_manager.lua

+        core.log.info("checking working pool for resource: ", resource_path,
+                    " current version: ", current_ver, " item version: ", item.version)
+        if item.version ~= current_ver then
+            item.checker:delayed_clear(10)


The magic number 10 (seconds) for delayed_clear should be defined as a named constant to improve maintainability and make the cleanup timeout configurable.

Suggested change

item.checker:delayed_clear(10)

item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

apisix/healthcheck_manager.lua

membphis · 2025-07-23T01:54:17Z

apisix/balancer.lua

@@ -27,7 +27,7 @@ local get_last_failure = balancer.get_last_failure
 local set_timeouts     = balancer.set_timeouts
 local ngx_now          = ngx.now
 local str_byte         = string.byte
-
+local healthcheck_manager = require("apisix.healthcheck_manager")


membphis · 2025-07-23T02:31:05Z

apisix/balancer.lua

@@ -75,7 +75,8 @@ local function fetch_health_nodes(upstream, checker)
    local port = upstream.checks and upstream.checks.active and upstream.checks.active.port
    local up_nodes = core.table.new(0, #nodes)
    for _, node in ipairs(nodes) do
-        local ok, err = checker:get_target_status(node.host, port or node.port, host)
+        local ok, err = healthcheck_manager.fetch_node_status(checker,


we can cache healthcheck_manager.fetch_node_status, a short local function

do you mean using the lrucache with checker as key?

membphis · 2025-07-23T02:37:13Z

apisix/healthcheck_manager.lua

+local events = require("apisix.events")
+local tab_clone = core.table.clone
+local timer_every = ngx.timer.every
+local _M = {


it is not safe to export working_pool and waiting_pool.

We will never allow the "working_pool" or "waiting_pool" to be destroyed.

they only can be use in this lua module

apisix/healthcheck_manager.lua

membphis · 2025-07-24T02:37:57Z

it looks good to me now, but we have to wait for more time, for more comments, it is a big change

BTW, the PR title should be refactor ^_^

membphis

LGTM

nic-chen · 2025-07-25T02:10:16Z

Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json?

Revolyssup · 2025-07-25T04:42:58Z

Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json?

A lot of existing tests here already use yaml config_provider. See https://github.com/apache/apisix/pull/12426/files#diff-d89c41bb4b1cc7c97936090edd39c05e1fcbc35c94039a16dbb392515d4f38d3R34

feat: add healthcheck manager to decouple upstream

72634ba

Revolyssup marked this pull request as draft July 14, 2025 15:18

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 14, 2025

Revolyssup added 18 commits July 14, 2025 20:48

fix

2e114fb

fix

588ea22

fix

275d179

fix

6a169a3

handle both resource paths

40f5a79

pass test

40b0f36

fix tests

4d3736a

fix tests

be15b46

fix nil check

ef9ded6

fix

f2485ff

add sleep

d65748b

fix tests

342674b

fix lint

5dd2f5a

fix tests

a0e0f3d

fix tests

af9a26c

fix lint

562105a

fix tests

c3fbdf1

fix lint

e551bbc

Revolyssup marked this pull request as ready for review July 16, 2025 12:15

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 16, 2025

Revolyssup added 4 commits July 17, 2025 00:32

fix logic

9f19c2c

fix tests

45bba02

fix lint

d962d76

fix test

8c3a884

Revolyssup marked this pull request as draft July 16, 2025 20:27

nic-6443 reviewed Jul 21, 2025

View reviewed changes

apisix/healthcheck_manager.lua Outdated Show resolved Hide resolved

nic-6443 reviewed Jul 21, 2025

View reviewed changes

apisix/healthcheck_manager.lua Outdated Show resolved Hide resolved

nic-6443 reviewed Jul 21, 2025

View reviewed changes

apisix/upstream.lua Outdated Show resolved Hide resolved

nic-6443 reviewed Jul 21, 2025

View reviewed changes

apisix/upstream.lua Outdated Show resolved Hide resolved

nic-6443 reviewed Jul 21, 2025

View reviewed changes

apisix/healthcheck_manager.lua Outdated Show resolved Hide resolved

Revolyssup added 5 commits July 21, 2025 10:12

change warn to info

98a65e5

apply suggestions

66fc383

fix lint

9a56f59

fix CI

f11e391

put timer in local

30f12a5

moonming requested a review from Copilot July 21, 2025 08:31

Copilot AI reviewed Jul 21, 2025

View reviewed changes

nic-6443 previously approved these changes Jul 22, 2025

View reviewed changes

apply copilot suggestion

17b7397

Revolyssup dismissed nic-6443’s stale review via 17b7397 July 22, 2025 08:52

nic-chen reviewed Jul 22, 2025

View reviewed changes

apisix/healthcheck_manager.lua Outdated Show resolved Hide resolved

nic-chen reviewed Jul 22, 2025

View reviewed changes

apisix/healthcheck_manager.lua Show resolved Hide resolved

nic-6443 requested a review from membphis July 23, 2025 01:31

membphis self-requested a review July 23, 2025 01:31

nic-6443 requested a review from bzp2010 July 23, 2025 01:31

membphis requested changes Jul 23, 2025

View reviewed changes

Revolyssup added 2 commits July 23, 2025 12:28

apply suggestions

15979b3

skip creating checker if up_conf.checks nil

e725ad5

Revolyssup requested a review from nic-chen July 23, 2025 07:30

membphis reviewed Jul 24, 2025

View reviewed changes

Revolyssup changed the title ~~feat: add healthcheck manager to decouple upstream~~ refactor: add healthcheck manager to decouple upstream Jul 24, 2025

nic-6443 approved these changes Jul 30, 2025

View reviewed changes

	existing_checker.checker:delayed_clear(10)
	existing_checker.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

	item.checker:delayed_clear(10)
	item.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)

refactor: add healthcheck manager to decouple upstream #12426

Are you sure you want to change the base?

refactor: add healthcheck manager to decouple upstream #12426

Conversation

Revolyssup commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Benchmark results confirming no anomalies

wrk

Test with enabled healthcheck + Updating routes in the background for 24 hrs

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

nic-chen Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

membphis Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

membphis Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Revolyssup Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

membphis Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

membphis commented Jul 24, 2025

Uh oh!

membphis left a comment

Choose a reason for hiding this comment

Uh oh!

nic-chen commented Jul 25, 2025

Uh oh!

Revolyssup commented Jul 25, 2025

Uh oh!

Uh oh!

Revolyssup commented Jul 14, 2025 •

edited

Loading

Revolyssup Jul 23, 2025 •

edited

Loading