AER-3864 Improved handling changes in number of workers available #104

Hilbrand · 2025-07-21T11:10:28Z

To improve the handling of changes in number of workers running the following improvements are made:

Changed the event handling of changes in number of workers triggered by RabbitMQ channel event queue watchter.
Instead of directly updating the number of workers it will trigger the RabbitMQ api call.
The api will give the actual number of workers available.
Because the event is triggered for every worker added/removed a timer is used to delay the api call.
When a new event is given within 15 seconds, the timer is reset.
The actual recurring api call is included in the timer process, so it will not run also when the timer is running from an event or the other way around.
This time mechanism assures an update id done within 15 seconds after a change.
This seems appropriate. (The alternative is to just do the call each 15 seconds, and remove the whole event triggering).
The changes in number of workers is also propagated to the priority scheduler so it can also check it's internal state.
The workerFinishedHandlers are guarded from RuntimeExceptions, which could result in not all handlers to be called when an exception happens in one of the handlers.
In workerpool to check if a freeworker should be released when a task is finished it doesn't include the number of freeworkers.
Because if no change in number of workers this test would always test on equals size(configured == free + running).
Unless the number of free workers is just a bit out of sync due to concurrency.
While number of running workers is always in sync because when called it's when a task is finished. So if all workers were running it would valid the == condition.
To free the workers when the number of workers is reduced no waiting acquire is used.
It could block if the data isn't completely in sync. By removing the potential block this is avoided.
Even when this would mean less workers are marked as free than are actually free this
would recover from when running tasks a finishing and than won't release a free worker,
this will than recover itself to the correct condition.
It's unclear if it every actually blocked. But just removing the block makes the code more robust.
The timeout values on unit tests have been fixed. The default unit is seconds. Therefore all timeout that had no unit actually had a timeout that was 1000 time seconds. The annotations have been fixed and have a explicit time unit added.

To improve the handling of changes in number of workers running the following improvements are made: 1) Changed the event handling of changes in number of workers triggered by RabbitMQ channel event queue watchter. Instead of directly updating the number of workers it will trigger the RabbitMQ api call. The api will give the actual number of workers available. Because the event is triggered for every worker added/removed a timer is used to delay the api call. When a new event is given within 15 seconds, the timer is reset. The actual recurring api call is included in the timer process, so it will not run also when the timer is running from an event or the other way around. This time mechanism assures an update id done within 15 seconds after a change. This seems appropriate. (The alternative is to just do the call each 15 seconds, and remove the whole event triggering). 2) The changes in number of workers is also propagated to the priority scheduler so it can also check it's internal state. 3) The workerFinishedHandlers are guarded from RuntimeExceptions, which could result in not all handlers to be called when an exception happens in one of the handlers. 4) In workerpool to check if a freeworker should be released when a task is finished it doesn't include the number of freeworkers. Because if no change in number of workers this test would always test on equals size(configured == free + running). Unless the number of free workers is just a bit out of sync due to concurrency. While number of running workers is always in sync because when called it's when a task is finished. So if all workers were running it would valid the == condition. 5) To free the workers when the number of workers is reduced no waiting acquire is used. It could block if the data isn't completely in sync. By removing the potential block this is avoided. Even when this would mean less workers are marked as free than are actually free this would recover from when running tasks a finishing and than won't release a free worker, this will than recover itself to the correct condition. It's unclear if it every actually blocked. But just removing the block makes the code more robust. 6) The timeout values on unit tests have been fixed. The default unit is seconds. Therefore all timeout that had no unit actually had a timeout that was 1000 time seconds. The annotations have been fixed and have a explicit time unit added.

BertScholten

Looks good, just got some questions

source/taskmanager/src/main/java/nl/aerius/taskmanager/WorkerPool.java

source/taskmanager/src/main/java/nl/aerius/taskmanager/mq/RabbitMQWorkerSizeProvider.java

...nager/src/main/java/nl/aerius/taskmanager/scheduler/priorityqueue/PriorityTaskScheduler.java

BertScholten

LGTM

Hilbrand requested review from BertScholten and SerhatG July 21, 2025 11:10

Hilbrand force-pushed the improve-update branch from 524f878 to 64fda7f Compare July 21, 2025 11:11

Hilbrand changed the title ~~Improved handling changes in number of workers available~~ AER-3864 Improved handling changes in number of workers available Jul 21, 2025

BertScholten reviewed Jul 21, 2025

View reviewed changes

Review comments

1177445

Hilbrand requested a review from BertScholten July 22, 2025 08:01

BertScholten approved these changes Jul 22, 2025

View reviewed changes

Hilbrand merged commit 6b8e589 into aerius:main Jul 22, 2025
1 check passed

Hilbrand deleted the improve-update branch July 22, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AER-3864 Improved handling changes in number of workers available #104

AER-3864 Improved handling changes in number of workers available #104

Uh oh!

Hilbrand commented Jul 21, 2025

Uh oh!

BertScholten left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BertScholten left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AER-3864 Improved handling changes in number of workers available #104

AER-3864 Improved handling changes in number of workers available #104

Uh oh!

Conversation

Hilbrand commented Jul 21, 2025

Uh oh!

BertScholten left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BertScholten left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants