Skip to content

Conversation

@Hilbrand
Copy link
Member

To improve the handling of changes in number of workers running the following improvements are made:

  1. Changed the event handling of changes in number of workers triggered by RabbitMQ channel event queue watchter.
    Instead of directly updating the number of workers it will trigger the RabbitMQ api call.
    The api will give the actual number of workers available.
    Because the event is triggered for every worker added/removed a timer is used to delay the api call.
    When a new event is given within 15 seconds, the timer is reset.
    The actual recurring api call is included in the timer process, so it will not run also when the timer is running from an event or the other way around.
    This time mechanism assures an update id done within 15 seconds after a change.
    This seems appropriate. (The alternative is to just do the call each 15 seconds, and remove the whole event triggering).

  2. The changes in number of workers is also propagated to the priority scheduler so it can also check it's internal state.

  3. The workerFinishedHandlers are guarded from RuntimeExceptions, which could result in not all handlers to be called when an exception happens in one of the handlers.

  4. In workerpool to check if a freeworker should be released when a task is finished it doesn't include the number of freeworkers.
    Because if no change in number of workers this test would always test on equals size(configured == free + running).
    Unless the number of free workers is just a bit out of sync due to concurrency.
    While number of running workers is always in sync because when called it's when a task is finished. So if all workers were running it would valid the == condition.

  5. To free the workers when the number of workers is reduced no waiting acquire is used.
    It could block if the data isn't completely in sync. By removing the potential block this is avoided.
    Even when this would mean less workers are marked as free than are actually free this
    would recover from when running tasks a finishing and than won't release a free worker,
    this will than recover itself to the correct condition.
    It's unclear if it every actually blocked. But just removing the block makes the code more robust.

  6. The timeout values on unit tests have been fixed. The default unit is seconds. Therefore all timeout that had no unit actually had a timeout that was 1000 time seconds. The annotations have been fixed and have a explicit time unit added.

@Hilbrand Hilbrand requested review from BertScholten and SerhatG July 21, 2025 11:10
To improve the handling of changes in number of workers running the following improvements are made:

1) Changed the event handling of changes in number of workers triggered by RabbitMQ channel event queue watchter.
   Instead of directly updating the number of workers it will trigger the RabbitMQ api call.
   The api will give the actual number of workers available.
   Because the event is triggered for every worker added/removed a timer is used to delay the api call.
   When a new event is given within 15 seconds, the timer is reset.
   The actual recurring api call is included in the timer process, so it will not run also when the timer is running from an event or the other way around.
   This time mechanism assures an update id done within 15 seconds after a change.
   This seems appropriate. (The alternative is to just do the call each 15 seconds, and remove the whole event triggering).

2) The changes in number of workers is also propagated to the priority scheduler so it can also check it's internal state.

3) The workerFinishedHandlers are guarded from RuntimeExceptions, which could result in not all handlers to be called when an exception happens in one of the handlers.

4) In workerpool to check if a freeworker should be released when a task is finished it doesn't include the number of freeworkers.
   Because if no change in number of workers this test would always test on equals size(configured == free + running).
   Unless the number of free workers is just a bit out of sync due to concurrency.
   While number of running workers is always in sync because when called it's when a task is finished. So if all workers were running it would valid the == condition.

5) To free the workers when the number of workers is reduced no waiting acquire is used.
   It could block if the data isn't completely in sync. By removing the potential block this is avoided.
   Even when this would mean less workers are marked as free than are actually free this
   would recover from when running tasks a finishing and than won't release a free worker,
   this will than recover itself to the correct condition.
   It's unclear if it every actually blocked. But just removing the block makes the code more robust.

6) The timeout values on unit tests have been fixed. The default unit is seconds. Therefore all timeout that had no unit actually had a timeout that was 1000 time seconds. The annotations have been fixed and have a explicit time unit added.
@Hilbrand Hilbrand changed the title Improved handling changes in number of workers available AER-3864 Improved handling changes in number of workers available Jul 21, 2025
Copy link
Member

@BertScholten BertScholten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just got some questions

@Hilbrand Hilbrand requested a review from BertScholten July 22, 2025 08:01
Copy link
Member

@BertScholten BertScholten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hilbrand Hilbrand merged commit 6b8e589 into aerius:main Jul 22, 2025
1 check passed
@Hilbrand Hilbrand deleted the improve-update branch July 22, 2025 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants