Skip to content

Conversation

@ztison
Copy link
Contributor

@ztison ztison commented Nov 20, 2025

What is the purpose of the change

This PR resolves a race condition that leads to a ConcurrentModificationException when OpenTelemetry (OTel) metrics are collected simultaneously with updates to the tracked task managers in the SlotManager.

Brief change log

  • The new fields for the metrics are introduced
  • These fields are periodically updated on the main thread every 1s
  • OpenTelemetry reads these static fields

Verifying this change

This change added tests and can be verified as follows:

  • New test was added that verifies the fields are periodically updated and values are propagated to the OpenTelemetry collector.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 20, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@ztison ztison marked this pull request as draft November 21, 2025 09:53
@ztison
Copy link
Contributor Author

ztison commented Nov 21, 2025

CI is not passing due to a failing integration test.

Failures: 
Nov 20 17:36:29 17:36:29.834 [ERROR]   JobManagerMetricsITCase.testJobManagerMetrics:136 expected:<1> but was:<0>

The reason is that the failing test reads metrics before the fields were updated (interval 1s).
Update: I added a sleep to the test.

@ztison ztison marked this pull request as ready for review November 21, 2025 14:01
@ztison
Copy link
Contributor Author

ztison commented Nov 21, 2025

@flinkbot run azur

}

private void registerSlotManagerMetrics() {
slotManagerMetricGroup.gauge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would not be simpler to synchronise on slotManagerMetricGroup around the original 2 lines. WDYT?

Copy link
Contributor Author

@ztison ztison Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @davidradl, thanks for looking at this PR.

Can you elaborate more on what you mean?

If I understand that correctly then a synchronization on slotManagerMetricGroup level wouldn't help because the ConcurrentModificationException is still possible if the FineGrainedTaskManagerTracker#addTaskManager or the FineGrainedTaskManagerTracker#removeTaskManager is called from the main thread at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ztison I see that this change is ensuring that some fields in FineGrainedManager are updated on the same thread, so they cannot be concurrently modified. If the code modifying these fields were synchronized on the FineGrainedManager object, then thread 1 would get the lock and make the updates, then if thread 2 tries to update the same object, it would also request the lock, which it would only get, when thread 1 completed its updates. So it would ensure only one thread ever updated the fields.

Copy link
Contributor Author

@ztison ztison Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidradl
The easiest fix would be to make the FineGrainedTaskManagerTracker.taskManagerRegistrations map a concurrent map. But what about the other maps in the FineGrainedTaskManagerTracker? My point is that this component doesn't claim to be thread-safe. I don't want to make it partially thread-safe, as that could confuse future readers of this class. That's why I moved the synchronization logic up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ztison Personally I would go with the synchronise or concurrent map as it solves the same problem without the future, but let's see what others have to say. If others are ok with your approach - I will back it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I have already discussed it with @zentol internally. So let's wait for his feedback.
Thanks anyway.

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Nov 21, 2025
@snuyanzin
Copy link
Contributor

fyi: to have green ci, rebase to the latest master
e2e was fixed at FLINK-38700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants