-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38703][runtime] Update slot manager metrics in thread-safety manner #27257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
CI is not passing due to a failing integration test. The reason is that the failing test reads metrics before the fields were updated (interval 1s). |
eafab52 to
1ae156a
Compare
|
@flinkbot run azur |
| } | ||
|
|
||
| private void registerSlotManagerMetrics() { | ||
| slotManagerMetricGroup.gauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would not be simpler to synchronise on slotManagerMetricGroup around the original 2 lines. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @davidradl, thanks for looking at this PR.
Can you elaborate more on what you mean?
If I understand that correctly then a synchronization on slotManagerMetricGroup level wouldn't help because the ConcurrentModificationException is still possible if the FineGrainedTaskManagerTracker#addTaskManager or the FineGrainedTaskManagerTracker#removeTaskManager is called from the main thread at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ztison I see that this change is ensuring that some fields in FineGrainedManager are updated on the same thread, so they cannot be concurrently modified. If the code modifying these fields were synchronized on the FineGrainedManager object, then thread 1 would get the lock and make the updates, then if thread 2 tries to update the same object, it would also request the lock, which it would only get, when thread 1 completed its updates. So it would ensure only one thread ever updated the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidradl
The easiest fix would be to make the FineGrainedTaskManagerTracker.taskManagerRegistrations map a concurrent map. But what about the other maps in the FineGrainedTaskManagerTracker? My point is that this component doesn't claim to be thread-safe. I don't want to make it partially thread-safe, as that could confuse future readers of this class. That's why I moved the synchronization logic up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ztison Personally I would go with the synchronise or concurrent map as it solves the same problem without the future, but let's see what others have to say. If others are ok with your approach - I will back it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. I have already discussed it with @zentol internally. So let's wait for his feedback.
Thanks anyway.
|
fyi: to have green ci, rebase to the latest master |
What is the purpose of the change
This PR resolves a race condition that leads to a
ConcurrentModificationExceptionwhen OpenTelemetry (OTel) metrics are collected simultaneously with updates to the tracked task managers in theSlotManager.Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation