-
Notifications
You must be signed in to change notification settings - Fork 349
dp: Fix DP scheduler locking #10327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dp: Fix DP scheduler locking #10327
Conversation
When at least two DP modules are running, each on a separate core, using irq_lock() may lead to interrupts being disabled for a very long time. When module_is_ready_to_process() always returns true, the DP task is executed in a loop all the time except for periods when preempted by higher priority threads. irq_lock() disables interrupts globally. Using irq_lock() on multiple cores can lead to unbalanced double locks without unlock in between. Consider the case: core 1 calls irq_lock(1); this does not prevent core 2 from also calling flags = irq_lock(2); now flags contains the "interrupts disabled" state as interrupts were previously globally disabled by core 1. Then core 1 calls irq_unlock() -- interrupts are re-enabled; then core 2 calls irq_unlock(flags) to restore interrupts, which actually leads to interrupts being disabled. On the next loop iteration, core 1 calls flags = irq_lock(1), and since then interrupts might be disabled forever with only two DP threads constantly running. This fixes a regression in multicore DP tests. The issue is triggered by this commit 4225c27, which just allows the DP task to run all the available time without being triggered by LL for every cycle. Signed-off-by: Serhiy Katsyuba <serhiy.katsyuba@intel.com>
softwarecki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad my solution helped resolve the issue :)
|
@serhiy-katsyuba-intel can you check the internal CI. Thanks ! |
The CI fails on test_00_11_enter_d3_with_topology_stress test on NVL FPGA. That test does not use DP, so should not be directly affected by the changes from this PR. I checked few neighboring PRs: they all fail the same test. cc: @lrudyX , @tmleman . |
Fail is related to DUT issue. Working on solving this problem. |
|
Internal Intel CI now is working and passed successfully. |
When at least two DP modules are running, each on a separate core, using irq_lock() may lead to interrupts being disabled for a very long time. When module_is_ready_to_process() always returns true, the DP task is executed in a loop all the time except for periods when preempted by higher priority threads. irq_lock() disables interrupts globally. Using irq_lock() on multiple cores can lead to unbalanced double locks without unlock in between.
Consider the case: core 1 calls irq_lock(1); this does not prevent core 2 from also calling flags = irq_lock(2); now flags contains the "interrupts disabled" state as interrupts were previously globally disabled by core 1. Then core 1 calls irq_unlock() -- interrupts are re-enabled; then core 2 calls irq_unlock(flags) to restore interrupts, which actually leads to interrupts being disabled. On the next loop iteration, core 1 calls flags = irq_lock(1), and since then interrupts might be disabled forever with only two DP threads constantly running.
This fixes a regression in multicore DP tests. The issue is triggered by this commit 4225c27, which just allows the DP task to run all the available time without being triggered by LL for every cycle.