-
Notifications
You must be signed in to change notification settings - Fork 349
fix IPC timeouts #9926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix IPC timeouts #9926
Conversation
src/audio/host-zephyr.c
Outdated
| static uint64_t nobytes_last_logged; | ||
| uint64_t now = k_uptime_get(); | ||
|
|
||
| if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given uptime starts from zero, seems safe to ignore rollover (and seems standard practice in Zephyr code using k_uptime_get()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i can we hire an intern to calculate 64-bit overflow date in milliseconds? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my rough estimate takes me somewhere to 400mln years from now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably only ratelimit consecutive 'no bytes to copy' prints?
This is also going to 'mute' all host copiers and might hide prints from non stopping paths and hides possibly valuable information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"
@lgirdwood sure, as also suggested by @marcinszkudlinski - done in the current version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably only ratelimit consecutive 'no bytes to copy' prints?
This is also going to 'mute' all host copiers and might hide prints from non stopping paths and hides possibly valuable information?
@ujfalusi sorry, not sure I understand. (1) - how would you check for only consecutive prints? You'd need to "hack" into the logging core to check if any other messages intermix. And also not sure why it matters - it does happen from time to time that in a flood of this messages another one appears, so with this rate-limiting we'll throw away the flood before and after and other messages will still come through unobstructed. (2) it will not completely mute these messages, it will just reduce their number. And now we'll also see how many of them got dropped. Also note, that in fact it can improve logs - this flood can trigger the rate-limiter in the logging core, which then would drop random messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lyakh, what I meant is that for example you have two streams running, you stop one of them, which will try to flood the log with 'no bytes to copy', but there is a bug in somewhere which causes the other, still running stream to glitch and would want to print also 'no bytes to copy'
|
we have a build regression from #9913 https://github.com/thesofproject/sof/actions/runs/13980547833/job/39144648495 tracked in zephyrproject-rtos/zephyr#87475 |
| if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) { | ||
| nobytes_last_logged = now; | ||
| comp_info(dev, "no bytes to copy, available samples: %d, free_samples: %d", | ||
| avail_samples, free_samples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: add a counter. Log how many messages have been hidden like
"no bytes to copy, %u such events in last %u ms, available samples: %d, free_samples: %d",
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets count the volume of messages as no data can point to other issues.
src/audio/host-zephyr.c
Outdated
| static uint64_t nobytes_last_logged; | ||
| uint64_t now = k_uptime_get(); | ||
|
|
||
| if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"
|
fixes #9914 |
@lgirdwood I looked at priorities, they look correct. There's a thread for LL, running with a very high priority, then there's a logging thread, running with a very low priority, then there's an incoming IPC processing thread and an outgoing IPC queue / thread, running with a middle priority. I think it's that outgoing IPC work queue, that gets filled with these log messages. I've checked Linux logs and see there multiple repeated log notifications. |
|
CI:
|
src/audio/host-zephyr.c
Outdated
| nobytes_last_logged = now; | ||
| comp_info(dev, | ||
| "no bytes to copy, %u such events in last %llu ms, available samples: %d, free_samples: %d", | ||
| n_skipped, delta, avail_samples, free_samples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: in this context the avail_samples and free_samples are misleading. They are valid for this specific print, but they might have been different for the skipped ones...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi I can split this print into 2: keep first unchanged for this specific event, and if (n_skipped) an additional print with the number and time interval
17547dc to
8dac4e5
Compare
src/audio/host-zephyr.c
Outdated
| if (!dma_copy_bytes) | ||
| comp_info(dev, "no bytes to copy, available samples: %d, free_samples: %d", | ||
| avail_samples, free_samples); | ||
| if (!dma_copy_bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another suggestion. Now when an event occur twice in less than a timeout, the second notification will be lost (or reported at next event occurrence, maybe minutes later).
better this way:
if (n_skipped>0 && is_a_timeout) log a message regardless of current dma_copy_bytes state.
I found those messages extremely useful in many debuggings, so I prefer not to loose them ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski ok, done that, it adds work to the "good" case like reading the timer, checking stuff, take a look if it's worth it
When tearing down streams, when some of the pipelines have already been deleted and some are still active, the remaining active pipelines might experience no-data conditions. Currently this is logged on every LL-scheduler period, i.e. every millisecond. This isn't adding any useful information and in fact can create a flood of outgoing IPC notifications, eventually blocking valid IPC replies and leading to IPC timeouts. Rate-limit these logging entries to fix the problem and relax the log. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
This restores commit 0e53393. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
|
Logging looks good now |
|
"close with comment" and "comment" buttons are close to each other ;) |
Yes, perf monitoring would be good! At least we should be protected against accidentally moving hot paths to DRAM when #9907 is merged |
|
Build results for Intel jenkins build are fine (internal id 6814) and it shows failure only due to the accidental close/reopen. Proceedin with merge. |
fix IPC timeouts and move IPC processing back to DRAM