fix IPC timeouts #9926

lyakh · 2025-03-26T12:04:28Z

fix IPC timeouts and move IPC processing back to DRAM

kv2019i · 2025-03-26T12:49:28Z

src/audio/host-zephyr.c

+		static uint64_t nobytes_last_logged;
+		uint64_t now = k_uptime_get();
+
+		if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) {


Given uptime starts from zero, seems safe to ignore rollover (and seems standard practice in Zephyr code using k_uptime_get()).

@kv2019i can we hire an intern to calculate 64-bit overflow date in milliseconds? ;-)

my rough estimate takes me somewhere to 400mln years from now

if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"

probably only ratelimit consecutive 'no bytes to copy' prints?

This is also going to 'mute' all host copiers and might hide prints from non stopping paths and hides possibly valuable information?

if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"

@lgirdwood sure, as also suggested by @marcinszkudlinski - done in the current version

probably only ratelimit consecutive 'no bytes to copy' prints?

This is also going to 'mute' all host copiers and might hide prints from non stopping paths and hides possibly valuable information?

@ujfalusi sorry, not sure I understand. (1) - how would you check for only consecutive prints? You'd need to "hack" into the logging core to check if any other messages intermix. And also not sure why it matters - it does happen from time to time that in a flood of this messages another one appears, so with this rate-limiting we'll throw away the flood before and after and other messages will still come through unobstructed. (2) it will not completely mute these messages, it will just reduce their number. And now we'll also see how many of them got dropped. Also note, that in fact it can improve logs - this flood can trigger the rate-limiter in the logging core, which then would drop random messages

@lyakh, what I meant is that for example you have two streams running, you stop one of them, which will try to flood the log with 'no bytes to copy', but there is a bug in somewhere which causes the other, still running stream to glitch and would want to print also 'no bytes to copy'

lyakh · 2025-03-26T13:41:10Z

we have a build regression from #9913 https://github.com/thesofproject/sof/actions/runs/13980547833/job/39144648495 tracked in zephyrproject-rtos/zephyr#87475

marcinszkudlinski · 2025-03-26T13:41:54Z

src/audio/host-zephyr.c

+		if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) {
+			nobytes_last_logged = now;
+			comp_info(dev, "no bytes to copy, available samples: %d, free_samples: %d",
+				  avail_samples, free_samples);


Idea: add a counter. Log how many messages have been hidden like
"no bytes to copy, %u such events in last %u ms, available samples: %d, free_samples: %d",

lgirdwood

Lets count the volume of messages as no data can point to other issues.

lgirdwood · 2025-03-26T13:46:38Z

src/audio/host-zephyr.c

+		static uint64_t nobytes_last_logged;
+		uint64_t now = k_uptime_get();
+
+		if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) {


if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"

lyakh · 2025-03-26T13:55:13Z

fixes #9914

lgirdwood · 2025-03-26T14:01:06Z

@lyakh @kv2019i do we know why a high logging rate impacted IPC messaging. I would assume IPC would still reply after LOG flood ? Do we need to look at the priority of the LOG thread so its preemptable or even tunable via Kconfig ?

lyakh · 2025-03-26T14:42:41Z

@lyakh @kv2019i do we know why a high logging rate impacted IPC messaging. I would assume IPC would still reply after LOG flood ? Do we need to look at the priority of the LOG thread so its preemptable or even tunable via Kconfig ?

@lgirdwood I looked at priorities, they look correct. There's a thread for LL, running with a very high priority, then there's a logging thread, running with a very low priority, then there's an incoming IPC processing thread and an outgoing IPC queue / thread, running with a middle priority. I think it's that outgoing IPC work queue, that gets filled with these log messages. I've checked Linux logs and see there multiple repeated log notifications.

lyakh · 2025-03-26T16:30:02Z

CI:

MTL https://sof-ci.01.org/sofpr/PR9926/build11813/devicetest/index.html - HDA not tested
LNL https://sof-ci.01.org/sofpr/PR9926/build11811/devicetest/index.html - SDW not tested
cAVS https://sof-ci.01.org/sofpr/PR9926/build11814/devicetest/index.html - nocodec not tested
PTL https://sof-ci.01.org/sofpr/PR9926/build11812/devicetest/index.html - only the known verify-kernel-boot HDA failure

ujfalusi · 2025-03-27T07:13:06Z

src/audio/host-zephyr.c

+			nobytes_last_logged = now;
+			comp_info(dev,
+				  "no bytes to copy, %u such events in last %llu ms, available samples: %d, free_samples: %d",
+				  n_skipped, delta, avail_samples, free_samples);


nitpick: in this context the avail_samples and free_samples are misleading. They are valid for this specific print, but they might have been different for the skipped ones...

@ujfalusi I can split this print into 2: keep first unchanged for this specific event, and if (n_skipped) an additional print with the number and time interval

marcinszkudlinski · 2025-03-27T14:24:08Z

src/audio/host-zephyr.c

-	if (!dma_copy_bytes)
-		comp_info(dev, "no bytes to copy, available samples: %d, free_samples: %d",
-			  avail_samples, free_samples);
+	if (!dma_copy_bytes) {


Another suggestion. Now when an event occur twice in less than a timeout, the second notification will be lost (or reported at next event occurrence, maybe minutes later).
better this way:
if (n_skipped>0 && is_a_timeout) log a message regardless of current dma_copy_bytes state.

I found those messages extremely useful in many debuggings, so I prefer not to loose them ;)

@marcinszkudlinski ok, done that, it adds work to the "good" case like reading the timer, checking stuff, take a look if it's worth it

When tearing down streams, when some of the pipelines have already been deleted and some are still active, the remaining active pipelines might experience no-data conditions. Currently this is logged on every LL-scheduler period, i.e. every millisecond. This isn't adding any useful information and in fact can create a flood of outgoing IPC notifications, eventually blocking valid IPC replies and leading to IPC timeouts. Rate-limit these logging entries to fix the problem and relax the log. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

This restores commit 0e53393. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

marcinszkudlinski · 2025-03-28T13:08:29Z

Logging looks good now
still I'm not convinced by moving so much to __cold, we do need a close monitoring of perf regression
Anyway, no objections to merging as for now

marcinszkudlinski · 2025-03-28T13:09:12Z

"close with comment" and "comment" buttons are close to each other ;)

lyakh · 2025-03-28T13:20:24Z

Logging looks good now still I'm not convinced by moving so much to __cold, we do need a close monitoring of perf regression Anyway, no objections to merging as for now

Yes, perf monitoring would be good! At least we should be protected against accidentally moving hot paths to DRAM when #9907 is merged

kv2019i · 2025-03-28T14:51:21Z

Build results for Intel jenkins build are fine (internal id 6814) and it shows failure only due to the accidental close/reopen. Proceedin with merge.

lyakh requested review from bardliao, dbaluta, kv2019i, lbetlej, lgirdwood, marcinszkudlinski, mmaka1, pblaszko and plbossart as code owners March 26, 2025 12:04

kv2019i approved these changes Mar 26, 2025

View reviewed changes

marcinszkudlinski reviewed Mar 26, 2025

View reviewed changes

lgirdwood reviewed Mar 26, 2025

View reviewed changes

lyakh force-pushed the ipc branch from 06cfdac to 4ebbd20 Compare March 26, 2025 14:08

ujfalusi reviewed Mar 27, 2025

View reviewed changes

lyakh force-pushed the ipc branch 2 times, most recently from 17547dc to 8dac4e5 Compare March 27, 2025 12:12

lyakh requested review from RanderWang and abonislawski as code owners March 27, 2025 12:12

ujfalusi approved these changes Mar 27, 2025

View reviewed changes

lgirdwood approved these changes Mar 27, 2025

View reviewed changes

marcinszkudlinski reviewed Mar 27, 2025

View reviewed changes

lyakh mentioned this pull request Mar 28, 2025

Logging: switch to delayed work #9929

Merged

lyakh added 2 commits March 28, 2025 12:03

Reapply "ipc: move most functions to run from DRAM"

3b108b7

This restores commit 0e53393. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh force-pushed the ipc branch from 8dac4e5 to 3b108b7 Compare March 28, 2025 11:04

marcinszkudlinski closed this Mar 28, 2025

marcinszkudlinski reopened this Mar 28, 2025

lgirdwood approved these changes Mar 28, 2025

View reviewed changes

kv2019i merged commit 8bf2d92 into thesofproject:main Mar 28, 2025
80 of 91 checks passed

lyakh deleted the ipc branch March 28, 2025 14:51

lyakh mentioned this pull request Apr 22, 2025

[BUG] "IPC timed out" on multiple ACE platforms with xrun warnings in mtrace #9914

Closed

fix IPC timeouts #9926

fix IPC timeouts #9926

Uh oh!

Conversation

lyakh commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyakh commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyakh commented Mar 26, 2025

Uh oh!

lgirdwood commented Mar 26, 2025

Uh oh!

lyakh commented Mar 26, 2025

Uh oh!

lyakh commented Mar 26, 2025

Uh oh!

ujfalusi Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcinszkudlinski Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcinszkudlinski commented Mar 28, 2025

Uh oh!

marcinszkudlinski commented Mar 28, 2025

Uh oh!

lyakh commented Mar 28, 2025

Uh oh!

kv2019i commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lyakh commented Mar 26, 2025 •

edited

Loading

ujfalusi Mar 27, 2025 •

edited

Loading

marcinszkudlinski Mar 27, 2025 •

edited

Loading