Skip to content

Conversation

@lyakh
Copy link
Collaborator

@lyakh lyakh commented Mar 26, 2025

fix IPC timeouts and move IPC processing back to DRAM

static uint64_t nobytes_last_logged;
uint64_t now = k_uptime_get();

if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given uptime starts from zero, seems safe to ignore rollover (and seems standard practice in Zephyr code using k_uptime_get()).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i can we hire an intern to calculate 64-bit overflow date in milliseconds? ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my rough estimate takes me somewhere to 400mln years from now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably only ratelimit consecutive 'no bytes to copy' prints?

This is also going to 'mute' all host copiers and might hide prints from non stopping paths and hides possibly valuable information?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"

@lgirdwood sure, as also suggested by @marcinszkudlinski - done in the current version

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably only ratelimit consecutive 'no bytes to copy' prints?

This is also going to 'mute' all host copiers and might hide prints from non stopping paths and hides possibly valuable information?

@ujfalusi sorry, not sure I understand. (1) - how would you check for only consecutive prints? You'd need to "hack" into the logging core to check if any other messages intermix. And also not sure why it matters - it does happen from time to time that in a flood of this messages another one appears, so with this rate-limiting we'll throw away the flood before and after and other messages will still come through unobstructed. (2) it will not completely mute these messages, it will just reduce their number. And now we'll also see how many of them got dropped. Also note, that in fact it can improve logs - this flood can trigger the rate-limiter in the logging core, which then would drop random messages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh, what I meant is that for example you have two streams running, you stop one of them, which will try to flood the log with 'no bytes to copy', but there is a bug in somewhere which causes the other, still running stream to glitch and would want to print also 'no bytes to copy'

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 26, 2025

if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) {
nobytes_last_logged = now;
comp_info(dev, "no bytes to copy, available samples: %d, free_samples: %d",
avail_samples, free_samples);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: add a counter. Log how many messages have been hidden like
"no bytes to copy, %u such events in last %u ms, available samples: %d, free_samples: %d",

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets count the volume of messages as no data can point to other issues.

static uint64_t nobytes_last_logged;
uint64_t now = k_uptime_get();

if (now - nobytes_last_logged > SOF_MIN_NO_BYTES_INTERVAL_MS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are not going to log can we count the filtered messages and print them in the log i.e. "blah ... event occurred %d times since last message"

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 26, 2025

fixes #9914

@lgirdwood
Copy link
Member

@lyakh @kv2019i do we know why a high logging rate impacted IPC messaging. I would assume IPC would still reply after LOG flood ? Do we need to look at the priority of the LOG thread so its preemptable or even tunable via Kconfig ?

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 26, 2025

@lyakh @kv2019i do we know why a high logging rate impacted IPC messaging. I would assume IPC would still reply after LOG flood ? Do we need to look at the priority of the LOG thread so its preemptable or even tunable via Kconfig ?

@lgirdwood I looked at priorities, they look correct. There's a thread for LL, running with a very high priority, then there's a logging thread, running with a very low priority, then there's an incoming IPC processing thread and an outgoing IPC queue / thread, running with a middle priority. I think it's that outgoing IPC work queue, that gets filled with these log messages. I've checked Linux logs and see there multiple repeated log notifications.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 26, 2025

nobytes_last_logged = now;
comp_info(dev,
"no bytes to copy, %u such events in last %llu ms, available samples: %d, free_samples: %d",
n_skipped, delta, avail_samples, free_samples);
Copy link
Contributor

@ujfalusi ujfalusi Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: in this context the avail_samples and free_samples are misleading. They are valid for this specific print, but they might have been different for the skipped ones...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi I can split this print into 2: keep first unchanged for this specific event, and if (n_skipped) an additional print with the number and time interval

@lyakh lyakh force-pushed the ipc branch 2 times, most recently from 17547dc to 8dac4e5 Compare March 27, 2025 12:12
if (!dma_copy_bytes)
comp_info(dev, "no bytes to copy, available samples: %d, free_samples: %d",
avail_samples, free_samples);
if (!dma_copy_bytes) {
Copy link
Contributor

@marcinszkudlinski marcinszkudlinski Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another suggestion. Now when an event occur twice in less than a timeout, the second notification will be lost (or reported at next event occurrence, maybe minutes later).
better this way:
if (n_skipped>0 && is_a_timeout) log a message regardless of current dma_copy_bytes state.

I found those messages extremely useful in many debuggings, so I prefer not to loose them ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski ok, done that, it adds work to the "good" case like reading the timer, checking stuff, take a look if it's worth it

lyakh added 2 commits March 28, 2025 12:03
When tearing down streams, when some of the pipelines have already
been deleted and some are still active, the remaining active
pipelines might experience no-data conditions. Currently this is
logged on every LL-scheduler period, i.e. every millisecond. This
isn't adding any useful information and in fact can create a flood
of outgoing IPC notifications, eventually blocking valid IPC replies
and leading to IPC timeouts. Rate-limit these logging entries to fix
the problem and relax the log.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
This restores commit 0e53393.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@marcinszkudlinski
Copy link
Contributor

Logging looks good now
still I'm not convinced by moving so much to __cold, we do need a close monitoring of perf regression
Anyway, no objections to merging as for now

@marcinszkudlinski
Copy link
Contributor

"close with comment" and "comment" buttons are close to each other ;)

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 28, 2025

Logging looks good now still I'm not convinced by moving so much to __cold, we do need a close monitoring of perf regression Anyway, no objections to merging as for now

Yes, perf monitoring would be good! At least we should be protected against accidentally moving hot paths to DRAM when #9907 is merged

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 28, 2025

Build results for Intel jenkins build are fine (internal id 6814) and it shows failure only due to the accidental close/reopen. Proceedin with merge.

@kv2019i kv2019i merged commit 8bf2d92 into thesofproject:main Mar 28, 2025
80 of 91 checks passed
@lyakh lyakh deleted the ipc branch March 28, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants