zephyr: Protect Zephyr heap metadata from corruption #10311

serhiy-katsyuba-intel · 2025-10-17T09:15:41Z

Zephyr stores heap metadata just before each allocated chunk. So not only must the chunk start be aligned to cache line size, but the chunk size must also be aligned up to cache line size. This is to prevent metadata corruption in case an invalidate or writeback is called for a chunk -- it may corrupt metadata for the next chunk if it is located in the same cache line.

serhiy-katsyuba-intel · 2025-10-17T09:20:05Z

The cached allocation already had such size ALIGN_UP() done in heap_alloc_aligned_cached(). This PR adds same for non-cached allocation.

This fixes sporadic CI failures on Google AEC mock test_001_06_rtc_arc_dmic_driver_topo.

tmleman

As usual, a week of debugging and hours of analysis result in a one-line fix. Great job!

kv2019i

Thanks for catching this @serhiy-katsyuba-intel ! Some changes see inline, but the fix is solid.

FYI @jsarha @lyakh this explains Jyri why you saw issues with combining rmalloc()/rballoc(). rmalloc() had a bug so if you routed more allocs to it, you started to hit this bug (which is difficult to debug as you need a suitable alloc pattern so that the metadata gets corrupted). UPDATE: this doesn't explain, rballoc() was changed in commit cc24c9c and since that (in 2022) it has done the same split to cached/noncached behaviour (as rmalloc), so this doesn't relate to recent refactoring after all.

kv2019i · 2025-10-17T09:58:48Z

zephyr/lib/alloc.c

+		 * allocations.
 		 */
 		alignment = MAX(PLATFORM_DCACHE_ALIGN, alignment);
+		bytes = ALIGN_UP(bytes, alignment);


Wow @serhiy-katsyuba-intel , this is bad. I mean applying same rules was added in the initial implementation in commit 956a587 (2021). It seems rballoc() has been good, but the bug has been lurking in rmalloc(). Now when in recent memory refactoring work, there has been commits like commit b7a1616 that have changed old rballoc() usage with calls to framework that might now end up using rmalloc() instead, the bug (that was always in rmalloc()), is now hit. UPDATE: hmm, in main the logic is same in rballoc(), so this hypothesis w.r.t. rmalloc/rballoc does not hold.

@serhiy-katsyuba-intel Can you move this whole segment:

» /* » * Zephyr sys_heap stores metadata at start of each » * heap allocation. To ensure no allocated cached buffer » * overlaps the same cacheline with the metadata chunk, » * align both allocation start and size of allocation » * to cacheline. As cached and non-cached allocations are » * mixed, same rules need to be followed for both type of » * allocations. » */ #ifdef CONFIG_SOF_ZEPHYR_HEAP_CACHED » min_align = MAX(PLATFORM_DCACHE_ALIGN, min_align); » by

.. and put it to heap_alloc_aligned(). No need to have this logic (and comment) twice in the file as these are both static functions. This also handles the logic what to do if CONFIG_SOF_ZEPHYR_HEAP_CACHED is not set correctly.

This shouldn't be needed. heap_alloc_aligned_cached() already aligns both the size and the beginning of cached allocations, so cached and uncached allocations should already never cross cache-line boundaries. Since it fixes (or seems to fix) some failures, I'm reluctant to "ask for change" but it really doesn't seem correct to me.

Right @lyakh , so the comment (in code) is a bit misleading. The rule that metadata chunk must not live on the same cacheline does hold for all allocations. It is misleading in that we enforce this rule by aligning up the cached allocation sizes in heap_alloc_aligned_cached(). So if you make a 4 byte cached alloc, we turn it to a 64 byte alloc on a 64 byte boundary -> Zephyr heap cannot put any metadata on this cacheline as it is not free.

This does have the big caveat that if any user allocates noncached heap object, then converts it to cached usage on its own, bad things will happen (end of object allocation may have heap metadata for unrelated object and cache flush can corrupt this). This was one of the reasons for introducing the __sparse annotations in our interfaces to catch these cases.

@serhiy-katsyuba-intel does the above logic hold in the case you are observing with the RTC test case?

This does have the big caveat that if any user allocates noncached heap object, then converts it to cached usage on its own, bad things will happen (end of object allocation may have heap metadata for unrelated object and cache flush can corrupt this). This was one of the reasons for introducing the __sparse annotations in our interfaces to catch these cases.

What happen if someone allocate noncached mem and without converting the address to cached call invalidate/writeback? I'm not sure if we have checks in invalidate and writeback to check if address belongs to cached space and I'm not sure if actually invalidate/writeback to memory will happen. Such bug could easily be introduced by a mistake in refactoring.

If fact I found one such bug introduced by this commit d519e7e at line 362. ring_buffer->_data_buffer previously was always cached, after that commit it became noncached when is_shared==true (as a mistake during refactoring). All invalidates/writeback are still in code for ring_buffer->_data_buffer.

Just in case, fixing ring_buffer->_data_buffer allocation does not help AEC CI problem. However, similar bugs could (and will :) ) be easily introduced as a mistake at refactoring. Protecting heat metadata from corruption might be an OK workaround to have. Especially as such bugs could trigger randomly then something unrelated just added to code.

@serhiy-katsyuba-intel wrote:

What happen if someone allocate noncached mem and without converting the address to cached call invalidate/writeback? I

I was checking this today and this should be safe. The cache line lookup is based on the physical addresses, and the uncached addresses will just never have dirty lines in the cache. It's still wasted DSP cycles to do the invalidate/wb, but should not corrupt the metadata. The fact that your test to fix ring_buffer->_data_buffer allocation, did not fix the AEC CI problem, is at least compatible with this analysis.

So it's still a bit of a mystery why this PR helps with the AEC case. I surveyed the codebase again and we have limited the uncache-to-cache conversions very aggressively, so these are only done in a very few places.

Oh well, this PR is for sure safer as it protects against rogue cache ops.

serhiy-katsyuba-intel · 2025-10-17T12:45:38Z

Pushed new version with size alignment moved inside heap_alloc_aligned() and updated commit description.

lyakh · 2025-10-17T14:34:39Z

zephyr/lib/alloc.c

+	 */
+#ifdef CONFIG_SOF_ZEPHYR_HEAP_CACHED
+	min_align = MAX(PLATFORM_DCACHE_ALIGN, min_align);
+	bytes = ALIGN_UP(bytes, min_align);


I don't think we need to also align the size to the same value. If the user requests a page-aligned buffer, only the beginning should be page-aligned. The size can stay cache-line size aligned. Yes, usually if you request a page-aligned buffer the size is also a multiple of the page size, but for correctness / cache management we only need cache-line size alignment

Agree. This was a copy-paste from the cached allocation implementation. Anyway, pushed a new version with size aligned to cache line.

Zephyr stores heap metadata just before each allocated chunk. This change ensures metadata for each chunk is stored in its own separate cache line. So if invalidate/writeback is mistakenly called for non-cached memory, metadata of a neighboring chunk does not get corrupted. We already have such size alignment constraints implemented for cached allocations; this change adds the same size alignment for non-cached allocations. This is a workaround for potential problems caused by invalidate/writeback calls for non-cached memory, which are wrong and should never happen in well-written code. However, such problems could be easily introduced and are quite hard to debug. The trade-off -- we waste an additional cache line for each chunk's metadata for non-cached allocations (as we already do for cached allocations). Signed-off-by: Serhiy Katsyuba <serhiy.katsyuba@intel.com>

kv2019i

Thanks. So in the end, this was a smaller issue. But seems the consensus is to take the more conservative approach, to protect against such issues in the future. I'm ok with this approach as well. We may now revisit the discussions to optimize the heap w.r.t. SRAM usage.

serhiy-katsyuba-intel requested a review from dabekjakub as a code owner October 17, 2025 09:15

serhiy-katsyuba-intel requested review from abonislawski, kv2019i, lgirdwood, softwarecki and tmleman October 17, 2025 09:20

tmleman approved these changes Oct 17, 2025

View reviewed changes

serhiy-katsyuba-intel force-pushed the fix_heap_size_alignment branch from d6ca1a3 to 8f70ca9 Compare October 17, 2025 09:37

kv2019i requested changes Oct 17, 2025

View reviewed changes

abonislawski approved these changes Oct 17, 2025

View reviewed changes

kv2019i requested review from jsarha and lyakh October 17, 2025 10:06

serhiy-katsyuba-intel force-pushed the fix_heap_size_alignment branch from 8f70ca9 to a2be266 Compare October 17, 2025 12:40

serhiy-katsyuba-intel changed the title ~~zephyr: Fix Zephyr heap metadata corruption~~ zephyr: Protect Zephyr heap metadata from corruption Oct 17, 2025

lyakh reviewed Oct 17, 2025

View reviewed changes

serhiy-katsyuba-intel force-pushed the fix_heap_size_alignment branch from a2be266 to 53ee8c1 Compare October 17, 2025 15:21

kv2019i approved these changes Oct 17, 2025

View reviewed changes

lgirdwood approved these changes Oct 19, 2025

View reviewed changes

lgirdwood merged commit 0b37f25 into thesofproject:main Oct 19, 2025
38 of 45 checks passed

zephyr: Protect Zephyr heap metadata from corruption #10311

zephyr: Protect Zephyr heap metadata from corruption #10311

Uh oh!

Conversation

serhiy-katsyuba-intel commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-katsyuba-intel commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmleman left a comment

Choose a reason for hiding this comment

Uh oh!

kv2019i left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

lyakh Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-katsyuba-intel Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-katsyuba-intel commented Oct 17, 2025

Uh oh!

lyakh Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-katsyuba-intel Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

kv2019i left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

serhiy-katsyuba-intel commented Oct 17, 2025 •

edited

Loading

serhiy-katsyuba-intel commented Oct 17, 2025 •

edited

Loading

kv2019i left a comment •

edited

Loading

kv2019i Oct 17, 2025 •

edited

Loading