Skip to content

Conversation

@serhiy-katsyuba-intel
Copy link
Contributor

@serhiy-katsyuba-intel serhiy-katsyuba-intel commented Oct 17, 2025

Zephyr stores heap metadata just before each allocated chunk. So not only must the chunk start be aligned to cache line size, but the chunk size must also be aligned up to cache line size. This is to prevent metadata corruption in case an invalidate or writeback is called for a chunk -- it may corrupt metadata for the next chunk if it is located in the same cache line.

@serhiy-katsyuba-intel
Copy link
Contributor Author

serhiy-katsyuba-intel commented Oct 17, 2025

The cached allocation already had such size ALIGN_UP() done in heap_alloc_aligned_cached(). This PR adds same for non-cached allocation.

This fixes sporadic CI failures on Google AEC mock test_001_06_rtc_arc_dmic_driver_topo.

Copy link
Contributor

@tmleman tmleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, a week of debugging and hours of analysis result in a one-line fix. Great job!

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @serhiy-katsyuba-intel ! Some changes see inline, but the fix is solid.

FYI @jsarha @lyakh this explains Jyri why you saw issues with combining rmalloc()/rballoc(). rmalloc() had a bug so if you routed more allocs to it, you started to hit this bug (which is difficult to debug as you need a suitable alloc pattern so that the metadata gets corrupted). UPDATE: this doesn't explain, rballoc() was changed in commit cc24c9c and since that (in 2022) it has done the same split to cached/noncached behaviour (as rmalloc), so this doesn't relate to recent refactoring after all.

* allocations.
*/
alignment = MAX(PLATFORM_DCACHE_ALIGN, alignment);
bytes = ALIGN_UP(bytes, alignment);
Copy link
Collaborator

@kv2019i kv2019i Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow @serhiy-katsyuba-intel , this is bad. I mean applying same rules was added in the initial implementation in commit 956a587 (2021). It seems rballoc() has been good, but the bug has been lurking in rmalloc(). Now when in recent memory refactoring work, there has been commits like commit b7a1616 that have changed old rballoc() usage with calls to framework that might now end up using rmalloc() instead, the bug (that was always in rmalloc()), is now hit. UPDATE: hmm, in main the logic is same in rballoc(), so this hypothesis w.r.t. rmalloc/rballoc does not hold.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serhiy-katsyuba-intel Can you move this whole segment:

»       /*                                                                                                                                                                                   
»        * Zephyr sys_heap stores metadata at start of each                                                                                                                                  
»        * heap allocation. To ensure no allocated cached buffer                                                                                                                             
»        * overlaps the same cacheline with the metadata chunk,                                                                                                                              
»        * align both allocation start and size of allocation                                                                                                                                
»        * to cacheline. As cached and non-cached allocations are                                                                                                                            
»        * mixed, same rules need to be followed for both type of                                                                                                                            
»        * allocations.                                                                                                                                                                      
»        */
#ifdef CONFIG_SOF_ZEPHYR_HEAP_CACHED
»       min_align = MAX(PLATFORM_DCACHE_ALIGN, min_align);
»       by

.. and put it to heap_alloc_aligned(). No need to have this logic (and comment) twice in the file as these are both static functions. This also handles the logic what to do if CONFIG_SOF_ZEPHYR_HEAP_CACHED is not set correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed. heap_alloc_aligned_cached() already aligns both the size and the beginning of cached allocations, so cached and uncached allocations should already never cross cache-line boundaries. Since it fixes (or seems to fix) some failures, I'm reluctant to "ask for change" but it really doesn't seem correct to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @lyakh , so the comment (in code) is a bit misleading. The rule that metadata chunk must not live on the same cacheline does hold for all allocations. It is misleading in that we enforce this rule by aligning up the cached allocation sizes in heap_alloc_aligned_cached(). So if you make a 4 byte cached alloc, we turn it to a 64 byte alloc on a 64 byte boundary -> Zephyr heap cannot put any metadata on this cacheline as it is not free.

This does have the big caveat that if any user allocates noncached heap object, then converts it to cached usage on its own, bad things will happen (end of object allocation may have heap metadata for unrelated object and cache flush can corrupt this). This was one of the reasons for introducing the __sparse annotations in our interfaces to catch these cases.

@serhiy-katsyuba-intel does the above logic hold in the case you are observing with the RTC test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does have the big caveat that if any user allocates noncached heap object, then converts it to cached usage on its own, bad things will happen (end of object allocation may have heap metadata for unrelated object and cache flush can corrupt this). This was one of the reasons for introducing the __sparse annotations in our interfaces to catch these cases.

What happen if someone allocate noncached mem and without converting the address to cached call invalidate/writeback? I'm not sure if we have checks in invalidate and writeback to check if address belongs to cached space and I'm not sure if actually invalidate/writeback to memory will happen. Such bug could easily be introduced by a mistake in refactoring.

If fact I found one such bug introduced by this commit d519e7e at line 362. ring_buffer->_data_buffer previously was always cached, after that commit it became noncached when is_shared==true (as a mistake during refactoring). All invalidates/writeback are still in code for ring_buffer->_data_buffer.

Just in case, fixing ring_buffer->_data_buffer allocation does not help AEC CI problem. However, similar bugs could (and will :) ) be easily introduced as a mistake at refactoring. Protecting heat metadata from corruption might be an OK workaround to have. Especially as such bugs could trigger randomly then something unrelated just added to code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serhiy-katsyuba-intel wrote:

What happen if someone allocate noncached mem and without converting the address to cached call invalidate/writeback? I

I was checking this today and this should be safe. The cache line lookup is based on the physical addresses, and the uncached addresses will just never have dirty lines in the cache. It's still wasted DSP cycles to do the invalidate/wb, but should not corrupt the metadata. The fact that your test to fix ring_buffer->_data_buffer allocation, did not fix the AEC CI problem, is at least compatible with this analysis.

So it's still a bit of a mystery why this PR helps with the AEC case. I surveyed the codebase again and we have limited the uncache-to-cache conversions very aggressively, so these are only done in a very few places.

Oh well, this PR is for sure safer as it protects against rogue cache ops.

@kv2019i kv2019i requested review from jsarha and lyakh October 17, 2025 10:06
@serhiy-katsyuba-intel serhiy-katsyuba-intel changed the title zephyr: Fix Zephyr heap metadata corruption zephyr: Protect Zephyr heap metadata from corruption Oct 17, 2025
@serhiy-katsyuba-intel
Copy link
Contributor Author

Pushed new version with size alignment moved inside heap_alloc_aligned() and updated commit description.

*/
#ifdef CONFIG_SOF_ZEPHYR_HEAP_CACHED
min_align = MAX(PLATFORM_DCACHE_ALIGN, min_align);
bytes = ALIGN_UP(bytes, min_align);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to also align the size to the same value. If the user requests a page-aligned buffer, only the beginning should be page-aligned. The size can stay cache-line size aligned. Yes, usually if you request a page-aligned buffer the size is also a multiple of the page size, but for correctness / cache management we only need cache-line size alignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This was a copy-paste from the cached allocation implementation. Anyway, pushed a new version with size aligned to cache line.

Zephyr stores heap metadata just before each allocated chunk. This change
ensures metadata for each chunk is stored in its own separate cache line.
So if invalidate/writeback is mistakenly called for non-cached memory,
metadata of a neighboring chunk does not get corrupted.

We already have such size alignment constraints implemented for cached
allocations; this change adds the same size alignment for non-cached
allocations.

This is a workaround for potential problems caused by invalidate/writeback
calls for non-cached memory, which are wrong and should never happen in
well-written code. However, such problems could be easily introduced and
are quite hard to debug.

The trade-off -- we waste an additional cache line for each chunk's
metadata for non-cached allocations (as we already do for cached
allocations).

Signed-off-by: Serhiy Katsyuba <serhiy.katsyuba@intel.com>
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. So in the end, this was a smaller issue. But seems the consensus is to take the more conservative approach, to protect against such issues in the future. I'm ok with this approach as well. We may now revisit the discussions to optimize the heap w.r.t. SRAM usage.

@lgirdwood lgirdwood merged commit 0b37f25 into thesofproject:main Oct 19, 2025
38 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants