Implement direct mapped cache for instruction fetch #103

yy214123 · 2025-10-29T18:53:26Z

Extend the existing architecture to cache the last fetched PC instruction, improving instruction fetch hit rate by approximately 2%.

based on performance metrics introduced in #99 (MMU Cache Statistics).

=== MMU Cache Statistics ===

Hart 0:
   Fetch: 443724828 hits,    7671580 misses (98.30% hit rate)
   Load:   65003296 hits,   34592372 misses (2-way) (65.27% hit rate)
   Store:  59967516 hits,   11483892 misses (83.93% hit rate)

Also includes clang-format fixes for several expressions.

Summary by cubic

Implemented a direct-mapped instruction fetch cache with a small victim cache to reduce conflict misses and speed up fetches. Improves instruction fetch hit rate by ~2% and reduces fetch translations by ~10%, with detailed MMU fetch stats for icache/vcache/TLB.

New Features
- Direct-mapped I-cache (256 blocks × 256 B).
- 16-block victim cache with swap-on-hit to mitigate conflict misses.
- I-cache invalidation on MMU invalidate.
- 2-entry direct-mapped fetch page cache with parity hash to avoid thrashing.
Refactors
- Documented I-cache macros/masks and renamed IC/VC prefixes to ICACHE/VCACHE.
- Fixed tag calculation, victim-cache fill, and hit/miss accounting in mmu_fetch.

^{Written for commit bc84f9b. Summary will update automatically on new commits.}

riscv.h

riscv.c

jserv · 2025-10-30T08:33:05Z

Consider 2-Way Page Cache: Given that the current page-level cache already achieves 98.30% hit rate, consider simply upgrading it instead:

  // Current: single-entry page cache
  mmu_fetch_cache_t cache_fetch;

  // Proposed: 2-way set-associative page cache
  mmu_fetch_cache_t cache_fetch[2];  // Only +16 bytes overhead

  Use parity hash (like load/store caches):
  uint32_t idx = __builtin_parity(vpn) & 0x1;
  if (unlikely(vpn != vm->cache_fetch[idx].n_pages)) {
      // ... fill
  }

riscv.h

riscv.c

jserv

Unify the naming scheme.

jserv

Rebase the latest 'master' branch and resolve build errors.

jserv

Squash commits and refine commit message.

visitorckw

This series appears to contain several "fix-up," "refactor," or "build-fix" commits that correct or adjust a preceding patch.

To maintain a clean history and ensure the project is bisectable, each patch in a series should be complete and correct on its own.

visitorckw · 2025-11-04T18:31:37Z

As a friendly reminder regarding project communication:

Please ensure that when you quote-reply to others' comments, you do not translate the quoted text into any language other than English.

This is an open-source project, and it's important that we keep all discussions in English. This ensures that the conversation remains accessible to everyone in the community, including current and future participants who may not be familiar with other languages.

visitorckw · 2025-11-04T19:02:22Z

riscv.c

+            icache_block_t tmp = *blk;
+            *blk = *vblk;
+            *vblk = tmp;
+            blk->tag = tag;


This code looks suspicious to me.

When you move the evicted I-cache block (tmp) back into the victim cache, you are setting the vblk->tag to tmp.tag, which is the 16-bit I-cache tag.

Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.

Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.

Thank you for pointing that out. I’ve added the following expressions to ensure correctness :

+ vblk->tag = (tmp.tag << ICACHE_INDEX_BITS) | idx;

yy214123 · 2025-11-06T10:07:19Z

...

To maintain a clean history and ensure the project is bisectable, each patch in a series should be complete and correct on its own.

I've finished cleaning up the commits and split them into three functionally independent patches.

Please ensure that when you quote-reply to others' comments, you do not translate the quoted text into any language other than English.

Thank you for pointing that out. I've corrected the mistake and will be more careful in the future.

riscv.h

jserv · 2025-11-06T18:45:49Z

riscv.h

+ *   - index bits        = log2(ICACHE_BLOCKS)
+ *
+ * For power-of-two values, log2(x) equals the number of trailing zero bits in
+ * x. Therefore, we use __builtin_ctz(x) (count trailing zeros) to compute these


It is not worthy to invoke __builtin_ctz for constants.

The reason it was designed this way is because I initially needed to experiment with various cache size combinations.
Do you mean it's better to just directly fill in the log2 value?

- #define ICACHE_OFFSET_BITS (__builtin_ctz((ICACHE_BLOCKS_SIZE))) - #define ICACHE_INDEX_BITS (__builtin_ctz((ICACHE_BLOCKS))) + #define ICACHE_OFFSET_BITS 8 + #define ICACHE_INDEX_BITS 8

Shouldn't clz be used here instead of ctz?

Additionally, I don't think the logic for calculating ilog2 and its relationship with clz belongs in riscv.h. IMHO, this isn't RISC-V specific and should be placed in a more general helper file.

Furthermore, if we were to introduce such a helper, we shouldn't rely solely on the gnu extensions __builtin_clz()/ctz(). We would also need to provide a fallback implementation for compilers that don't support them to avoid compatibility issues.

However, since this is just a pre-computable value in this context, I don't believe it's necessary to introduce such a helper at all.

Shouldn't clz be used here instead of ctz?

I only considered the case where the value is a power of two. For non–power-of-two values, clz should indeed be used.

We would also need to provide a fallback implementation for compilers that don't support them to avoid compatibility issues.

Thanks for pointing that out. After reviewing the code, I noticed that __builtin_parity already accounts for this situation.
I’ll be more careful when using similar compiler extensions in the future.

#if !defined(__GNUC__) && !defined(__clang__) /* Portable parity implementation for non-GCC/Clang compilers */ static inline unsigned int __builtin_parity(unsigned int x) { x ^= x >> 16; x ^= x >> 8; x ^= x >> 4; x ^= x >> 2; x ^= x >> 1; return x & 1; } #endif

Additionally, I don't think the logic for calculating ilog2 and its relationship with clz belongs in riscv.h. IMHO, this isn't RISC-V specific and should be placed in a more general helper file.

I’ve already defined it as a constant.

On a related note, I’d like to ask for your advice — I currently place the structures for the i-cache and victim cache, along with their related #defines, inside riscv.h.

Would you suggest separating these parts into a new header file instead?

Rather than discussing which header they should go into, I’m more interested in knowing how likely it is that these will be shared with other C files in the future. If the probability is low, should they just be placed directly in riscv.c instead of a header?

But placing these definitions in riscv.c would create a compilation error,
because icache_t is a member of struct __hart_internal:

struct __hart_internal { icache_t icache; ... };

Sorry, I missed that.
I don't have a strong opinion on this, so both options are fine with me.

Extend the existing architecture to include a direct-mapped instruction cache that stores recently fetched instructions. Add related constants and macros for cache size and address fields.

main.c

riscv.h

Replace the previous 1-entry direct-mapped design with a 2-entry direct-mapped cache using hash-based indexing (same parity hash as cache_load). This allows two hot virtual pages to coexist without thrashing. Measurement shows that the number of virtual-to-physical translations during instruction fetch (mmu_translate() calls) decreased by ~10%.

Introduce a small victim cache to reduce conflict misses in the direct-mapped instruction cache. On an I-cache miss, probe the victim cache; on hit, swap the victim block with the current I-cache block and return the data. Measurement shows that the number of virtual-to-physical translations during instruction fetch (mmu_translate() calls) decreased by ~8%.

cubic-dev-ai

2 issues found across 3 files

Prompt for AI agents (all 2 issues)


Understand the root cause of the following 2 issues and fix them.


<file name="main.c">

<violation number="1" location="main.c:1028">
`cache_fetch` only has two entries (indices 0 and 1), so reading index 2 here is undefined behavior. Please sum entries 0 and 1 instead.</violation>
</file>

<file name="riscv.c">

<violation number="1" location="riscv.c:238">
mmu_invalidate_range now needs to flush the new instruction cache as well; otherwise pages whose mappings change keep returning stale instructions from vm-&gt;icache.</violation>
</file>

_{React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

main.c

cubic-dev-ai · 2025-11-11T08:28:09Z

riscv.c

-    if (vm->cache_fetch.n_pages >= start_vpn &&
-        vm->cache_fetch.n_pages <= end_vpn)
-        vm->cache_fetch.n_pages = 0xFFFFFFFF;
+    for (int i = 0; i < 2; i++) {


mmu_invalidate_range now needs to flush the new instruction cache as well; otherwise pages whose mappings change keep returning stale instructions from vm->icache.

Prompt for AI agents

Address the following comment on riscv.c at line 238: <comment>mmu_invalidate_range now needs to flush the new instruction cache as well; otherwise pages whose mappings change keep returning stale instructions from vm->icache.</comment> <file context> @@ -227,9 +235,11 @@ void mmu_invalidate_range(hart_t *vm, uint32_t start_addr, uint32_t size) - if (vm->cache_fetch.n_pages >= start_vpn && - vm->cache_fetch.n_pages <= end_vpn) - vm->cache_fetch.n_pages = 0xFFFFFFFF; + for (int i = 0; i < 2; i++) { + if (vm->cache_fetch[i].n_pages >= start_vpn && + vm->cache_fetch[i].n_pages <= end_vpn) </file context>

I’m a bit confused about this.
The current implementation is as follows:

/* Cache invalidation for fetch cache */ for (int i = 0; i < 2; i++) { if (vm->cache_fetch[i].n_pages >= start_vpn && vm->cache_fetch[i].n_pages <= end_vpn) vm->cache_fetch[i].n_pages = 0xFFFFFFFF; }

@yy214123 You're right to be confused—the code snippet you showed only handles invalidating the page translation caches (cache_fetch), but the comment was pointing out a bigger gap: the actual instruction cache blocks (vm->icache.i_block and vm->icache.v_block) aren't being touched at all.

Right now, when you call mmu_invalidate_range, it only invalidates the 2-entry page cache entries that fall in your VPN range. But if the instruction cache already holds translated/cached instruction blocks from those pages (which it does after fetches), they'll keep serving stale data even after the page mappings change—until those specific cache lines get evicted naturally by future misses.

To fix it properly, you need to add logic in mmu_invalidate_range to scan the I-cache and victim cache for blocks whose tags correspond to the invalidated address range, and mark those valid = false. Something like:

/* Invalidate I-cache blocks in the address range */ uint32_t start_block_addr = (start_addr >> ICACHE_OFFSET_BITS) << ICACHE_OFFSET_BITS; uint32_t end_block_addr = ((end_addr + ICACHE_BLOCK_SIZE - 1) >> ICACHE_OFFSET_BITS) << ICACHE_OFFSET_BITS; for (uint32_t block_idx = 0; block_idx < ICACHE_BLOCKS; block_idx++) { uint32_t block_base = (block_idx << ICACHE_OFFSET_BITS) | (vm->icache.i_block[block_idx].tag << (ICACHE_OFFSET_BITS + ICACHE_INDEX_BITS)); if (block_base >= start_block_addr && block_base <= end_block_addr && vm->icache.i_block[block_idx].valid) { vm->icache.i_block[block_idx].valid = false; } } /* Similar loop for victim cache - check if v_block[i].tag matches any address in range */ for (int i = 0; i < VCACHE_BLOCKS; i++) { if (vm->icache.v_block[i].valid) { uint32_t v_block_addr = vm->icache.v_block[i].tag << ICACHE_OFFSET_BITS; if (v_block_addr >= start_block_addr && v_block_addr <= end_block_addr) { vm->icache.v_block[i].valid = false; } } }

This is a bit more involved than the full icache_invalidate_all (which you already call in mmu_invalidate), but it's necessary for range-based invalidation to work correctly without flushing the entire cache unnecessarily. The tag math might need tweaking based on how you're storing virtual vs. physical addresses in the blocks—double-check that the reconstructed address comparison works as expected.

Add detailed MMU cache statistics for fetch path Introduce fine-grained counters to replace the previous aggregated fetch hit/miss statistics. The new metrics track: - total fetches - icache hits/misses - victim cache hits/misses - TLB hits/misses The old MMU statistics only distinguished total fetch hits/misses, which no longer reflect real performance after the addition of icache and victim cache layers. This change enables more accurate profiling and debugging. This provides a clearer view of instruction fetch performance and helps identify which cache level contributes most to stalls or misses.

yy214123 · 2025-11-11T18:51:30Z

Found that the victim cache implementation still has significant room for improvement:

=== MMU Cache Statistics ===

=== Introduction Cache Statistics ===
  Total access:     626471945
  Icache hits:      620281067 (99.01%)
  Icache misses:      6190878 (0.99%)
   ├ Vcache hits:      405962 (6.56% of Icache misses)
   └ Vcache misses:   5784916 (93.44% of Icache misses)
      ├ TLB hits:     2797259 (48.35%)
      └ TLB misses:   2987657 (51.65%)

=== Data Cache Statistics ===
  Load:     137431399 hits,      3261925 misses (8x2) (97.68% hit rate)
  Store:     95309605 hits,      1269149 misses (8x2) (98.69% hit rate)

jserv · 2025-11-12T08:16:50Z

Found that the victim cache implementation still has significant room for improvement:

What are you going to proceed with?

cubic-dev-ai

3 issues found across 3 files

Prompt for AI agents (all 3 issues)


Understand the root cause of the following 3 issues and fix them.


<file name="main.c">

<violation number="1" location="main.c:1076">
Guard the I-cache percentage calculations against access_total==0 before dividing so idle harts don’t emit NaN/inf.</violation>

<violation number="2" location="main.c:1083">
Add a zero-miss guard before dividing by fetch_misses_icache so that perfect-icache runs don’t emit NaN/inf.</violation>

<violation number="3" location="main.c:1092">
Check that the TLB total is non-zero before computing the percentage to avoid NaN/inf when no TLB accesses occurred.</violation>
</file>

_{React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

cubic-dev-ai · 2025-11-12T08:25:06Z

main.c

+                (fetch_misses_vcache * 100.0) / access_total);
+        fprintf(stderr, "      ├ TLB hits:     %4llu (%.2f%%)\n",
+                fetch_hits_tlb,
+                (fetch_hits_tlb * 100.0) / (fetch_hits_tlb + fetch_misses_tlb));


Check that the TLB total is non-zero before computing the percentage to avoid NaN/inf when no TLB accesses occurred.

Prompt for AI agents

Address the following comment on main.c at line 1092: <comment>Check that the TLB total is non-zero before computing the percentage to avoid NaN/inf when no TLB accesses occurred.</comment> <file context> @@ -1047,14 +1069,32 @@ static void print_mmu_cache_stats(vm_t *vm) + (fetch_misses_vcache * 100.0) / access_total); + fprintf(stderr, " ├ TLB hits: %4llu (%.2f%%)\n", + fetch_hits_tlb, + (fetch_hits_tlb * 100.0) / (fetch_hits_tlb + fetch_misses_tlb)); + fprintf( + stderr, " └ TLB misses: %4llu (%.2f%%)\n", fetch_misses_tlb, </file context>

cubic-dev-ai · 2025-11-12T08:25:06Z

main.c

+        fprintf(stderr,
+                "   ├ Vcache hits:    %8llu (%.2f%% of Icache misses)\n",
+                fetch_hits_vcache,
+                (fetch_hits_vcache * 100.0) / fetch_misses_icache,


Add a zero-miss guard before dividing by fetch_misses_icache so that perfect-icache runs don’t emit NaN/inf.

Prompt for AI agents

Address the following comment on main.c at line 1083: <comment>Add a zero-miss guard before dividing by fetch_misses_icache so that perfect-icache runs don’t emit NaN/inf.</comment> <file context> @@ -1047,14 +1069,32 @@ static void print_mmu_cache_stats(vm_t *vm) + fprintf(stderr, + " ├ Vcache hits: %8llu (%.2f%% of Icache misses)\n", + fetch_hits_vcache, + (fetch_hits_vcache * 100.0) / fetch_misses_icache, + (fetch_hits_vcache * 100.0) / access_total); + fprintf(stderr, </file context>

cubic-dev-ai · 2025-11-12T08:25:06Z

main.c

+        fprintf(stderr, "\n=== Introduction Cache Statistics ===\n");
+        fprintf(stderr, "  Total access:  %12llu\n", access_total);
+        fprintf(stderr, "  Icache hits:   %12llu (%.2f%%)\n", fetch_hits_icache,
+                (fetch_hits_icache * 100.0) / access_total);


Guard the I-cache percentage calculations against access_total==0 before dividing so idle harts don’t emit NaN/inf.

Prompt for AI agents

Address the following comment on main.c at line 1076: <comment>Guard the I-cache percentage calculations against access_total==0 before dividing so idle harts don’t emit NaN/inf.</comment> <file context> @@ -1047,14 +1069,32 @@ static void print_mmu_cache_stats(vm_t *vm) + fprintf(stderr, "\n=== Introduction Cache Statistics ===\n"); + fprintf(stderr, " Total access: %12llu\n", access_total); + fprintf(stderr, " Icache hits: %12llu (%.2f%%)\n", fetch_hits_icache, + (fetch_hits_icache * 100.0) / access_total); + fprintf(stderr, " Icache misses: %12llu (%.2f%%)\n", + fetch_misses_icache, </file context>

yy214123 · 2025-11-12T08:35:58Z

What are you going to proceed with?

Not sure yet if it’s due to the FIFO policy or the victim cache size.
I’ll try a simple approximate LRU version (with a counter) to see how much the replacement policy affects the stats.

Does this approach make sense to you, or would you suggest a different direction?

yy214123 marked this pull request as draft October 29, 2025 18:56

This comment was marked as outdated.

Sign in to view

yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from 98114a7 to 0e4f67b Compare October 30, 2025 05:32

jserv reviewed Oct 30, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Oct 30, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Oct 30, 2025

View reviewed changes

riscv.c Outdated Show resolved Hide resolved

yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from bb9e6cb to 74e3b99 Compare October 30, 2025 19:24

yy214123 closed this Oct 31, 2025

This comment was marked as outdated.

Sign in to view

yy214123 reopened this Oct 31, 2025

This comment was marked as resolved.

Sign in to view

yy214123 force-pushed the direct-mapped-cache branch from 686cede to 5478710 Compare November 1, 2025 06:40

yy214123 requested a review from jserv November 2, 2025 09:08

jserv reviewed Nov 2, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Nov 2, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Nov 2, 2025

View reviewed changes

riscv.c Outdated Show resolved Hide resolved

jserv requested changes Nov 2, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from fe2d95e to f657fb2 Compare November 4, 2025 16:55

jserv requested changes Nov 4, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from aab465c to db3a37f Compare November 4, 2025 18:20

sysprog21 deleted a comment from yy214123 Nov 4, 2025

jserv requested changes Nov 4, 2025

View reviewed changes

visitorckw suggested changes Nov 4, 2025

View reviewed changes

visitorckw reviewed Nov 4, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from db3a37f to c6485fc Compare November 6, 2025 09:56

yy214123 requested review from jserv and visitorckw November 6, 2025 10:07

jserv reviewed Nov 6, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

yy214123 force-pushed the direct-mapped-cache branch from c6485fc to f270dca Compare November 6, 2025 18:16

jserv reviewed Nov 6, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from f270dca to 5e9504b Compare November 8, 2025 15:46

Implement direct-mapped instruction cache

7718dc1

Extend the existing architecture to include a direct-mapped instruction cache that stores recently fetched instructions. Add related constants and macros for cache size and address fields.

yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from e2527ab to 1ff3c37 Compare November 8, 2025 18:34

jserv reviewed Nov 10, 2025

View reviewed changes

main.c Outdated Show resolved Hide resolved

jserv reviewed Nov 10, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

yy214123 added 2 commits November 11, 2025 15:47

yy214123 force-pushed the direct-mapped-cache branch from 1ff3c37 to efde8b1 Compare November 11, 2025 07:55

cubic-dev-ai bot reviewed Nov 11, 2025

View reviewed changes

sysprog21 deleted a comment from cubic-dev-ai bot Nov 12, 2025

cubic-dev-ai bot reviewed Nov 12, 2025

View reviewed changes

Implement direct mapped cache for instruction fetch #103

Are you sure you want to change the base?

Implement direct mapped cache for instruction fetch #103

Conversation

yy214123 commented Oct 29, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

jserv commented Oct 30, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

visitorckw left a comment

Choose a reason for hiding this comment

Uh oh!

visitorckw commented Nov 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yy214123 commented Nov 6, 2025

Uh oh!

Uh oh!

jserv Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

yy214123 commented Nov 11, 2025

Uh oh!

jserv commented Nov 12, 2025

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

yy214123 commented Oct 29, 2025 •

edited by cubic-dev-ai bot

Loading

jserv Nov 6, 2025 •

edited

Loading

cubic-dev-ai bot Nov 11, 2025 •

edited

Loading

cubic-dev-ai bot Nov 12, 2025 •

edited

Loading

cubic-dev-ai bot Nov 12, 2025 •

edited

Loading

cubic-dev-ai bot Nov 12, 2025 •

edited

Loading