fix: make evm_execution more robust #2942

tac0turtle · 2025-12-17T14:05:31Z

Overview

Summary

Prevents fork explosions where multiple EL blocks are created for the same app block height by adding three layers of protection:

Idempotency check - Before building, checks if EL already has a block at the requested height with matching timestamp and reuses it
Fixed safe/finalized semantics - setFinal(isFinal=false) no longer incorrectly advances safe block; safe now lags head by SafeBlockLag (2 blocks)

Changes

Added executeMu sync.Mutex to serialize ExecuteTxs and ResumePayload calls
Added idempotency check at start of ExecuteTxs using height + timestamp matching
Fixed forkchoice args to use tracked currentSafeBlockHash/currentFinalizedBlockHash instead of prevBlockHash
Added SafeBlockLag constant (2 blocks) for automatic safe block advancement
Added new APIs: SetSafe(), SetSafeByHeight(), SetFinalized(), ResumePayload()
Refactored common RPC logic into `doForkchoiceUpdate()

claude · 2025-12-17T14:06:02Z

Claude finished @tac0turtle's task in 2m 30s —— View job

Code Review: EVM Execution Robustness Improvements

I've completed a comprehensive review of this PR. The changes introduce critical improvements for preventing fork explosions and enabling crash recovery. Overall, the implementation is solid with good error handling patterns, but there are several important issues to address.

Critical Issues

1. Missing Concurrency Protection for ExecuteTxs 🔴

Location: execution/evm/execution.go:293

The PR description mentions "Added executeMu sync.Mutex to serialize ExecuteTxs and ResumePayload calls", but this mutex was removed in commit 349c052. This is a critical concurrency bug.

Problem: Without serialization, concurrent calls to ExecuteTxs can:

Create multiple payloads for the same height
Race on currentHeadBlockHash updates
Cause fork explosions (the exact problem this PR aims to solve)

Evidence:

Only mu sync.Mutex exists at line 146, which protects internal state (currentHeadBlockHash, etc.)
ExecuteTxs has no top-level mutex lock/unlock
The idempotency check alone is insufficient - it prevents reusing blocks, but doesn't prevent concurrent builds

Recommendation: Add an executeMu sync.Mutex field and wrap the entire ExecuteTxs method body with:

c.executeMu.Lock()
defer c.executeMu.Unlock()

This is essential to prevent the fork explosion scenario described in the PR.

2. Transaction Validation Gap in Idempotency Check 🟡

Location: execution/evm/execution.go:656-713 (checkIdempotency)

Problem: The idempotency check at line 686 only verifies height + timestamp when reusing an existing EL block, but does not verify that the transactions match. As @alpe correctly noted in review comments, this creates a state divergence risk.

Attack Scenario:

Sequencer executes block N with txs [A, B, C] → produces stateRoot X
Sequencer crashes/restarts
Malicious/buggy sequencer sends different txs [D, E, F] for same height/timestamp
Idempotency check returns stateRoot X (from first execution)
State divergence: nodes disagree on which txs were in block N

Current Code:

// Line 686-691: Only checks timestamp, not transaction content
if existingTimestamp == uint64(timestamp.Unix()) {
    c.logger.Info()...Msg("ExecuteTxs: reusing existing block at height (EL idempotency)")
    // Returns existing stateRoot WITHOUT verifying txs match
    return existingStateRoot.Bytes(), nil, true, nil
}

Recommendation:
Option A (Strict): Compute tx hash and compare with stored ExecMeta.TxHash. If mismatch, return error indicating fork attempt.

Option B (Defensive): If EL block exists at height with matching timestamp, retrieve the actual block transactions from EL and compare. If mismatch, log critical warning and reject.

The TxHash field is already being computed (line 841-846) but not used during idempotency checks.

High Priority Issues

3. Unused Public APIs Without Clear Purpose 🟡

Locations:

SetSafe() (line 546)
SetSafeByHeight() (line 563)
SetFinalized() (line 612)
ResumePayload() (line 633)

Issue: As @alpe noted, these public methods are not called anywhere in the codebase. The PR description mentions they're for "DA inclusion loop" integration, but:

No follow-up PR is linked
No TODO comments explain future usage
They add API surface area without current value

Recommendation:

If these are for future DA integration, add clear TODO comments with issue/PR links
Consider making them internal/unexported until actually needed
Document the intended usage in godoc comments

4. Safe Block Lag Semantics Need Clarification 🟡

Location: execution/evm/execution.go:36-44

Issue: The constants SafeBlockLag and FinalizedBlockLag implement a "mock" finalization strategy (lines 41-43), but:

Line 40 has a TODO comment that was flagged by reviewer @tac0turtle to integrate with DA inclusion loop
Current lag-based approach doesn't reflect actual DA confirmation
Mixed with real head advancement logic, making it unclear what's temporary vs permanent

Recommendation:

Add clear documentation that this is a temporary placeholder until DA-based finalization
Link to tracking issue for DA integration
Consider a feature flag or config option to disable auto-advancement when real DA finalization is available

Code Quality & Architecture

5. Store Integration Pattern ✅ (Resolved)

The discussion between @julienrbrt and others about store dependencies was resolved by:

Moving ExecMeta proto to proto/execution/evm/v1/state.proto
Using ds.Batching interface at the boundary
Prefixing with evmStorePrefix (line 16 in store.go)

This is a good pattern that keeps execution/evm modular while still using ev-node's datastore abstraction.

6. Error Handling: Good Retry Logic ✅

The exponential backoff retry pattern (retryWithBackoffOnPayloadStatus, lines 96-130) correctly implements Engine API spec:

Distinguishes transient (SYNCING) from permanent (INVALID) failures
Respects context cancellation
Uses exponential backoff

Minor suggestion: Consider making retry constants configurable for testing.

7. Logging: Generally Good ✅

Structured logging with appropriate levels. Good use of contextual fields.

Minor nit: Some debug logs at line 451-453, 472-475 could be elevated to Info level since block production failures are operationally significant.

Security Considerations

8. JWT Secret Handling ✅

JWT secret management in apps/evm/cmd/run.go (lines 226-234) properly:

Reads from file (not command-line args)
Trims whitespace
Validates non-empty

9. Transaction Filtering ✅

filterTransactions (line 716) correctly validates force-included txs from DA while trusting mempool txs (already validated). This is appropriate.

10. State Root Validation ✅

Line 316-318: Validates that prevStateRoot matches execution layer state. Good defensive check.

Testing

11. Test Coverage: Adequate but Missing Concurrency Tests 🟡

execution/evm/test/execution_test.go has good end-to-end coverage:

Genesis initialization
Multi-block execution
Empty blocks (edge case at block 4)
Sync/replay scenario

Missing:

No tests for concurrent ExecuteTxs calls (critical given issue ABCI interface #1)
No tests for idempotency with different txs (issue Peer discovery #2)
No tests for crash recovery using stored ExecMeta
No tests for ResumePayload functionality

Recommendation: Add tests for:

// Concurrent execution attempts
func TestConcurrentExecuteTxs(t *testing.T) { ... }

// Idempotency with tx mismatch
func TestIdempotencyDifferentTxs(t *testing.T) { ... }

// Crash recovery
func TestResumePayloadAfterCrash(t *testing.T) { ... }

Documentation

12. Code Comments: Excellent ✅

Function-level documentation is thorough and explains:

Purpose and behavior
Engine API compliance
Retry semantics
State transitions

The comment style aligns with Go conventions.

Performance

13. Block Hash Cache Management ✅

cacheBlockHash (line 587) implements bounded cache with pruning. Size limit of 10 is appropriate for SafeBlockLag=2.

14. Potential Optimization: Store Writes 🟡

saveExecMeta is called multiple times per block execution (lines 400, 700, 808). Each call does:

Protobuf marshaling
Datastore Put

Suggestion: Consider batching these writes or making intermediate stages optional. The critical saves are "started" (for crash recovery) and "promoted" (for idempotency).

Summary of Required Changes

Must Fix Before Merge:

✅ Re-add executeMu serialization for ExecuteTxs (Issue ABCI interface #1)
✅ Add transaction validation to idempotency check (Issue Peer discovery #2)

Should Address:
3. Document/justify unused public APIs or defer them (Issue #3)
4. Clarify temporary nature of lag-based finalization (Issue #4)
5. Add concurrency and crash recovery tests (Issue #11)

Nice to Have:
6. Optimize saveExecMeta calls (Issue #14)
7. Elevate some debug logs to info level (Issue #7)

Overall Assessment

This PR makes important progress on preventing fork explosions and enabling crash recovery. The core idempotency design with ExecMeta tracking is sound. However, the removal of executeMu (issue #1) and missing transaction validation (issue #2) are critical gaps that must be addressed.

The code quality is generally high with good error handling, logging, and documentation. Once the critical issues are fixed and tests are added, this will be a solid improvement to execution layer robustness.

Recommendation: Request changes for issues #1 and #2 before merging.

gemini-code-assist · 2025-12-17T14:06:03Z

Summary of Changes

Hello @tac0turtle, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of EVM execution by addressing potential "fork explosions" and improving crash recovery mechanisms. It introduces a multi-layered approach including serialization of block builds, idempotent execution checks, and the ability to resume interrupted payload creations. Furthermore, it refines the handling of safe and finalized block states, ensuring more stable and predictable chain progression.

Highlights

Concurrency Control: Implemented a mutex (executeMu) to serialize ExecuteTxs calls, preventing concurrent block builds and potential "fork explosions."
Idempotent Execution: Added a check in ExecuteTxs to reuse existing blocks if an EL block with a matching height and timestamp is already present, improving retry robustness.
Crash Recovery for Payloads: Introduced a payloadResumer interface and logic to resume in-progress payload builds using stored payloadIDs after crashes, avoiding sibling block creation.
Refined Safe/Finalized Semantics: Corrected the advancement of the safe block, ensuring it now lags the head by a configurable SafeBlockLag (2 blocks) for better reorg protection.
New State Tracking and APIs: Introduced ExecMeta to persistently track execution stages and added new APIs (SetSafe, SetSafeByHeight, SetFinalized, ResumePayload) for more granular control over forkchoice state.
RPC Logic Refactoring: Consolidated common forkchoiceUpdatedV3 RPC call logic into a new doForkchoiceUpdate helper function.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-12-17T14:06:43Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Dec 19, 2025, 8:33 PM

gemini-code-assist

Code Review

This pull request introduces significant changes to enhance the robustness and idempotency of block execution, particularly for EVM-based chains. Key changes include adding an ExecMeta struct and associated storage (pkg/store) to track the execution state of blocks at different stages (started, promoted), enabling crash recovery and preventing the creation of sibling blocks during retries. The Executor now saves and retrieves this metadata, and a new payloadResumer interface allows EVM execution clients to resume in-progress payload builds. The EngineClient in execution/evm has been updated to serialize ExecuteTxs calls, implement idempotency checks by reusing existing blocks, and manage forkchoice state more granularly with SafeBlockLag and new setHead, SetSafe, SetFinalized, and ResumePayload methods. Mock generation configurations (.mockery.yaml) and protobuf definitions (proto/evnode/v1/state.proto) were updated to support these new structures and interfaces, and the buf.gen.yaml configuration was simplified by removing the clean: true directive.

codecov · 2025-12-17T14:15:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.27%. Comparing base (d4a2da9) to head (03335c4).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2942      +/-   ##
==========================================
+ Coverage   59.19%   59.27%   +0.08%     
==========================================
  Files          90       90              
  Lines        8627     8627              
==========================================
+ Hits         5107     5114       +7     
+ Misses       2938     2931       -7     
  Partials      582      582

Flag	Coverage Δ
combined	`59.27% <ø> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tac0turtle · 2025-12-17T18:04:17Z

execution/evm/execution.go

+	// This provides a buffer for reorg protection - safe blocks won't be reorged
+	// under normal operation. A value of 2 means when head is at block N,
+	// safe is at block N-2.
+	SafeBlockLag = 2


Suggested change

SafeBlockLag = 2

//TODO: make this work with da inclusion loop or syncer when using da.

SafeBlockLag = 2

julienrbrt

Still reviewing the core logic. My comments are still about moving most of it outside of ev-node.

execution/evm/execution.go

apps/evm/cmd/run.go

block/internal/executing/executor.go

block/internal/executing/executor_logic_test.go

execution/evm/test/execution_test.go

pkg/store/batch.go

execution/evm/execution.go

julienrbrt · 2025-12-17T19:54:32Z

execution/evm/execution.go

-	err = c.engineClient.CallContext(ctx, &payloadResult, "engine_getPayloadV4", *payloadID)
-	if err != nil {
-		return nil, 0, fmt.Errorf("get payload failed: %w", err)
+	// Save ExecMeta with payloadID for crash recovery (Stage="started")


let's remove this nil check and assume store is always set. it is in the constructor.

julienrbrt · 2025-12-17T19:55:09Z

execution/evm/execution.go

+	return c.doForkchoiceUpdate(ctx, args, "setHead")
 }

 func (c *EngineClient) setFinal(ctx context.Context, blockHash common.Hash, isFinal bool) error {


bit weird to have two private setFinal and setFinalWithHeight. Can we merge them?

alpe

Nice work.
It would be great to get rid of the store nil checks in the code

alpe · 2025-12-18T13:34:24Z

execution/evm/execution.go

+// Uses cached block hashes when available to avoid RPC calls. Falls back to RPC on cache miss
+// (e.g., during restart before cache is warmed).
+// Returns nil if the height doesn't exist yet (block not produced).
+func (c *EngineClient) SetSafeByHeight(ctx context.Context, height uint64) error {


This and SetSafe are public methods but I could not find where they are called from. Do you need them?

this will need to a be a follow up in order to build it against the da inclusion loop.

alpe · 2025-12-18T13:37:12Z

execution/evm/execution.go

+// SetFinalized explicitly sets the finalized block hash.
+// This allows the derivation layer to advance finalization independently.
+// Finalized indicates a block that will never be reorged (e.g., included in DA with sufficient confirmations).
+func (c *EngineClient) SetFinalized(ctx context.Context, blockHash common.Hash) error {


Same here and other public methods. I could not find where they are called.

execution/evm/store.go

alpe · 2025-12-18T14:25:26Z

execution/evm/execution.go

+// - err: error during checks
+func (c *EngineClient) checkIdempotency(ctx context.Context, height uint64, timestamp time.Time, txs [][]byte) (stateRoot []byte, payloadID *engine.PayloadID, found bool, err error) {
+	// 1. Check ExecMeta (if store is configured)
+	if c.store != nil {


TXs are unused.
The function checks if a block exists at a certain height with a specific timestamp. However, it does not verify if the transactions (txs) match what was previously executed or what exists on the Execution Layer (EL).

Risk: If the sequencer sends a different set of transactions for the same height/timestamp (e.g., due to a bug or malicious actor), this function will return the stateRoot of the previous execution, leading to state divergence between nodes.

I
f there is an already executed block we shouldnt override it as this would be considered a fork.

execution/evm/test/execution_test.go

proto/evnode/v1/state.proto

alpe

Thank you for the updates and additional informations.
Change look good. Please fix the test before merging

* main: fix: make evm_execution more robust (#2942) fix(sequencers/single): deterministic queue (#2938) fix(block): fix init logic sequencer for da epoch fetching (#2926)

github-actions bot assigned tac0turtle Dec 17, 2025

make evm_execution more robust

f2a5390

tac0turtle force-pushed the marko/execution_refactor branch from 9e47304 to f2a5390 Compare December 17, 2025 14:09

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

tac0turtle added 5 commits December 17, 2025 06:25

consolidate

fd2d51e

check

377793c

simplify

11325f3

more checks

70b42a6

remove unneeded mutex

349c052

tac0turtle commented Dec 17, 2025

View reviewed changes

tac0turtle marked this pull request as ready for review December 17, 2025 18:34

julienrbrt self-requested a review December 17, 2025 19:39

julienrbrt reviewed Dec 17, 2025

View reviewed changes

tac0turtle added 2 commits December 17, 2025 18:13

revert changes

c74e7f5

fix build

bbefeba

tac0turtle requested a review from julienrbrt December 18, 2025 13:54

alpe reviewed Dec 18, 2025

View reviewed changes

tac0turtle force-pushed the marko/execution_refactor branch from 2a1d79a to bf5c9fa Compare December 18, 2025 15:40

remove nil checks and store setter

ecec4e7

tac0turtle force-pushed the marko/execution_refactor branch from bf5c9fa to ecec4e7 Compare December 18, 2025 15:41

tac0turtle requested a review from alpe December 18, 2025 16:34

Merge branch 'main' into marko/execution_refactor

182e768

julienrbrt reviewed Dec 19, 2025

View reviewed changes

execution/evm/test/execution_test.go Outdated Show resolved Hide resolved

julienrbrt reviewed Dec 19, 2025

View reviewed changes

proto/evnode/v1/state.proto Show resolved Hide resolved

alpe previously approved these changes Dec 19, 2025

View reviewed changes

feat: add FinalizedBlockLag constant and update setFinalWithHeight logic

1672e25

tac0turtle dismissed alpe’s stale review via 1672e25 December 19, 2025 14:10

julienrbrt previously approved these changes Dec 19, 2025

View reviewed changes

move proto + go mod tidy

e9ba0f9

julienrbrt dismissed their stale review via e9ba0f9 December 19, 2025 19:54

julienrbrt approved these changes Dec 19, 2025

View reviewed changes

julienrbrt enabled auto-merge December 19, 2025 19:56

Merge branch 'main' into marko/execution_refactor

03335c4

julienrbrt disabled auto-merge December 19, 2025 21:05

julienrbrt merged commit d386df5 into main Dec 19, 2025
49 of 53 checks passed

julienrbrt deleted the marko/execution_refactor branch December 19, 2025 21:05

alpe added a commit that referenced this pull request Dec 22, 2025

Merge branch 'main' into alex/2609_hints

993c2b1

* main: fix: make evm_execution more robust (#2942) fix(sequencers/single): deterministic queue (#2938) fix(block): fix init logic sequencer for da epoch fetching (#2926)

	SafeBlockLag = 2
	//TODO: make this work with da inclusion loop or syncer when using da.
	SafeBlockLag = 2

fix: make evm_execution more robust #2942

fix: make evm_execution more robust #2942

Uh oh!

Conversation

tac0turtle commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Summary

Changes

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: EVM Execution Robustness Improvements

Critical Issues

1. Missing Concurrency Protection for ExecuteTxs 🔴

2. Transaction Validation Gap in Idempotency Check 🟡

High Priority Issues

3. Unused Public APIs Without Clear Purpose 🟡

4. Safe Block Lag Semantics Need Clarification 🟡

Code Quality & Architecture

5. Store Integration Pattern ✅ (Resolved)

6. Error Handling: Good Retry Logic ✅

7. Logging: Generally Good ✅

Security Considerations

8. JWT Secret Handling ✅

9. Transaction Filtering ✅

10. State Root Validation ✅

Testing

11. Test Coverage: Adequate but Missing Concurrency Tests 🟡

Documentation

12. Code Comments: Excellent ✅

Performance

13. Block Hash Cache Management ✅

14. Potential Optimization: Store Writes 🟡

Summary of Required Changes

Overall Assessment

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alpe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

tac0turtle commented Dec 17, 2025 •

edited

Loading

claude bot commented Dec 17, 2025 •

edited

Loading

github-actions bot commented Dec 17, 2025 •

edited

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading