Skip to content

Conversation

@tac0turtle
Copy link
Contributor

@tac0turtle tac0turtle commented Dec 17, 2025

Overview

Summary

Prevents fork explosions where multiple EL blocks are created for the same app block height by adding three layers of protection:

  1. Idempotency check - Before building, checks if EL already has a block at the requested height with matching timestamp and reuses it
  2. Fixed safe/finalized semantics - setFinal(isFinal=false) no longer incorrectly advances safe block; safe now lags head by SafeBlockLag (2 blocks)

Changes

  • Added executeMu sync.Mutex to serialize ExecuteTxs and ResumePayload calls
  • Added idempotency check at start of ExecuteTxs using height + timestamp matching
  • Fixed forkchoice args to use tracked currentSafeBlockHash/currentFinalizedBlockHash instead of prevBlockHash
  • Added SafeBlockLag constant (2 blocks) for automatic safe block advancement
  • Added new APIs: SetSafe(), SetSafeByHeight(), SetFinalized(), ResumePayload()
  • Refactored common RPC logic into `doForkchoiceUpdate()

@claude
Copy link
Contributor

claude bot commented Dec 17, 2025

Claude finished @tac0turtle's task in 2m 30s —— View job


Code Review: EVM Execution Robustness Improvements

I've completed a comprehensive review of this PR. The changes introduce critical improvements for preventing fork explosions and enabling crash recovery. Overall, the implementation is solid with good error handling patterns, but there are several important issues to address.


Critical Issues

1. Missing Concurrency Protection for ExecuteTxs 🔴

Location: execution/evm/execution.go:293

The PR description mentions "Added executeMu sync.Mutex to serialize ExecuteTxs and ResumePayload calls", but this mutex was removed in commit 349c052. This is a critical concurrency bug.

Problem: Without serialization, concurrent calls to ExecuteTxs can:

  • Create multiple payloads for the same height
  • Race on currentHeadBlockHash updates
  • Cause fork explosions (the exact problem this PR aims to solve)

Evidence:

  • Only mu sync.Mutex exists at line 146, which protects internal state (currentHeadBlockHash, etc.)
  • ExecuteTxs has no top-level mutex lock/unlock
  • The idempotency check alone is insufficient - it prevents reusing blocks, but doesn't prevent concurrent builds

Recommendation: Add an executeMu sync.Mutex field and wrap the entire ExecuteTxs method body with:

c.executeMu.Lock()
defer c.executeMu.Unlock()

This is essential to prevent the fork explosion scenario described in the PR.


2. Transaction Validation Gap in Idempotency Check 🟡

Location: execution/evm/execution.go:656-713 (checkIdempotency)

Problem: The idempotency check at line 686 only verifies height + timestamp when reusing an existing EL block, but does not verify that the transactions match. As @alpe correctly noted in review comments, this creates a state divergence risk.

Attack Scenario:

  1. Sequencer executes block N with txs [A, B, C] → produces stateRoot X
  2. Sequencer crashes/restarts
  3. Malicious/buggy sequencer sends different txs [D, E, F] for same height/timestamp
  4. Idempotency check returns stateRoot X (from first execution)
  5. State divergence: nodes disagree on which txs were in block N

Current Code:

// Line 686-691: Only checks timestamp, not transaction content
if existingTimestamp == uint64(timestamp.Unix()) {
    c.logger.Info()...Msg("ExecuteTxs: reusing existing block at height (EL idempotency)")
    // Returns existing stateRoot WITHOUT verifying txs match
    return existingStateRoot.Bytes(), nil, true, nil
}

Recommendation:
Option A (Strict): Compute tx hash and compare with stored ExecMeta.TxHash. If mismatch, return error indicating fork attempt.

Option B (Defensive): If EL block exists at height with matching timestamp, retrieve the actual block transactions from EL and compare. If mismatch, log critical warning and reject.

The TxHash field is already being computed (line 841-846) but not used during idempotency checks.


High Priority Issues

3. Unused Public APIs Without Clear Purpose 🟡

Locations:

  • SetSafe() (line 546)
  • SetSafeByHeight() (line 563)
  • SetFinalized() (line 612)
  • ResumePayload() (line 633)

Issue: As @alpe noted, these public methods are not called anywhere in the codebase. The PR description mentions they're for "DA inclusion loop" integration, but:

  • No follow-up PR is linked
  • No TODO comments explain future usage
  • They add API surface area without current value

Recommendation:

  • If these are for future DA integration, add clear TODO comments with issue/PR links
  • Consider making them internal/unexported until actually needed
  • Document the intended usage in godoc comments

4. Safe Block Lag Semantics Need Clarification 🟡

Location: execution/evm/execution.go:36-44

Issue: The constants SafeBlockLag and FinalizedBlockLag implement a "mock" finalization strategy (lines 41-43), but:

  • Line 40 has a TODO comment that was flagged by reviewer @tac0turtle to integrate with DA inclusion loop
  • Current lag-based approach doesn't reflect actual DA confirmation
  • Mixed with real head advancement logic, making it unclear what's temporary vs permanent

Recommendation:

  • Add clear documentation that this is a temporary placeholder until DA-based finalization
  • Link to tracking issue for DA integration
  • Consider a feature flag or config option to disable auto-advancement when real DA finalization is available

Code Quality & Architecture

5. Store Integration Pattern ✅ (Resolved)

The discussion between @julienrbrt and others about store dependencies was resolved by:

  • Moving ExecMeta proto to proto/execution/evm/v1/state.proto
  • Using ds.Batching interface at the boundary
  • Prefixing with evmStorePrefix (line 16 in store.go)

This is a good pattern that keeps execution/evm modular while still using ev-node's datastore abstraction.

6. Error Handling: Good Retry Logic

The exponential backoff retry pattern (retryWithBackoffOnPayloadStatus, lines 96-130) correctly implements Engine API spec:

  • Distinguishes transient (SYNCING) from permanent (INVALID) failures
  • Respects context cancellation
  • Uses exponential backoff

Minor suggestion: Consider making retry constants configurable for testing.

7. Logging: Generally Good

Structured logging with appropriate levels. Good use of contextual fields.

Minor nit: Some debug logs at line 451-453, 472-475 could be elevated to Info level since block production failures are operationally significant.


Security Considerations

8. JWT Secret Handling

JWT secret management in apps/evm/cmd/run.go (lines 226-234) properly:

  • Reads from file (not command-line args)
  • Trims whitespace
  • Validates non-empty

9. Transaction Filtering

filterTransactions (line 716) correctly validates force-included txs from DA while trusting mempool txs (already validated). This is appropriate.

10. State Root Validation

Line 316-318: Validates that prevStateRoot matches execution layer state. Good defensive check.


Testing

11. Test Coverage: Adequate but Missing Concurrency Tests 🟡

execution/evm/test/execution_test.go has good end-to-end coverage:

  • Genesis initialization
  • Multi-block execution
  • Empty blocks (edge case at block 4)
  • Sync/replay scenario

Missing:

  • No tests for concurrent ExecuteTxs calls (critical given issue ABCI interface #1)
  • No tests for idempotency with different txs (issue Peer discovery #2)
  • No tests for crash recovery using stored ExecMeta
  • No tests for ResumePayload functionality

Recommendation: Add tests for:

// Concurrent execution attempts
func TestConcurrentExecuteTxs(t *testing.T) { ... }

// Idempotency with tx mismatch
func TestIdempotencyDifferentTxs(t *testing.T) { ... }

// Crash recovery
func TestResumePayloadAfterCrash(t *testing.T) { ... }

Documentation

12. Code Comments: Excellent

Function-level documentation is thorough and explains:

  • Purpose and behavior
  • Engine API compliance
  • Retry semantics
  • State transitions

The comment style aligns with Go conventions.


Performance

13. Block Hash Cache Management

cacheBlockHash (line 587) implements bounded cache with pruning. Size limit of 10 is appropriate for SafeBlockLag=2.

14. Potential Optimization: Store Writes 🟡

saveExecMeta is called multiple times per block execution (lines 400, 700, 808). Each call does:

  • Protobuf marshaling
  • Datastore Put

Suggestion: Consider batching these writes or making intermediate stages optional. The critical saves are "started" (for crash recovery) and "promoted" (for idempotency).


Summary of Required Changes

Must Fix Before Merge:

  1. ✅ Re-add executeMu serialization for ExecuteTxs (Issue ABCI interface #1)
  2. ✅ Add transaction validation to idempotency check (Issue Peer discovery #2)

Should Address:
3. Document/justify unused public APIs or defer them (Issue #3)
4. Clarify temporary nature of lag-based finalization (Issue #4)
5. Add concurrency and crash recovery tests (Issue #11)

Nice to Have:
6. Optimize saveExecMeta calls (Issue #14)
7. Elevate some debug logs to info level (Issue #7)


Overall Assessment

This PR makes important progress on preventing fork explosions and enabling crash recovery. The core idempotency design with ExecMeta tracking is sound. However, the removal of executeMu (issue #1) and missing transaction validation (issue #2) are critical gaps that must be addressed.

The code quality is generally high with good error handling, logging, and documentation. Once the critical issues are fixed and tests are added, this will be a solid improvement to execution layer robustness.

Recommendation: Request changes for issues #1 and #2 before merging.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tac0turtle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of EVM execution by addressing potential "fork explosions" and improving crash recovery mechanisms. It introduces a multi-layered approach including serialization of block builds, idempotent execution checks, and the ability to resume interrupted payload creations. Furthermore, it refines the handling of safe and finalized block states, ensuring more stable and predictable chain progression.

Highlights

  • Concurrency Control: Implemented a mutex (executeMu) to serialize ExecuteTxs calls, preventing concurrent block builds and potential "fork explosions."
  • Idempotent Execution: Added a check in ExecuteTxs to reuse existing blocks if an EL block with a matching height and timestamp is already present, improving retry robustness.
  • Crash Recovery for Payloads: Introduced a payloadResumer interface and logic to resume in-progress payload builds using stored payloadIDs after crashes, avoiding sibling block creation.
  • Refined Safe/Finalized Semantics: Corrected the advancement of the safe block, ensuring it now lags the head by a configurable SafeBlockLag (2 blocks) for better reorg protection.
  • New State Tracking and APIs: Introduced ExecMeta to persistently track execution stages and added new APIs (SetSafe, SetSafeByHeight, SetFinalized, ResumePayload) for more granular control over forkchoice state.
  • RPC Logic Refactoring: Consolidated common forkchoiceUpdatedV3 RPC call logic into a new doForkchoiceUpdate helper function.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedDec 19, 2025, 8:33 PM

@tac0turtle tac0turtle force-pushed the marko/execution_refactor branch from 9e47304 to f2a5390 Compare December 17, 2025 14:09
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant changes to enhance the robustness and idempotency of block execution, particularly for EVM-based chains. Key changes include adding an ExecMeta struct and associated storage (pkg/store) to track the execution state of blocks at different stages (started, promoted), enabling crash recovery and preventing the creation of sibling blocks during retries. The Executor now saves and retrieves this metadata, and a new payloadResumer interface allows EVM execution clients to resume in-progress payload builds. The EngineClient in execution/evm has been updated to serialize ExecuteTxs calls, implement idempotency checks by reusing existing blocks, and manage forkchoice state more granularly with SafeBlockLag and new setHead, SetSafe, SetFinalized, and ResumePayload methods. Mock generation configurations (.mockery.yaml) and protobuf definitions (proto/evnode/v1/state.proto) were updated to support these new structures and interfaces, and the buf.gen.yaml configuration was simplified by removing the clean: true directive.

@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.27%. Comparing base (d4a2da9) to head (03335c4).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2942      +/-   ##
==========================================
+ Coverage   59.19%   59.27%   +0.08%     
==========================================
  Files          90       90              
  Lines        8627     8627              
==========================================
+ Hits         5107     5114       +7     
+ Misses       2938     2931       -7     
  Partials      582      582              
Flag Coverage Δ
combined 59.27% <ø> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// This provides a buffer for reorg protection - safe blocks won't be reorged
// under normal operation. A value of 2 means when head is at block N,
// safe is at block N-2.
SafeBlockLag = 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SafeBlockLag = 2
//TODO: make this work with da inclusion loop or syncer when using da.
SafeBlockLag = 2

@tac0turtle tac0turtle marked this pull request as ready for review December 17, 2025 18:34
@julienrbrt julienrbrt self-requested a review December 17, 2025 19:39
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the core logic. My comments are still about moving most of it outside of ev-node.

err = c.engineClient.CallContext(ctx, &payloadResult, "engine_getPayloadV4", *payloadID)
if err != nil {
return nil, 0, fmt.Errorf("get payload failed: %w", err)
// Save ExecMeta with payloadID for crash recovery (Stage="started")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this nil check and assume store is always set. it is in the constructor.

return c.doForkchoiceUpdate(ctx, args, "setHead")
}

func (c *EngineClient) setFinal(ctx context.Context, blockHash common.Hash, isFinal bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit weird to have two private setFinal and setFinalWithHeight. Can we merge them?

Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.
It would be great to get rid of the store nil checks in the code

// Uses cached block hashes when available to avoid RPC calls. Falls back to RPC on cache miss
// (e.g., during restart before cache is warmed).
// Returns nil if the height doesn't exist yet (block not produced).
func (c *EngineClient) SetSafeByHeight(ctx context.Context, height uint64) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and SetSafe are public methods but I could not find where they are called from. Do you need them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to a be a follow up in order to build it against the da inclusion loop.

// SetFinalized explicitly sets the finalized block hash.
// This allows the derivation layer to advance finalization independently.
// Finalized indicates a block that will never be reorged (e.g., included in DA with sufficient confirmations).
func (c *EngineClient) SetFinalized(ctx context.Context, blockHash common.Hash) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and other public methods. I could not find where they are called.

// - err: error during checks
func (c *EngineClient) checkIdempotency(ctx context.Context, height uint64, timestamp time.Time, txs [][]byte) (stateRoot []byte, payloadID *engine.PayloadID, found bool, err error) {
// 1. Check ExecMeta (if store is configured)
if c.store != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TXs are unused.
The function checks if a block exists at a certain height with a specific timestamp. However, it does not verify if the transactions (txs) match what was previously executed or what exists on the Execution Layer (EL).

Risk: If the sequencer sends a different set of transactions for the same height/timestamp (e.g., due to a bug or malicious actor), this function will return the stateRoot of the previous execution, leading to state divergence between nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I
f there is an already executed block we shouldnt override it as this would be considered a fork.

@tac0turtle tac0turtle force-pushed the marko/execution_refactor branch from 2a1d79a to bf5c9fa Compare December 18, 2025 15:40
@tac0turtle tac0turtle force-pushed the marko/execution_refactor branch from bf5c9fa to ecec4e7 Compare December 18, 2025 15:41
@tac0turtle tac0turtle requested a review from alpe December 18, 2025 16:34
alpe
alpe previously approved these changes Dec 19, 2025
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates and additional informations.
Change look good. Please fix the test before merging

julienrbrt
julienrbrt previously approved these changes Dec 19, 2025
@julienrbrt julienrbrt enabled auto-merge December 19, 2025 19:56
@julienrbrt julienrbrt disabled auto-merge December 19, 2025 21:05
@julienrbrt julienrbrt merged commit d386df5 into main Dec 19, 2025
49 of 53 checks passed
@julienrbrt julienrbrt deleted the marko/execution_refactor branch December 19, 2025 21:05
alpe added a commit that referenced this pull request Dec 22, 2025
* main:
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants