Skip to content

eth_call/StateCall needlessly refused at the epoch after an expensive migration, with an undetectable error code #13642

@rvagg

Description

@rvagg

Originally reported by @ArseniiPetrovich

Summary

Explicit calls (eth_call, EthEstimateGas, StateCall) pinned to a specific tipset are rejected with ErrExpensiveFork ("refusing explicit call due to state fork at epoch") not only at an expensive upgrade epoch U, but also at U+1, the first epoch after the migration, whose state is already fully materialised. The error is also returned with the generic JSON-RPC code 1, so downstream tooling cannot distinguish it from any other application error.

Two independent problems, both worth fixing:

  1. The refusal window is one epoch too wide (U+1 should be served).
  2. The error has no stable, registered code.

Symptom

Reproduced on mainnet across the nv28 / FireHorse upgrade (UpgradeFireHorseHeight = 6052800). Same eth_call (feeGrowthGlobal0X128() on a Uniswap v3 pool), varying only the pinned block:

epoch U-1 (6052799): OK   -> 0x...15675c877f4119b6014c3ff7346ceae74
epoch U   (6052800): ERR  -> {"code":1,"message":"refusing explicit call due to state fork at epoch"}   <-- legit
epoch U+1 (6052801): ERR  -> {"code":1,"message":"refusing explicit call due to state fork at epoch"}   <-- the problem
epoch U+2 (6052802): OK   -> 0x...15675c877f4119b6014c3ff7346ceae74

The state at U+1 is plainly available: eth_getStorageAt and eth_getCode both succeed there and return the correct values. Only the explicit-call path refuses.

Downstream impact

The Graph's graph-node indexes FEVM contracts by replaying eth_call at the block where each event was emitted. When an event lands in block U or U+1 the call is refused, and graph-node does not recognise the message as a deterministic error, so it treats it as a possible reorg and retries indefinitely. The subgraph wedges permanently at the upgrade epoch. This gets more likely every upgrade as FEVM activity grows.

Why it happens

The guard (in node/impl/eth/gas.go and chain/stmgr/call.go) refuses a call when an expensive migration sits anywhere between the parent epoch and the called epoch. Including the parent makes a call at U+1 trip on the migration at U, even though that migration is already baked into the state U+1 runs on (it is recoverable without re-execution, and nothing runs the migration on demand). The epoch U itself genuinely must stay refused, because serving it would run the migration on demand against the wrong state.

Separately, ErrExpensiveFork is a bare sentinel that is never converted to a typed RPC error, so go-jsonrpc falls back to code 1. Lotus already registers typed errors with stable codes (api/api_errors.go); ErrExpensiveFork is the natural sibling of ErrNullRound (both mean "this epoch can't be served as requested").

Workarounds today

  • graph-node operators: set GRAPH_GETH_ETH_CALL_ERRORS="refusing explicit call due to state fork" so the message is treated as deterministic, stopping the retry loop. Relies on the exact message string. Doing this prior to applying a proper fix means that some indexable content may be missed.
  • callers: avoid pinning explicit calls to the upgrade epoch or the one immediately after; U+2 onward is fine.

Possible fixes

  • Serve U+1: narrow the guard so it no longer refuses on a migration at the parent epoch, keeping U refused. This is what actually unblocks indexers, since events almost always land at U+1 rather than exactly at U.
  • Signal it the way the ecosystem already does: there's no widely-recognised code for this, but the de-facto convention for "state not servable at this block" (pruned/archive nodes) is code -32000 plus a recognisable message phrase such as required historical state unavailable or state ... is not available. Lotus emits 1 with a bespoke message that no tooling recognises. Moving to -32000 and growing the message to carry that phrase (e.g. required historical state unavailable: refusing explicit call due to state fork at epoch N) aligns with geth-oriented tooling. We should not borrow revert-style phrasing (e.g. -32015 / "execution reverted"): graph-node would then treat the call as a successful revert and index null data, which is worse than the retry loop.

The serve-U+1 fix is the one that resolves this in practice. The residual case (U itself, genuinely unservable) needs graph-node to grow a "state unavailable, advance past this block" path; it currently only knows retry-forever or treat-as-revert. The code/message change is the precondition for such an upstream fix, and less in our control.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    📌 Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions