What does this research mean for the field?

State-of-the-art LLM-agent systems can achieve up to 88.1% zero-shot success in cross-chain smart contract generation, while open-weight systems achieve 52-60% success with performance heavily dependent on the target blockchain and task type. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to evaluate the performance of AI agents in generating cross-chain smart contracts and translating code across different chains.

May 20, 2026Open Access

ChainBench: An LLM Benchmark for Cross-Chain Code Generation

Key Points

This research aims to evaluate the performance of AI agents in generating cross-chain smart contracts and translating code across different chains.
Introduced ChainBench benchmark with 42 tasks from production repositories.
Used a zero-shot evaluation of nine deployed model-agent systems to assess task success rates.
Tasks involved passing existing verification suites to evaluate correctness.
The best performing agent achieved 88.1% success on the ChainBench benchmark.
Open-weight systems demonstrated success rates between 52% and 60% with longer solve times.
Performance varied significantly by target chain and task type, with common failures related to behavioral mismatches.

Abstract

AI agents are increasingly capable of repository scale software work, and smart-contract engineering is a particularly high-stakes setting where toolchains and tests enforce correctness. We introduce ChainBench, a benchmark for cross-chain smart contract translation and contract generation built from production repositories and their real verification suites. Each task provides a structured specification and containerized environment: systems implement missing functionality to pass the existing verification suite, and are scored by Pass@1 task success. ChainBench contains 42 tasks spanning EVM Solidity, NEAR Rust, Aptos Move, Sui Move, and Starknet Cairo. In a zero-shot evaluation of nine deployed model-agent systems, including both standardized same harness runs and preferred-harness runs where available, the best achieves 88.1% success, while several open-weight systems reach 52%–60% with substantially longer solve times. Performance further varies by target chain and task type, with failures ranging from narrow behavioral mismatches (e.g., missing edge case guards) to build and repository-compatibility issues.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper