AI agents are increasingly capable of repository scale software work, and smart-contract engineering is a particularly high-stakes setting where toolchains and tests enforce correctness. We introduce ChainBench, a benchmark for cross-chain smart contract translation and contract generation built from production repositories and their real verification suites. Each task provides a structured specification and containerized environment: systems implement missing functionality to pass the existing verification suite, and are scored by Pass@1 task success. ChainBench contains 42 tasks spanning EVM Solidity, NEAR Rust, Aptos Move, Sui Move, and Starknet Cairo. In a zero-shot evaluation of nine deployed model-agent systems, including both standardized same harness runs and preferred-harness runs where available, the best achieves 88.1% success, while several open-weight systems reach 52%–60% with substantially longer solve times. Performance further varies by target chain and task type, with failures ranging from narrow behavioral mismatches (e.g., missing edge case guards) to build and repository-compatibility issues.
Shah et al. (Mon,) studied this question.