What question did this study set out to answer?

This research aims to establish a benchmark for evaluating autonomous agents on macOS.

May 12, 2026Open Access

macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents

Key Points

This research aims to establish a benchmark for evaluating autonomous agents on macOS.
Developed a benchmark with 369 task slots in 15 categories.
Implemented dual scoring methods and agent-agnostic Go runner.
Conducted reference runs to document debugging and performance trajectories.
Achieved a benchmark score of 67.3% IMPLEMENTED during the first reference run.
Documented an increase in task implementation from 49.3 to 67.3% across multiple iterations.
Established a platform ceiling with 21/31 tasks fully implemented despite 10 categorized as unreachable.

Abstract

The first publicly published macOS-native computer-use benchmark. 369 task slots across 15 categories (Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app), agent-agnostic Go runner, dual scoring (IMPLEMENTED + STRICT), per-task PID-snapshot isolation. First reference run: kinclaw v1. 15. 0 + Kimi-K2. 5 = 67. 3% IMPLEMENTED. Documents the full 49. 3 -> 62 -> 67. 3 debugging trajectory as methodology contribution. Note (2026-05-09): This version bundles English + 中文 in a single PDF (English first, then Chinese), generated directly from the canonical Markdown source files. v0. 1. 1 (2026-05-10): Adds §6. 7 "The platform ceiling — separating agent capability from environmental limits. " Notes category: 21/31 → 31/31 fully implemented. 5 eval bug fixes. New tools/referenceᵥerifier. sh runs every Notes task with canonical osascript/shell solutions in ~100s — establishes the platform ceiling (21/31 = 67. 7%) as upper bound on any agent's score, decomposing the 10 unreachable tasks into 4 platform-locked categories.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper