The first publicly published macOS-native computer-use benchmark. 369 task slots across 15 categories (Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app), agent-agnostic Go runner, dual scoring (IMPLEMENTED + STRICT), per-task PID-snapshot isolation. First reference run: kinclaw v1. 15. 0 + Kimi-K2. 5 = 67. 3% IMPLEMENTED. Documents the full 49. 3 -> 62 -> 67. 3 debugging trajectory as methodology contribution. Note (2026-05-09): This version bundles English + 中文 in a single PDF (English first, then Chinese), generated directly from the canonical Markdown source files. v0. 1. 1 (2026-05-10): Adds §6. 7 "The platform ceiling — separating agent capability from environmental limits. " Notes category: 21/31 → 31/31 fully implemented. 5 eval bug fixes. New tools/referenceᵥerifier. sh runs every Notes task with canonical osascript/shell solutions in ~100s — establishes the platform ceiling (21/31 = 67. 7%) as upper bound on any agent's score, decomposing the 10 unreachable tasks into 4 platform-locked categories.
Building similarity graph...
Analyzing shared references across papers
Loading...
The LocalKin Team
Building similarity graph...
Analyzing shared references across papers
Loading...
The LocalKin Team (Sun,) studied this question.
www.synapsesocial.com/papers/6a02c380ce8c8c81e9640cb0 — DOI: https://doi.org/10.5281/zenodo.20113062