Human action recognition (HAR) is a fundamental component of ubiquitous computing, yet its wide‐range applications are hindered by privacy concerns. Specifically, high‐accuracy models typically require cloud‐based processing that compromises sensitive visual data, while privacy‐preserving on‐device models suffer from limited reasoning capacities and frequent hallucinations. To resolve this conflict, we introduce multiagent debate for HAR (MAD‐HAR), a novel framework designed for strictly local environments. MAD‐HAR leverages a lightweight vision–language model (VLM) with a granular prompt to convert visual inputs into semantic captions, anonymizing data before inference. To mitigate reasoning failures, a heterogeneous ensemble of diverse small and medium language model agents (ranging from 8B to 14B parameters) engages in a structured multiround debate. Rather than outputting simple labels, agents are prompted to generate structured rationales to explicitly justify their logic, utilizing collaborative critique to override hallucinations. We evaluate our approach on public benchmarks. Preliminary experiments guided the selection of the optimal VLM backbone, while extensive main and ablation studies suggest that scaling to a seven‐agent pool with rationale‐driven debate synthesizes higher‐order reasoning. Experimental results show that MAD‐HAR significantly improves macro‐F1, while maximizing consensus and yielding consistent net error rectification.
Zhou et al. (Thu,) studied this question.