What question did this study set out to answer?

The study aims to enhance task completion rates by utilizing a multi-model orchestration approach that integrates heterogeneous LLM-Skills.

April 13, 2026Open Access

LLM-Skill Orchestration: Achieving 202/202 Subtask Completion via Rule-Augmented Multi-Model Collaboration in 50 Agentic Tasks

Key Points

The study aims to enhance task completion rates by utilizing a multi-model orchestration approach that integrates heterogeneous LLM-Skills.
Introduced a three-layer architecture for task execution.
Developed a reasoning model to generate orchestration rules.
Used a planning model to decompose tasks into skill graphs with dependencies.
Evaluated 50 agentic tasks with binary checklist items.
Achieved 202/202 task completion with an average quality score of 17.5/20 using the rule-augmented system.
Compared to a single-model baseline, which completed 137/202 tasks (68%) with a score of 7.4/20.
Demonstrated that model diversity enhances performance over single-model decomposition.
Found that deductive reasoning outperforms inductive learning in generating rules.

Abstract

LLM agents typically rely on a single model for multi-step tool-using tasks, creating a tension between required capability breadth and individual model limitations. We introduce LLM-Skill Orchestration, a three-layer architecture where: (1) a reasoning model generates orchestration rules from system constraints alone; (2) a planning model decomposes tasks into skill graphs with explicit dependencies; and (3) heterogeneous LLM-Skills — both pure-text and tool-equipped — execute in parallel through a shared context pool. We evaluate 50 agentic tasks across five types (information retrieval, code construction, cross-system analysis, multi-step reasoning, compound decision-making). Each task has 4–6 binary checklist items, totaling 202 items. The rule-augmented system (Hb) achieves 202/202 completion and 17.5/20 average quality (LLM-as-Judge, σ=2.0), compared to 137/202 (68%) and 7.4/20 for the single-model baseline (A), and 166/202 (82%) and 13.7/20 for static-rule orchestration (C). Key findings: (i) same-model decomposition (D: 8/22) performs worse than no decomposition (A: 13/22), proving that model diversity, not parallelism, drives collaborative gains; (ii) rule-blind generation (Hb: 96/100) outperforms rule-informed generation (Hi: 76/100), demonstrating that deductive reasoning from system invariants generalizes better than inductive learning from failure cases; (iii) 34 of 227 skills (15%) produced 0-byte output due to API anomalies, yet all were autonomously compensated by the synthesis stage — an emergent architectural resilience not designed into the system. This is the second paper in a three-part series on multi-model orchestration ('AI Managing AI'). Part 1 (DOI: 10.5281/zenodo.19387375) addresses knowledge synthesis; this paper addresses agentic task execution; Part 3 (in preparation) develops the theoretical framework including hallucination duality. Part 2 of 3 in the 'AI Managing AI' paper series. Part 1: Dimension-Direct Routing — knowledge synthesis via multi-model orchestration (DOI: 10.5281/zenodo.19387375). Part 2 (this paper): LLM-Skill Orchestration — agentic task execution via rule-augmented multi-model collaboration. Part 3 (in preparation): AI Managing AI — a dual-mode framework for deterministic execution and creative exploration, including hallucination duality theory.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper