Research and development (R&D) organizations face significant operational bottlenecks due to the manual processing of diverse, unstructured documents. This paper presents the design, implementation, and pilot evaluation of an on-premise, multi-agent natural language processing (NLP) system developed for the GIG National Research Institute (GIG-NRI). Built upon a LangGraph architecture, the system utilizes open-weight large language models (LLMs) to perform zero-shot document classification, dynamic routing, and specialized information extraction. We rigorously evaluated the classification agent across twelve different local LLMs under two distinct testing regimes: first, using a strictly defined dataset of known administrative and scientific document types, and second, introducing a subset of out-of-distribution (unclassified) data to test real-world robustness. Our results demonstrate that the 70-billion parameter model (cogito:70b) achieved a peak accuracy of 97.3% in the first regime and maintained a strong 94.3% accuracy when confronted with out-of-spec data. However, our analysis reveals a critical operational trade-off regarding computational efficiency. The 24-billion parameter (magistral:24b) and 32-billion parameter (qwen3:32b) models emerged as the next best in overall accuracy while requiring less than half the processing time of their 70B counterpart. Notably, magistral:24b proved superior for strictly defined document streams, whereas qwen3:32b demonstrated greater robustness when handling out-of-distribution inputs. Furthermore, we demonstrate the efficacy of heterogeneous model assignments for complex multi-stage tasks, such as Scientific Article summarization via hierarchical Map-Reduce.
Iwaszenko et al. (Wed,) studied this question.