Fault diagnosis is a time-intensive maintenance task often reliant on the expertise of senior technicians. As this workforce ages and demand grows for digital tools, there is a growing need to capture and automate this knowledge while maintaining the precision required for technical applications. This study introduces an evaluation-driven framework for fault code recommendation, applied to a ground vehicle diagnosis system. Two tasks were designed to reflect potential system configurations: (1) a chat-style task simulating large language model (LLM) interaction, and (2) a label-constrained task using structured fault codes from technical manuals. Multiple retrieval-augmented generation (RAG) configurations were compared against LLM-only and retrieval-only baselines. Results showed that retrieval-based methods outperformed LLM-based ones for label-matching tasks, while the chat task showed challenges in linking observations to fault codes from the manual. These results highlight the importance of aligning task design with evaluation goals and considering retrieval-first approaches as viable alternatives to LLMs in technical language processing (TLP) applications. Beyond experimental findings, we outline industrial lessons learned: the importance of aligning system design to use case goals, adopting evaluation-first validation, and the need to pilot LLM-based systems under realistic conditions. These lessons provide practical guidance for developing effective diagnostic support systems in industrial contexts.
Lukens et al. (Sun,) studied this question.