Automated anomaly detection is vital to industrial quality control, yet conventional deep learning detectors often struggle with scalability. These models, typically following a rigid “one-model-per-task” paradigm, require separate systems for each product line, increasing operational complexity and cost in diverse manufacturing environments. To address this limitation, we propose a unified defect detection framework based on a Multimodal Large Language Model (MLLM). Our approach utilizes a two-stage fine-tuning strategy: Supervised Fine-Tuning (SFT) to impart domain-specific knowledge, followed by a novel Reinforcement Fine-Tuning (RFT) process that refines visual reasoning. This RFT stage is guided by a multi-faceted verifiable reward function designed to optimize localization accuracy, classification correctness, and output structure. On a challenging real-world glove manufacturing dataset, our RFT-enhanced MLLM achieves a mean Average Precision (mAP) of 0.63, which is comparable to a highly specialized YOLO baseline (0.62). More importantly, a single, unified MLLM trained on a mixed-product dataset maintains competitive performance (mAP 0.61), demonstrating its ability to dynamically handle different products and defect types via natural language prompts. This study validates the feasibility of using a single, flexible MLLM to replace multiple rigid models in complex industrial inspection, offering a scalable and cost-effective paradigm for future intelligent quality control systems. The open-source code will be released at https://github.com/GloamXun/Glove-MLLM .
Zhao et al. (Wed,) studied this question.