Large Vision-Language Models (VLMs), which jointly process visual and textual information, have demonstrated strong performance in general scene understanding; however, their application to safety-critical industrial environments remains limited by the need for precise semantic reasoning, interpretability, and computational efficiency. Conventional worker safety monitoring systems typically rely on rigid object detection pipelines, which are prone to error propagation and provide limited explanatory capability. This paper presents mLLM-CRD, a multimodal framework that formulates safety compliance assessment as an image-to-language reasoning task. Built upon Bootstrapping Language-Image Pre-training (BLIP-2), the proposed approach applies a joint parameter-efficient fine-tuning strategy using Low-Rank Adaptation (LoRA), which is integrated into both the Querying Transformer (Q-Former) and the encoder-decoder-based language model while keeping the vision encoder frozen, thereby enabling task adaptation with minimal additional parameters. To support rule-aware semantic reasoning, we construct a scenario-based question-answer (QA) dataset that encodes safety compliance conditions as structured language queries derived from industrial safety rules. Experimental results show that mLLM-CRD consistently outperforms zero-shot VLM baselines in both classification accuracy and natural language generation quality, particularly under class-imbalanced conditions. These findings suggest that unified multimodal reasoning can provide interpretable, language-based assessments of worker safety compliance, offering a practical alternative to conventional multi-stage detection pipelines in real-world industrial settings.
Lim et al. (Sat,) studied this question.