What question did this study set out to answer?

To develop a multimodal framework for assessing worker safety compliance using image-to-language reasoning.

April 13, 2026Open Access

mLLM-CRD: multi-modal LLM based context-aware rule detection using LoRA-integrated attention for worker safety monitoring

Key Points

To develop a multimodal framework for assessing worker safety compliance using image-to-language reasoning.
Utilized a joint parameter-efficient fine-tuning strategy with Low-Rank Adaptation (LoRA).
Incorporated a scenario-based question-answer dataset derived from industrial safety rules.
Implemented a Bootstrapping Language-Image Pre-training (BLIP-2) framework.
Demonstrated superior classification accuracy compared to zero-shot VLM baselines.
Improved natural language generation quality, especially in class-imbalanced scenarios.
Provided interpretable language-based assessments for worker safety compliance.

Abstract

Large Vision-Language Models (VLMs), which jointly process visual and textual information, have demonstrated strong performance in general scene understanding; however, their application to safety-critical industrial environments remains limited by the need for precise semantic reasoning, interpretability, and computational efficiency. Conventional worker safety monitoring systems typically rely on rigid object detection pipelines, which are prone to error propagation and provide limited explanatory capability. This paper presents mLLM-CRD, a multimodal framework that formulates safety compliance assessment as an image-to-language reasoning task. Built upon Bootstrapping Language-Image Pre-training (BLIP-2), the proposed approach applies a joint parameter-efficient fine-tuning strategy using Low-Rank Adaptation (LoRA), which is integrated into both the Querying Transformer (Q-Former) and the encoder-decoder-based language model while keeping the vision encoder frozen, thereby enabling task adaptation with minimal additional parameters. To support rule-aware semantic reasoning, we construct a scenario-based question-answer (QA) dataset that encodes safety compliance conditions as structured language queries derived from industrial safety rules. Experimental results show that mLLM-CRD consistently outperforms zero-shot VLM baselines in both classification accuracy and natural language generation quality, particularly under class-imbalanced conditions. These findings suggest that unified multimodal reasoning can provide interpretable, language-based assessments of worker safety compliance, offering a practical alternative to conventional multi-stage detection pipelines in real-world industrial settings.

Bookmark

View Full Paper

Cite This Study

Lim et al. (Sat,) studied this question.

synapsesocial.com/papers/69dc89183afacbeac03eacf7 https://doi.org/https://doi.org/10.1007/s44443-026-00572-2

Bookmark

View Full Paper