What question did this study set out to answer?

April 21, 2026Open Access

Development of a Vermin Detection System Using Multimodal Large Language Models

Key Points

To create a zero-shot vermin detection system using multimodal large language models for agricultural monitoring.
Collected 1,073 images of vermin using cameras at three locations in Hokkaido, Japan.
Tested and compared the detection performance of GPT-4o, LLaVA, YOLO-World, and Grounding DINO.
Evaluated model performance based on recall and false detection rates.
GPT-4o demonstrated perfect recall (1.00) in detecting vermin during preliminary deployment.
17 false detections occurred in images without target animals, indicating room for improvement.

Abstract

In recent years, the damage to humans and crops caused by bears and other vermin has become increasingly serious across Japan. Although smart agricultural monitoring systems have shown some promise, they are still limited by issues such as specificity to certain species, high expenses, and a lack of adaptability. This study focused on creating and testing a zero-shot system for vermin detection using a multimodal large language model. A total of 1,073 images were collected using cameras installed at three locations in Nanae-cho, Hokkaido, Japan, between May and September 2025. Twenty-two images showed the target animals, including 12 bears, nine deer, and one crow. A comparative evaluation of GPT-4o, LLaVA, YOLO-World, and Grounding DINO showed that GPT-4o had promising recall in our preliminary deployment (recall =1.00), although 17 false detections occurred in images without animals.

Development of a Vermin Detection System Using Multimodal Large Language Models

Key Points

Abstract

Cite This Study