What type of study is this?

This is a Experimental Study study.

October 8, 2025Open Access

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Key Points

VIPER outperforms existing visual instruction-based planners, indicating significant advancements in sequential decision-making.
Experiments on the ALFWorld benchmark show a marked improvement, with VIPER addressing previously open problems in multimodal planning.
The framework integrates visual perception and reasoning through a modular pipeline, showcasing the strength of both models when combined.
Leveraging text enhances explainability, allowing for introspection into the decision-making process of the AI agent.

Abstract

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Aissi et al. (Wed,) studied this question.

synapsesocial.com/papers/68e62de1a8c0c6d45874001f https://doi.org/https://doi.org/10.48550/arxiv.2503.15108

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Perguntar à IA

Bookmark

View Full Paper