What question did this study set out to answer?

To evaluate the effectiveness of a decision support agent, MARTIAN, for autonomous maritime operations.

March 14, 2026Open Access

Deploying Efficient LLM Agents on Maritime Autonomous Surface Ships: Fine-Tuning, RAG, and Function Calling in a Mid-Size Model

Key Points

To evaluate the effectiveness of a decision support agent, MARTIAN, for autonomous maritime operations.
Development of the MARTIAN agent utilizing a Cognitive Core architecture
Evaluation of performance using the Bilingual Maritime Multiple-Choice Questionnaire
Ablation studies to investigate effects of adaptive routing mechanisms on task performance
MARTIAN achieves 73.23% accuracy with SFT and 81.16% with SFT + RAG on BM-MCQ
The agent surpasses the Qwen-2.5 model on COLREG logic tasks, achieving 78.53% accuracy
Adaptive routing mechanisms enhance performance but introduce semantic noise affecting logic tasks

Abstract

Deploying Large Language Models (LLMs) on Maritime Autonomous Surface Ships (MASS) entails a critical trade-off between reasoning depth, inference latency, and hardware constraints. To fill the existing gap, we introduce MARTIAN (Maritime Agent for Real-time Tactical Inference And Navigation), a 14B-parameter decision support agent engineered for edge deployment on standard vessel hardware (e.g., the NVIDIA Jetson AGX Orin). Central to our approach is the Cognitive Core architecture, which utilizes a verified dataset of 21,800 Chain-of-Thought (CoT) instruction–response pairs to align general linguistic capabilities with maritime procedural logic. Empirical evaluations demonstrate that MARTIAN achieves an overall accuracy of 73.23% (SFT only) and 81.16% (SFT + RAG) on the Bilingual Maritime Multiple-Choice Questionnaire (BM-MCQ), a standardized assessment dataset constructed based on Officer of the Watch (OOW) competencies. Notably, the SFT-only configuration attains 78.53% on pure-logic-intensive COLREG tasks—surpassing the 72B-parameter Qwen-2.5 foundation model in this domain—while maintaining a real-time inference latency of 22.4 ms/token. Crucially, our ablation studies support a nuanced Interference Hypothesis: while RAG significantly enhances factual recall in knowledge-intensive domains (boosting total accuracy from 73.23% to 81.16%), it concurrently introduces semantic noise that degrades performance in pure logic reasoning tasks (e.g., COLREG maneuvering accuracy decreases from 78.53% to 77.36%). On the basis of this finding, we identify and empirically motivate a decoupled cognitive design principle that separates procedural reflexes (via SFT) from declarative knowledge (via RAG). While the full implementation of an adaptive routing mechanism is deferred to future work, the ablation results presented herein offer a validated, cost-effective reference architecture for deploying transparent and regulation-compliant AI on resource-constrained merchant vessels.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yuyu Ren

Minglong Chen

Junjie Weng

Journals

Information

Actions

Institutions

Wuhan University of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Deploying Efficient LLM Agents on Maritime Autonomous Surface Ships: Fine-Tuning, RAG, and Function Calling in a Mid-Size Model

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider