March 3, 2026Open Access

Optimizing System Prompts as RL-like Policy Search with Prompt Evolution

Key Points

Prompt optimization functions as a policy search problem, enhancing alignment in language models.
On the HH-RLHF dataset, judges assess safety and utility, influencing prompt editing with structured feedback.
Multi-armed bandit frameworks enable effective action selection with contextual feedback and iterative improvements.
Preliminary findings indicate the importance of controlling for prompt stability while making necessary edits.

Abstract

We reframe prompt optimization as an RL-like policy search problem in which system prompts are the policy parameters and LLM-generated feedback serves as a surrogate gradient. We study LLM-Based Contextual Multi-Armed Bandit with Prompt Evolution (LLM-CMAB-PE): multiple prompt-conditioned arms propose outputs, a bandit ensemble selects an action, a judge scores the result, and a controller edits prompts using structured feedback and explicit prompt diffs. Unlike black-box prompt tuning, this approach provides gradient-like updates in natural language (controller edits + diffs) and preserves full traceability across iterations. We report run examples from two distinct datasets: (i) HH-RLHF preference judging and (ii) Schema-Guided Dialogue (SGD) next-action prediction. The runs show how prompt edits, judge rationales, and prompt diffs interact to improve alignment: on HH-RLHF, the judge highlights safety/utility trade-offs and the controller keeps prompts stable; on SGD, early iterations fail because the model returns explanations rather than exact action labels, prompting controller edits to enforce brevity and action-first outputs. We summarize dataset characteristics, evolution behaviors, and measured outcomes, then place a best-effort regret sketch at the end for scheduling intuition.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

BECHIR TRABELSI (Tue,) studied this question.

synapsesocial.com/papers/69a75b42c6e9836116a2249a https://doi.org/https://doi.org/10.5281/zenodo.18385004

Bookmark

View Full Paper