We reframe prompt optimization as an RL-like policy search problem in which system prompts are the policy parameters and LLM-generated feedback serves as a surrogate gradient. We study LLM-Based Contextual Multi-Armed Bandit with Prompt Evolution (LLM-CMAB-PE): multiple prompt-conditioned arms propose outputs, a bandit ensemble selects an action, a judge scores the result, and a controller edits prompts using structured feedback and explicit prompt diffs. Unlike black-box prompt tuning, this approach provides gradient-like updates in natural language (controller edits + diffs) and preserves full traceability across iterations. We report run examples from two distinct datasets: (i) HH-RLHF preference judging and (ii) Schema-Guided Dialogue (SGD) next-action prediction. The runs show how prompt edits, judge rationales, and prompt diffs interact to improve alignment: on HH-RLHF, the judge highlights safety/utility trade-offs and the controller keeps prompts stable; on SGD, early iterations fail because the model returns explanations rather than exact action labels, prompting controller edits to enforce brevity and action-first outputs. We summarize dataset characteristics, evolution behaviors, and measured outcomes, then place a best-effort regret sketch at the end for scheduling intuition.
BECHIR TRABELSI (Tue,) studied this question.