I am writing this paper following a moment that genuinely changed how I think about reinforcement learning in modern AI systems. While interacting with a widely used large-scale AI model (name intentionally omitted), I was engaged in a conversation about makeup, skincare, and personal appearance. Without any explicit request or conversational reference, the model generated an unrelated image depicting a group of men standing at what appeared to be a construction or contact site.The output was not offensive or harmful, but it was unexpected. More importantly, it raised a fundamental question: why did the model decide that this action was appropriate? This paper is the result of investigating that question.
Snehal Kalebag (Thu,) studied this question.