Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning | Synapse