Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO | Synapse