What question did this study set out to answer?

The aim is to develop a Q-learning-based algorithm for optimal control of 2D discrete systems with unknown dynamics.

April 19, 2026

Reinforcement Q-learning optimal control of 2D discrete-time systems with unknown dynamics

Key Points

The aim is to develop a Q-learning-based algorithm for optimal control of 2D discrete systems with unknown dynamics.
Constructed value function using Lyapunov function framework.
Derived algebraic Riccati inequality and Bellman inequality for LQR.
Developed suboptimal state feedback controller and offline policy iteration algorithm.
Transformed objective function and Bellman inequality into Q-function format for Q-learning.
Designed online policy iteration algorithm and collected data during iterations.
Validated effectiveness through two examples showing improved control.
Demonstrated optimal handling of 2D systems with unknown dynamics.
Achieved suitable performance metrics for proposed control scheme.

Abstract

This paper proposes a Q-learning-based algorithm to solve the linear quadratic regulator (LQR) problem for unknown dynamic two-dimensional (2D) discrete-time systems. First, based on the value function formulation constructed using the Lyapunov function framework, algebraic Riccati inequality (ARI) and the Bellman inequality for solving the LQR problem are derived. Subsequently, a suboptimal state feedback controller is obtained based on these inequalities, and an offline policy iteration algorithm based on semi-definite programming (SDP) is introduced. On this foundation, by introducing the concept of Q-learning, the objective function and the Bellman inequality are transformed into the Q-function and its corresponding inequality. A Q-learning-based offline policy iteration equation is then derived, and further, an online policy iteration algorithm based on Q-learning is designed. Data are collected online during each iteration to solve the LQR problem for 2D discrete systems with unknown dynamics. Finally, the effectiveness of the proposed control scheme is validated through two examples.

Mark Helpful

Bookmark

Relay