Imitation learning (IL) learns a policy from expert trajectories, serving as a fundamental paradigm in both large language model training and embodied AI. This process is challenging due to the nature of sequential decision-making where errors can accumulate and distributions may shift over horizons. However, it has been found that a kind of IL approach, adversarial imitation learning (AIL), can have exceptional empirical performance. With just one expert trajectory, AIL often matches the expert performance even in a long horizon, on tasks such as robotic locomotion control. There are two fundamental yet unsolved questions: why does AIL perform well with so few trajectories, and why does it maintain good performance over long horizons? Previous theoretical results fail to answer these questions as they are meaningful only in large sample regime (i. e. , lots of expert trajectories) and have dependence on the decision horizon. In this paper, we analyze a total-variation-distance-based AIL (called TV-AIL), showing a horizon-free imitation gap {O} (1, | {S|/N}) on a class of instances abstracted from robotic locomotion control tasks. Here | {S}| is the state space size for a Markov Decision Process (MDP), and N is the number of expert trajectories. We emphasize two important features of our bound. First, this bound is meaningful in both small and large sample regimes. Second, this bound suggests that the imitation gap of TV-AIL does not increase with the decision horizon. Together, our bound can therefore explain the empirical observations and provide insights into how AIL addresses the distribution shift issue. Our analysis leverages the multi-stage policy optimization structure in TV-AIL and presents a new stage-coupled analysis. This tool also helps analyze the worst-case imitation gap of TV-AIL, disclosing its limitations in general MDPs.
Xu et al. (Thu,) studied this question.