f-Policy Gradients: A General Framework for Goal Conditioned RL using f-Divergences

Siddhant Agarwal¹

Ishan Durugkar²

Peter Stone^{1, 2}

Amy Zhang¹

¹The University of Texas at Austin

²Sony AI

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

Paper

Code

Presentation (Coming Soon)

Slides (Coming Soon)

Abstract

Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$-Policy Gradients, or $f$-PG. $f$-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization objective, that we call $state$-MaxEnt RL (or $s$-MaxEnt RL) as a special case of our objective. We show that several metric-based shaping rewards like L2 can be used with $s$-MaxEnt RL, providing a common ground to study such metric-based shaping rewards with efficient exploration. We find that $f$-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and FetchReach environments.

Introduction

Goal Conditioned Reinforcement Learning requires being able to learn from spare rewards. Prior works have used a learnt reward function to augment the sparse reward but these can lead to suboptimal policies if the rewards are misaligned to the goal. Divergence minimization has been extensively studied in imitation learning but its use in RL has been limited. The commonly used imiatation learning methods that aim to minimize some form of divergence between the agent's visitation distribution and an expert's visitation distribution construct a minmax objective as a lower bound to the divergence. Moreover, they use a discriminator to construct the reward function which is non-stationary. We propose a novel framework, $f$-Policy Gradients or $f$-PG, that minimizes the $f$-divergence between the agent's state visitation distribution and the goal using analytical gradients. We also show that special cases of our objective can be shown to optimize a reward (can also be a metric-based shaping reward) along with the entropy of the state-visitation distribution introducing the $state$-MaxEnt RL objective.

The $f$-PG objective

The agents learn by minimizing the following $f$-divergence:

$J(\theta) = D_f(p_\theta(s) || p_g(s))$
where $p_\theta(s)$ is the agent's state visitation distribution and $p_g(s)$ is the goal distribution. We can derive the analytical gradient for the objective which looks similar to policy gradients.

$state$-MaxEnt RL

We present the following Lemma which states that special case of $f$-PG ($f(u) = u\log{u}$), the agent maximizes a reward of $\log{p_g(s)}$ along with the entropy of the state visitation distribution. This is different from the commonly studied MaxEnt RL (which will will call $\pi$-MaxEnt RL) which maximizes the entropy of the policy.

Consider a gridworld where the agent start and goal distributions are seperated by a wall, making the agent necessary to travel around the wall to reach the goal. The exploration of the $\pi$-MaxEnt RL and $s$-MaxEnt RL agents vary as shown below (the evolution of the state-visitation distributions)

$\pi$-MaxEnt RL

$s$-MaxEnt RL

Learning Signals

$f$-PG involves a learning signal $f'(\frac{p_\theta(s)}{p_g(s)})$ to weigh the log probabilities of the policy. It is thus important to understand how $f'(\frac{p_\theta(s)}{p_g(s)})$ behaves for goal-conditioned RL settings. During the initial stages of training, the agent visits regions with very low $p_g$. For such states, the signal has a lower value than the states that have lower $p_\theta$, i.e., the unexplored states. This is because for any convex function $f$, $f'(x)$ is an increasing function, so minimizing $f'(\frac{p_\theta(s)}{p_g(s)})$ (recall that we are minimizing $f$-divergence) will imply minimizing $p_\theta(s)$ for the states with low $p_g(s)$. The only way to do this is to increase the entropy of the state-visitation distribution, directly making the agent explore new states. As long as there is no significant overlap between the two distributions, it will push $p_\theta$ down to a flatter distribution until there is enough overlap with the goal distribution when it will pull back the agent's visitation again to be closer to the goal distribution.

$f'\big(\frac{p_\theta(s)}{p_g(s)}\big)$	Forward KL	Reverse KL	Jenson Shanon	$\chi^2$
$p_\theta(s)$

Results

We compare $f$-PG with several previous works that have used distribution matching to provide some shaped rewards like AIM, GAIL, AIRL etc. We initially perform experiments on a gridworld followed by Point Maze and FetchReach environments.

Gridworld

Here, the baselines are implemented on top of Soft Q Learning which is a $\pi$-MaxEnt RL algorithm.

$fkl$-PG	$rkl$-PG	AIM
GAIL	AIRL	FAIRL

Point Maze and FetchReach

In these experiments, the baselines are implemented on top of an on-policy PPO to provide a fair comparison with the sample complexity.

Acknowledgements

This work was in part supported by Cisco Research. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Cisco Research.
This work has partially taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (FAIN-2019844, NRT-2125858), ONR (N00014-18- 2243), ARO (E2061621), Bosch, Lockheed Martin, and UT Austin’s Good Systems grand challenge. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

@inproceedings{agarwal2023fpg,
             author = {Agarwal, Siddhant and Durugkar, Ishan and Stone, Peter and Zhang, Amy},
             booktitle = {Advances in Neural Information Processing Systems},
             title = {$f$ Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences},
             volume = {36}, 
             year = {2023}
         }