Published on Tue May 29 2018

Supervised Policy Update for Deep Reinforcement Learning

Quan Vuong, Yiming Zhang, Keith W. Ross

Supervised Policy Update (SPU) is a new sample-efficient methodology for deep reinforcement learning. SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. In terms of sample efficiency, SPU outperforms TRPO in simulated robotic tasks.

0
0
0
Abstract

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem. We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. The SPU implementation is much simpler than TRPO. In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.

Thu Jul 20 2017
Machine Learning
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning. They alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. We show that PPO outperforms other online policy gradient methods.
2
23
5
Tue Oct 27 2020
Artificial Intelligence
Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient
Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
0
0
0
Tue Oct 23 2018
Artificial Intelligence
Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space
We explore Deep Reinforcement Learning in a parameterized action space. We propose a new compact architecture for the tasks. We also propose two new methods based on the state-of-the-art algorithms.
0
0
0
Tue Jun 16 2020
Machine Learning
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Reinforcement learning (RL) provides an appealing formalism for learning policies from experience. However, it remains exceptionally difficult to train a policy withline data and improve it further with online RL. We propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates.
2
2
6
Tue Apr 17 2018
Artificial Intelligence
An Adaptive Clipping Approach for Proximal Policy Optimization
A new algorithm, known as PPO- optimizes policies repeatedly based on a theoretical target for adaptive policy improvement. destructively large policy update can effectively be effectively prevented through both clipping and adaptive control of a hyperparameter.
0
0
0
Fri Feb 12 2021
Machine Learning
Q-Value Weighted Regression: Reinforcement Learning with Limited Data
QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm. AWR performs very well on continuous control tasks, but has low sample efficiency and struggles with high-dimensional observation spaces. We show that QWR matches the state-
0
0
0