We consider reinforcement learning (RL) in continuous time. We study the best trade-off between exploration of a black box and exploitation of current knowledge. We find that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian.
We consider reinforcement learning (RL) in continuous time and study the
problem of achieving the best trade-off between exploration of a black box
environment and exploitation of current knowledge. We propose an
entropy-regularized reward function involving the differential entropy of the
distributions of actions, and motivate and devise an exploratory formulation
for the feature dynamics that captures repetitive learning under exploration.
The resulting optimization problem is a revitalization of the classical relaxed
stochastic control. We carry out a complete analysis of the problem in the
linear--quadratic (LQ) setting and deduce that the optimal feedback control
distribution for balancing exploitation and exploration is Gaussian. This in
turn interprets and justifies the widely adopted Gaussian exploration in RL,
beyond its simplicity for sampling. Moreover, the exploitation and exploration
are captured, respectively and mutual-exclusively, by the mean and variance of
the Gaussian distribution. We also find that a more random environment contains
more learning opportunities in the sense that less exploration is needed. We
characterize the cost of exploration, which, for the LQ case, is shown to be
proportional to the entropy regularization weight and inversely proportional to
the discount rate. Finally, as the weight of exploration decays to zero, we
prove the convergence of the solution of the entropy-regularized LQ problem to
the one of the classical LQ problem.