Published on Wed Mar 17 2021

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Lin Chen, Bruno Scherrer, Peter L. Bartlett
0
0
0
Abstract

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime , where is the dimension of the feature vector and is the discount rate. In this regime, for any , we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is and it requires samples to approximate the value function up to an additive error . Note that the lower bound of the sample complexity is exponential in . If , even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $O\left(\max\left\{ \frac{\left\Vert \theta^{\pi}\right\Vert _{2}^{4}}{\varepsilon^{4}}\log\frac{d}{\delta},\frac{1}{\varepsilon^{2}}\left(d+\log\frac{1}{\delta}\right)\right\} \right)$ samples ( is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of with probability at least .