Published on Sun Jul 22 2012

Meta-Learning of Exploration/Exploitation Strategies: The Multi-Armed Bandit Case

Francis Maes, Damien Ernst, Louis Wehenkel

The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Most current research in this field focuses on generic solutions that can be applied to a wide range of problems. In practice, it is often the case that a form of prior information is available about the specific class of target problems.

0
0
0
Abstract

The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Multi-armed bandit problems formalize this dilemma in its canonical form. Most current research in this field focuses on generic solutions that can be applied to a wide range of problems. However, in practice, it is often the case that a form of prior information is available about the specific class of target problems. Prior knowledge is rarely used in current solutions due to the lack of a systematic approach to incorporate it into the E/E strategy. To address a specific class of E/E problems, we propose to proceed in three steps: (i) model prior knowledge in the form of a probability distribution over the target class of E/E problems; (ii) choose a large hypothesis space of candidate E/E strategies; and (iii), solve an optimization problem to find a candidate E/E strategy of maximal average performance over a sample of problems drawn from the prior distribution. We illustrate this meta-learning approach with two different hypothesis spaces: one where E/E strategies are numerically parameterized and another where E/E strategies are represented as small symbolic formulas. We propose appropriate optimization algorithms for both cases. Our experiments, with two-armed Bernoulli bandit problems and various playing budgets, show that the meta-learnt E/E strategies outperform generic strategies of the literature (UCB1, UCB1-Tuned, UCB-v, KL-UCB and epsilon greedy); they also evaluate the robustness of the learnt E/E strategies, by tests carried out on arms whose rewards follow a truncated Gaussian distribution.

Sun Sep 20 2020
Artificial Intelligence
Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms
EXP-based algorithms are often used for exploration in multi-armed bandit games. We establish both the lower and upper bounds of regret in the Gaussian multi-arm bandit setting. The analyses do not require bounded rewards compared to classical regret assumptions.
0
0
0
Wed Apr 12 2017
Artificial Intelligence
Value Directed Exploration in Multi-Armed Bandits with Structured Priors
Multi-armed bandits are a quintessential machine learning problem requiring balancing of exploration and exploitation. We introduce linearly-separable value functions that take both the expected return and the benefit of exploration into consideration. The algorithm enjoys a sub-linear performance with a strong theoretical guarantee.
0
0
0
Tue Jul 02 2019
Machine Learning
Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits
RBMLE is a novel family of learning algorithms for multi-armed bandits (SMABs) We prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of the regret bound.
0
0
0
Sun May 29 2016
Machine Learning
On Explore-Then-Commit Strategies
We study the problem of minimising regret in two-armed bandit problems with Gaussian rewards. Our objective is to use this simple setting to illustrate that strategies based on an exploration phase (up to a stopping time) followed by exploitation are necessarily suboptimal.
0
0
0
Wed Jun 10 2015
Machine Learning
Explore no more: Improved high-probability regret bounds for non-stochastic bandits
This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems. We focus on performance guarantees that hold with high probability. Such results are rather scarce in the literature.
0
0
0
Tue Jan 12 2016
Machine Learning
Infomax strategies for an optimal balance between exploration and exploitation
The Infomax principle postulates that maximization of information directs the function of diverse systems. The validity of information as a proxy for reward remains unclear. The highest mean reward considered by Info-p is not the quantity actually needed.
0
0
0