Published on Tue Mar 13 2018

Thompson Sampling for Combinatorial Semi-Bandits

Siwei Wang, Wei Chen

We study the application of the Thompson sampling (TS) methodology to the combinatorial multi-armed bandit (CMAB) framework. We show that one cannot directly replace the exact offline oracle with anroximation oracle in TS algorithm for even the classical MAB problem.

0
0
0
Abstract

We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of , where is the number of arms, is the size of the largest super arm, is the time horizon, and is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot directly replace the exact offline oracle with an approximation oracle in TS algorithm for even the classical MAB problem. Then we expand the analysis to two special cases: the linear reward case and the matroid bandit case. When the reward function is linear, the regret of the TS algorithm achieves a better bound . For matroid bandit, we could remove the independence assumption across arms and achieve a regret upper bound that matches the lower bound for the matroid case. Finally, we use some experiments to show the comparison between regrets of TS and other existing algorithms like CUCB and ESCB.

Mon Apr 20 2020
Machine Learning
Thompson Sampling for Linearly Constrained Bandits
The objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed.
0
0
0
Wed Nov 01 2017
Artificial Intelligence
Minimal Exploration in Structured Stochastic Bandits
This paper introduces and addresses a wide class of stochastic bandit problems. We derive an asymptotic instance-specific regret lower bound for these problems. We then develop OSSB, an algorithm whose regret matches this limit.
0
0
0
Thu Jun 03 2021
Machine Learning
A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms
Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms. It achieves optimal O(log n) regret in instances with "large" gaps, and a near-optimal O('log n') minimax regret when the gap is "small"
1
0
0
Thu May 14 2020
Machine Learning
Thompson Sampling for Combinatorial Semi-bandits with Sleeping Arms and Long-Term Fairness Constraints
We study the combinatorial sleeping multi-armed semi-bandit problem. We adopt Thompson Sampling~(TS) to maximize the total rewards. We use virtual queue techniques to handle the fairness constraints. We prove TSCSF-B can satisfy the fairness constraint.
0
0
0
Thu Jul 31 2014
Machine Learning
Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms
We define a general framework for a large class of combinatorial multi-armed (CMAB) problems. In each round, a super arm is played and the base armscontained in the super arm are played and their outcomes are observed. We provide CUCB algorithm that achieves distribution-dependent regret.
0
0
0
Fri Oct 11 2019
Machine Learning
Old Dog Learns New Tricks: Randomized UCB for Bandit Problems
We propose a bandit strategy that builds on theoreticallyderived confidence intervals similar to upper confidence bound (UCB) algorithms. It uses randomization to trade off exploration and exploitation. In the -armed bandit setting, we show that there are infinitely many variants of the strategy.
0
0
0