Multi-Armed Bandits
196 papers with code • 1 benchmarks • 2 datasets
Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off.
( Image credit: Microsoft Research )
Libraries
Use these libraries to find Multi-Armed Bandits models and implementationsMost implemented papers
Quantile Bandits for Best Arms Identification
We consider a variant of the best arm identification task in stochastic multi-armed bandits.
Inverse Contextual Bandits: Learning How Behavior Evolves over Time
Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare.
Empirical analysis of representation learning and exploration in neural kernel bandits
We consider policies based on a GP and a Student's t-process (TP).
Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model
We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions.
Truncated LinUCB for Stochastic Linear Bandits
This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts.
Kernel Conditional Moment Constraints for Confounding Robust Inference
It can be shown that our estimator contains the recently proposed sharp estimator by Dorn and Guo (2022) as a special case, and our method enables a novel extension of the classical marginal sensitivity model using f-divergence.
Doubly Robust Policy Evaluation and Learning
The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy.
Thompson Sampling for Contextual Bandits with Linear Payoffs
Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems.
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of $K$ actions in response to the observed context, and observes the reward only for that chosen action.
Regulating Greed Over Time in Multi-Armed Bandits
In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward periods, and more exploration occurs in periods of low reward.