Multi-Armed Bandits

196 papers with code • 1 benchmarks • 2 datasets

Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off.

( Image credit: Microsoft Research )

Libraries

Use these libraries to find Multi-Armed Bandits models and implementations

Most implemented papers

Quantile Bandits for Best Arms Identification

Mengyanz/QSAR 22 Oct 2020

We consider a variant of the best arm identification task in stochastic multi-armed bandits.

Inverse Contextual Bandits: Learning How Behavior Evolves over Time

alihanhyk/invconban 13 Jul 2021

Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare.

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

st-tech/zr-obp 3 Feb 2022

We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions.

Truncated LinUCB for Stochastic Linear Bandits

simonzhou86/tr_linucb 23 Feb 2022

This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts.

Kernel Conditional Moment Constraints for Confounding Robust Inference

kstoneriv3/confounding-robust-inference-old 26 Feb 2023

It can be shown that our estimator contains the recently proposed sharp estimator by Dorn and Guo (2022) as a special case, and our method enables a novel extension of the classical marginal sensitivity model using f-divergence.

Doubly Robust Policy Evaluation and Learning

leoguelman/BLBF 23 Mar 2011

The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy.

Thompson Sampling for Contextual Bandits with Linear Payoffs

yanyangbaobeiIsEmma/Reinforcement-Learning-Contextual-Bandits 15 Sep 2012

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems.

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

VowpalWabbit/vowpal_wabbit 4 Feb 2014

We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of $K$ actions in response to the observed context, and observes the reward only for that chosen action.

Regulating Greed Over Time in Multi-Armed Bandits

5tefan0/Regulating-Greed-Over-Time 21 May 2015

In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward periods, and more exploration occurs in periods of low reward.