no code implementations • 27 Feb 2024 • Rong Jiang, Cong Ma
We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations.