no code implementations • ICML 2020 • Pooria Joulani, Anant Raj, András György, Csaba Szepesvari
In this paper, we show that there is a simpler approach to obtaining accelerated rates: applying generic, well-known optimistic online learning algorithms and using the online average of their predictions to query the (deterministic or stochastic) first-order optimization oracle at each time step.
no code implementations • 27 Feb 2024 • Jincheng Mei, Zixin Zhong, Bo Dai, Alekh Agarwal, Csaba Szepesvari, Dale Schuurmans
We show that the \emph{stochastic gradient} bandit algorithm converges to a \emph{globally optimal} policy at an $O(1/t)$ rate, even with a \emph{constant} step size.
no code implementations • 29 Jan 2023 • Dong Yin, Sridhar Thiagarajan, Nevena Lazic, Nived Rajaraman, Botao Hao, Csaba Szepesvari
One useful property of simulators is that it is typically easy to reset the environment to a previously observed state.
no code implementations • 16 Jan 2023 • Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans
Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance.
1 code implementation • 11 Apr 2022 • Arushi Jain, Sharan Vaswani, Reza Babanezhad, Csaba Szepesvari, Doina Precup
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems.
no code implementations • NeurIPS 2021 • Jincheng Mei, Bo Dai, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans
We study the effect of stochasticity in on-policy policy optimization, and make the following four contributions.
no code implementations • 18 Jun 2021 • Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, Csaba Szepesvari
In high stake applications, active experimentation may be considered too risky and thus data are often collected passively.
no code implementations • 15 Jun 2021 • Abbas Abdolmaleki, Sandy H. Huang, Giulia Vezzani, Bobak Shahriari, Jost Tobias Springenberg, Shruti Mishra, Dhruva TB, Arunkumar Byravan, Konstantinos Bousmalis, Andras Gyorgy, Csaba Szepesvari, Raia Hadsell, Nicolas Heess, Martin Riedmiller
Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives or constraints in the policy optimization step.
no code implementations • 13 May 2021 • Jincheng Mei, Yue Gao, Bo Dai, Csaba Szepesvari, Dale Schuurmans
Classical global convergence results for first-order methods rely on uniform smoothness and the \L{}ojasiewicz inequality.
no code implementations • 6 Apr 2021 • Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li, Csaba Szepesvari, Dale Schuurmans
First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis.
no code implementations • 25 Feb 2021 • Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, Csaba Szepesvari
We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation.
no code implementations • NeurIPS 2021 • Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, Mengdi Wang
By assuming the overparameterizaiton of policy and exploiting the hidden convexity of the problem, we further show that TSIVR-PG converges to global $\epsilon$-optimal policy with $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples.
no code implementations • 11 Feb 2021 • Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-Wei Hsu, Martin Mladenov, Craig Boutilier, Csaba Szepesvari
Efficient exploration in bandits is a fundamental online learning problem.
no code implementations • 1 Jan 2021 • Qi Cai, Zhuoran Yang, Csaba Szepesvari, Zhaoran Wang
Although policy optimization with neural networks has a track record of achieving state-of-the-art results in reinforcement learning on various domains, the theoretical understanding of the computational and sample efficiency of policy optimization remains restricted to linear function approximations with finite-dimensional feature representations, which hinders the design of principled, effective, and efficient algorithms.
no code implementations • 15 Dec 2020 • Dongruo Zhou, Quanquan Gu, Csaba Szepesvari
Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting.
no code implementations • NeurIPS 2020 • Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer
Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form.
no code implementations • NeurIPS 2020 • Gellert Weisz, András György, Wei-I Lin, Devon Graham, Kevin Leyton-Brown, Csaba Szepesvari, Brendan Lucier
Algorithm configuration procedures optimize parameters of a given algorithm to perform well over a distribution of inputs.
no code implementations • NeurIPS 2020 • Jincheng Mei, Chenjun Xiao, Bo Dai, Lihong Li, Csaba Szepesvari, Dale Schuurmans
Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities.
no code implementations • NeurIPS 2020 • Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, Mengdi Wang
Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function.
no code implementations • NeurIPS 2020 • Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvari, John Shawe-Taylor
Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds.
no code implementations • 9 Jun 2020 • Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, Craig Boutilier
Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters.
no code implementations • ICML 2020 • Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, Lin F. Yang
We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed.
no code implementations • ICML 2020 • Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans
First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization.
no code implementations • NeurIPS 2020 • Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, Csaba Szepesvari
Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal $O(\sqrt{T})$ model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee.
no code implementations • NeurIPS 2020 • Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer
In this work, we learn such policies for an unknown distribution $\mathcal{P}$ using samples from $\mathcal{P}$.
1 code implementation • 8 Feb 2020 • Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, Csaba Szepesvari
This is an improvement over the best existing bound of $\tilde{O}(T^{3/4})$ for the average-reward case with function approximation.
no code implementations • NeurIPS 2019 • Pooria Joulani, András György, Csaba Szepesvari
ASYNCADA is, to our knowledge, the first asynchronous stochastic optimization algorithm with finite-time data-dependent convergence guarantees for generic convex constraints.
no code implementations • ICML 2020 • Tor Lattimore, Csaba Szepesvari, Gellert Weisz
The construction by Du et al. (2019) implies that even if a learner is given linear features in $\mathbb R^d$ that approximate the rewards in a bandit with a uniform error of $\epsilon$, then searching for an action that is optimal up to $O(\epsilon)$ requires examining essentially all actions.
no code implementations • 18 Oct 2019 • Pratik Gajane, Ronald Ortner, Peter Auer, Csaba Szepesvari
We consider a setting in which the objective is to learn to navigate in a controlled Markov process (CMP) where transition probabilities may abruptly change.
no code implementations • 15 Oct 2019 • Botao Hao, Tor Lattimore, Csaba Szepesvari
Contextual bandits serve as a fundamental model for many sequential decision making tasks.
no code implementations • 27 Aug 2019 • Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, Gellert Weisz
POLITEX has sublinear regret guarantees in uniformly-mixing MDPs when the value estimation error can be controlled, which can be satisfied if all policies sufficiently explore the environment.
no code implementations • 19 Aug 2019 • Omar Rivasplata, Vikram M Tankasali, Csaba Szepesvari
We explore the family of methods "PAC-Bayes with Backprop" (PBB) to train probabilistic neural networks by minimizing PAC-Bayes bounds.
2 code implementations • ICLR 2020 • Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado van Hasselt
bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives.
no code implementations • 12 Jul 2019 • Tor Lattimore, Csaba Szepesvari
We provide a simple and efficient algorithm for adversarial $k$-action $d$-outcome non-degenerate locally observable partial monitoring game for which the $n$-round minimax regret is bounded by $6(d+1) k^{3/2} \sqrt{n \log(k)}$, matching the best known information-theoretic upper bound.
no code implementations • 21 Jun 2019 • Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, Craig Boutilier
The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution.
no code implementations • ICML 2018 • Yao Ma, Alex Olshevsky, Venkatesh Saligrama, Csaba Szepesvari
We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model.
no code implementations • 4 Apr 2019 • Chih-Wei Hsu, Branislav Kveton, Ofer Meshi, Martin Mladenov, Csaba Szepesvari
In this work, we pioneer the idea of algorithm design by minimizing the empirical Bayes regret, the average regret over problem instances sampled from a known distribution.
no code implementations • 21 Mar 2019 • Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier
We evaluate our algorithms empirically and show that they are practical.
no code implementations • 12 Mar 2019 • Karim Abou-Moustafa, Csaba Szepesvari
In particular, the main question we address here is \emph{whether it is possible to derive exponential generalization bounds for the estimated risk using a notion of stability that is computationally tractable and distribution dependent, but weaker than uniform stability.
no code implementations • 26 Feb 2019 • Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier
Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.
no code implementations • 1 Feb 2019 • Tor Lattimore, Csaba Szepesvari
We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary.
no code implementations • ICLR 2019 • Jonathan Uesato, Ananya Kumar, Csaba Szepesvari, Tom Erez, Avraham Ruderman, Keith Anderson, Krishmamurthy, Dvijotham, Nicolas Heess, Pushmeet Kohli
We demonstrate this is an issue for current agents, where even matching the compute used for training is sometimes insufficient for evaluation.
no code implementations • 13 Nov 2018 • Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, Tor Lattimore
Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards.
no code implementations • ICML 2018 • Gellert Weisz, Andras Gyorgy, Csaba Szepesvari
We consider the problem of configuring general-purpose solvers to run efficiently on problem instances drawn from an unknown distribution.
no code implementations • NeurIPS 2018 • Omar Rivasplata, Emilio Parrado-Hernandez, John Shawe-Taylor, Shiliang Sun, Csaba Szepesvari
Our main result estimates the risk of the randomized algorithm in terms of the hypothesis stability coefficients.
no code implementations • 15 Jun 2018 • Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, Masrour Zoghi
In this paper, we study the problem of safe online learning to re-rank, where user feedback is used to improve the quality of displayed lists.
no code implementations • NeurIPS 2018 • Tor Lattimore, Branislav Kveton, Shuai Li, Csaba Szepesvari
Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks from the user.
no code implementations • 23 May 2018 • Tor Lattimore, Csaba Szepesvari
Partial monitoring is a generalization of the well-known multi-armed bandit framework where the loss is not directly observed by the learner.
no code implementations • 17 Apr 2018 • Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari
Model-free approaches for reinforcement learning (RL) and continuous control find policies based only on past states and rewards, without fitting a model of the system dynamics.
no code implementations • 13 Dec 2017 • Branislav Kveton, Csaba Szepesvari, Anup Rao, Zheng Wen, Yasin Abbasi-Yadkori, S. Muthukrishnan
Many problems in computer vision and recommender systems involve low-rank matrices.
no code implementations • NeurIPS 2017 • Mahdi Karami, Martha White, Dale Schuurmans, Csaba Szepesvari
In this paper, we instead reconsider likelihood maximization and develop an optimization based strategy for recovering the latent states and transition parameters.
no code implementations • ICML 2018 • Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, Steffen Grunewalder
In this problem, when the player pulls an arm, a reward is generated, however it is not immediately observed.
no code implementations • 20 Jun 2017 • Yao Ma, Alex Olshevsky, Venkatesh Saligrama, Csaba Szepesvari
We then formulate a weighted rank-one optimization problem to estimate skills based on observations on an irreducible, aperiodic interaction graph.
no code implementations • 19 Jun 2017 • Karim Abou-Moustafa, Csaba Szepesvari
Next, under some reasonable notion of stability, we use this exponential tail bound to analyze the concentration of the k-fold cross-validation (KFCV) estimate around the true risk of a hypothesis generated by a general learning rule.
no code implementations • ICML 2017 • Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, Zheng Wen
In this work, we propose BatchRank, the first online learning to rank algorithm for a broad class of click models.
no code implementations • 18 Oct 2016 • Manjesh Hanawal, Csaba Szepesvari, Venkatesh Saligrama
We reduce USS to a special case of multi-armed bandit problem with side information and develop polynomial time algorithms that achieve sublinear regret.
no code implementations • 14 Oct 2016 • Tor Lattimore, Csaba Szepesvari
Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications.
no code implementations • 10 Aug 2016 • Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, Zheng Wen
The main challenge of the problem is that the individual values of the row and column are unobserved.
no code implementations • NeurIPS 2015 • Tor Lattimore, Koby Crammer, Csaba Szepesvari
In each time step the learner chooses an allocation of several resource types between a number of tasks.
1 code implementation • 10 Nov 2015 • Ruitong Huang, Bing Xu, Dale Schuurmans, Csaba Szepesvari
The robustness of neural networks to intended perturbations has recently attracted significant attention.
no code implementations • NeurIPS 2015 • Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari
The agent observes the index of the first chosen item whose weight is zero.
no code implementations • 10 Feb 2015 • Branislav Kveton, Csaba Szepesvari, Zheng Wen, Azin Ashkan
We also prove gap-dependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits.
no code implementations • NeurIPS 2014 • Hengshuai Yao, Csaba Szepesvari, Richard S. Sutton, Joseph Modayil, Shalabh Bhatnagar
We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function.
no code implementations • 3 Oct 2014 • Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari
A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff.
no code implementations • 12 Sep 2014 • Lihong Li, Remi Munos, Csaba Szepesvari
This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.
no code implementations • 16 Jun 2014 • Yasin Abbasi-Yadkori, Csaba Szepesvari
We study Bayesian optimal control of a general class of smoothly parameterized Markov decision problems.
no code implementations • NeurIPS 2013 • Yasin Abbasi, Peter L. Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari
The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node.
no code implementations • 20 Jun 2012 • Gergely Neu, Csaba Szepesvari
In this paper we propose a novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem.
no code implementations • 13 Jun 2012 • Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling
Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.
no code implementations • 14 Feb 2012 • Mahdi Milani Fard, Joelle Pineau, Csaba Szepesvari
PAC-Bayesian methods overcome this problem by providing bounds that hold regardless of the correctness of the prior distribution.