no code implementations • ICML 2020 • Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.
no code implementations • 13 Mar 2024 • Yang Cai, Constantinos Daskalakis, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng
While Online Gradient Descent and other no-regret learning procedures are known to efficiently converge to coarse correlated equilibrium in games where each agent's utility is concave in their own strategy, this is not the case when the utilities are non-concave, a situation that is common in machine learning applications where the agents' strategies are parameterized by deep neural networks, or the agents' utilities are computed by a neural network, or both.
no code implementations • 12 Feb 2024 • Mengxiao Zhang, Haipeng Luo
Contextual multinomial logit (MNL) bandits capture many real-world assortment recommendation problems such as online retailing/advertising.
no code implementations • 12 Feb 2024 • Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro
Bandits with feedback graphs are powerful online learning models that interpolate between the full information and classic bandit problems, capturing many real-life applications.
no code implementations • 26 Jan 2024 • Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng
In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium.
no code implementations • 1 Nov 2023 • Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Weiqiang Zheng
Algorithms based on regret matching, specifically regret matching$^+$ (RM$^+$), and its variants are the most popular approaches for solving large-scale two-player zero-sum games in practice.
no code implementations • 8 Oct 2023 • Mengxiao Zhang, Haipeng Luo
We study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads and needs to make an estimate on their click-through rate (CTR) in order to run a second-price pay-per-click auction.
1 code implementation • 18 Aug 2023 • Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, JianGuang Lou, Chongyang Tao, Xiubo Geng, QIngwei Lin, Shifeng Chen, Dongmei Zhang
Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model.
Ranked #50 on Arithmetic Reasoning on GSM8K (using extra training data)
no code implementations • 2 Feb 2023 • Akhil Agnihotri, Rahul Jain, Haipeng Luo
Often, the average criterion is more suitable than the discounted criterion.
no code implementations • 30 Jan 2023 • Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert
This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest.
5 code implementations • CVPR 2023 • Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.
Ranked #1 on Zero-Shot Action Recognition on ActivityNet
4 code implementations • CVPR 2023 • Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.
Ranked #7 on Video Retrieval on VATEX
no code implementations • 23 Oct 2022 • Mengxiao Zhang, Shi Chen, Haipeng Luo, Yingfei Wang
Supply chain management (SCM) has been recognized as an important discipline with applications to many industries, where the two-echelon stochastic inventory model, involving one downstream retailer and one upstream supplier, plays a fundamental role for developing firms' SCM strategies.
no code implementations • 4 Oct 2022 • Haipeng Luo, Hanghang Tong, Mengxiao Zhang, Yuheng Zhang
For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$.
no code implementations • 17 Jun 2022 • Gabriele Farina, Ioannis Anagnostides, Haipeng Luo, Chung-Wei Lee, Christian Kroer, Tuomas Sandholm
In this paper, we answer this in the positive by establishing the first uncoupled learning algorithm with $O(\log T)$ per-player regret in general \emph{convex games}, that is, games with concave utility functions supported on arbitrary convex and compact strategy sets.
no code implementations • 26 May 2022 • Yan Dai, Haipeng Luo, Liyu Chen
More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022).
no code implementations • 25 May 2022 • Liyu Chen, Haipeng Luo
We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions.
no code implementations • 25 Apr 2022 • Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Tuomas Sandholm
In this paper we establish efficient and \emph{uncoupled} learning dynamics so that, when employed by all players in a general-sum multiplayer game, the \emph{swap regret} of each player after $T$ repetitions of the game is bounded by $O(\log T)$, improving over the prior best bounds of $O(\log^4 (T))$.
no code implementations • 12 Feb 2022 • Haipeng Luo, Mengxiao Zhang, Peng Zhao
We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them.
no code implementations • 12 Feb 2022 • Haipeng Luo, Mengxiao Zhang, Peng Zhao, Zhi-Hua Zhou
The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon.
no code implementations • 7 Feb 2022 • Liyu Chen, Haipeng Luo, Aviv Rosenberg
Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees.
no code implementations • 1 Feb 2022 • Gabriele Farina, Chung-Wei Lee, Haipeng Luo, Christian Kroer
In this paper we show that the Optimistic Multiplicative Weights Update (OMWU) algorithm -- the premier learning algorithm for NFGs -- can be simulated on the normal-form equivalent of an EFG in linear time per iteration in the game tree size using a kernel trick.
no code implementations • 31 Jan 2022 • Liyu Chen, Rahul Jain, Haipeng Luo
We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints.
no code implementations • 31 Jan 2022 • Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately.
no code implementations • 30 Jan 2022 • Mengxiao Zhang, Peng Zhao, Haipeng Luo, Zhi-Hua Zhou
Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning.
no code implementations • NeurIPS 2021 • Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee
When a simulator is unavailable, we further consider a linear MDP setting and obtain $\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.
no code implementations • NeurIPS 2021 • Chung-Wei Lee, Christian Kroer, Haipeng Luo
Inspired by recent advances on last-iterate convergence of optimistic algorithms in zero-sum normal-form games, we study this phenomenon in sequential games, and provide a comprehensive study of last-iterate convergence for zero-sum extensive-form games with perfect recall (EFGs), using various optimistic regret-minimization algorithms over treeplexes.
no code implementations • NeurIPS 2021 • Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, Haipeng Luo
We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured.
no code implementations • 9 Jun 2021 • Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state.
no code implementations • NeurIPS 2021 • Tiancheng Jin, Longbo Huang, Haipeng Luo
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the losses are adversarial and simultaneously $\mathcal{O}(\text{polylog}(T))$ regret when the losses are (almost) stochastic.
no code implementations • 11 Feb 2021 • Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, Xiaojin Zhang
In this work, we develop linear bandit algorithms that automatically adapt to different environments.
no code implementations • 10 Feb 2021 • Liyu Chen, Haipeng Luo
Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
no code implementations • 10 Feb 2021 • Chen-Yu Wei, Haipeng Luo
Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.
no code implementations • 8 Feb 2021 • Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo
We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play.
no code implementations • 1 Feb 2021 • Liyu Chen, Haipeng Luo, Chen-Yu Wei
We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t, i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t, i}$ is the loss for expert $i$ in round $t$.
no code implementations • 7 Dec 2020 • Liyu Chen, Haipeng Luo, Chen-Yu Wei
We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes.
no code implementations • 23 Jul 2020 • Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain
We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.
no code implementations • NeurIPS 2020 • Dirk van der Hoeven, Ashok Cutkosky, Haipeng Luo
We study bandit convex optimization methods that adapt to the norm of the comparator, a topic that has only been studied before for its full-information counterpart.
no code implementations • 25 Jun 2020 • Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang
We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains.
no code implementations • 19 Jun 2020 • Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo
In statistical learning, algorithms for model selection allow the learner to adapt to the complexity of the best hypothesis class in a sequence.
1 code implementation • ICLR 2021 • Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo
Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption.
no code implementations • NeurIPS 2020 • Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang
We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary.
no code implementations • ICML Workshop LifelongML 2020 • Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang
We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains.
no code implementations • NeurIPS 2020 • Tiancheng Jin, Haipeng Luo
This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback.
no code implementations • 8 Jun 2020 • Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo
Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation.
no code implementations • 7 Mar 2020 • Ehsan Emamjomeh-Zadeh, Chen-Yu Wei, Haipeng Luo, David Kempe
We revisit the problem of online learning with sleeping experts/bandits: in each time step, only a subset of the actions are available for the algorithm to choose from (and learn about).
no code implementations • 4 Mar 2020 • Chen-Yu Wei, Haipeng Luo, Alekh Agarwal
We initiate the study of learning in contextual bandits with the help of loss predictors.
no code implementations • 2 Feb 2020 • Chung-Wei Lee, Haipeng Luo, Mengxiao Zhang
We study small-loss bounds for adversarial multi-armed bandits with graph feedback, that is, adaptive regret bounds that depend on the loss of the best arm or related quantities, instead of the total number of rounds.
no code implementations • 13 Dec 2019 • Yifang Chen, Alex Cuellar, Haipeng Luo, Jignesh Modi, Heramb Nemlekar, Stefanos Nikolaidis
We introduce a Multi-Armed Bandit algorithm with fairness constraints, where fairness is defined as a minimum rate that a task or a resource is assigned to a user.
no code implementations • 3 Dec 2019 • Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.
1 code implementation • ICML 2020 • Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, Rahul Jain
Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems.
1 code implementation • NeurIPS 2019 • Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo
We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension $d_1 < d_2 < \ldots$, where the $m^\star$-th class contains the optimal policy, and we design an algorithm that achieves $\tilde{O}(T^{2/3}d^{1/3}_{m^\star})$ regret with no prior knowledge of the optimal dimension $d_{m^\star}$.
no code implementations • NeurIPS 2019 • Kai Zheng, Haipeng Luo, Ilias Diakonikolas, Li-Wei Wang
We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret.
no code implementations • NeurIPS 2019 • Dylan J. Foster, Spencer Greenberg, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan
Our main result is a generalization bound for data-dependent hypothesis sets expressed in terms of a notion of hypothesis set stability and a notion of Rademacher complexity for data-dependent hypothesis sets that we introduce.
no code implementations • 3 Feb 2019 • Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.
no code implementations • 29 Jan 2019 • Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei
We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.
no code implementations • 25 Jan 2019 • Julian Zimmert, Haipeng Luo, Chen-Yu Wei
We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$.
no code implementations • NeurIPS 2018 • Haipeng Luo, Chen-Yu Wei, Kai Zheng
We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation.
no code implementations • 25 Mar 2018 • Dylan J. Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan
Starting with the simple observation that the logistic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm.
no code implementations • ICML 2018 • Dylan J. Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, Robert E. Schapire
A major challenge in contextual bandits is to design general-purpose algorithms that are both practically useful and theoretically well-founded.
no code implementations • 10 Jan 2018 • Chen-Yu Wei, Haipeng Luo
We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem).
no code implementations • 5 Aug 2017 • Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, John Langford
In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i. i. d.
1 code implementation • 19 Dec 2016 • Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire
We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own.
no code implementations • 5 Nov 2016 • Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, Jennifer Wortman Vaughan
We consider the design of computationally efficient online learning algorithms in an adversarial setting in which the learner has access to an offline optimization oracle.
no code implementations • NeurIPS 2016 • Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, Robert E. Schapire
We give an oracle-based algorithm for the adversarial contextual bandit problem, where either contexts are drawn i. i. d.
no code implementations • NeurIPS 2016 • Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, John Langford
We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data.
no code implementations • 5 Feb 2016 • Elad Hazan, Haipeng Luo
The Frank-Wolfe optimization algorithm has recently regained popularity for machine learning applications due to its projection-free property and its ability to handle structured constraints.
no code implementations • NeurIPS 2015 • Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire
We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games.
no code implementations • NeurIPS 2015 • Alina Beygelzimer, Elad Hazan, Satyen Kale, Haipeng Luo
We extend the theory of boosting for regression problems to the online learning setting.
no code implementations • 20 Feb 2015 • Haipeng Luo, Robert E. Schapire
We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information.
no code implementations • 9 Feb 2015 • Alina Beygelzimer, Satyen Kale, Haipeng Luo
We study online boosting, the task of converting any weak online learner into a strong online learner.
no code implementations • 25 Nov 2014 • Haipeng Luo, Patrick Haffner, Jean-Francois Paiement
The growing amount of high dimensional data in different machine learning applications requires more efficient and scalable optimization algorithms.
no code implementations • NeurIPS 2014 • Haipeng Luo, Robert E. Schapire
Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games.
no code implementations • 31 Jul 2013 • Haipeng Luo, Robert E. Schapire
We apply a minimax analysis, beginning with the fixed horizon case, and then moving on to two unknown-horizon settings, one that assumes the horizon is chosen randomly according to some known distribution, and the other which allows the adversary full control over the horizon.