no code implementations • 17 Nov 2021 • Yanqiu Wu, Xinyue Chen, Che Wang, Yiming Zhang, Keith W. Ross
In particular, Truncated Quantile Critics (TQC) achieves state-of-the-art asymptotic training performance on the MuJoCo benchmark with a distributional representation of critics; and Randomized Ensemble Double Q-Learning (REDQ) achieves high sample efficiency that is competitive with state-of-the-art model-based methods using a high update-to-data ratio and target randomization.
no code implementations • 14 Jun 2021 • Yiming Zhang, Keith W. Ross
Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion.
no code implementations • 1 Jan 2021 • Yiming Zhang, Keith W. Ross
In continuing control tasks, an agent’s average reward per time step is a more natural performance measure compared to the commonly used discounting framework as it can better capture an agent’s long-term behavior.
2 code implementations • NeurIPS 2020 • Yiming Zhang, Quan Vuong, Keith W. Ross
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints.
1 code implementation • ICLR 2019 • Quan Vuong, Yiming Zhang, Keith W. Ross
We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology.
no code implementations • 2 Jun 2018 • Yiming Zhang, Quan Ho Vuong, Kenny Song, Xiao-Yue Gong, Keith W. Ross
We develop several novel unbiased estimators for the entropy bonus and its gradient.
1 code implementation • ICLR 2019 • Quan Vuong, Yiming Zhang, Keith W. Ross
We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology.
no code implementations • ICLR 2018 • Vuong Ho Quan, Yiming Zhang, Kenny Song, Xiao-Yue Gong, Keith W. Ross
In the case of high-dimensional action spaces, calculating the entropy and the gradient of the entropy requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible.