1 code implementation • 25 Jul 2020 • Zehong Cao, KaiChiu Wong, Chin-Teng Lin
The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments.