Bi-linear Value Networks for Multi-goal Reinforcement Learning
Universal value functions are used to score the long-term utility of actions to achieve a goal from the current state. In contrast to prior methods that learn a monolithic function to approximate the value, we propose a bi-linear decomposition of the value function. The first component, akin to a global plan models how the state should be changed to reach the goal. The second component, akin to a local controller selects the optimal action to actualize the desired change in state. We simultaneously learn both components. Such decomposition enables both the global and local components to make efficient use of interaction data and independently generalize. The consequence is superior overall generalization and performance of our system on a wide range of challenging goal-conditioned tasks in comparison to the current state-of-the-art.
PDF Abstract