Towards Combining On-Off-Policy Methods for Real-World Applications

In this paper, we point out a fundamental property of the objective in reinforcement learning, with which we can reformulate the policy gradient objective into a perceptron-like loss function, removing the need to distinguish between on and off policy training. Namely, we posit that it is sufficient to only update a policy $\pi$ for cases that satisfy the condition $A(\frac{\pi}{\mu}-1)\leq0$, where $A$ is the advantage, and $\mu$ is another policy... (read more)

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper


METHOD TYPE
Sigmoid Activation
Activation Functions
Tanh Activation
Activation Functions
V-trace
Value Function Estimation
Experience Replay
Replay Memory
Entropy Regularization
Regularization
Residual Connection
Skip Connections
Gradient Clipping
Optimization
RMSProp
Stochastic Optimization
ReLU
Activation Functions
Max Pooling
Pooling Operations
Convolution
Convolutions
LSTM
Recurrent Neural Networks
IMPALA
Policy Gradient Methods
PPO
Policy Gradient Methods