No evaluation results yet. Help compare methods by
submit
evaluation metrics.

To the best knowledge of the authors, DeepThermal is the first AI application that has been used to solve real-world complex mission-critical control tasks using the offline RL approach.

Offline reinforcement learning approaches can generally be divided to proximal and uncertainty-aware methods.

Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables are all mediated through the action.

We consider offline reinforcement learning (RL) with heterogeneous agents under severe data scarcity, i. e., we only observe a single historical trajectory for every agent under an unknown, potentially sub-optimal policy.

QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline setting, but has low sample efficiency and struggles with high-dimensional observation spaces.

The recent success of supervised learning methods on ever larger offline datasets has spurred interest in the reinforcement learning (RL) field to investigate whether the same paradigms can be translated to RL algorithms.

Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy.

These errors can be compounded by bootstrapping when the function approximator overestimates, leading the value function to *grow unbounded*, thereby crippling learning.

As it turns out, fine-tuning offline RL agents is a non-trivial challenge, due to distribution shift – the agent encounters out-of-distribution samples during online interaction, which may cause bootstrapping error in Q-learning and instability during fine-tuning.

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications.