Analyzing Policy Distillation on Multi-Task Learning and Meta-Reinforcement Learning in Meta-World

Policy distillation partitions a Markov Decision Process into different subsections and learns expert policies in each individual partition before combining them into a single policy for the entire space. Similar to how a sports team has different positions that each con-tribute their own abilities to the team, policy distillation leverages the structure of a Markov Decision Process by first learning partition-specific experts that do not need to generalize as widely. When combined into one global policy, the experts each contribute the learned features from their partitions. Depending on which part of the state space the global policy faces, it can take advantage of the features it has gained from the local policy for that partition. Meta-reinforcement learning and multi-task learning are very closely intertwined fields. While meta-reinforcement learning aims to quickly solve new tasks based on prior experience, multi-task learning focuses more on the ability of an algorithm to generalize to a wide distribution of tasks at the same time. However,successful meta-learning is typically correlated with better performance on multi-task learning, and vice versa.An agent that can quickly adapt to a new task is, bydefinition, better at learning that new task; similarly, an agent that has generalized to many tasks is likely to learn more quickly when presented with a new but related task.Because both meta-learning and multi-task learning are composed of many individual tasks, they are naturally propitious to partitioning. Policy distillation has shown promise in multi-task learning, but the results are limited and not extensively studied. We explore the application of a policy distillation algorithm, Divide-and-Conquer, to the Meta-World benchmark. Divide-and-Conquer (DnC) is a policy distillation algorithm that uses a context to represent information on the partitions of a state space. Based on these contexts, local policies are trained with KL divergence constraints to keep them similar to one another. They are combined into a global policy with another KL divergence constraint. Meta-World is a new benchmark for multi-task learning and meta-learning. We analyze DnC’s performance on both the meta-learning (ML) and the multi-task learning (MT) benchmarks, using Trust-Region Policy Optimization (TRPO) as the benchmark. For the ML benchmark, we partition the state space by the separate tasks for DnC. During meta-training, we use the training tasks as the partitions for DnC without the test tasks;once we have the final global policy from meta-training,we apply it to the test tasks to determine final rewards and success rates. For the MT benchmark, we again partition the state space by separate tasks. However, there are no held-out tasks–DnC trains on all of the tasks and is tested on them. Each individual task also has variable goal states, so the local policies must learn how to adapt to these variable states. The global policy must not only learn to solve the distinct training tasks, but also it must learn to adapt to different goal states within each task. We find that DnC achieves the same performance as our baseline, TRPO, on the meta-learning benchmark.When we partition the state space into the individual tasks, the local policies are able to properly learn to solve each of the individual tasks successfully at a rate of around 4-5%. The global policy composed of these individual expert policies has the same performance and success rate as the local policies. On the multi-task learning benchmark, DnC achieves success rates around65%. We believe that because DnC is a policy distillation algorithm and multi-task learning test environments have the same tasks in the train and test environments, DnCcan memorize each of the individual tasks and perform well in all of them at test time. However, with meta-learning, it is more difficult for DnC to adapt to new tasks at test time, and therefore its performance is not nearly as good.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Meta-Learning ML10 DnC Meta-test success rate 5.4% # 3

Methods