#1 Research Paper Explained Mix & Match — Agent Curricula for Reinforcement Learning [ ] arxiv The Learning Techniques that are used nowadays are very quick and give immediate results for less complex environments using the gradient based policy optimisations. Reinforcement The Gradient based policies are competitive and not collaborative. So what if we need long lasting results for more complex environments having agents with complex tasks to perform. There are many world environment where we do not have the ability to modify the environments and performing Reinforcement on the real world tasks are really time consuming. So to overcome that problem this paper tries to solve that issue with the help for and . Learning Curriculum Learning Population Based Training Before You Start Curriculum Learning [ ] Paper To design a sequence of task using transfer learning for an agent to train on, such that final performance or learning speed is improved. Population Based Training for Neural Networks [ DeepMind Blog ] PBT has the ability to modify the hyper-parameters of the Network while training. The Main Idea The main idea behind the Mix & Match paper is to Generate multiple variants of RL agents where we don’t have the ability to modify task arranged according to training complexity but only use one MIX & MATCH agent constructed by leveraging different agent varying structurally by their policy generation process. MIX & MATCH Framework Multiple agents that are arranged in learning complexity (shown above) are treated as one Mix & Match Agent acting with a mixture of policies. The (i.e. distillation) is done in such a way that the complex agents are matched with the simpler ones early on. knowledge transfer is controlled such that ultimately complex target agent is used to generate experience. Mixing Coefficient Method Details Let’s assume we are given a sequence of trainable agents 1 ( ) with corresponding policies π 1 , …, π K , each parametrised with some θ i ⊂ θ which can share some parameters The aim is to train π K ,while all remaining agents are there to induce faster/easier learning. Let’s introduce ∼ Cat(1, …, K|α) ) which will be used to select apolicy at a given time: categorical random variable ‘c’ Probability mass function p(c = i) = α i The point of Mix & Match is to allow curriculum learning, consequently we need the probability mass function (pmf) of c to be changed over time. Initially the pmf should have α 1 = 1 and near the end of training α K = 1 thus allowing the curriculum of policies from simple π 1 to the target one π K. Now, it should be trained in such a way which maximises long lasting increase of performance and share knowledge together unlike those gradients. To address this issue we will be using a which will align the policies together. Distillation-like Cost (D) The final optimisation problem we consider is just a weighted sum of the original (below), applied to the control policy and the knowledge transfer loss: L^RL loss π^mm Now, let’s understand the Mix & Match Architecture in Steps → Policy Mixing Policy mixing will be done by explicit mixing for the sake of variance reduction. Knowledge Transfer For simplicity we consider the case of K = 2. Consider the problem of ensuring that matches the , while having access to samples from the (π . final policy (π2) simpler policy (π1) control policy mm) Same as previous one For simplicity, we define our over the trajectories directly and are sampled from the control policy. The term is introduced so that the distillation cost disappears when we switch to . M&M loss trajectories (s ∈ S) 1 − α π 2 Adjusting α (alpha) through training is the variable used in the equation (f_irst equation_). α population mass function This is the first equation An important component of the proposed method is how to set values of α through time. For simplicity let us again consider the case of K = 2, where one needs just a single α (as c now comes from Bernoulli distribution) which wetreat as a function of time t. Since α changes through time one cannot use typical hyperparameter tuning techniques as the space of possible values is exponential in number of timesteps (α = (α (1) , · · · , α (T ) ) ∈ 4 Tk−1 ,where 4 k denotes a k dimensional simplex). Online hyperparameter tuning → To solve this issue we use (PBT). Population Based Training Population Based Training and M&M (PBT) keeps a population of , trained in parallel, in order to optimise hyperparameters through time while training and periodically query each other to check how well they are doing relative to others. Badly performing agents copy the weights (neural network parameters) of stronger agents and perform local modifications of their hyperparameters. Population Based Training agents This ability of PBT to modify hyperparameters throughout a single training run makes it is possible to discover powerful adaptive strategies e.g. . auto-tuned learning rate annealing schedules This way poorly performing agents are used to the hyperparameters space. explore So, we need to define → two functions which measures how strong a current agent is eval → → which defines how to instigate the hyperparameters. explore Note: Keep in mind the PBT agents are the MIX & MATCH Agents which is already a mixture of constituent agents. Now, we should use one of the two schemes mentioned below, depending on the characteristics of the problem we are interested in. If models is having performance improvement by switching from simple to the more complex model, then Provide with performance (i.e. reward over k episodes) of the mixed policy. a ) eval For an function for we randomly add or subtract a fixed value (truncating between 0 and 1). Thus, once there is a significant benefit of switching to more complex one — PBT will do it automatically. b ) explore α 2. Often we want to switch from an unconstrained architecture to some specific, heavily constrained one (where there may not be an obvious benefit in performance from switching). When training a multitask policy from constituent single-task policies, we can make an independent evaluation job which only looks at performance of an agent with . eval α K = 1 This way we directly optimise for the final performance of the model of interest, but at the cost of additional evaluations needed for PBT. EXPERIMENTS Let’s test and analyse whether our new M&M method is working correctly in all circumstances. We will now consider 3 sets of Reinforcement Learning experiments which can be Scaled to large and complex action spaces, Complexities of Agents Architecture and Learning Multitask Policy. In all the process, we initialise α around 0 and analyse its adaptation throughtime. that even though in the experimental sections we use K = 2, the actual curriculum goes through potentially infinitely many agents being a result of mixing between π 1 and π 2 . Note → Unless otherwise stated, the eval function returns averaged rewards from last of the control policy. 30 episodes The Environment Suite of DeepMind Lab offers a range of challenging 3D, first-person view based tasks (see, appendix) for RL agents. Agents perceive . 96 × 72 pixel based RGB observations, at 60 fps and can move, rotate, jump and tag built-in bots 1. Curricula Over Number of Actions Used The action space is complex represented in 6 dimensional vector where two action group are very high resolution ( ) and remaining four are low resolution actions ( ). rotation and looking up/down actions moving forward, backward or not moving at all, shooting or not shooting etc Here, we use 9 actions to construct , the simple policy (Small action space). π 1 Similarly to the research in continuous control using diagonal Gaussian distributions we use a factorised policy π 2 (a 1 , a 2 , …, a 6 |s) := j=1 π̂ j (a j |s), which we refer to as Big action space. In order to be able to mix these two policies we map π 1 actions onto the corresponding ones in the action space of π 2. The mixing of values between networks of agents shown in the figure below. The paper also discussed Shared Head and Masked KL techniques but both are wore than M&M. RESULTS We see that the small action space leads to faster learning but hampers final performance as compared to the big action space. Mix & Match applied to this setting gets the best of both worlds, it learns fast, and surpasses the final performance of the big action space. FIG: Training in actionspaces experiments. When plotting α through time (Fig. 5 Left) we see that the agent switches fully to the big action space early on, thus showing that small action space was useful only for initial phase of learning. This is further confirmed by looking at how varied the actions taken by the agent are through training. Fig. 5 (Right) shows how the marginal distribution over actions evolves through time.We see that new actions are unlocked through training, and further that the final distribution is more entropic that the initial one, which means that more entropy leads to more stability. 2. Curricula Over Agent Architecture We substitute the LSTM with a linear projection from the processed convolutional signal. We share both the convolutional modules as well as the policy/value function projections. The experiments are focused on various . navigation tasks On one hand, reactive policies (which can be represented solely by a Feed Forward policy) should learn reasonably quickly to move around and explore, while on the other hand, recurrent networks (which have memory) are needed to maximise the final performance — by either learning to navigate new maze layouts (Explore Object Location Small) or avoiding (seeking) explored unsuccessful (successful) paths through the maze. RESULTS FIG 6 M&M applied to the transition of LSTM and FF, leads to significant improvement in the final performance. But not as fast as FF counterpart. FIG 7 FIG 8 leads to two observations The Green curve switches late to LSTM due to complexity in the level of Game. The model (Blue Curve) has the ability to switch back to the mixture policy if needed. 3. Curricula for Multitask As a final proof of concept we consider the task of learning a single policy capable of solving multiple RL problems at the same time. The basic approach for this sort of task is to train a model in a mixture of environments or equivalently to train a shared model in multiple environments in parallel. This type of training can suffer from two drawbacks: → It is heavily reward scale dependent, and will be biased towardshigh-reward environments. → Environments that are easy to train provide a lot of updates for the model and consequently can also bias the solution towards themselves. Let’s take a look at the results: FIG 9 One is , which has high rewards and a steep initial learning curve (lots of reward signal coming from gathering apples). Explore Object Locations Small In remaining two problems training is hard, the agent is interacting with other bots as well as complex mechanics (pick up bonuses, tagging floors, etc.) The multitask solution focuses on solving the navigation task, while performing comparitively poorly on the more challenging problems. To apply M&M to this problem we construct one agent per environment(each acting as π 1 from previous sections) and then o (π 2 from previous sections). Crucially, agents share convolutional layers but have independent LSTMs. ne centralised “multitask” agent Training is done in a multitask way, but the control policy in each environment is again a mixture between the task specific π i (the specialist) and π mt (centralised agent). Since it is no longer beneficial to switch to the centralised policy, we use theperformance of π mt (i.e. the central policy) as the optimisation criterion (eval) for PBT, instead of the control policy. We evaluate both the performance of the mixture and the centralised agent independently. Fig. 9 shows per task performance of the proposed method. One can notice much more uniform performance — M&M agent learns to play well in both challenging laser tag environments, while slightly sacrificing performance in a single navigation task. One of the reasons of this success is the fact that knowledge transfer is done in policy space, which is invariant to reward scaling. While the agent can still focus purely on high reward environments once it has switched to usingonly the central policy, this inductive bias in training with M&M ensures a much higher minimum score. CONCLUSION Over time the component weightings of this mixture are adapted such that at the end of training we are left with a single active component consisting of the most complex agent. Also, Improved and Accelerated performance in Complex Environment. Collection of agents is bound together as a single composite whole usinga mixture policy. Information can be shared between the components via shared experience or shared architectural elements, and also through a distillation-like KL-matching loss. I will be posting 2 posts per week so don’t miss the tutorial. So, follow me on and to see similar posts. Medium Twitter Any comments or if you have any question, write it in the comment. Clap it(Multiple Times)! Share it! Follow Me! Happy to be helpful. kudos….. Previous stories you will love: _TensorFlow API Cheetsheet_towardsdatascience.com 50 TensorFlow.js API Explained in 5 Minutes | TensorFlow.js Cheetsheet _Everything you need to know…_towardsdatascience.com TensorFlow 1.9 has Arrived! _On Android and iOS_towardsdatascience.com TensorFlow on Mobile: Tutorial _Sigmoid, tanh, Softmax, ReLU, Leaky ReLU EXPLAINED !!!_towardsdatascience.com Activation Functions: Neural Networks