“Reinforcement Learning is simply Science of Decision Making. That’s what makes it so General.” — David Silver Abstract You might have observed a level of saturation in Machine recently. Well that’s actually saturation in ‘Supervised Learning’ actually (poor Kaggle). Learning Most of us don’t know any other learning algorithm than Back-Propagation. There are a few recent ones like ‘ ’ and ‘ ’ but more-or-less fall under similar paradigm as Back-propagation. Equilibrium Propagation Synthetic Gradients While, Unsupervised learning and Reinforcement learning remain unscratched (comparatively, in mainstream at least). In this blog we’ll be diving into Reinforcement Learning or as I like to call it ‘ ’ or ‘ ’ learning. Those are actually pretty accurate names. Stupidity-followed-by-Regret What-If Intro Reinforcement learning is interaction based learning in an ‘cause-effect’ environment. Effect or consequence is based upon the actions and cause if the goal to be achieved. Much like talking to your crush. The goal is to impress her but every word you say will have a consequence. She’ll either: find you charming or randomly remember she has to be somewhere #BelieveMeIAmAGuy #IntrovertProblems RL is Goal oriented learning, where the learner (referred to as ‘ ’ in RL) is not dictated what action to take, rather it is encouraged to explore and discover the best action to yield best results. Vaguely speaking, is based upon approach. agent exploration trial-and-error In RL, your present action not only affect your immediate reward (getting that kiss) but also your rewards (her agreeing for a second date). So the agent might opt for higher future rewards instead of high current reward. future Remember the following statement, you’ll understand it better later: RL is not defined by characterising a learning method but by characterising the learning problem itself. Agent The learner in RL is referred to as agent. An ‘ ’ has the and is able to take actions in form of in order to maximise the . agent sense of it’s environment interactions with the environment reward received from the environment Who is a Supervisor? Q) In Supervised learning, the . Hence it acts like a supervisor. Ans) training set supervises the learning of the learner How and when is an RL Agent better than a Supervisor? Q) In an uncharted territory, a supervisor (your ever-single friend) does not have much idea about the correct actions (step to impress your crush). In this case much better learning can be done by own (agent) experiences. Ans) Exploration-Exploitation Trade-Off Obviously there is a trade-off, there is always a trade-off! Supervised learning has Bias-Variance trade-off and RL has Exploration-Exploitation trade-off. To obtain high reward, the agent can either prefer actions that were fruitful in past explore possibility of better actions Let’s break it down Exploitation : use the pre-learned knowledge to obtain GOOD reward Explore : try something new to obtain either GREAT or WORSE rewards In Stochastic tasks, an action has to be tried multiple times to gain a reliable estimate about it’s reward. Yet, this can’t ensure the which is the best possible action. Agent can’t explore and exploit at same step, hence the trade-off. Information gained during exploration can be exploited multiple times during future steps. Extra At times RL involves supervised learning. This is done to determine which capabilities are critical for achieving rewards. Example: Evolution, makes sure you know how to breath otherwise no matter how awesome you are, you won’t be able to achieve anything. Growing up, do you achieve noble prize of keep playing with fidget spinners is not the concern of the supervised part. These algorithms are categorised as Deep-Reinforcement-Learning. More on that later, for now just have a look how good they can perform! Fundamentals of RL RL has 4 fundamental concepts: Policy Reward Function Value Function Model (of Environment) [ideally] Policy It defines the behaviour of the agent at any given time. Roughly, policy is mapping of currently known states and actions to be taken when in those states. In terms of psychology, it corresponds to stimulus-response rules. Policy can be in form of a look-up table or an extensive search process. Reward & Value is the gain that indicates what is good action in . Reward immediate sense is cumulative gain to estimate of how good an action will be in . Value long run Example: Eating cake will make me happy right now. Though in long run, cake is bad for my health. The , combines reward and value under single concept and conveys: Reward System how rewarding an action is w.r.t rewards hence also how desirable a state is In biological sense it refers to pleasure and pain. In relationship sense it refers to her saying “you are looking cute” and “you should shave that beard off” (it still hurts). Mostly, the reward system is unaltered by the agent, though it may serve as bases for altering the policy. Example: Falling off the cycle hurts, so very low rewarding. I can’t alter the reward system to take pleasure in that pain. BUT. I can later the way I ride cycle so I don’t fall again. Our main concern is the Value and not Reward. Why? Q) Reward only makes sure that immediate gain is max, future gains could be way worse. Value makes sure that we have max overall gain. Ans) What’s the problem then? Q) Usually immediate gains have more surety than future gains. Ans) What are value functions? Q) As the name suggest, value functions are functions used to compute value of an action. The tricky part is how they do it. These usually use complex inference based algorithms. Ans) Example: You buy a stock. Current market status is good. So with much surety you can sell it tomorrow and get good reward. Now if you keep the shares for another year, the value might sky-rocket to multiple folds but one can never be very sure about that. Evolutionary Algorithms Although they sound very similar, RL and Evolutionary algorithms are not the same thing. Evolutionary algorithms don’t use Value function. They directly search into spaces of policies. Pros: Effective when agent doesn’t have much sense of the environment Cons: They don’t notice which states agent passes through it’s lifetime They do not exploit the fact that the policy they are searching for is a function from state to action. This was just the first chapter, a basic intro to RL. We still have a world to explore in upcoming chapters. Yeah, that’s a random GIF.