The Reward Function by Midjourney.
Generative transformer-based models have received a lot of attention since the introduction of ChatGPT with its remarkable Large Language Model (LLM) capabilities. But another, less remarked artificial intelligence technique keeps peeking from behind the curtains: Reinforcement Learning, which is being used to tame unruly generative AI with a carrot-and-stick approach.
Reinforcement Learning (RL) is finally getting more attention as the big names in AI like Google, Deepmind, and OpenAI are publishing more papers on its use to solve problems that LLMs cannot reach. Nature published a paper this year from Deepmind that used reinforcement learning to code a more efficient implementation of the C++ sort() library function.
RL algorithms were developed by Richard Sutton and Andrew Barto based on operant conditioning in behavioral psychology as described by B.F. Skinner and Ivan Pavlov.
But it has languished in the background of artificial intelligence research - Turing Award winner Yann LeCun famously likened it to the cherry on top of the artificial intelligent cake (with supervised learning as the icing and unsupervised learning as the cake itself.
RL is a type of machine learning where the AI system, called an agent, learns through trial and error with its environment. It's how humans learn many skills - by trying different things and seeing what works - and how we learn what to avoid, like the pain of being stung by an insect.
The agent gets feedback on its actions as rewards (punishments are negative rewards), and its goal is to collect as much reward as possible over time. This makes reinforcement learning a great fit for iterative decision-making problems.
Compared to supervised learning, where the AI is trained on labeled datasets of correct answers, and unsupervised learning, where the AI finds patterns in unlabeled data, reinforcement learning doesn't need huge datasets.
The agent simply learns from its own experience interacting with the environment, like children discovering how the world works.
Reinforcement learning also differs from genetic algorithms, where the fittest agents are selected over generations - in reinforcement learning, each agent learns within its lifetime.
Certain qualities make a problem a good fit for reinforcement learning. The problem should have a quantifiable reward function that provides clear feedback on how well the agent's actions accomplish the intended goal.
For example, a maze navigation task would score the agent higher for moving closer to the exit. Without this measure of progress, the agent cannot determine which behaviors are effective.
Second, the problem should allow the agent to continually take actions and learn from the results. If the agent must make all its decisions simultaneously, it will not be able to adjust based on experience. Instead, reinforcement learning works best in iterative environments where the agent can try different approaches over time.
Finally, the problem must permit mistakes during the trial-and-error learning process.
Reinforcement learning fundamentally involves exploring through less optimal actions. If errors are acceptable while the agent learns, reinforcement learning can discover novel and effective strategies through experimentation.
Problems like robot control, game-playing, code-writing, automated unit testing, and automated trading satisfy these criteria. The agent can take actions, evaluate their impact quantitatively, and adapt its behavior over time.
While tradeoffs exist, reinforcement learning opens solutions not possible with other machine learning approaches. By embracing the goal-seeking process, reinforcement learning expands the range of what AI can accomplish.
Reinforcement learning has the potential to transform industries where human oversight is impractical or solutions aren't obvious. Dr Phil Winder, author of a recent book on Reinforcement Learning, curates case studies of its application (https://winder.ai/tags/case-study/)
Developing new medicines involves immense trial and error. Reinforcement learning can simulate this more efficiently by finding relationships in existing medical data. Since live human drug testing is unethical, reinforcement learning gives a faster alternative to traditional techniques.
The algorithm explores links between chemical properties, genes, and disease interactions to pinpoint promising research directions for scientists.
Reinforcement learning trading systems like RBC's Aiden make investments by recognizing patterns from market experience. Reinforcement learning suits the noisy, rapidly changing finance world where hand-coded models fail.
Trading algorithms using reinforcement learning adapt based on what works rather than needing constant human updates. However, challenges remain in avoiding overfitting limited data and ensuring ethical agent behavior.
In the context of unit test code-writing, the environment is the space of potential unit test programs, and the agent is trying to find the best possible unit tests that maximize coverage of the code under test and find the most bugs.
Reinforcement learning can probabilistically search the set of possible test methods for a given program, analyze the program, and then write tests autonomously. The trial-and-error approach means the test code generated has been run and therefore its behavior is known and can be evaluated.
This approach allows entire suites of fully functioning unit test code – tens or hundreds of thousands of tests – to be written without human code review.
Reinforcement learning enables more lifelike, human game AI. Instead of following scripted rules, reinforcement learning agents actively explore environments, discovering creative solutions a human may miss.
Open-ended games like Minecraft are an exciting new frontier for reinforcement learning in games. So far, it shows promise for interesting, emergent AI behaviors.
One major challenge with reinforcement learning agents is that they require large environments. With insufficient data, the agents will fail to generalize properly to new situations and perform poorly.
Unfortunately, in many real-world scenarios, acquiring enough quality data is difficult and time-consuming. Without enough data, the agent's learned behaviors will be brittle and limited.
Another hurdle is that real-world conditions are often shifting, not stationary. Changes in the environment may suddenly render an agent's previously learned behaviors obsolete or suboptimal.
To deploy reinforcement learning successfully, agents must have techniques to quickly adapt their policies to respond to changing conditions.
A third issue is balancing exploration against performance. Exploration through random actions is key to reinforcement learning, but excessive random acts will hurt short-term results. Agents must temper undirected exploration to focus on exploiting known rewards while still trying new approaches when needed. Striking the right balance remains an open challenge.
Finally, without close human oversight, agents may learn harmful policies that exploit loopholes in the reward system. Researchers must develop methods to align agent objectives with human values and predict how learned policies will generalize.
Reinforcement learning offers enormous potential, and research is progressing swiftly to address these challenges in areas like sim-to-real transfer - the process of training a reinforcement learning agent in a simulated environment and then transferring the learned policy to the real world - reward design, and interpretability.
LLMs currently hog the limelight, but RL unlocks solutions where neural network techniques falter. As research tackles current weaknesses, expect to see growing use of reinforcement learning in medicine, finance, coding, robotics, and more.