Multi-Armed Bandits (MABs) form a cornerstone in the world of Reinforcement Learning, offering a powerful mechanism to negotiate scenarios that necessitate optimal equilibrium between exploration and exploitation. Exploration refers to trying out novel options, while exploitation signifies sticking to the current best option. In such circumstances, MABs often prove to be a stellar choice due to their intelligent learning mechanisms. First, let's unravel the concept behind MABs. Picture a gambler surrounded by an array of slot machines, each known as a "one-armed bandit." The critical decisions to make here involve determining which machines to play, the order of playing, and the frequency of play. The catch is that each machine offers a different, uncertain reward. The ultimate goal is to amass the highest cumulative reward across a sequence of plays. What is a Multi-Armed Bandit? The name "multi-armed bandit" comes from a thought experiment involving a gambler at a row of slot machines (often known as "one-armed bandits"). The gambler needs to decide which machines to play, how many times to play each machine and in what order, with the objective of maximizing their total reward. In a more formal definition, a MAB problem is a tuple (A, R) The goal in the MAB problem is to find a policy (a mapping from historical data to actions) that maximizes the expected total reward over a given number of rounds. π Q The key dilemma in MAB is the trade-off between "exploration" and "exploitation": involves trying different arms to learn more about the reward they might yield. This might involve pulling sub-optimal arms, but it helps to gather information about the system. Exploration involves pulling the arm that the agent currently believes to have the highest expected reward, based on the information it has gathered so far. Exploitation Balancing exploration and exploitation is a core challenge in reinforcement learning and MAB problems. To address this problem, a host of algorithms have been developed, including the Epsilon-Greedy, the Upper Confidence Bound (UCB) and Thompson Sampling strategies. The detailed review of Contextual Bandits as an extension of MAB will be covered in next articles. The Epsilon(ε)-Greedy The Epsilon(ε)-Greedy strategy adheres to a straightforward principle. Most of the time, quantified as (1 - ε), it opts for the action yielding the highest estimated reward, thereby exploiting the best known option. However, the rest of the time, quantified as ε, it chooses a random action, thus exploring new possibilities. The ingenuity of this strategy lies in its simplicity and effectiveness, despite the caveat that it might select less rewarding actions with the same probability as the most rewarding ones during its exploration phase and the value of ε determines the balance between exploration and exploitation. Mathematically it can be defined as: This can be defined by the following pseudocode: Initialize Q(a) for all a
Repeat:
  Generate a random number p between 0 and 1
  If p < epsilon:
    Select a random action a
  Else:
    Select action a with the highest Q(a)
  Execute action a and observe reward r
  Update Q(a) = Q(a) + alpha * (r - Q(a)) The Pythonic implementation of the Epsilon-Greedy strategy is fairly straightforward: import numpy as np

class EpsilonGreedy:
    def __init__(self, epsilon, counts, values):
        self.epsilon = epsilon
        self.counts = counts
        self.values = values

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.choice(len(self.values))
        else:
            return np.argmax(self.values)

    def update(self, chosen_arm, reward):
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        value = self.values[chosen_arm]
        new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
        self.values[chosen_arm] = new_value The ε-greedy algorithm has the benefit of simplicity and the guarantee that it will continue to explore (a bit) even after many steps. However, it does not take into account how much is known about each action when it explores, unlike UCB or Thompson Sampling. UCB Conversely, the UCB strategy factors in both the estimated reward and the associated uncertainty or variance when deciding on an action. It exhibits a preference for actions with high uncertainty, seeking to diminish it and ensuring a more productive exploration phase. The UCB strategy is mathematically defined as follows: Its Python implementation is as follows: class UCB:
    def __init__(self, counts, values):
        self.counts = counts
        self.values = values

    def select_arm(self):
        n_arms = len(self.counts)
        for arm in range(n_arms):
            if self.counts[arm] == 0:
                return arm
        ucb_values = [0.0 for arm in range(n_arms)]
        total_counts = sum(self.counts)
        for arm in range(n_arms):
            bonus = sqrt((2 * log(total_counts)) / float(self.counts[arm]))
            ucb_values[arm] = self.values[arm] + bonus
        return np.argmax(ucb_values)

    def update(self, chosen_arm, reward):
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        value = self.values[chosen_arm]
        new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
        self.values[chosen_arm] = new_value The balance between exploration and exploitation in UCB comes from its formula: the estimated value of an action plus a term that decreases over time (as more about the action is learned) but increases with the uncertainty about that action. Thus, the algorithm tends to explore arms with high uncertainty and high potential reward. Thompson Sampling Thompson Sampling is a Bayesian-inspired algorithm for the multi-armed bandit problem. It assigns a prior distribution to the reward probabilities of each bandit (or each action), then updates these priors as rewards are observed. The action selection is probabilistic, according to the posterior distribution over the best action. In a Beta-Bernoulli framework, we consider each bandit's reward distribution as a Bernoulli distribution (i.e., binary rewards 0 or 1). We then assign a Beta prior to the probability of getting a reward. The Beta distribution is the conjugate prior of the Bernoulli distribution, which allows for an easy posterior update. Logic: Assign each bandit a Beta distribution with parameters α=1, β=1 (Uniform prior). For each round: Sample a random number from each bandit's current Beta distribution. Select the bandit with the highest sampled number and pull that bandit. Observe the reward from the pulled bandit. If it's a success (1), increase the bandit's α by one; if it's a failure (0), increase the bandit's β by one. Repeat steps 2 and 3. import numpy as np
from scipy.stats import beta

class Bandit:
    def __init__(self, true_probability):
        self.true_probability = true_probability
        self.alpha = 1
        self.beta = 1

    def pull(self):
        return np.random.random() < self.true_probability

    def sample(self):
        return np.random.beta(self.alpha, self.beta)

    def update(self, reward):
        self.alpha += reward
        self.beta += (1 - reward)

def Thompson(bandits, num_trials):
    rewards = np.zeros(num_trials)
    for i in range(num_trials):
        # Thompson sampling
        j = np.argmax([b.sample() for b in bandits])

        # Pull the arm for the bandit with the largest sample
        reward = bandits[j].pull()

        # Update rewards log
        rewards[i] = reward

        # Update the distribution for the bandit whose arm we just pulled
        bandits[j].update(reward)
    return rewards

# Suppose we have 3 bandits with these true probabilities
true_probabilities = [0.2, 0.5, 0.75]
bandits = [Bandit(p) for p in true_probabilities]

# Run experiment
rewards = Thompson(bandits, num_trials=10000)

# Print the total reward
print("Total reward earned:", rewards.sum())
print("Overall win rate:", rewards.sum() / len(rewards)) Thompson Sampling thus "explores" actions that it is uncertain about (those for which the distribution of values is spread out) and "exploits" actions that it believes may have high value (those for which the distribution is skewed towards high values). Over time, as more is learned about each action, the distributions become more peaked and the actions chosen by the algorithm tend to converge on the one with the highest expected value. The exploration/exploitation balance in Thompson Sampling comes naturally from the shape of the distributions. This method is quite effective, but can be more complex to implement than UCB or ε-greedy, particularly for problems with large or continuous action spaces, or complex reward structures. Why are Multi-Armed Bandits the Best RL for Your Task? : MAB algorithms are simpler and more computationally efficient than full-fledged RL algorithms, which require maintaining and updating a potentially large state-action value table or approximator. Simplicity : MAB algorithms provide robust methods for managing the trade-off between trying new actions and sticking with known good actions. Balance of exploration and exploitation : MAB algorithms can adapt in real-time to changes in the reward distributions of the actions. Real-time adaptability : The simplicity and efficiency of MABs allow for easy integration into existing systems, providing immediate benefits with minimal disruption. Easy integration : MABs have been successfully applied in diverse fields, including advertising (choosing which ad to show to maximize click-through rate), healthcare (personalizing treatment strategies), and web page optimization (A/B testing). Broad applicability Applications of MABs Multi-Armed Bandits (MABs) have a wide range of applications across various industries and domains. Here are some examples of how they can be used: : MABs can be used to dynamically adjust the selection of advertisements to display to users based on their interactions. This helps maximize click-through rates or conversions over time. Online Advertising : In medical research, MAB algorithms can be used to dynamically assign patients to different treatments. This ensures that more patients receive the most effective treatments, thus reducing regret, i.e., the loss incurred due to not always choosing the best treatment. Clinical Trials : News websites can use MABs to personalize the articles shown to each user. The MAB algorithm can learn over time which types of articles each user is interested in, and adjust the recommendations accordingly. News Article Recommendation : E-commerce platforms or airlines can use MAB algorithms to optimize their pricing strategies in real time, maximizing revenue based on customer behavior and market dynamics. Dynamic Pricing : In computer networking, MAB algorithms can be used to manage congestion and optimize the routing of packets. Each route can be treated as an arm, and the algorithm can dynamically select routes to minimize packet loss or latency. Network Routing : MABs can also be used to optimize the selection of hyperparameters in machine learning models. Each set of hyperparameters can be treated as an arm, and the algorithm can iteratively refine the selection to find the optimal model configuration. Machine Learning Hyperparameter Tuning In essence, the utility of MABs extends far beyond conventional reinforcement learning tasks. They symbolize an effective framework for enhancing decision-making processes in environments of uncertainty, providing practical solutions to real-world problems across various domains. Therefore, when the task at hand involves balancing exploration and exploitation, MABs often emerge as the go-to option, providing a versatile, robust, and adaptive solution to decision-making conundrums. Conclusion Multi-Armed Bandits, with their ability to balance exploration and exploitation effectively, provide a robust solution to many real-world decision-making problems. Their inherent adaptability and versatility make them a valuable tool, not just within the realm of reinforcement learning, but also across a broad range of applications, from healthcare to online advertising. Whether you're a data scientist, a machine learning enthusiast, or a professional looking to enhance your decision-making processes, understanding and implementing MAB strategies can be an enriching and rewarding experience.

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Multi-Armed Bandits: The Best Reinforcement Learning Solution for Your Task

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Advanced Techniques for Time Series Data Feature Engineering

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Advanced Techniques for Time Series Data Feature Engineering

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps