Multi-Armed Bandits: The Best Reinforcement Learning Solution for Your Taskby@teenl0ve
1,915 reads
1,915 reads

Multi-Armed Bandits: The Best Reinforcement Learning Solution for Your Task

by Valentine ShkulovJuly 20th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

The article explores Multi-Armed Bandits (MABs), a reinforcement learning technique used to balance exploration (trying new options) and exploitation (using the current best option). It introduces various MAB algorithms like ε-greedy, UCB, and Thompson Sampling. The ε-greedy method exploits the best known option most of the time, but also explores new options. UCB, on the other hand, considers the estimated reward and associated uncertainty. Thompson Sampling, a Bayesian approach, uses a probabilistic action selection. MABs have wide-ranging applications in advertising, healthcare, web optimization, dynamic pricing, network routing, and machine learning. Their balance of exploration and exploitation makes them ideal for decision-making in uncertain environments.

People Mentioned

Mention Thumbnail
featured image - Multi-Armed Bandits: The Best Reinforcement Learning Solution for Your Task
Valentine Shkulov HackerNoon profile picture

Multi-Armed Bandits (MABs) form a cornerstone in the world of Reinforcement Learning, offering a powerful mechanism to negotiate scenarios that necessitate optimal equilibrium between exploration and exploitation. Exploration refers to trying out novel options, while exploitation signifies sticking to the current best option. In such circumstances, MABs often prove to be a stellar choice due to their intelligent learning mechanisms.

First, let's unravel the concept behind MABs. Picture a gambler surrounded by an array of slot machines, each known as a "one-armed bandit." The critical decisions to make here involve determining which machines to play, the order of playing, and the frequency of play. The catch is that each machine offers a different, uncertain reward. The ultimate goal is to amass the highest cumulative reward across a sequence of plays.

What is a Multi-Armed Bandit?

The name "multi-armed bandit" comes from a thought experiment involving a gambler at a row of slot machines (often known as "one-armed bandits"). The gambler needs to decide which machines to play, how many times to play each machine and in what order, with the objective of maximizing their total reward.

In a more formal definition, a MAB problem is a tuple (A, R)

The goal in the MAB problem is to find a policy π (a mapping from historical data to actions) that maximizes the expected total reward Q over a given number of rounds.

The key dilemma in MAB is the trade-off between "exploration" and "exploitation":

  1. Exploration involves trying different arms to learn more about the reward they might yield. This might involve pulling sub-optimal arms, but it helps to gather information about the system.
  2. Exploitation involves pulling the arm that the agent currently believes to have the highest expected reward, based on the information it has gathered so far.

Balancing exploration and exploitation is a core challenge in reinforcement learning and MAB problems.

To address this problem, a host of algorithms have been developed, including the Epsilon-Greedy, the Upper Confidence Bound (UCB) and Thompson Sampling strategies.

The detailed review of Contextual Bandits as an extension of MAB will be covered in next articles.

The Epsilon(ε)-Greedy

The Epsilon(ε)-Greedy strategy adheres to a straightforward principle. Most of the time, quantified as (1 - ε), it opts for the action yielding the highest estimated reward, thereby exploiting the best known option. However, the rest of the time, quantified as ε, it chooses a random action, thus exploring new possibilities. The ingenuity of this strategy lies in its simplicity and effectiveness, despite the caveat that it might select less rewarding actions with the same probability as the most rewarding ones during its exploration phase and the value of ε determines the balance between exploration and exploitation.

Mathematically it can be defined as:

This can be defined by the following pseudocode:

Initialize Q(a) for all a
  Generate a random number p between 0 and 1
  If p < epsilon:
    Select a random action a
    Select action a with the highest Q(a)
  Execute action a and observe reward r
  Update Q(a) = Q(a) + alpha * (r - Q(a))

The Pythonic implementation of the Epsilon-Greedy strategy is fairly straightforward:

import numpy as np

class EpsilonGreedy:
    def __init__(self, epsilon, counts, values):
        self.epsilon = epsilon
        self.counts = counts
        self.values = values

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.choice(len(self.values))
            return np.argmax(self.values)

    def update(self, chosen_arm, reward):
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        value = self.values[chosen_arm]
        new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
        self.values[chosen_arm] = new_value

The ε-greedy algorithm has the benefit of simplicity and the guarantee that it will continue to explore (a bit) even after many steps. However, it does not take into account how much is known about each action when it explores, unlike UCB or Thompson Sampling.


Conversely, the UCB strategy factors in both the estimated reward and the associated uncertainty or variance when deciding on an action. It exhibits a preference for actions with high uncertainty, seeking to diminish it and ensuring a more productive exploration phase. The UCB strategy is mathematically defined as follows:

Its Python implementation is as follows:

class UCB:
    def __init__(self, counts, values):
        self.counts = counts
        self.values = values

    def select_arm(self):
        n_arms = len(self.counts)
        for arm in range(n_arms):
            if self.counts[arm] == 0:
                return arm
        ucb_values = [0.0 for arm in range(n_arms)]
        total_counts = sum(self.counts)
        for arm in range(n_arms):
            bonus = sqrt((2 * log(total_counts)) / float(self.counts[arm]))
            ucb_values[arm] = self.values[arm] + bonus
        return np.argmax(ucb_values)

    def update(self, chosen_arm, reward):
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        value = self.values[chosen_arm]
        new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
        self.values[chosen_arm] = new_value

The balance between exploration and exploitation in UCB comes from its formula: the estimated value of an action plus a term that decreases over time (as more about the action is learned) but increases with the uncertainty about that action. Thus, the algorithm tends to explore arms with high uncertainty and high potential reward.

Thompson Sampling

Thompson Sampling is a Bayesian-inspired algorithm for the multi-armed bandit problem. It assigns a prior distribution to the reward probabilities of each bandit (or each action), then updates these priors as rewards are observed. The action selection is probabilistic, according to the posterior distribution over the best action.

In a Beta-Bernoulli framework, we consider each bandit's reward distribution as a Bernoulli distribution (i.e., binary rewards 0 or 1). We then assign a Beta prior to the probability of getting a reward. The Beta distribution is the conjugate prior of the Bernoulli distribution, which allows for an easy posterior update.


  1. Assign each bandit a Beta distribution with parameters α=1, β=1 (Uniform prior).

  2. For each round:

    1. Sample a random number from each bandit's current Beta distribution.
    2. Select the bandit with the highest sampled number and pull that bandit.
    3. Observe the reward from the pulled bandit. If it's a success (1), increase the bandit's α by one; if it's a failure (0), increase the bandit's β by one.
    4. Repeat steps 2 and 3.
import numpy as np
from scipy.stats import beta

class Bandit:
    def __init__(self, true_probability):
        self.true_probability = true_probability
        self.alpha = 1
        self.beta = 1

    def pull(self):
        return np.random.random() < self.true_probability

    def sample(self):
        return np.random.beta(self.alpha, self.beta)

    def update(self, reward):
        self.alpha += reward
        self.beta += (1 - reward)

def Thompson(bandits, num_trials):
    rewards = np.zeros(num_trials)
    for i in range(num_trials):
        # Thompson sampling
        j = np.argmax([b.sample() for b in bandits])

        # Pull the arm for the bandit with the largest sample
        reward = bandits[j].pull()

        # Update rewards log
        rewards[i] = reward

        # Update the distribution for the bandit whose arm we just pulled
    return rewards

# Suppose we have 3 bandits with these true probabilities
true_probabilities = [0.2, 0.5, 0.75]
bandits = [Bandit(p) for p in true_probabilities]

# Run experiment
rewards = Thompson(bandits, num_trials=10000)

# Print the total reward
print("Total reward earned:", rewards.sum())
print("Overall win rate:", rewards.sum() / len(rewards))

Thompson Sampling thus "explores" actions that it is uncertain about (those for which the distribution of values is spread out) and "exploits" actions that it believes may have high value (those for which the distribution is skewed towards high values).

Over time, as more is learned about each action, the distributions become more peaked and the actions chosen by the algorithm tend to converge on the one with the highest expected value.

The exploration/exploitation balance in Thompson Sampling comes naturally from the shape of the distributions. This method is quite effective, but can be more complex to implement than UCB or ε-greedy, particularly for problems with large or continuous action spaces, or complex reward structures.

Why are Multi-Armed Bandits the Best RL for Your Task?

  1. Simplicity: MAB algorithms are simpler and more computationally efficient than full-fledged RL algorithms, which require maintaining and updating a potentially large state-action value table or approximator.
  2. Balance of exploration and exploitation: MAB algorithms provide robust methods for managing the trade-off between trying new actions and sticking with known good actions.
  3. Real-time adaptability: MAB algorithms can adapt in real-time to changes in the reward distributions of the actions.
  4. Easy integration: The simplicity and efficiency of MABs allow for easy integration into existing systems, providing immediate benefits with minimal disruption.
  5. Broad applicability: MABs have been successfully applied in diverse fields, including advertising (choosing which ad to show to maximize click-through rate), healthcare (personalizing treatment strategies), and web page optimization (A/B testing).

Applications of MABs

Multi-Armed Bandits (MABs) have a wide range of applications across various industries and domains. Here are some examples of how they can be used:

  1. Online Advertising: MABs can be used to dynamically adjust the selection of advertisements to display to users based on their interactions. This helps maximize click-through rates or conversions over time.
  2. Clinical Trials: In medical research, MAB algorithms can be used to dynamically assign patients to different treatments. This ensures that more patients receive the most effective treatments, thus reducing regret, i.e., the loss incurred due to not always choosing the best treatment.
  3. News Article Recommendation: News websites can use MABs to personalize the articles shown to each user. The MAB algorithm can learn over time which types of articles each user is interested in, and adjust the recommendations accordingly.
  4. Dynamic Pricing: E-commerce platforms or airlines can use MAB algorithms to optimize their pricing strategies in real time, maximizing revenue based on customer behavior and market dynamics.
  5. Network Routing: In computer networking, MAB algorithms can be used to manage congestion and optimize the routing of packets. Each route can be treated as an arm, and the algorithm can dynamically select routes to minimize packet loss or latency.
  6. Machine Learning Hyperparameter Tuning: MABs can also be used to optimize the selection of hyperparameters in machine learning models. Each set of hyperparameters can be treated as an arm, and the algorithm can iteratively refine the selection to find the optimal model configuration.

In essence, the utility of MABs extends far beyond conventional reinforcement learning tasks. They symbolize an effective framework for enhancing decision-making processes in environments of uncertainty, providing practical solutions to real-world problems across various domains. Therefore, when the task at hand involves balancing exploration and exploitation, MABs often emerge as the go-to option, providing a versatile, robust, and adaptive solution to decision-making conundrums.


Multi-Armed Bandits, with their ability to balance exploration and exploitation effectively, provide a robust solution to many real-world decision-making problems. Their inherent adaptability and versatility make them a valuable tool, not just within the realm of reinforcement learning, but also across a broad range of applications, from healthcare to online advertising. Whether you're a data scientist, a machine learning enthusiast, or a professional looking to enhance your decision-making processes, understanding and implementing MAB strategies can be an enriching and rewarding experience.