Multi-Armed Bandits (MABs) form a cornerstone in the world of Reinforcement Learning, offering a powerful mechanism to negotiate scenarios that necessitate optimal equilibrium between exploration and exploitation. Exploration refers to trying out novel options, while exploitation signifies sticking to the current best option. In such circumstances, MABs often prove to be a stellar choice due to their intelligent learning mechanisms.
First, let's unravel the concept behind MABs. Picture a gambler surrounded by an array of slot machines, each known as a "one-armed bandit." The critical decisions to make here involve determining which machines to play, the order of playing, and the frequency of play. The catch is that each machine offers a different, uncertain reward. The ultimate goal is to amass the highest cumulative reward across a sequence of plays.
The name "multi-armed bandit" comes from a thought experiment involving a gambler at a row of slot machines (often known as "one-armed bandits"). The gambler needs to decide which machines to play, how many times to play each machine and in what order, with the objective of maximizing their total reward.
In a more formal definition, a MAB problem is a tuple (A, R)
The goal in the MAB problem is to find a policy π
(a mapping from historical data to actions) that maximizes the expected total reward Q
over a given number of rounds.
The key dilemma in MAB is the trade-off between "exploration" and "exploitation":
Balancing exploration and exploitation is a core challenge in reinforcement learning and MAB problems.
To address this problem, a host of algorithms have been developed, including the Epsilon-Greedy, the Upper Confidence Bound (UCB) and Thompson Sampling strategies.
The detailed review of Contextual Bandits as an extension of MAB will be covered in next articles.
The Epsilon(ε)-Greedy strategy adheres to a straightforward principle. Most of the time, quantified as (1 - ε), it opts for the action yielding the highest estimated reward, thereby exploiting the best known option. However, the rest of the time, quantified as ε, it chooses a random action, thus exploring new possibilities. The ingenuity of this strategy lies in its simplicity and effectiveness, despite the caveat that it might select less rewarding actions with the same probability as the most rewarding ones during its exploration phase and the value of ε determines the balance between exploration and exploitation.
Mathematically it can be defined as:
This can be defined by the following pseudocode:
Initialize Q(a) for all a
Repeat:
Generate a random number p between 0 and 1
If p < epsilon:
Select a random action a
Else:
Select action a with the highest Q(a)
Execute action a and observe reward r
Update Q(a) = Q(a) + alpha * (r - Q(a))
The Pythonic implementation of the Epsilon-Greedy strategy is fairly straightforward:
import numpy as np
class EpsilonGreedy:
def __init__(self, epsilon, counts, values):
self.epsilon = epsilon
self.counts = counts
self.values = values
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.choice(len(self.values))
else:
return np.argmax(self.values)
def update(self, chosen_arm, reward):
self.counts[chosen_arm] += 1
n = self.counts[chosen_arm]
value = self.values[chosen_arm]
new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
self.values[chosen_arm] = new_value
The ε-greedy algorithm has the benefit of simplicity and the guarantee that it will continue to explore (a bit) even after many steps. However, it does not take into account how much is known about each action when it explores, unlike UCB or Thompson Sampling.
Conversely, the UCB strategy factors in both the estimated reward and the associated uncertainty or variance when deciding on an action. It exhibits a preference for actions with high uncertainty, seeking to diminish it and ensuring a more productive exploration phase. The UCB strategy is mathematically defined as follows:
Its Python implementation is as follows:
class UCB:
def __init__(self, counts, values):
self.counts = counts
self.values = values
def select_arm(self):
n_arms = len(self.counts)
for arm in range(n_arms):
if self.counts[arm] == 0:
return arm
ucb_values = [0.0 for arm in range(n_arms)]
total_counts = sum(self.counts)
for arm in range(n_arms):
bonus = sqrt((2 * log(total_counts)) / float(self.counts[arm]))
ucb_values[arm] = self.values[arm] + bonus
return np.argmax(ucb_values)
def update(self, chosen_arm, reward):
self.counts[chosen_arm] += 1
n = self.counts[chosen_arm]
value = self.values[chosen_arm]
new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
self.values[chosen_arm] = new_value
The balance between exploration and exploitation in UCB comes from its formula: the estimated value of an action plus a term that decreases over time (as more about the action is learned) but increases with the uncertainty about that action. Thus, the algorithm tends to explore arms with high uncertainty and high potential reward.
Thompson Sampling is a Bayesian-inspired algorithm for the multi-armed bandit problem. It assigns a prior distribution to the reward probabilities of each bandit (or each action), then updates these priors as rewards are observed. The action selection is probabilistic, according to the posterior distribution over the best action.
In a Beta-Bernoulli framework, we consider each bandit's reward distribution as a Bernoulli distribution (i.e., binary rewards 0 or 1). We then assign a Beta prior to the probability of getting a reward. The Beta distribution is the conjugate prior of the Bernoulli distribution, which allows for an easy posterior update.
Logic:
Assign each bandit a Beta distribution with parameters α=1, β=1 (Uniform prior).
For each round:
import numpy as np
from scipy.stats import beta
class Bandit:
def __init__(self, true_probability):
self.true_probability = true_probability
self.alpha = 1
self.beta = 1
def pull(self):
return np.random.random() < self.true_probability
def sample(self):
return np.random.beta(self.alpha, self.beta)
def update(self, reward):
self.alpha += reward
self.beta += (1 - reward)
def Thompson(bandits, num_trials):
rewards = np.zeros(num_trials)
for i in range(num_trials):
# Thompson sampling
j = np.argmax([b.sample() for b in bandits])
# Pull the arm for the bandit with the largest sample
reward = bandits[j].pull()
# Update rewards log
rewards[i] = reward
# Update the distribution for the bandit whose arm we just pulled
bandits[j].update(reward)
return rewards
# Suppose we have 3 bandits with these true probabilities
true_probabilities = [0.2, 0.5, 0.75]
bandits = [Bandit(p) for p in true_probabilities]
# Run experiment
rewards = Thompson(bandits, num_trials=10000)
# Print the total reward
print("Total reward earned:", rewards.sum())
print("Overall win rate:", rewards.sum() / len(rewards))
Thompson Sampling thus "explores" actions that it is uncertain about (those for which the distribution of values is spread out) and "exploits" actions that it believes may have high value (those for which the distribution is skewed towards high values).
Over time, as more is learned about each action, the distributions become more peaked and the actions chosen by the algorithm tend to converge on the one with the highest expected value.
The exploration/exploitation balance in Thompson Sampling comes naturally from the shape of the distributions. This method is quite effective, but can be more complex to implement than UCB or ε-greedy, particularly for problems with large or continuous action spaces, or complex reward structures.
Multi-Armed Bandits (MABs) have a wide range of applications across various industries and domains. Here are some examples of how they can be used:
In essence, the utility of MABs extends far beyond conventional reinforcement learning tasks. They symbolize an effective framework for enhancing decision-making processes in environments of uncertainty, providing practical solutions to real-world problems across various domains. Therefore, when the task at hand involves balancing exploration and exploitation, MABs often emerge as the go-to option, providing a versatile, robust, and adaptive solution to decision-making conundrums.
Multi-Armed Bandits, with their ability to balance exploration and exploitation effectively, provide a robust solution to many real-world decision-making problems. Their inherent adaptability and versatility make them a valuable tool, not just within the realm of reinforcement learning, but also across a broad range of applications, from healthcare to online advertising. Whether you're a data scientist, a machine learning enthusiast, or a professional looking to enhance your decision-making processes, understanding and implementing MAB strategies can be an enriching and rewarding experience.