Use Beta Distribution and Thompson Sampling to Beat The Multi-armed Bandit at the Casino by@ryan-yu

# Use Beta Distribution and Thompson Sampling to Beat The Multi-armed Bandit at the Casino

Beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution. We use Beta distribution to model the simplest form of the multi-armed bandit problem, which is the binary outcome/reward. In the casino example, each machine will pay a reward of $1 when the outcome is success, and$0 when it is fail. Our goal is to identify the machine with the highest probability of success.

## As a logical person at the casino. you want to put your money on the machine with the maximum expected return. This is the origin of the multi-armed bandit problem. We will cover the two most basic concept here: Beta distribution and Thompson sampling.Beta Distribution

We use Beta distribution to model the simplest form of the multi-armed bandit problem, which is the binary outcome/reward. In the casino example, each machine will pay a reward of $1 when the outcome is success, and$0 when the outcome is fail.

Our goal is to identify the machine with the highest probability of success. Beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

We use Beta distribution for the following two important properties:
1. Beta distribution is defined between 0 and 1, which correlates to the range of our estimators.
2. The posterior probability is Beta with updated parameters.

Use baseball as an example. Batting average (BA) is a metric to evaluate a player, and we want to estimate BA for a new player whose true BA is 30%. Since the new player does not have any batting history, we can assume the distribution of estimated BA follows Beta(1,1), i.e. our estimate is equally distributed between 0 and 1.

With progress of games, we can collect more data and produce a better estimate of the player’s BA. Using the property of Beta, we observe that the estimated BA is ~10% after 10 AB(At Bats), but is very close to the true value 30% after 1000 AB (code).

## Thompson Sampling

Continuing the discussion we use 3 slot machines with Beta distribution to explain how Thompson Sampling works. The true success probability of slot 0, 1, and 2 is 0.7, 0.8, and 0.9 respectively, nevertheless the probability is unknown to us.

Our goal is to identify the machine with highest success probability.

Again since we have no information, we assume all 3 machines have the same prior distribution Beta(1,1), i.e. the 3 machines are equally likely to be selected for experiment.

At starting time t = 0, we run 1000 experiments and observe that machine 0, 1, and 2 are being selected 324, 345, and 331 experiments respectively. We then use the experiment results to update α and β of the three machines’ Beta distribution, and then use the updated Beta as the new prior probability at time t = 1 for the next 1000 experiments.

We continue the process until t = 1000 and the results are shown below:

{
"nbformat": 4,
"nbformat_minor": 0,
"colab": {
"name": "Untitled0.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyOpA2LaWRSAI9kAR4SILrfF",
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"id": "view-in-github",
"colab_type": "text"
},
"source": [
]
},
{
"cell_type": "code",
"id": "gzCI9Bmgvkd3",
"colab_type": "code",
"colab": {}
},
"source": [
"# https://arxiv.org/pdf/1707.02038.pdf\n",
"from scipy.stats import beta\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib as mpl\n",
"import seaborn as sns\n",
"import numpy as np\n",
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "UHBzEqo-SaZ8",
"colab_type": "code",
"outputId": "8f0bf3f5-57e0-476a-81d6-b5550af777a7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 282
}
},
"source": [
"action1 = beta.rvs(a=600, b=400, size=1000)\n",
"action2 = beta.rvs(a=400, b=600, size=1000)\n",
"action3 = beta.rvs(a=4, b=6, size=10)\n",
"sns.distplot(action1,hist=False,color='blue')\n",
"sns.distplot(action2,hist=False,color='green')\n",
"sns.distplot(action3,hist=False,color='red')"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
]
},
"tags": []
},
"execution_count": 2
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"tags": []
}
}
]
},
{
"cell_type": "code",
"id": "xtdGWuqfScrM",
"colab_type": "code",
"colab": {}
},
"source": [
"simloops = 1000\n",
"periods = 1000\n",
"probs= [0.7,0.8,0.9]\n",
"pct = {}\n",
"pct[0] = [0]*periods\n",
"pct[1] = [0]*periods\n",
"pct[2] = [0]*periods"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "yq2fHxjc5UPG",
"colab_type": "code",
"outputId": "a2aa6e48-9340-47b1-c25b-4cc45316eac4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 185
}
},
"source": [
"for sim in range(simloops):\n",
"    a = [1,1,1]\n",
"    b = [1,1,1]\n",
"    for t in range(periods):\n",
"        posterior_prob = np.random.beta(a,b)\n",
"        action = np.random.choice(np.where(posterior_prob == posterior_prob.max())[0])\n",
"        reward = np.random.binomial(1, probs[action])\n",
"        pct[action][t] += 1\n",
"        a[action] += reward\n",
"        b[action] += 1-reward\n",
"    if sim%100 == 1:\n",
"        print (\"{} out of {} simulations have been completed\".format(sim,simloops))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"1 out of 1000 simulations have been completed\n",
"101 out of 1000 simulations have been completed\n",
"201 out of 1000 simulations have been completed\n",
"301 out of 1000 simulations have been completed\n",
"401 out of 1000 simulations have been completed\n",
"501 out of 1000 simulations have been completed\n",
"601 out of 1000 simulations have been completed\n",
"701 out of 1000 simulations have been completed\n",
"801 out of 1000 simulations have been completed\n",
"901 out of 1000 simulations have been completed\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"id": "_5YVrLeC-gLP",
"colab_type": "code",
"outputId": "d42564ce-2d84-4552-d7b9-6c754c4c82df",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 67
}
},
"source": [
"for i in range(3):\n",
"    print (i,pct[i][:5])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"0 [324, 287, 284, 312, 292]\n",
"1 [345, 330, 340, 325, 322]\n",
"2 [331, 383, 376, 363, 386]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"id": "B0jSQfqe6Dot",
"colab_type": "code",
"colab": {}
},
"source": [
"for i in range(3):\n",
"    pct[i] = [x/simloops for x in pct[i]]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "QqNGF4VZ-xt8",
"colab_type": "code",
"outputId": "8404928a-80ef-43de-8c4f-ff6286334cea",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 67
}
},
"source": [
"# check the first 5 records - 3 actions have similar probability\n",
"for i in range(3):\n",
"    print (i,pct[i][:5])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"0 [0.324, 0.287, 0.284, 0.312, 0.292]\n",
"1 [0.345, 0.33, 0.34, 0.325, 0.322]\n",
"2 [0.331, 0.383, 0.376, 0.363, 0.386]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"id": "tVZ-9nMG6Dro",
"colab_type": "code",
"outputId": "8fb21c8c-544f-4b1c-c5a2-814816909d03",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 378
}
},
"source": [
"t = [x for x in range(periods)]\n",
"mpl.style.use('seaborn')\n",
"plt.plot(t,pct[0],color='red',label='action_0')\n",
"plt.plot(t,pct[1],color='blue',label='action_1')\n",
"plt.plot(t,pct[2],color='green',label='action_2')\n",
"plt.legend()\n",
"plt.xlabel('t')\n",
"plt.ylabel('action probability')"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Text(0, 0.5, 'action probability')"
]
},
"tags": []
},
"execution_count": 10
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"tags": []
}
}
]
},
{
"cell_type": "code",
"id": "QJxz4YGP9532",
"colab_type": "code",
"outputId": "38c2c8e6-bb1d-4c6a-e484-5d91ac0e0160",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 67
}
},
"source": [
"# check the last 5 records - 3 actions have very different probability\n",
"for i in range(3):\n",
"    print (i,pct[i][-5:])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"0 [0.004, 0.008, 0.002, 0.001, 0.003]\n",
"1 [0.026, 0.016, 0.016, 0.018, 0.015]\n",
"2 [0.97, 0.976, 0.982, 0.981, 0.982]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"id": "QBGsGPYmDV85",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}

Thompson Sampling allows us to identify the true best machine (machine #2).

This example is also known as Beta-Bernoulli Bandit.

## Reference

#### Related Stories

L O A D I N G
. . . comments & more!