in Chapter 5 of AI Crash Course, the author writes
nSelected = nPosReward + nNegReward
for i in range(d):
print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1))
Why are the number of negative rewards added to the number of positive rewards? To find the best machine shouldn’t we only be concerned about the machine with the most positive rewards? I’m confused as to why we need to add the negative with the positive rewards. Also I understand that this is a simulation where you randomly assign successes and and you pre assign success rates. However in a real life situation, how do you know the success rates of each slot machine ahead of time? And how do you know which machines should be assigned a “1” ? Thank you so much! Here is the full code:
# Importing the libraries
import numpy as np
# Setting conversion rates and the number of samples
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]
N = 10000
d = len(conversionRates)
# Creating the dataset
X = np.zeros((N, d))
for i in range(N):
for j in range(d):
if np.random.rand() < conversionRates[j]:
X[i][j] = 1
# Making arrays to count our losses and wins
nPosReward = np.zeros(d)
nNegReward = np.zeros(d)
# Taking our best slot machine through beta distribution and updating its losses and wins
for i in range(N):
selected = 0
maxRandom = 0
for j in range(d):
randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1)
if randomBeta > maxRandom:
maxRandom = randomBeta
selected = j
if X[i][selected] == 1:
nPosReward[selected] += 1
else:
nNegReward[selected] += 1
# Showing which slot machine is considered the best
nSelected = nPosReward + nNegReward
for i in range(d):
print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1))
2
Answers
With more and more feedback, Thompson Sampling shifts its focus more and more from exploration to exploitation. That is, with large
nSelected
values across the board (due to the largeN
), all Beta distributions will be quite concentrated around their mean (nPosReward[i]/nSelected[i]
) and for larger iterations, with increasing probability, Thompson Sampling will pick the machine that it thinks is the most rewarding. By looking at a long enough horizon, you are pushing the probability of seeing the best considered machine also being the most often picked machine close to 1.To sum up, your intuition is correct. The machine that is the most rewarding in expectation (given the observed feedback so far) is the one that has the highest empirical mean. Due to the probabilistic phenomena I just described, if you run the algorithm for long enough, the most frequently picked and the machine with highest expected reward will coincide with probability approaching 1.
About the second part of your question, we don’t know the success rates. If we knew them, the optimal algorithm would simply pick the ones with the highest success rate at all times. What we do have in real life is observing outputs from these random processes. For example, when you show an online advertisement, you don’t know the probability of them clicking. However, with a gross simplification assuming everybody behaves the same way, by showing it to people and observing whether they click it or not, we learn the success rate on the fly.
Steven,
(I’m writing this almost 7 months after you posted. Hopefully you will reply and share some of your own insights, and the code that you used to gain the insight. I started the AI Crash Course in the end of November 2020, and similarly I was curious about the Thompson sampling in chapter 5. In my case I was interested mostly in the cases when Thompson Sampling doesn’t select the best machine. I was curious how often the ‘worst machine’ was selected. So over the past six weeks I’ve likely tried over a thousand different variations of code to gain some insight. I’ve likely made over a thousand coding mistakes and hundred different rabbit holes, in an attempt to "catch" when Thompson doesn’t work. As well as understand how the betaRandom function and adding the posRewards and NegRewards works. Likely there are mistakes in the following code, and the overall approach to better insight could be made more graphical, so please be kind.:-)
curlycharcoal provides some insight in his thoughtful answers. Even Hadelin in the same chapter offers the reader a lot of insight. What follows is an attempt at an "iterative", "trap the errors" approach that helped me gain some insight. We can try something like the code below and "compare" the result of posReward+negReward vs posRewardOnly.
Consider the following: First, insert a few lines of code that: accumulates posRewardsOnly. Also, insert some additional parameters to the concluding print statement so you can see the results of both. Also insert the true values of the conversion rates (i.e. the mean of the X values) so you can show the actual conversion rates used. Remove the multiple print statements about the selections for the other machines, just to clean up the output.
Second, create a large loop over most of Hadelin’s original code. Then iterate over that loop. Since we inserted the posRewardOnly result into the concluding print you can compare the results of when you add in the negative rewards, vs selecting the best machine with positive rewards only. (You can think of this outer loop as a crude "AI" test environment, wherein you gain insight into which approach is a better performer. )
We even insert an array at each iteration, for which machine is selected correctly that compares the with negativeRewards vs posRewardsOnly and graph that at the end. ( I haven’t done this but would be fun to see )
We can also insert an array to keep track of the original betaRandom selections on the inner loop compared to the actual best machine, and see how that selection does a drunkards walk over each time step, and eventually sobers up and selects the best machine if N is large enough usually say a few thousand time steps N>5000.
Also, we can compare if there are times where the best machine isn’t selected (this will provide some insight into the error rates, of the Thompson Sampling overall), with five machines, and N=600 it’s interesting to see sometimes as much as25% the best machine is not selected, and sometimes, even the worst machine is selected (although rarely).
Also, as curlycharcoal noted, that negative rewards aren’t always assigned through every N, for every machine, they’re only assigned when the result of the betarandom function returns a maxValue, then that machine is selected to provide "a sample". That said, if you play with the code below you might find out that your posOnlyReward idea may perform better, and converge faster than the Pos+Neg reward… or does it? 😉