Thompson Sampling: Adding positive rewards to Negative rewards in Python for Artificial Intelligence

StevenDascoli
May 26, 2020
155 views
0 votes
2 Answers

in Chapter 5 of AI Crash Course, the author writes

 nSelected = nPosReward + nNegReward 

for i in range(d):
    print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1))

Why are the number of negative rewards added to the number of positive rewards? To find the best machine shouldn’t we only be concerned about the machine with the most positive rewards? I’m confused as to why we need to add the negative with the positive rewards. Also I understand that this is a simulation where you randomly assign successes and and you pre assign success rates. However in a real life situation, how do you know the success rates of each slot machine ahead of time? And how do you know which machines should be assigned a “1” ? Thank you so much! Here is the full code:

# Importing the libraries
import numpy as np

# Setting conversion rates and the number of samples
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]
N = 10000
d = len(conversionRates)


# Creating the dataset
X = np.zeros((N, d))

for i in range(N):

    for j in range(d):
        if np.random.rand() < conversionRates[j]:
            X[i][j] = 1


# Making arrays to count our losses and wins
nPosReward = np.zeros(d)
nNegReward = np.zeros(d)


# Taking our best slot machine through beta distribution and updating its losses and wins
for i in range(N):
    selected = 0
    maxRandom = 0


    for j in range(d):
        randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1)
        if randomBeta > maxRandom:
            maxRandom = randomBeta
            selected = j


    if X[i][selected] == 1:
        nPosReward[selected] += 1
    else:
        nNegReward[selected] += 1






# Showing which slot machine is considered the best

nSelected = nPosReward + nNegReward 

for i in range(d):
    print('Machine number ' + str(i + 1) + ' was selected ' + str(nSelected[i]) + ' times')
print('Conclusion: Best machine is machine number ' + str(np.argmax(nSelected) + 1))

Tags: artificial-intelligence python

Answers

- curlycharcoal
- May 26, 2020 at 11:32 am
- 0 votes
0
With more and more feedback, Thompson Sampling shifts its focus more and more from exploration to exploitation. That is, with large nSelected values across the board (due to the large N), all Beta distributions will be quite concentrated around their mean (nPosReward[i]/nSelected[i]) and for larger iterations, with increasing probability, Thompson Sampling will pick the machine that it thinks is the most rewarding. By looking at a long enough horizon, you are pushing the probability of seeing the best considered machine also being the most often picked machine close to 1.

To sum up, your intuition is correct. The machine that is the most rewarding in expectation (given the observed feedback so far) is the one that has the highest empirical mean. Due to the probabilistic phenomena I just described, if you run the algorithm for long enough, the most frequently picked and the machine with highest expected reward will coincide with probability approaching 1.

About the second part of your question, we don’t know the success rates. If we knew them, the optimal algorithm would simply pick the ones with the highest success rate at all times. What we do have in real life is observing outputs from these random processes. For example, when you show an online advertisement, you don’t know the probability of them clicking. However, with a gross simplification assuming everybody behaves the same way, by showing it to people and observing whether they click it or not, we learn the success rate on the fly.

Login or Signup to reply.

Steven,
(I’m writing this almost 7 months after you posted. Hopefully you will reply and share some of your own insights, and the code that you used to gain the insight. I started the AI Crash Course in the end of November 2020, and similarly I was curious about the Thompson sampling in chapter 5. In my case I was interested mostly in the cases when Thompson Sampling doesn’t select the best machine. I was curious how often the ‘worst machine’ was selected. So over the past six weeks I’ve likely tried over a thousand different variations of code to gain some insight. I’ve likely made over a thousand coding mistakes and hundred different rabbit holes, in an attempt to "catch" when Thompson doesn’t work. As well as understand how the betaRandom function and adding the posRewards and NegRewards works. Likely there are mistakes in the following code, and the overall approach to better insight could be made more graphical, so please be kind.:-)

curlycharcoal provides some insight in his thoughtful answers. Even Hadelin in the same chapter offers the reader a lot of insight. What follows is an attempt at an "iterative", "trap the errors" approach that helped me gain some insight. We can try something like the code below and "compare" the result of posReward+negReward vs posRewardOnly.

Consider the following: First, insert a few lines of code that: accumulates posRewardsOnly. Also, insert some additional parameters to the concluding print statement so you can see the results of both. Also insert the true values of the conversion rates (i.e. the mean of the X values) so you can show the actual conversion rates used. Remove the multiple print statements about the selections for the other machines, just to clean up the output.

Second, create a large loop over most of Hadelin’s original code. Then iterate over that loop. Since we inserted the posRewardOnly result into the concluding print you can compare the results of when you add in the negative rewards, vs selecting the best machine with positive rewards only. (You can think of this outer loop as a crude "AI" test environment, wherein you gain insight into which approach is a better performer. )

We even insert an array at each iteration, for which machine is selected correctly that compares the with negativeRewards vs posRewardsOnly and graph that at the end. ( I haven’t done this but would be fun to see )

We can also insert an array to keep track of the original betaRandom selections on the inner loop compared to the actual best machine, and see how that selection does a drunkards walk over each time step, and eventually sobers up and selects the best machine if N is large enough usually say a few thousand time steps N>5000.

Also, we can compare if there are times where the best machine isn’t selected (this will provide some insight into the error rates, of the Thompson Sampling overall), with five machines, and N=600 it’s interesting to see sometimes as much as25% the best machine is not selected, and sometimes, even the worst machine is selected (although rarely).

Also, as curlycharcoal noted, that negative rewards aren’t always assigned through every N, for every machine, they’re only assigned when the result of the betarandom function returns a maxValue, then that machine is selected to provide "a sample". That said, if you play with the code below you might find out that your posOnlyReward idea may perform better, and converge faster than the Pos+Neg reward… or does it? 😉

######################################################################
# Try and catch when Thompson fails:
# Also, compare performance of selecting 
# based on negRewards+posRewards vs just posRewards
# 03 Jan 2021 JFB
#
#######################################################################
import numpy as np
np.set_printoptions(precision=2)

# Note the following are the target conversion rates. 
# Further down pls remember to compare actual rates against selected machine.
# Also, in later versions reorder the rates from low to hi and visa-versa
# to determine if there is some "greedy Thompson" bias
# based on order of best rates.
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]# hadelins AI Crash Course

N = 200   
# Increasing N improves the result, Hadelin explains this  in same chapter
# I've found that 10,000 results in about 1% error
# 2000 in about 20% error give or take when using 
# Hadelin's original conversion rates above.
# 100 results results in about 48% error, 
# and posRewards + negRewards disagree with posRewardOnly varying percent,
# my initial sampling of this indicates will be tricky to determine which
# performs better over a variety of situations.  But Hadelin provides code
# to create "tests" with various input states and policies.

catchLimit = 100

d = len(conversionRates)
wrong = 0.0
pcntWrong = 0.0

selectedWrong = 0.0

posOnlyWrong = 0.0
pcntPosOnlyWrong = 0.0

posOnlyVsActual = 0.0
pcntPosOnlyVsActual = 0.0

nSelectedArgMax = -1
NSelectedArgMaxPosOnly = -1

for ii in range( 1, catchLimit):

    ################---- Original X generator----##########################
    #creating the set of the bandit payouts at each time t.
    # Five columns, many rows. 
    # a 1 value means the the slot machine 
    # paid out if you selected that machine at this point in time.
    # this can be improved upon so we can order 
    # the best to worst, and visa vs.
    #
    X = np.zeros((N, d))
    for i in range(N):
        for j in range(d):
            if np.random.rand() < conversionRates[j]:
                X[i][j] = 1

    Xmean = X.mean(axis=0)
    ##############-- end of the Original X generator----###################
    
    #make arrays to count  rewards from the table of losses and wins.
    nPosReward = np.zeros(d)
    nNegReward = np.zeros(d)
    
    #Taking our best slot machine through beta distribution 
    # and updating its losses and wins.
    # Taking some of the slot machines through the beta distribution, 
    # with the goal of 
    # determining which slot machine is the best. 
    # because sometimes the best slot machine isn't found.
    for i in range(N):
        selected = 0
        maxRandom = 0
        for j in range(d):
            randomBeta = np.random.beta(nPosReward[j] + 1, 
                                        nNegReward[j] + 1)
            if randomBeta > maxRandom:
                maxRandom = randomBeta
                selected = j
        if X[i][selected] == 1:
            nPosReward[selected] +=1
        else:
            nNegReward[selected] +=1
            
    nSelected = nPosReward + nNegReward
    nSelectedPosOnly = nPosReward
    
    nSelectedArgMax = np.argmax(nSelected) + 1
    nSelectedArgMaxPosOnly = np.argmax(nSelectedPosOnly) + 1
    
    XMeanArgMax = np.argmax(Xmean) + 1  # find the actual true best slot machine

    if ( nSelectedArgMax != XMeanArgMax and
        XMeanArgMax != nSelectedArgMaxPosOnly):
        #for i in range(d):
            #print('Machine number ' + str(i+1) + ' was selected ' + str(nSelected[i]) + ' times')
        print('Fail: Pos&Neg predct slot ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' )
        wrong +=1
     
    elif ( nSelectedArgMax != XMeanArgMax and
             XMeanArgMax == nSelectedArgMaxPosOnly):
        
        print('PosOnly==Actual pos&neg ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'*' )
        selectedWrong +=1
        
    elif ( nSelectedArgMax == XMeanArgMax and
                 XMeanArgMax != nSelectedArgMaxPosOnly):
        print('PosNeg==Actual predcts ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'***' )
        posOnlyWrong +=1
        
    elif ( nSelectedArgMax == nSelectedArgMaxPosOnly and
                 XMeanArgMax != nSelectedArgMax):
        print('PosNeg == PosOnly but != actual ' + str(nSelectedArgMax),
              'posOnly predct ' + str(nSelectedArgMaxPosOnly),
             'But Xconv rates', Xmean,'actual best=',XMeanArgMax,'<>' )
        wrong +=1  
        
pcntWrong = wrong / catchLimit * 100
pcntSelectedWrong = selectedWrong / catchLimit * 100
pcntPosOnlyVsActual = posOnlyWrong / catchLimit * 100

print('Catch Limit =', catchLimit, 'N=', N)
print('<>wrong: pos+neg != Actual, and PosOnly != Actual  Failure Rate=  %.1f' %pcntWrong, '%')
print('* PosOnly == Actual but Actual != pos+neg  Failure rate =  %.1f' %pcntSelectedWrong,'%')
print('** pos+Neg == Actual but Actual != PosOnly  Failure rate =   %.1f' %pcntPosOnlyVsActual, '%')

############# END #################

Please signup or login to give your own answer.

Click here to cancel reply.