# CS440: Introduction to Artificial Intelligence

### Sidebar

CS440

Instructor
Chuck Anderson

reinflearn2


# Reinforcement Learning for Two-Player Games

How does Tic-Tac-Toe differ from the maze problem?

• Different state and action sets.
• Two players rather than one.
• Reinforcement is 0 until end of game, when it is 1 for win, 0 for draw, or -1 for loss.
• Maximizing sum of reinforcement rather than minimizing.
• Anything else?

## Representing the Q Table

The state is the board configuration. There are $3^9$ of them, though not all are reachable. Is this too big?

It is a bit less than 20,000. Not bad. Is this the full size of the Q table?

No. We must add the action dimension. There are at most 9 actions, one for each cell on the board. So the Q table will contain about $20,000 \cdot 9$ values or about 200,000. No worries.

Instead of thinking about the Q table as a three-dimensional array, as we did last time, let's be more pythonic and use a dictionary. Use the current state as the key, and the value associated with the state is an array of Q values for each action taken in that state.

We still need a way to represent a board.

How about an array of characters? So

X |   | O
---------
| X | O
---------
X |   |

would be

board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])

The initial board would be

board = np.array([' ']*9)

We can represent a move as an index, 0 to 8, into this array.

What should the reinforcement values be?

How about 0 every move except when X wins, with a reinforcement of 1, and when O wins, with a reinforcement of -1.

For the above board, let's say we, meaning Player X, prefer move to index 3. In fact, this always results in a win. So tthe Q value for move to 3 should be 1. What other Q values do you know?

If we don't play a move to win, O could win in one move. So the other moves might have Q values close to -1, depending on the skill of Player O. In the following discussion we will be using a random player for O, so the Q value for a move other than 8 or 3 will be close to but not exactly -1.

How will these values be stored? We could assign them like

for i,v in [0,-0.8,0, 1,0,0, 0,-0.7,0.1]:
Q[(tuple(board),i)] = v

To update a value for move m, we can just

Q[(tuple(board),m)] = some new value

If a key has not yet been assigned, we must assign it something before accessing it.

if not Q.has_key((tuple(board),move)):
Q[(tuple(board),move)] = 0

## Agent-World Interaction Loop

For our agent to interact with its world, we must implement

• Initialize Q.
• Set initial state, as empty board.
• Repeat:
• Agent chooses next X move.
• If X wins, set Q(board,move) to 1.
• Else, if board is full, set Q(board,move) to 0.
• Else, let O take move.
• If O won, update Q(board,move) by (-1 - Q(board,move))
• For all cases, update Q(oldboard,oldmove) by Q(board,move) - Q(oldboard,oldmove)
• Shift current board and move to old ones.

## Now the Python

First, here is the result of running tons of games.

First, let's get some function definitions out of the way.

import numpy as np
import matplotlib.pyplot as plt
import random
from copy import copy

######################################################################
def winner(board):
combos = np.array((0,1,2, 3,4,5, 6,7,8, 0,3,6, 1,4,7, 2,5,8, 0,4,8, 2,4,6))
if np.any(np.logical_or(np.all('X' == board[combos].reshape((-1,3)), axis=1),
np.all('O' == board[combos].reshape((-1,3)), axis=1))):
return True
else:
return False

def printBoard(board):
print """
%c|%c|%c
-----
%c|%c|%c
------
%c|%c|%c""" % tuple(board)

def plotOutcomes(outcomes,nGames,game):
if game==0:
return
plt.clf()
nBins = 100
nPer = nGames/nBins
outcomeRows = outcomes.reshape((-1,nPer))
outcomeRows = outcomeRows[:int(game/float(nPer))+1,:]
avgs = np.mean(outcomeRows,axis=1)
plt.subplot(2,1,1)
xs = np.linspace(nPer,game,len(avgs))
plt.plot(xs, avgs)
plt.xlabel('Games')
plt.ylabel('Result (0=draw, 1=X win, -1=O win)')
plt.subplot(2,1,2)
plt.plot(xs,np.sum(outcomeRows==-1,axis=1),'r-',label='Losses')
plt.plot(xs,np.sum(outcomeRows==0,axis=1),'b-',label='Draws')
plt.plot(xs,np.sum(outcomeRows==1,axis=1),'g-',label='Wins')
plt.legend(loc="center")
plt.draw()


Now some initializations. How do we initialize the Q table???

plt.ion()
nGames = 10000                          # number of games
rho = 0.1                               # learning rate
epsilonExp = 0.999                      # rate of epsilon decay
outcomes = np.zeros(nGames)             # 0 draw, 1 X win, -1 O win
Q = {}                                  # initialize Q dictionary
epsilon = 1.0                           # initial epsilon value
showMoves = False                       # flag to print each board change

We must talk a bit about epsilon. This is the probability that a random action is taken each step. If it is 1, then every action is random. Initially we want random actions, but as experience is gained, this should be reduced, towards zero. An easy way to do this is to exponentially decay it towards zero, like <jsmath> \begin{align*} \epsilon_0 &= 1
\epsilon_{k+1} &= 0.999\,\epsilon_k \end{align*} </jsmath> or some other exponent close to 1.

Now for the main loop over multiple games. The basic structure of the main loop is

for game in range(nGames):              # iterate over multiple games

epsilon *= epsilonExp
step = 0
board = np.array([' ']*9)
done = False

while not done:

# play one game

Now again, with the guts of the loop.

for game in range(nGames):              # iterate over multiple games

epsilon *= epsilonExp
step = 0
board = np.array([' ']*9)
done = False

while not done:
step += 1
# X's turn

validMoves = np.where(board==' ')[0]
if np.random.uniform() < epsilon:
# Random move
move = validMoves[random.sample(range(len(validMoves)),1)]
move = move[0]              # to convert do scalar
else:
# Greedy move. Collect Q values for valid moves from current board.
# Select move with highest Q value
qs = []
for m in validMoves:
qs.append(Q.get((tuple(board),m), 0))
move = validMoves[np.argmax(np.asarray(qs))]

if not Q.has_key((tuple(board),move)):
Q[(tuple(board),move)] = 0

boardNew = copy(board)
boardNew[move] = 'X'
if showMoves:
printBoard(boardNew)

if winner(boardNew):
# X won
Q[(tuple(board),move)] = 1
done = True
outcomes[game] = 1

elif not np.any(boardNew == ' '):
# Game over. No winner.
Q[(tuple(board),move)] = 0
done = True
outcomes[game] = 0

else:
# O's turn. Random player
validMoves = np.where(boardNew==' ')[0]
moveO = validMoves[random.sample(range(len(validMoves)),1)]
boardNew[moveO] = 'O'
if showMoves:
printBoard(boardNew)
if winner(boardNew):
# O won
Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)])
done = True
outcomes[game] = -1

if step > 1:
Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)])

boardOld,moveOld = board,move
board = boardNew

if game % (nGames/10) == 0 or game == nGames-1:
plotOutcomes(outcomes,nGames,game)

print "Outcomes",np.sum(outcomes==0),"draws,",\
np.sum(outcomes == 1),"X wins,", np.sum(outcomes==-1),"O wins"

Let's discuss all of the lines where the Q function is modified.

That's not much code. Now let's see the rest of it.

Ha ha! There's nothing else! Python conciseness!

You should be able to download and paste together these pieces into running code. You can also adapt it to many other two-player games, by first extracting the parts specific to Tic-Tac-Toe into functions so the basic loop need not be modified to apply it to another game.

Now let's discuss the reinforcement learning assignment, Assignment 5.