<jsm> \newcommand{\Rv}{\mathbf{R}} \newcommand{\rv}{\mathbf{r}} \newcommand{\Qv}{\mathbf{Q}} \newcommand{\Av}{\mathbf{A}} \newcommand{\Aiv}{\mathbf{Ai}} \newcommand{\av}{\mathbf{a}} \newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\betav}{\mathbf{\beta}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\Uv}{\mathbf{U}} \newcommand{\uv}{\mathbf{u}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\TDv}{\mathbf{TD}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\Gv}{\mathbf{G}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} \newcommand{\grad}{\mathbf{\nabla}} \newcommand{\ebx}[1]{e^{\betav_{#1}^T \xv_n}} \newcommand{\eby}[1]{e^{y_{n,#1}}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Fv}{\mathbf{F}} \newcommand{\ones}[1]{\mathbf{1}_{#1}} </jsm>
How does Tic-Tac-Toe differ from the maze problem?
The state is the board configuration. There are $3^9$ of them, though not all are reachable. Is this too big?
It is a bit less than 20,000. Not bad. Is this the full size of the Q table?
No. We must add the action dimension. There are at most 9 actions, one for each cell on the board. So the Q table will contain about $20,000 \cdot 9$ values or about 200,000. No worries.
Instead of thinking about the Q table as a three-dimensional array, as we did last time, let's be more pythonic and use a dictionary. Use the current state as the key, and the value associated with the state is an array of Q values for each action taken in that state.
We still need a way to represent a board.
How about an array of characters? So
X | | O --------- | X | O --------- X | |
would be
board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])
The initial board would be
board = np.array([' ']*9)
We can represent a move as an index, 0 to 8, into this array.
What should the reinforcement values be?
How about 0 every move except when X wins, with a reinforcement of 1, and when O wins, with a reinforcement of -1.
For the above board, let's say we, meaning Player X, prefer move to index 3. In fact, this always results in a win. So tthe Q value for move to 3 should be 1. What other Q values do you know?
If we don't play a move to win, O could win in one move. So the other moves might have Q values close to -1, depending on the skill of Player O. In the following discussion we will be using a random player for O, so the Q value for a move other than 8 or 3 will be close to but not exactly -1.
How will these values be stored? We could assign them like
for i,v in [0,-0.8,0, 1,0,0, 0,-0.7,0.1]:
Q[(tuple(board),i)] = v
To update a value for move m, we can just
Q[(tuple(board),m)] = some new value
If a key has not yet been assigned, we must assign it something before accessing it.
if not Q.has_key((tuple(board),move)):
Q[(tuple(board),move)] = 0
For our agent to interact with its world, we must implement
First, here is the result of running tons of games.
First, let's get some function definitions out of the way.
import numpy as np
import matplotlib.pyplot as plt
import random
from copy import copy
######################################################################
def winner(board):
combos = np.array((0,1,2, 3,4,5, 6,7,8, 0,3,6, 1,4,7, 2,5,8, 0,4,8, 2,4,6))
if np.any(np.logical_or(np.all('X' == board[combos].reshape((-1,3)), axis=1),
np.all('O' == board[combos].reshape((-1,3)), axis=1))):
return True
else:
return False
def printBoard(board):
print """
%c|%c|%c
-----
%c|%c|%c
------
%c|%c|%c""" % tuple(board)
def plotOutcomes(outcomes,nGames,game):
if game==0:
return
plt.clf()
nBins = 100
nPer = nGames/nBins
outcomeRows = outcomes.reshape((-1,nPer))
outcomeRows = outcomeRows[:int(game/float(nPer))+1,:]
avgs = np.mean(outcomeRows,axis=1)
plt.subplot(2,1,1)
xs = np.linspace(nPer,game,len(avgs))
plt.plot(xs, avgs)
plt.xlabel('Games')
plt.ylabel('Result (0=draw, 1=X win, -1=O win)')
plt.subplot(2,1,2)
plt.plot(xs,np.sum(outcomeRows==-1,axis=1),'r-',label='Losses')
plt.plot(xs,np.sum(outcomeRows==0,axis=1),'b-',label='Draws')
plt.plot(xs,np.sum(outcomeRows==1,axis=1),'g-',label='Wins')
plt.legend(loc="center")
plt.draw()
Now some initializations. How do we initialize the Q table???
plt.ion()
nGames = 10000 # number of games
rho = 0.1 # learning rate
epsilonExp = 0.999 # rate of epsilon decay
outcomes = np.zeros(nGames) # 0 draw, 1 X win, -1 O win
Q = {} # initialize Q dictionary
epsilon = 1.0 # initial epsilon value
showMoves = False # flag to print each board change
We must talk a bit about epsilon. This is the probability that a
random action is taken each step. If it is 1, then every action is
random. Initially we want random actions, but as experience is
gained, this should be reduced, towards zero. An easy way to do this
is to exponentially decay it towards zero, like
<jsmath>
\begin{align*}
\epsilon_0 &= 1
\epsilon_{k+1} &= 0.999\,\epsilon_k
\end{align*}
</jsmath>
or some other exponent close to 1.
Now for the main loop over multiple games. The basic structure of the main loop is
for game in range(nGames): # iterate over multiple games
epsilon *= epsilonExp
step = 0
board = np.array([' ']*9)
done = False
while not done:
# play one game
Now again, with the guts of the loop.
for game in range(nGames): # iterate over multiple games
epsilon *= epsilonExp
step = 0
board = np.array([' ']*9)
done = False
while not done:
step += 1
# X's turn
validMoves = np.where(board==' ')[0]
if np.random.uniform() < epsilon:
# Random move
move = validMoves[random.sample(range(len(validMoves)),1)]
move = move[0] # to convert do scalar
else:
# Greedy move. Collect Q values for valid moves from current board.
# Select move with highest Q value
qs = []
for m in validMoves:
qs.append(Q.get((tuple(board),m), 0))
move = validMoves[np.argmax(np.asarray(qs))]
if not Q.has_key((tuple(board),move)):
Q[(tuple(board),move)] = 0
boardNew = copy(board)
boardNew[move] = 'X'
if showMoves:
printBoard(boardNew)
if winner(boardNew):
# X won
Q[(tuple(board),move)] = 1
done = True
outcomes[game] = 1
elif not np.any(boardNew == ' '):
# Game over. No winner.
Q[(tuple(board),move)] = 0
done = True
outcomes[game] = 0
else:
# O's turn. Random player
validMoves = np.where(boardNew==' ')[0]
moveO = validMoves[random.sample(range(len(validMoves)),1)]
boardNew[moveO] = 'O'
if showMoves:
printBoard(boardNew)
if winner(boardNew):
# O won
Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)])
done = True
outcomes[game] = -1
if step > 1:
Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)])
boardOld,moveOld = board,move
board = boardNew
if game % (nGames/10) == 0 or game == nGames-1:
plotOutcomes(outcomes,nGames,game)
print "Outcomes",np.sum(outcomes==0),"draws,",\
np.sum(outcomes == 1),"X wins,", np.sum(outcomes==-1),"O wins"
Let's discuss all of the lines where the Q function is modified.
That's not much code. Now let's see the rest of it.
Ha ha! There's nothing else! Python conciseness!
You should be able to download and paste together these pieces into running code. You can also adapt it to many other two-player games, by first extracting the parts specific to Tic-Tac-Toe into functions so the basic loop need not be modified to apply it to another game.
Now let's discuss the reinforcement learning assignment, Assignment 5.