\newcommand{\Rv}{\mathbf{R}} \newcommand{\rv}{\mathbf{r}} \newcommand{\Qv}{\mathbf{Q}} \newcommand{\Av}{\mathbf{A}} \newcommand{\Aiv}{\mathbf{Ai}} \newcommand{\av}{\mathbf{a}} \newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\betav}{\mathbf{\beta}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\Uv}{\mathbf{U}} \newcommand{\uv}{\mathbf{u}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\TDv}{\mathbf{TD}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\Gv}{\mathbf{G}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} \newcommand{\grad}{\mathbf{\nabla}} \newcommand{\ebx}[1]{e^{\betav_{#1}^T \xv_n}} \newcommand{\eby}[1]{e^{y_{n,#1}}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Fv}{\mathbf{F}} \newcommand{\ones}[1]{\mathbf{1}_{#1}} ====== Reinforcement Learning for Two-Player Games ====== How does Tic-Tac-Toe differ from the maze problem? * Different state and action sets. * Two players rather than one. * Reinforcement is 0 until end of game, when it is 1 for win, 0 for draw, or -1 for loss. * Maximizing sum of reinforcement rather than minimizing. * Anything else? ===== Representing the Q Table ===== The state is the board configuration. There are $3^9$ of them, though not all are reachable. Is this too big? It is a bit less than 20,000. Not bad. Is this the full size of the Q table? No. We must add the action dimension. There are at most 9 actions, one for each cell on the board. So the Q table will contain about $20,000 \cdot 9$ values or about 200,000. No worries. Instead of thinking about the Q table as a three-dimensional array, as we did last time, let's be more pythonic and use a dictionary. Use the current state as the key, and the value associated with the state is an array of Q values for each action taken in that state. We still need a way to represent a board. How about an array of characters? So


X |   | O
---------
  | X | O
---------
X |   |

would be


board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])

The initial board would be


board = np.array([' ']*9)

We can represent a move as an index, 0 to 8, into this array. What should the reinforcement values be? How about 0 every move except when X wins, with a reinforcement of 1, and when O wins, with a reinforcement of -1. For the above board, let's say we, meaning Player X, prefer move to index 3. In fact, this always results in a win. So tthe Q value for move to 3 should be 1. What other Q values do you know? If we don't play a move to win, O could win in one move. So the other moves might have Q values close to -1, depending on the skill of Player O. In the following discussion we will be using a random player for O, so the Q value for a move other than 8 or 3 will be close to but not exactly -1. How will these values be stored? We could assign them like


for i,v in [0,-0.8,0, 1,0,0, 0,-0.7,0.1]:
    Q[(tuple(board),i)] = v

To update a value for move ''m'', we can just


Q[(tuple(board),m)] = some new value

If a key has not yet been assigned, we must assign it something before accessing it.


if not Q.has_key((tuple(board),move)):
    Q[(tuple(board),move)] = 0

===== Agent-World Interaction Loop ===== For our agent to interact with its world, we must implement * Initialize Q. * Set initial state, as empty board. * Repeat: * Agent chooses next X move. * If X wins, set Q(board,move) to 1. * Else, if board is full, set Q(board,move) to 0. * Else, let O take move. * If O won, update Q(board,move) by (-1 - Q(board,move)) * For all cases, update Q(oldboard,oldmove) by Q(board,move) - Q(oldboard,oldmove) * Shift current board and move to old ones. ===== Now the Python ===== First, here is the result of running tons of games. {{ Notes:tttResult.png?500 }} First, let's get some function definitions out of the way.


import numpy as np
import matplotlib.pyplot as plt
import random
from copy import copy

######################################################################
def winner(board):
    combos = np.array((0,1,2, 3,4,5, 6,7,8, 0,3,6, 1,4,7, 2,5,8, 0,4,8, 2,4,6))
    if np.any(np.logical_or(np.all('X' == board[combos].reshape((-1,3)), axis=1),
                            np.all('O' == board[combos].reshape((-1,3)), axis=1))):
        return True
    else:
        return False

def printBoard(board):
    print """
%c|%c|%c
-----
%c|%c|%c
------
%c|%c|%c""" % tuple(board)
    
def plotOutcomes(outcomes,nGames,game):
    if game==0:
        return
    plt.clf()
    nBins = 100
    nPer = nGames/nBins
    outcomeRows = outcomes.reshape((-1,nPer))
    outcomeRows = outcomeRows[:int(game/float(nPer))+1,:]
    avgs = np.mean(outcomeRows,axis=1)
    plt.subplot(2,1,1)
    xs = np.linspace(nPer,game,len(avgs))
    plt.plot(xs, avgs)
    plt.xlabel('Games')
    plt.ylabel('Result (0=draw, 1=X win, -1=O win)')
    plt.subplot(2,1,2)
    plt.plot(xs,np.sum(outcomeRows==-1,axis=1),'r-',label='Losses')
    plt.plot(xs,np.sum(outcomeRows==0,axis=1),'b-',label='Draws')
    plt.plot(xs,np.sum(outcomeRows==1,axis=1),'g-',label='Wins')
    plt.legend(loc="center")
    plt.draw()

Now some initializations. How do we initialize the Q table???


plt.ion()
nGames = 10000                          # number of games
rho = 0.1                               # learning rate
epsilonExp = 0.999                      # rate of epsilon decay
outcomes = np.zeros(nGames)             # 0 draw, 1 X win, -1 O win
Q = {}                                  # initialize Q dictionary
epsilon = 1.0                           # initial epsilon value
showMoves = False                       # flag to print each board change

We must talk a bit about ''epsilon''. This is the probability that a random action is taken each step. If it is 1, then every action is random. Initially we want random actions, but as experience is gained, this should be reduced, towards zero. An easy way to do this is to exponentially decay it towards zero, like \begin{align*} \epsilon_0 &= 1\\ \epsilon_{k+1} &= 0.999\,\epsilon_k \end{align*} or some other exponent close to 1. Now for the main loop over multiple games. The basic structure of the main loop is


for game in range(nGames):              # iterate over multiple games

    epsilon *= epsilonExp
    step = 0
    board = np.array([' ']*9)
    done = False

    while not done:

        # play one game

Now again, with the guts of the loop.


for game in range(nGames):              # iterate over multiple games

    epsilon *= epsilonExp
    step = 0
    board = np.array([' ']*9)
    done = False

    while not done:
        step += 1
        # X's turn

        validMoves = np.where(board==' ')[0]
        if np.random.uniform() < epsilon:
            # Random move
            move = validMoves[random.sample(range(len(validMoves)),1)]
            move = move[0]              # to convert do scalar
        else:
            # Greedy move. Collect Q values for valid moves from current board.
            # Select move with highest Q value
            qs = []
            for m in validMoves:
                qs.append(Q.get((tuple(board),m), 0))
            move = validMoves[np.argmax(np.asarray(qs))]

        if not Q.has_key((tuple(board),move)):
            Q[(tuple(board),move)] = 0

        boardNew = copy(board)
        boardNew[move] = 'X'
        if showMoves:
            printBoard(boardNew)

        if winner(boardNew):
            # X won
            Q[(tuple(board),move)] = 1
            done = True
            outcomes[game] = 1

        elif not np.any(boardNew == ' '):
            # Game over. No winner.
            Q[(tuple(board),move)] = 0
            done = True
            outcomes[game] = 0

        else:
            # O's turn. Random player
            validMoves = np.where(boardNew==' ')[0]
            moveO = validMoves[random.sample(range(len(validMoves)),1)]
            boardNew[moveO] = 'O'
            if showMoves:
                printBoard(boardNew)
            if winner(boardNew):
                # O won
                Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)])
                done = True
                outcomes[game] = -1

        if step > 1:
            Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)])

        boardOld,moveOld = board,move
        board = boardNew

        if game % (nGames/10) == 0 or game == nGames-1:
            plotOutcomes(outcomes,nGames,game)
            
print "Outcomes",np.sum(outcomes==0),"draws,",\
      np.sum(outcomes == 1),"X wins,", np.sum(outcomes==-1),"O wins"

Let's discuss all of the lines where the Q function is modified. That's not much code. Now let's see the rest of it. Ha ha! There's nothing else! Python conciseness! You should be able to download and paste together these pieces into running code. You can also adapt it to many other two-player games, by first extracting the parts specific to Tic-Tac-Toe into functions so the basic loop need not be modified to apply it to another game. Now let's discuss the reinforcement learning assignment, [[Assignments:assignment5-reinforcement-learning|Assignment 5]].