\newcommand{\Rv}{\mathbf{R}} \newcommand{\rv}{\mathbf{r}} \newcommand{\Qv}{\mathbf{Q}} \newcommand{\Av}{\mathbf{A}} \newcommand{\Aiv}{\mathbf{Ai}} \newcommand{\av}{\mathbf{a}} \newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\betav}{\mathbf{\beta}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\Uv}{\mathbf{U}} \newcommand{\uv}{\mathbf{u}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\TDv}{\mathbf{TD}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\Gv}{\mathbf{G}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} \newcommand{\grad}{\mathbf{\nabla}} \newcommand{\ebx}[1]{e^{\betav_{#1}^T \xv_n}} \newcommand{\eby}[1]{e^{y_{n,#1}}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Fv}{\mathbf{F}} \newcommand{\ones}[1]{\mathbf{1}_{#1}} ====== Reinforcement Learning for Two-Player Games ====== How does Tic-Tac-Toe differ from the maze problem? * Different state and action sets. * Two players rather than one. * Reinforcement is 0 until end of game, when it is 1 for win, 0 for draw, or -1 for loss. * Maximizing sum of reinforcement rather than minimizing. * Anything else? ===== Representing the Q Table ===== The state is the board configuration. There are $3^9$ of them, though not all are reachable. Is this too big? It is a bit less than 20,000. Not bad. Is this the full size of the Q table? No. We must add the action dimension. There are at most 9 actions, one for each cell on the board. So the Q table will contain about $20,000 \cdot 9$ values or about 200,000. No worries. Instead of thinking about the Q table as a three-dimensional array, as we did last time, let's be more pythonic and use a dictionary. Use the current state as the key, and the value associated with the state is an array of Q values for each action taken in that state. We still need a way to represent a board. How about an array of characters? So X | | O --------- | X | O --------- X | | would be board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' ']) The initial board would be board = np.array([' ']*9) We can represent a move as an index, 0 to 8, into this array. What should the reinforcement values be? How about 0 every move except when X wins, with a reinforcement of 1, and when O wins, with a reinforcement of -1. For the above board, let's say we, meaning Player X, prefer move to index 3. In fact, this always results in a win. So tthe Q value for move to 3 should be 1. What other Q values do you know? If we don't play a move to win, O could win in one move. So the other moves might have Q values close to -1, depending on the skill of Player O. In the following discussion we will be using a random player for O, so the Q value for a move other than 8 or 3 will be close to but not exactly -1. How will these values be stored? We could assign them like for i,v in [0,-0.8,0, 1,0,0, 0,-0.7,0.1]: Q[(tuple(board),i)] = v To update a value for move ''m'', we can just Q[(tuple(board),m)] = some new value If a key has not yet been assigned, we must assign it something before accessing it. if not Q.has_key((tuple(board),move)): Q[(tuple(board),move)] = 0 ===== Agent-World Interaction Loop ===== For our agent to interact with its world, we must implement * Initialize Q. * Set initial state, as empty board. * Repeat: * Agent chooses next X move. * If X wins, set Q(board,move) to 1. * Else, if board is full, set Q(board,move) to 0. * Else, let O take move. * If O won, update Q(board,move) by (-1 - Q(board,move)) * For all cases, update Q(oldboard,oldmove) by Q(board,move) - Q(oldboard,oldmove) * Shift current board and move to old ones. ===== Now the Python ===== First, here is the result of running tons of games. {{ Notes:tttResult.png?500 }} First, let's get some function definitions out of the way. import numpy as np import matplotlib.pyplot as plt import random from copy import copy ###################################################################### def winner(board): combos = np.array((0,1,2, 3,4,5, 6,7,8, 0,3,6, 1,4,7, 2,5,8, 0,4,8, 2,4,6)) if np.any(np.logical_or(np.all('X' == board[combos].reshape((-1,3)), axis=1), np.all('O' == board[combos].reshape((-1,3)), axis=1))): return True else: return False def printBoard(board): print """ %c|%c|%c ----- %c|%c|%c ------ %c|%c|%c""" % tuple(board) def plotOutcomes(outcomes,nGames,game): if game==0: return plt.clf() nBins = 100 nPer = nGames/nBins outcomeRows = outcomes.reshape((-1,nPer)) outcomeRows = outcomeRows[:int(game/float(nPer))+1,:] avgs = np.mean(outcomeRows,axis=1) plt.subplot(2,1,1) xs = np.linspace(nPer,game,len(avgs)) plt.plot(xs, avgs) plt.xlabel('Games') plt.ylabel('Result (0=draw, 1=X win, -1=O win)') plt.subplot(2,1,2) plt.plot(xs,np.sum(outcomeRows==-1,axis=1),'r-',label='Losses') plt.plot(xs,np.sum(outcomeRows==0,axis=1),'b-',label='Draws') plt.plot(xs,np.sum(outcomeRows==1,axis=1),'g-',label='Wins') plt.legend(loc="center") plt.draw() Now some initializations. How do we initialize the Q table??? plt.ion() nGames = 10000 # number of games rho = 0.1 # learning rate epsilonExp = 0.999 # rate of epsilon decay outcomes = np.zeros(nGames) # 0 draw, 1 X win, -1 O win Q = {} # initialize Q dictionary epsilon = 1.0 # initial epsilon value showMoves = False # flag to print each board change We must talk a bit about ''epsilon''. This is the probability that a random action is taken each step. If it is 1, then every action is random. Initially we want random actions, but as experience is gained, this should be reduced, towards zero. An easy way to do this is to exponentially decay it towards zero, like \begin{align*} \epsilon_0 &= 1\\ \epsilon_{k+1} &= 0.999\,\epsilon_k \end{align*} or some other exponent close to 1. Now for the main loop over multiple games. The basic structure of the main loop is for game in range(nGames): # iterate over multiple games epsilon *= epsilonExp step = 0 board = np.array([' ']*9) done = False while not done: # play one game Now again, with the guts of the loop. for game in range(nGames): # iterate over multiple games epsilon *= epsilonExp step = 0 board = np.array([' ']*9) done = False while not done: step += 1 # X's turn validMoves = np.where(board==' ')[0] if np.random.uniform() < epsilon: # Random move move = validMoves[random.sample(range(len(validMoves)),1)] move = move[0] # to convert do scalar else: # Greedy move. Collect Q values for valid moves from current board. # Select move with highest Q value qs = [] for m in validMoves: qs.append(Q.get((tuple(board),m), 0)) move = validMoves[np.argmax(np.asarray(qs))] if not Q.has_key((tuple(board),move)): Q[(tuple(board),move)] = 0 boardNew = copy(board) boardNew[move] = 'X' if showMoves: printBoard(boardNew) if winner(boardNew): # X won Q[(tuple(board),move)] = 1 done = True outcomes[game] = 1 elif not np.any(boardNew == ' '): # Game over. No winner. Q[(tuple(board),move)] = 0 done = True outcomes[game] = 0 else: # O's turn. Random player validMoves = np.where(boardNew==' ')[0] moveO = validMoves[random.sample(range(len(validMoves)),1)] boardNew[moveO] = 'O' if showMoves: printBoard(boardNew) if winner(boardNew): # O won Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)]) done = True outcomes[game] = -1 if step > 1: Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)]) boardOld,moveOld = board,move board = boardNew if game % (nGames/10) == 0 or game == nGames-1: plotOutcomes(outcomes,nGames,game) print "Outcomes",np.sum(outcomes==0),"draws,",\ np.sum(outcomes == 1),"X wins,", np.sum(outcomes==-1),"O wins" Let's discuss all of the lines where the Q function is modified. That's not much code. Now let's see the rest of it. Ha ha! There's nothing else! Python conciseness! You should be able to download and paste together these pieces into running code. You can also adapt it to many other two-player games, by first extracting the parts specific to Tic-Tac-Toe into functions so the basic loop need not be modified to apply it to another game. Now let's discuss the reinforcement learning assignment, [[Assignments:assignment5-reinforcement-learning|Assignment 5]].