$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Assignment 5: Control a Marble with Reinforcement Learning

Steve Kommrusch

Key neural network subroutines provided by professor Chuck Anderson

Overview

In this assignment, I am modifying provided reinforcement learning code to solve the dynamic marble problem. I solve a more complex version of the marble problem in which a goal position is specified as a new state variable, and I adjusted the functions to solve a 2 dimensional gravity problem to dock a rocket with a space station orbiting a planet.

The reinforcement objective is to find the correct Q value to evaluate best actions for a given state by making this approximation as accurate as possible:

$$ \begin{align*} Q(s_t,a_t) \approx \sum_{k=0}^\infty r_{t+k+1} \end{align*} $$

This is usually formulated as the least squares objective:

$$ \begin{align*} \text{Minimize } \mathrm{E} \left ( \sum_{k=0}^\infty r_{t+k+1} - Q(s_t,a_t)\right )^2 \end{align*} $$

Using SARSA (state, action, reinforcement, next state, next action) for training, and discounting future reinforcements in the temporal-difference error, we can formulate the learning problem as minimizing:

$$ \begin{align*} \text{Minimize } \mathrm{E} \Bigl ( r_{t+1} + \gamma Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr )^2 \end{align*} $$

Initially provided code

For this assignment, we were provided this section by professor Chuck Anderson; the functions here are unchanged so as to provide a starting example. Also provided by professor Anderson was A5.tar, which included the files below. At the end of this section I describe the graphs that are generated.

  • marble.py
  • neuralnetworksbylayer.py
  • layers.py
  • scaledconjugategradient.py
  • mlutils.py
In [1]:
import neuralnetworksbylayer as nn
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import copy

%matplotlib inline

To define a reinforcement learning problem, we need functions that return

  • initialize the state of the environment,
  • a next state given a current state and an action,
  • a reinforcement value, and
  • an action to take, given the current state. For this we define an $\epsilon$-greedy policy.

Let's first define the set of valid actions. For our simple one-dimensional marble problem, we can push left, right, or not push at all.

In [2]:
validActions = np.array([ -1, 0, 1])

def initialState():
    return np.array([10*np.random.random_sample(), 3*(0.5-np.random.random_sample())])

def nextState(s,a):
    s = copy.copy(s)   # s[0] is position, s[1] is velocity. a is -1, 0 or 1
    deltaT = 0.1                           # Euler integration time step
    s[0] += deltaT * s[1]                  # Update position
    s[1] += deltaT * (2 * a - 0.2 * s[1])  # Update velocity. Includes friction
    if s[0] < 0:        # Bound next position. If at limits, set velocity to 0.
        s = np.array([0,0])
    elif s[0] > 10:
        s = np.array([10,0])
    return s

def reinforcement(s):  # s is new state
    goal = 5
    return 0 if abs(s[0]-goal) < 1 else -0.1

def policy(qnet, state, epsilon):
    if np.random.rand(1) < epsilon:
        actioni = np.random.randint(validActions.shape[0])
    else:
        inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions.reshape((-1,1))))
        qs = qnet.use(inputs)
        actioni = np.argmax(qs)
    return validActions[actioni]

Now we need a function to generate a bunch of samples that are interactions with the marble.

In [3]:
def makeSamples(qnet, nStepsPerStart):
    samples = []
    state = initialState()
    act = policy(qnet, state, epsilon)
    oldact = act
    for iStep in range(nStepsPerStart):
        newState = nextState(state, act)
        r = reinforcement(newState)
        newAct = policy(qnet, newState, epsilon)
        # SARSA
        samples.append(state.tolist() + [act, r] + newState.tolist() + [newAct])
        state = newState
        oldact = act
        act = newAct
    return np.array(samples)

Now we can play. Define constants we need.

In [4]:
def plotStatus(qnet, X, R, trial, epsilonTrace, rtrace):
    plt.subplot(4,3,1)
    plt.plot(epsilonTrace[:trial+1])
    plt.ylabel("Random Action Probability ($\epsilon$)")
    plt.ylim(0,1)
    plt.subplot(4,3,2)
    plt.plot(X[:,0])
    plt.plot([0,X.shape[0]], [5,5],'--',alpha=0.5,lw=5)
    plt.ylabel("$x$")
    plt.ylim(-1,11)
    qs = qnet.use(np.array([[s,0,a] for a in validActions for s in range(11)]))
    plt.subplot(4,3,3)
    acts = ["L","0","R"]
    actsiByState = np.argmax(qs.reshape((len(validActions),-1)),axis=0)
    for i in range(11):
        plt.text(i,0,acts[actsiByState[i]])
        plt.xlim(-1,11)
        plt.ylim(-1,1)
    plt.text(2,0.2,"Policy for Zero Velocity")
    plt.axis("off")
    plt.subplot(4,3,4)
    plt.plot(rtrace[:trial+1],alpha=0.5)
    binSize = 20
    if trial+1 > binSize:
        # Calculate mean of every bin of binSize reinforcement values
        smoothed = np.mean(rtrace[:int(trial/binSize)*binSize].reshape((int(trial/binSize),binSize)),axis=1)
        plt.plot(np.arange(1,1+int(trial/binSize))*binSize,smoothed)
    plt.ylabel("Mean reinforcement")
    plt.subplot(4,3,5)
    plt.plot(X[:,0],X[:,1])
    plt.plot(X[0,0],X[0,1],'o')
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.fill_between([4,6],[-5,-5],[5,5],color="red",alpha=0.3)
    plt.xlim(-1,11)
    plt.ylim(-5,5)
    plt.subplot(4,3,6)
    qnet.draw(["$x$","$\dot{x}$","$a$"],["Q"])

    plt.subplot(4,3,7)
    n = 20
    positions = np.linspace(0,10,n)
    velocities =  np.linspace(-5,5,n)
    xs,ys = np.meshgrid(positions,velocities)
    xsflat = xs.flat
    ysflat = ys.flat
    qs = qnet.use(np.array([[xsflat[i],ysflat[i],a] for a in validActions for i in range(len(xsflat))]))
    qs = qs.reshape((len(validActions),-1)).T
    qsmax = np.max(qs,axis=1).reshape(xs.shape)
    cs = plt.contourf(xs,ys,qsmax)
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Max Q")
    plt.subplot(4,3,8)
    acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
    cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Actions")

    s = plt.subplot(4,3,10)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Max Q")

    s = plt.subplot(4,3,11)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Action")

def testIt(qnet,nTrials,nStepsPerTrial):
    xs = np.linspace(0,10,nTrials)
    plt.subplot(4,3,12)
    for x in xs:
        s = [x,0] ## 0 velocity
        xtrace = np.zeros((nStepsPerTrial,2))
        for step in range(nStepsPerTrial):
            a = policy(qnet, s, 0.0)  # epsilon = 0
            s = nextState(s,a)
            xtrace[step,:] = s
        plt.plot(xtrace[:,0],xtrace[:,1])
        plt.xlim(-1,11)
        plt.ylim(-5,5)
        plt.plot([5,5],[-5,5],'--',alpha=0.5,lw=5)
        plt.ylabel('$\dot{x}$')
        plt.xlabel('$x$')
        plt.title('State Trajectories for $\epsilon=0$')
In [5]:
gamma = 0.999
nTrials = 300
nStepsPerTrial = 500 
nSCGIterations = 30
finalEpsilon = 0.01
epsilonDecay = np.exp(np.log(finalEpsilon)/(nTrials))  # to produce this final value
In [6]:
nh = [5,5]
qnet = nn.NeuralNetwork([3] + nh + [1])  # [3, 5, 5, 1]
qnet.setInputRanges(( (0, 10), (-3, 3), (-1,1)))
In [7]:
fig = plt.figure(figsize=(15,15))

epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)

for trial in range(nTrials):

    # Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
    samples = makeSamples(qnet, nStepsPerTrial)

    ns = 2
    na = 1
    X = samples[:, :ns+na]
    R = samples[:, ns+na:ns+na+1]
    nextX = samples[:, ns+na+1:]
    print
    nextQ = qnet.use(nextX)

    qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
    
    # Decay epsilon
    epsilon *= epsilonDecay

    # Rest is for plotting
    epsilonTrace[trial] = epsilon
    rtrace[trial] = np.mean(R)

    if trial % (nTrials//10) == 0 or trial == nTrials-1:
        plt.clf()
        plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
        testIt(qnet,10,500)
        clear_output(wait=True)
        display(fig);
        plt.pause(0.01)

clear_output(wait=True)

Here is a brief description of the diagrams above, starting with the upper left diagram and then describing row-by-row:

  • Random action probabilty decays from 1.0 to finalEpsilon over nTrials
  • An example trial showing positon of the marble during the nStepsPerTrial
  • Policy given for the state at given position but 0 velocity
  • Mean reinforcement of the samples for each trial, and a average shown for batches of 20 trials.
  • An example velocity and position trace for the same trial in the diagram from the row above.
  • The printout of the network, showing weights associated with various neurons.
  • The Max Q value any action given a positon and velocity of the marble. Generally the highest Q values should be 0 velocity at the goal, positive velocities when left of goal, and negative velocities right of goal.
  • Actions to take (-1 = accelerate left, 0 = no change to velocity, 1 = accelerate right) for a given marble position and velocity.
  • 3D graph of Max Q value (same data as diagram in row above)
  • 3D graph of actions to take (same data as diagram in row above)
  • Example velocity and position traces for marbles starting at positions 0,1,2...10

Assignment Requirements

I modified the above code to change the reinforcement learning problem to one for which the goal position of the marble can be varied. The new code trains one Q network for the marble problem with a variable goal. I run 60 experiments to find values for the network architecture (nh), gamma, nTrials, nStepsPerTrial, nSCGIterations, and finalEpsilon that result in a single Q network that can drive the marble to the goal position.

The general approach is as follows.

The state of the marble in the original code is given by $(x_t, \dot{x_t})$. In this new problem, the state will be $(x_t, \dot{x_t}, g_t)$, where $g_t$ is the goal at time $t$. The initialState function now randomly chooses a new goal $g_t$ to be a value between 1 and 9. nextState is changed so that the goal value remains the same in the new state as it was in the old state. Finally, the reinforcement function is now parameterized to depend on the current $g_t$ value.

Some figures are unchanged, but as requested I now plot the contour and surface plots for goals 1,5, and 9, and I plot the 0 velocity test plots for all 11 goals values (0 through 10).

As discussed below, I was able to find parameters for the system that work reasonably well, but are not consistently perfect. The main discovery is that it helped to leave the final epsilon rather high (0.6) to insure the network still performs well for unexpected states during usage.

This first run shows the results of the system with the same parameters as provided in the initial code section above.

In [8]:
validActions = np.array([ -1, 0, 1])

def initialState():
    goal = np.floor(9*np.random.random_sample()+1)  # Random from 1 to 9 inclusive
    return np.array([10*np.random.random_sample(), 3*(0.5-np.random.random_sample()), goal])

def nextState(s,a):
    s = copy.copy(s)   # s[0] is position, s[1] is velocity. a is -1, 0 or 1
    deltaT = 0.1                           # Euler integration time step
    s[0] += deltaT * s[1]                  # Update position
    s[1] += deltaT * (2 * a - 0.2 * s[1])  # Update velocity. Includes friction
    if s[0] < 0:        # Bound next position. If at limits, set velocity to 0.
        s = np.array([0,0,s[2]])
    elif s[0] > 10:
        s = np.array([10,0,s[2]])
    return s

def reinforcement(s):  # s is new state
    goal = s[2]
    return 0 if abs(s[0]-goal) < 1 else -0.1

def policy(qnet, state, epsilon):
    if np.random.rand(1) < epsilon:
        actioni = np.random.randint(validActions.shape[0])
    else:
        inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions.reshape((-1,1))))
        qs = qnet.use(inputs)
        actioni = np.argmax(qs)
    return validActions[actioni]
In [9]:
nh = [5,5]
qnet = nn.NeuralNetwork([4] + nh + [1])  # [4, 5, 5, 1]
qnet.setInputRanges(( (0, 10), (-3, 3), (-1,1), (0, 10)))
In [10]:
def plotStatus(qnet, X, R, trial, epsilonTrace, rtrace):
    plt.subplot(10,3,1)
    plt.plot(epsilonTrace[:trial+1])
    plt.ylabel("Random Action Probability ($\epsilon$)")
    plt.ylim(0,1)
    plt.subplot(10,3,2)
    plt.plot(X[:,0])
    plt.plot([0,X.shape[0]], [X[0,2],X[0,2]],'--',alpha=0.5,lw=5)
    plt.ylabel("$x$")
    plt.ylim(-1,11)
    qs = qnet.use(np.array([[s,0,5,a] for a in validActions for s in range(11)]))
    plt.subplot(10,3,3)
    acts = ["L","0","R"]
    actsiByState = np.argmax(qs.reshape((len(validActions),-1)),axis=0)
    for i in range(11):
        plt.text(i,0,acts[actsiByState[i]])
        plt.xlim(-1,11)
        plt.ylim(-1,1)
    plt.text(1,0.2,"Policy for Zero Velocity and goal=5")
    plt.axis("off")
    plt.subplot(10,3,4)
    plt.plot(rtrace[:trial+1],alpha=0.5)
    binSize = 20
    if trial+1 > binSize:
        # Calculate mean of every bin of binSize reinforcement values
        smoothed = np.mean(rtrace[:int(trial/binSize)*binSize].reshape((int(trial/binSize),binSize)),axis=1)
        plt.plot(np.arange(1,1+int(trial/binSize))*binSize,smoothed)
    plt.ylabel("Mean reinforcement")
    plt.subplot(10,3,5)
    plt.plot(X[:,0],X[:,1])
    plt.plot(X[0,0],X[0,1],'o')
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.fill_between([X[0,2]-1,X[0,2]+1],[-5,-5],[5,5],color="red",alpha=0.3)
    plt.xlim(-1,11)
    plt.ylim(-5,5)
    plt.subplot(10,3,6)
    qnet.draw(["$x$","$\dot{x}$","$goal$","$a$"],["Q"])

    n = 20
    positions = np.linspace(0,10,n)
    velocities =  np.linspace(-5,5,n)
    xs,ys = np.meshgrid(positions,velocities)
    xsflat = xs.flat
    ysflat = ys.flat
    plt.subplot(10,3,7)
    qs = qnet.use(np.array([[xsflat[i],ysflat[i],1,a] for a in validActions for i in range(len(xsflat))]))
    qs = qs.reshape((len(validActions),-1)).T
    qsmax = np.max(qs,axis=1).reshape(xs.shape)
    cs = plt.contourf(xs,ys,qsmax)
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Max Q for goal=1")
    plt.subplot(10,3,8)
    acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
    cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Actions for goal=1")
    s = plt.subplot(10,3,9)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Max Q for goal=1")
    s = plt.subplot(10,3,10)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Actions for goal=1")
    
    plt.subplot(10,3,11)
    qs = qnet.use(np.array([[xsflat[i],ysflat[i],5,a] for a in validActions for i in range(len(xsflat))]))
    qs = qs.reshape((len(validActions),-1)).T
    qsmax = np.max(qs,axis=1).reshape(xs.shape)
    cs = plt.contourf(xs,ys,qsmax)
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Max Q for goal=5")
    plt.subplot(10,3,12)
    acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
    cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Actions for goal=5")
    s = plt.subplot(10,3,13)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Max Q for goal=5")
    s = plt.subplot(10,3,14)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Actions for goal=5")
    
    plt.subplot(10,3,15)
    qs = qnet.use(np.array([[xsflat[i],ysflat[i],9,a] for a in validActions for i in range(len(xsflat))]))
    qs = qs.reshape((len(validActions),-1)).T
    qsmax = np.max(qs,axis=1).reshape(xs.shape)
    cs = plt.contourf(xs,ys,qsmax)
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Max Q for goal=9")
    plt.subplot(10,3,16)
    acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
    cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$\dot{x}$")
    plt.title("Actions for goal=9")
    s = plt.subplot(10,3,17)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Max Q for goal=9")
    s = plt.subplot(10,3,18)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$\dot{x}$")
    plt.title("Actions for goal=9")
    
def testIt(qnet,nTrials,nStepsPerTrial):
    for i in range(11):
        xs = np.linspace(0,10,nTrials)
        plt.subplot(10,3,19+i)
        for x in xs:
            s = [x,0,i] ## 0 velocity, goal=i
            xtrace = np.zeros((nStepsPerTrial,3))
            for step in range(nStepsPerTrial):
                a = policy(qnet, s, 0.0)  # epsilon = 0
                s = nextState(s,a)
                xtrace[step,:] = s
            plt.plot(xtrace[:,0],xtrace[:,1])
            plt.xlim(-1,11)
            plt.ylim(-5,5)
            plt.plot([i,i],[-5,5],'--',alpha=0.5,lw=5)
            plt.ylabel('$\dot{x}$')
            plt.xlabel('$x$')
            plt.title('State Trajectories for goal=' + str(i) + ', $\epsilon=0$')

The run below is with all the same parameters as the code provided (same hidden layers, epsilon, etc) but updated to train on goals from 0 to 10. The final results are not very good, as seen in the graphs.

In [13]:
# Generate graphs for arbitrary goal position with original (not optimized) parameters
# (see later run after the experiment table for optimized parameters)

fig = plt.figure(figsize=(15,50))

epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)

for trial in range(nTrials):

    # Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
    samples = makeSamples(qnet, nStepsPerTrial)

    ns = 3 # position, velocity, goal
    na = 1
    X = samples[:, :ns+na]
    R = samples[:, ns+na:ns+na+1]
    nextX = samples[:, ns+na+1:]
    nextQ = qnet.use(nextX)

    qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
    
    # Decay epsilon
    epsilon *= epsilonDecay

    # Rest is for plotting
    epsilonTrace[trial] = epsilon
    rtrace[trial] = np.mean(R)

    if trial % (nTrials//5) == 0 or trial == nTrials-1:
        plt.clf()
        plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
        testIt(qnet,10,500)
        clear_output(wait=True)
        display(fig);
        plt.pause(0.01)

clear_output(wait=True)

After the results above with the initial parameters, I tried a variety of parameter changes whose results are summarized in the table below.

Run nh gamma nTrials nSteps/Trial nSCGIterations finalEpsilon execution time goals correct
Initial 5,5 0.999 300 500 30 0.01 ~61 sec 1
2 10,10,10 0.999 300 600 60 0.001 ~86 sec 2
3 10,10,10 0.999 600 1200 120 0.001 ~321 sec 8
4 10,10,10 0.999 400 1000 100 0.01 ~171 sec 1
5 10,10,10 0.999 600 1000 120 0.01 ~277 sec 7
6 10,10,10 0.999 600 500 120 0.01 ~184 sec 2
7 10,10,10 0.99 600 500 120 0.01 ~187 sec 1
8 10,10,10 0.99 600 300 120 0.01 ~147 sec 2
9 10,10,10 0.99 600 100 120 0.01 ~85 sec 1
10 20,10 0.99 600 100 120 0.01 ~87 sec 2
11 20,10 0.99 600 100 120 0.001 ~80 sec 1
12 30,10,5 0.99 600 50 60 0.002 ~70 sec 2
13 30,10,5 0.99 600 50 20 0.002 ~28 sec 1
14 30,10,5 0.99 600 10 20 0.002 ~28 sec 0
15 30,10,5 0.99 2000 100 60 0.002 ~118 sec 2
16 20,10 0.99 2000 100 60 0.002 ~95 sec 2
16 20,10 0.99 2000 100 30 0.01 ~73 sec 0
17 20,10 0.99 1000 100 30 0.02 ~48 sec 1
18 30,10,10 0.999 600 1200 120 0.001 ~479 sec 1
19 10,10 0.999 600 1200 120 0.001 ~259 sec 5
20 10,10,10 0.999 1200 600 60 0.001 ~231 sec 2
21 10,10,10 0.999 600 1200 60 0.02 ~236 sec 2
22 10,10,10 0.999 600 600 120 0.02 ~207 sec 3
23 10,10,10 0.999 600 600 60 0.02 ~160 sec 4
24 10,10,10 0.999 600 1200 60 0.02 ~240 sec 4
25 10,10,10 0.999 600 1200 120 0.02 ~313 sec 2
26 10,10,10 0.9 600 600 60 0.02 ~145 sec 1
27 5,5,5 0.999 600 1200 120 0.001 ~255 sec 1
28 10,10,10 0.999 3000 1200 120 0.001 under 2000 sec 2
29 5,5 0.99 4000 200 60 0.01 under 1700 sec 2
30 3,3 0.99 600 200 60 0.1 under 60 sec 2
31 3 0.99 600 200 60 0.1 ~45 sec 0
32 5 0.99 600 200 60 0.1 ~50 sec 2
33 7 0.99 600 200 60 0.1 ~47 sec 2
34 7,3 0.99 600 500 60 0.1 ~97 sec 6
35 7,3 0.999 600 500 60 0.1 ~91 sec 5
36 7,3 0.99 600 500 30 0.1 ~69 sec 3
37 7,3 0.99 1000 400 30 0.2 ~80 sec 4
38 7,3 0.99 1000 400 30 0.5 ~80 sec 11 (perfect)
39 7,3 0.99 1000 200 30 0.5 ~48 sec 9
40 7,3 0.99 1000 200 30 0.9 ~43 sec 5
41 7,3 0.999 1000 200 30 0.5 ~43 sec 1
42 7,3 0.9 1000 200 30 0.5 ~48 sec 11 (perfect)
43 7,3 0.5 1000 200 30 0.5 ~48 sec 8
44 7,3 0.98 1000 200 30 0.5 ~55 sec 4
45 7,3 0.9 500 200 30 0.5 ~49 sec 1
46 7,3 0.9 1000 200 20 0.5 ~60 sec 11 (perfect)
47 7,3 0.9 1000 200 10 0.5 ~47 sec 9
48 7,3 0.9 1000 200 60 0.5 ~79 sec 3
49 7,3 0.9 1000 100 25 0.5 ~35 sec 0
50 7,3 0.9 1000 200 25 0.5 ~54 sec 0
51 7,3 0.95 1000 200 25 0.5 ~53 sec 0
52 7,3 0.98 1000 200 25 0.5 ~61 sec 4
53 7,3 0.98 1000 300 30 0.5 ~70 sec 9
54 5,3 0.98 1000 300 30 0.5 ~67 sec 8
55 9,3 0.98 1000 300 30 0.5 ~80 sec 2
56 7,3 0.98 2000 300 30 0.6 ~125 sec 11 (perfect)
57 7,5 0.98 1000 300 30 0.5 ~75 sec 4
58 7,3 0.99 2000 300 30 0.6 ~98 sec 1
59 7,3 0.98 1200 300 30 0.5 ~75 sec 8
60 7,3 0.98 2000 300 30 0.6 ~110 sec 10
61 7,3 0.98 2000 300 30 0.6 ~110 sec 11 (perfect)

Discussion of final parameters and results

My 61st run was with the same parameters as my 56th and 60th and it trained well, but the single mistake on run 60 shows the challenge of training to this problem. The table of experiments includes parameters I tested, the time to train, and the final number of test trials that worked correctly for all 11 goal positions (11 is a perfect score). My initial succes at run 3 (8 correct tests) was likely a randomly fortunate Q network; as can be seen in run 28, training the same network longer did not improve results. Indeed, the parameter that helped the most was changing finalEpsilon to a higher value. 0.9 was too high (as seen when runs 39 and 40 are compared), 0.5 and 0.6 worked well. I expect that keeping the random action probability high helped the network stay stable with random trajectories of the marble. Using the "mean reinforcement" graph was helpful to see when a network peaked in its average reinforcement result. Another interesting result is that setting gamma a bit lower seemed to help (0.98 vs 0.99 or more). I had noticed that when trained the marble could get to the goal in about 50 steps, so setting the gamma to 1-1/50 seemed like a good setting. This lets the goal attainment reinforcement be valid even 50 steps away still, but may help initial training by not overly valuing the early Q net results used for next state estimation. nSCGIterations seemed to be best at around 20-30 iterations; I suspect setting this too high results in the network was overtraining on early results preventing good convergence. The table shows other experiments I did to find parameters and training counts that worked well in the network.

In [14]:
# This cell runs the optimized parameters that I found for a variable goal position

gamma = 0.98
nTrials = 2000
nStepsPerTrial = 300
nSCGIterations = 30
finalEpsilon = 0.6
epsilonDecay = np.exp(np.log(finalEpsilon)/(nTrials))  # to produce this final value
nh = [7,3]
qnet = nn.NeuralNetwork([4] + nh + [1])
qnet.setInputRanges(( (0, 10), (-3, 3), (-1,1), (0, 10)))

fig = plt.figure(figsize=(15,50))

epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)

for trial in range(nTrials):

    # Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
    samples = makeSamples(qnet, nStepsPerTrial)

    ns = 3 # position, velocity, goal
    na = 1
    X = samples[:, :ns+na]
    R = samples[:, ns+na:ns+na+1]
    nextX = samples[:, ns+na+1:]
    nextQ = qnet.use(nextX)

    qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
    
    # Decay epsilon
    epsilon *= epsilonDecay

    # Rest is for plotting
    epsilonTrace[trial] = epsilon
    rtrace[trial] = np.mean(R)

    if trial % (nTrials//5) == 0 or trial == nTrials-1:
        plt.clf()
        plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
        testIt(qnet,10,nStepsPerTrial)
        clear_output(wait=True)
        display(fig);
        plt.pause(0.01)

clear_output(wait=True)

Extra Credit

I'm interested in seeing how well I can learn to fly a rocket ship to dock with an orbiting space station. For this excersize, a planet will be at location 0,0 and the 'goal' will have position and velocity affected by gravity (force decays with square of distance to planet). The 'world' will range from -10 to 10 in X and Y with the 'goal' orbiting at a distance of about 6 from the planet. The 'marble' will have position and velocity also, and also be affected by the planet. The 'marble' (rocket) and the 'goal' (space station) will not have gravity between them applied. As recommended in the extra credit section, I'll have 7 actions: no rocket thrust, and thrust in one of 6 directions spread at 60 degree angles. Negative reinforcement of -0.2 is applied if the rocket is more than 5 units from the space station and it is moving counterclockwise around the planet (wrong direction), -0.1 will be given if the rocket is further than 5 units from the space station but moving clockwise. Positive reinforcement of 0.1 will be given if the distance from the rocket to the space station is less than 3 units and 0.2 if the distance is less than 2 units. Crashing into the planet is bad - the rocket stays stuck for a while to build negative reinforcement and then randomly jumps to a corner of the world to help with training.

In [15]:
# Action directions of thrust
#   6   1
#    \ /
#  5--0--2
#    / \
#   4   3
validActions = np.array([[0,0],[0.25,np.sqrt(3)/4],[0.5,0],[0.25,-np.sqrt(3)/4],
                         [-0.25,-np.sqrt(3)/4],[-0.5,0],[-0.25,np.sqrt(3)/4]])

def initialState():
    # State is: x&y position, x&y velocity, goal x&y position, goal x&y velocity
    # World has planet at 0,0 and the grid is -10 to 10 in both X and Y
    # The planet starts at X=0, Y=6 with velocity to result in slightly elliptical orbit
    return np.array([20*np.random.random_sample()-10, 20*np.random.random_sample()-10,
                     3*(0.5-np.random.random_sample()), 3*(0.5-np.random.random_sample()),0,6,1,0]);

def nextState(s,a):
    s = copy.copy(s)   
    deltaT = 0.1                           # Euler integration time step
    s[0] += deltaT * s[2]                  # Update rocket X position
    s[1] += deltaT * s[3]                  # Update rocket Y position
    s[4] += deltaT * s[6]                  # Update space station X position
    s[5] += deltaT * s[7]                  # Update space station Y position
    ax,ay = validActions[a]
    r2rocket = s[0]*s[0] + s[1]*s[1]
    if r2rocket < 0.0001:  # avoid divide by 0 error, crash detection done later
        r2rocket = 0.00009
    r2station = s[4]*s[4] + s[5]*s[5]
    G = 1                             # Gravitational strength
    gxr = -G*s[0]/r2rocket            # Using newton's law of gravity
    gyr = -G*s[1]/r2rocket            # Using newton's law of gravity
    gxs = -G*s[4]/r2station           # Using newton's law of gravity
    gys = -G*s[5]/r2station           # Using newton's law of gravity
    s[2] += deltaT * (ax + gxr)       # Update rocket X velocity
    s[3] += deltaT * (ay + gyr)       # Update rocket Y velocity
    s[6] += deltaT * gxs              # Update space station X velocity
    s[7] += deltaT * gys              # Update space station Y velocity
    if s[0] < -10:    # Bound next position. If at limits, set velocity to 0.
        s = np.array([-10,s[1],0,s[3],s[4],s[5],s[6],s[7]])
    elif s[0] > 10:
        s = np.array([10,s[1],0,s[3],s[4],s[5],s[6],s[7]])
    if s[1] < -10:    # Bound next position. If at limits, set velocity to 0.
        s = np.array([s[0],-10,s[2],0,s[4],s[5],s[6],s[7]])
    elif s[1] > 10:
        s = np.array([s[0],10,s[2],0,s[4],s[5],s[6],s[7]])
    if r2rocket < 0.01:   # crashed into planet
        if r2rocket < 0.0002:  # sometimes, pop out to random +/-9 corner to help learning
            s = np.array([np.random.choice([-9,9]),np.random.choice([-9,9]),s[0],s[1],s[4],s[5],s[6],s[7]])
        else: # otherwise, randomize position within crash zone to improve learning
            s = np.array([np.random.random_sample()/10-0.05,np.random.random_sample()/10-0.05,s[0]/9,s[1]/9,s[4],s[5],s[6],s[7]])
    return s   # No need to bound space station position, it will stay in bounds

def reinforcement(s):  # s is new state
    distance = np.sqrt((s[0]-s[4])**2 + (s[1]-s[5])**2)
    # dot product of clockwise direction and current velocity
    clockwise = (s[1]*s[2]-s[0]*s[3]) > 0
    return (-0.1 if clockwise else -0.2) if distance > 5 else 0 if distance > 3 else 0.1 if distance > 2 else 0.2

def policy(qnet, state, epsilon):
    if np.random.rand(1) < epsilon:
        actioni = np.random.randint(validActions.shape[0])
    else:
        # inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions.reshape((-1,1))))
        inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions))
        qs = qnet.use(inputs)
        actioni = np.argmax(qs)
    return actioni
In [16]:
def makeSamples(qnet, nStepsPerStart):
    samples = []
    state = initialState()
    a = policy(qnet, state, epsilon)
    oldact = a
    for iStep in range(nStepsPerStart):
        newState = nextState(state, a)
        r = reinforcement(newState)
        newAct = policy(qnet, newState, epsilon)
        # SARSA
        samples.append(state.tolist() + validActions[a].tolist() + [r] + newState.tolist() +
                       validActions[newAct].tolist())
        state = newState
        oldact = a
        a = newAct
    return np.array(samples)
In [18]:
def plotStatus(qnet, X, R, trial, epsilonTrace, rtrace):
    plt.subplot(5,3,1)
    plt.plot(epsilonTrace[:trial+1])
    plt.ylabel("Random Action Probability ($\epsilon$)")
    plt.ylim(0,1)
    plt.subplot(5,3,2)
    plt.plot(X[:,0])
    plt.plot(X[:,4],'--',alpha=0.5,lw=5)
    plt.ylabel("$x$")
    plt.ylim(-11,11)
    plt.subplot(5,3,6)
    plt.plot(X[:,0],X[:,1])
    plt.plot(X[0,0],X[0,1],'o')
    plt.plot(X[:,4],X[:,5],'--',alpha=0.5,lw=5)
    plt.plot(X[0,4],X[0,5],'o',alpha=0.5,lw=5)
    plt.title("Rocket and Station path trial")
    plt.xlim(-11,11)
    plt.ylim(-11,11)
    qs = qnet.use(np.array([[s*2-10,5,0,0,0,6,1,0] + validActions[a].tolist() 
                            for a in range(validActions.shape[0]) for s in range(11)]))
    plt.subplot(5,3,12)
    acts = ["0","1","2","3","4","5","6"]
    actsiByState = np.argmax(qs.reshape((validActions.shape[0],-1)),axis=0)
    for i in range(11):
        plt.text(i*2-10,0,acts[actsiByState[i]])
        plt.xlim(-11,11)
        plt.ylim(-1,1)
    plt.text(-9,0.2,"Policy for Zero Velocity, y=5, goal at (0,6)")
    plt.axis("off")
    plt.subplot(5,3,4)
    plt.plot(rtrace[:trial+1],alpha=0.5)
    binSize = 20
    if trial+1 > binSize:
        # Calculate mean of every bin of binSize reinforcement values
        smoothed = np.mean(rtrace[:int(trial/binSize)*binSize].reshape((int(trial/binSize),binSize)),axis=1)
        plt.plot(np.arange(1,1+int(trial/binSize))*binSize,smoothed)
    plt.ylabel("Mean reinforcement")
    plt.subplot(5,3,5)
    plt.plot(X[:,1])
    plt.plot(X[:,5],'--',alpha=0.5,lw=5)
    plt.ylabel("$y$")
    plt.ylim(-11,11)
    plt.subplot(5,3,3)
    plt.plot(R[:,0]*10)
    plt.plot(X[:,4],'--',alpha=0.5,lw=5)
    plt.ylabel("reinforcement * 10")
    plt.ylim(-11,11)
    plt.subplot(5,3,9)
    qnet.draw(["$x$","$\dot{x}$","$y$","$\dot{y}$","$gx$","$\dot{gx}$","$gy$","$\dot{gy}$","$a$"],["Q"])

    plt.subplot(5,3,7)
    n = 20
    posx = np.linspace(-10,10,n)
    posy =  np.linspace(-10,10,n)
    xs,ys = np.meshgrid(posx,posy)
    xsflat = xs.flat
    ysflat = ys.flat
    qs = qnet.use(np.array([[xsflat[i],ysflat[i],0,0,0,6,1,0] + validActions[a].tolist()
                            for a in range(validActions.shape[0]) for i in range(len(xsflat))]))
    qs = qs.reshape((validActions.shape[0],-1)).T
    qsmax = np.max(qs,axis=1).reshape(xs.shape)
    cs = plt.contourf(xs,ys,qsmax)
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$y$")
    plt.title("Max Q: goal (0,6), velocity 0")
    plt.subplot(5,3,8)
    acts = np.argmax(qs,axis=1).reshape(xs.shape)
    cs = plt.contourf(xs,ys,acts,[-0.5,0.5,1.5,2.5,3.5,4.5,5.5,6.5])
    plt.colorbar(cs)
    plt.xlabel("$x$")
    plt.ylabel("$y$")
    plt.title("Action: goal (0,6), velocity 0")

    s = plt.subplot(5,3,10)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$y$")
    plt.title("Max Q: goal (0,6), velocity 0")

    s = plt.subplot(5,3,11)
    rect = s.get_position()
    ax = Axes3D(plt.gcf(),rect=rect)
    ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
    ax.set_xlabel("$x$")
    ax.set_ylabel("$y$")
    plt.title("Action: goal (0,6), velocity 0")
    
def testIt(qnet,nStepsPerTrial):
    for i in range(3):  # small. medium, large squares for start positions
        plt.subplot(5,3,13+i)
        for x in range(4):
            sx = -i*3-3 if (x==0) or (x==1) else i*3+3
            sy = -i*3-3 if (x==0) or (x==2) else i*3+3
            s = [sx,sy,0,0,0,6,1,0] ## 0 velocity
            xtrace = np.zeros((nStepsPerTrial,8))
            for step in range(nStepsPerTrial):
                a = policy(qnet, s, 0.0)  # epsilon = 0
                s = nextState(s,a)
                xtrace[step,:] = s
            plt.plot(xtrace[:,0],xtrace[:,1])
            plt.plot(xtrace[0,0],xtrace[0,1],'o')
            plt.xlim(-11,11)
            plt.ylim(-11,11)
            plt.plot(xtrace[:,4],xtrace[:,5],'--',alpha=0.5,lw=5)
            plt.ylabel('$y$')
            plt.xlabel('$x$')
            plt.title('State Trajectories for corners at +/-' + str(i*3+3) + ', $\epsilon=0$')
In [24]:
# This cell runs the optimal network parameters I found for tracking the
# orbiting space station with a rocket.

gamma = 0.98
nTrials = 1000
nStepsPerTrial = 1100   # Almost 3 orbits of space station around planet
nSCGIterations = 20
finalEpsilon = 0.7
epsilonDecay = np.exp(np.log(finalEpsilon)/(nTrials))  # to produce this final value
nh = [8,2]
qnet = nn.NeuralNetwork([10] + nh + [1])
qnet.setInputRanges(( (-10, 10), (-10,10), (-3, 3), (-3, 3), (-10, 10), (-10, 10), (-3, 3), (-3, 3), (-0.5,0.5), (-0.5,0.5)))

fig = plt.figure(figsize=(15,25))

epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)

for trial in range(nTrials):

    # Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
    samples = makeSamples(qnet, nStepsPerTrial)

    ns = 8 # X & Y position and velocity for space station and rocket
    na = 2
    X = samples[:, :ns+na]
    R = samples[:, ns+na:ns+na+1]
    nextX = samples[:, ns+na+1:]
    nextQ = qnet.use(nextX)

    qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
    
    # Decay epsilon
    epsilon *= epsilonDecay

    # Rest is for plotting
    epsilonTrace[trial] = epsilon
    rtrace[trial] = np.mean(R)

    if trial % (nTrials//5) == 0 or trial == nTrials-1:
        plt.clf()
        plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
        testIt(qnet,nStepsPerTrial)
        clear_output(wait=True)
        display(fig);
        plt.pause(0.01)

clear_output(wait=True)

One point of extra credit will be given for completing each of the following things.

  • Change the marble's world from one dimension to two. Add graphs of the marble's movement in the two-dimensional plane.
    • I tested my code with the goal held at position 0,6 and light gravity. I added the diagrams shown above to help improve my training behavior. The graphs above are for the full gravity and orbital space station traning.
    • Proceeding row-by-row from the upper left: the first graph is still the random action probability.
    • The 2nd graph shows the rocket X position and station X position each time step of a trial.
    • The 3rd graph shows the reinforcement applied and the station X position for the same trial as graph 2.
    • The 4th graph shows the reinforcement average per trial and a moving average of 20 trials. Similar to the 1D moving goal case, this graph was a valuable way to see if various parameters improved learned behavior.
    • The 5th graph shows the rocket Y position and station Y position for the same trial as graph 2.
    • The 6th graph shows the rocket and space station paths in 2 dimensions for the trial, with the start points highlighted.
    • The 7th graph shows the max Q values for the rocket XY position when the station is at 0,6 and the rocket has velocity 0. When well trained, the highest Q values are near or to the right 0,6 (the station is moving to the right when it is at position 0,6).
    • The 8th graph shows the policy chosen for the rocket XY position when the station is at 0,6 and the rocket has velocity 0.
    • The 9th graph is the drawing of the neural network.
    • The 10th and 11th graphs show a 3D image of graphs 7 and 8.
    • The 12th diagram shows the policy for various X position when the rocket is at Y position 5, has velocity 0, and the station is at location 0,6.
    • The 13th through 15th diagrams show the tragectories of 4 rockets starting at the 4 corners of square at vertices ($\pm$3,$\pm$3), ($\pm$6,$\pm$6), and ($\pm$9,$\pm$9) for the 3 graphs.
  • Increase the number of valid actions from three to seven.
    • I've done this. My first attempt was to have a single input to the network that varies from 0 to 6 for the 7 actions I chose, but the network had trouble learning the value of different actions in different situations. Next I tried having 7 inputs to the network - one per action - but the network still seemed to find the 7 actions hard to process. The final approach which worked well was to have 2 inputs for the action, the inputs are just the X and Y values of the accelation vector being applied. That seemed to work well for training.
  • Add a variable wind as a force on the marble, along with another state variable that indicates wind speed and direction.
    • I added gravity instead of wind, and the goal is now in orbit with both a position and velocity included in the full state. When observing the traces I can see that the system is able to learn from the reinforceent that it should go clockwise around the planet and stay near the space station.