$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
Steve Kommrusch
Key neural network subroutines provided by professor Chuck Anderson
In this assignment, I am modifying provided reinforcement learning code to solve the dynamic marble problem. I solve a more complex version of the marble problem in which a goal position is specified as a new state variable, and I adjusted the functions to solve a 2 dimensional gravity problem to dock a rocket with a space station orbiting a planet.
The reinforcement objective is to find the correct Q value to evaluate best actions for a given state by making this approximation as accurate as possible:
$$ \begin{align*} Q(s_t,a_t) \approx \sum_{k=0}^\infty r_{t+k+1} \end{align*} $$This is usually formulated as the least squares objective:
$$ \begin{align*} \text{Minimize } \mathrm{E} \left ( \sum_{k=0}^\infty r_{t+k+1} - Q(s_t,a_t)\right )^2 \end{align*} $$Using SARSA (state, action, reinforcement, next state, next action) for training, and discounting future reinforcements in the temporal-difference error, we can formulate the learning problem as minimizing:
$$ \begin{align*} \text{Minimize } \mathrm{E} \Bigl ( r_{t+1} + \gamma Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr )^2 \end{align*} $$For this assignment, we were provided this section by professor Chuck Anderson; the functions here are unchanged so as to provide a starting example. Also provided by professor Anderson was A5.tar, which included the files below. At the end of this section I describe the graphs that are generated.
marble.py
neuralnetworksbylayer.py
layers.py
scaledconjugategradient.py
mlutils.py
import neuralnetworksbylayer as nn
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import copy
%matplotlib inline
To define a reinforcement learning problem, we need functions that return
Let's first define the set of valid actions. For our simple one-dimensional marble problem, we can push left, right, or not push at all.
validActions = np.array([ -1, 0, 1])
def initialState():
return np.array([10*np.random.random_sample(), 3*(0.5-np.random.random_sample())])
def nextState(s,a):
s = copy.copy(s) # s[0] is position, s[1] is velocity. a is -1, 0 or 1
deltaT = 0.1 # Euler integration time step
s[0] += deltaT * s[1] # Update position
s[1] += deltaT * (2 * a - 0.2 * s[1]) # Update velocity. Includes friction
if s[0] < 0: # Bound next position. If at limits, set velocity to 0.
s = np.array([0,0])
elif s[0] > 10:
s = np.array([10,0])
return s
def reinforcement(s): # s is new state
goal = 5
return 0 if abs(s[0]-goal) < 1 else -0.1
def policy(qnet, state, epsilon):
if np.random.rand(1) < epsilon:
actioni = np.random.randint(validActions.shape[0])
else:
inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions.reshape((-1,1))))
qs = qnet.use(inputs)
actioni = np.argmax(qs)
return validActions[actioni]
Now we need a function to generate a bunch of samples that are interactions with the marble.
def makeSamples(qnet, nStepsPerStart):
samples = []
state = initialState()
act = policy(qnet, state, epsilon)
oldact = act
for iStep in range(nStepsPerStart):
newState = nextState(state, act)
r = reinforcement(newState)
newAct = policy(qnet, newState, epsilon)
# SARSA
samples.append(state.tolist() + [act, r] + newState.tolist() + [newAct])
state = newState
oldact = act
act = newAct
return np.array(samples)
Now we can play. Define constants we need.
def plotStatus(qnet, X, R, trial, epsilonTrace, rtrace):
plt.subplot(4,3,1)
plt.plot(epsilonTrace[:trial+1])
plt.ylabel("Random Action Probability ($\epsilon$)")
plt.ylim(0,1)
plt.subplot(4,3,2)
plt.plot(X[:,0])
plt.plot([0,X.shape[0]], [5,5],'--',alpha=0.5,lw=5)
plt.ylabel("$x$")
plt.ylim(-1,11)
qs = qnet.use(np.array([[s,0,a] for a in validActions for s in range(11)]))
plt.subplot(4,3,3)
acts = ["L","0","R"]
actsiByState = np.argmax(qs.reshape((len(validActions),-1)),axis=0)
for i in range(11):
plt.text(i,0,acts[actsiByState[i]])
plt.xlim(-1,11)
plt.ylim(-1,1)
plt.text(2,0.2,"Policy for Zero Velocity")
plt.axis("off")
plt.subplot(4,3,4)
plt.plot(rtrace[:trial+1],alpha=0.5)
binSize = 20
if trial+1 > binSize:
# Calculate mean of every bin of binSize reinforcement values
smoothed = np.mean(rtrace[:int(trial/binSize)*binSize].reshape((int(trial/binSize),binSize)),axis=1)
plt.plot(np.arange(1,1+int(trial/binSize))*binSize,smoothed)
plt.ylabel("Mean reinforcement")
plt.subplot(4,3,5)
plt.plot(X[:,0],X[:,1])
plt.plot(X[0,0],X[0,1],'o')
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.fill_between([4,6],[-5,-5],[5,5],color="red",alpha=0.3)
plt.xlim(-1,11)
plt.ylim(-5,5)
plt.subplot(4,3,6)
qnet.draw(["$x$","$\dot{x}$","$a$"],["Q"])
plt.subplot(4,3,7)
n = 20
positions = np.linspace(0,10,n)
velocities = np.linspace(-5,5,n)
xs,ys = np.meshgrid(positions,velocities)
xsflat = xs.flat
ysflat = ys.flat
qs = qnet.use(np.array([[xsflat[i],ysflat[i],a] for a in validActions for i in range(len(xsflat))]))
qs = qs.reshape((len(validActions),-1)).T
qsmax = np.max(qs,axis=1).reshape(xs.shape)
cs = plt.contourf(xs,ys,qsmax)
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Max Q")
plt.subplot(4,3,8)
acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Actions")
s = plt.subplot(4,3,10)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Max Q")
s = plt.subplot(4,3,11)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Action")
def testIt(qnet,nTrials,nStepsPerTrial):
xs = np.linspace(0,10,nTrials)
plt.subplot(4,3,12)
for x in xs:
s = [x,0] ## 0 velocity
xtrace = np.zeros((nStepsPerTrial,2))
for step in range(nStepsPerTrial):
a = policy(qnet, s, 0.0) # epsilon = 0
s = nextState(s,a)
xtrace[step,:] = s
plt.plot(xtrace[:,0],xtrace[:,1])
plt.xlim(-1,11)
plt.ylim(-5,5)
plt.plot([5,5],[-5,5],'--',alpha=0.5,lw=5)
plt.ylabel('$\dot{x}$')
plt.xlabel('$x$')
plt.title('State Trajectories for $\epsilon=0$')
gamma = 0.999
nTrials = 300
nStepsPerTrial = 500
nSCGIterations = 30
finalEpsilon = 0.01
epsilonDecay = np.exp(np.log(finalEpsilon)/(nTrials)) # to produce this final value
nh = [5,5]
qnet = nn.NeuralNetwork([3] + nh + [1]) # [3, 5, 5, 1]
qnet.setInputRanges(( (0, 10), (-3, 3), (-1,1)))
fig = plt.figure(figsize=(15,15))
epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)
for trial in range(nTrials):
# Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
samples = makeSamples(qnet, nStepsPerTrial)
ns = 2
na = 1
X = samples[:, :ns+na]
R = samples[:, ns+na:ns+na+1]
nextX = samples[:, ns+na+1:]
print
nextQ = qnet.use(nextX)
qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
# Decay epsilon
epsilon *= epsilonDecay
# Rest is for plotting
epsilonTrace[trial] = epsilon
rtrace[trial] = np.mean(R)
if trial % (nTrials//10) == 0 or trial == nTrials-1:
plt.clf()
plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
testIt(qnet,10,500)
clear_output(wait=True)
display(fig);
plt.pause(0.01)
clear_output(wait=True)
Here is a brief description of the diagrams above, starting with the upper left diagram and then describing row-by-row:
I modified the above code to change the reinforcement learning problem to one for which the goal position of the marble can be varied. The new code trains one Q network for the marble problem with a variable goal. I run 60 experiments to find values for the network architecture (nh), gamma, nTrials, nStepsPerTrial, nSCGIterations, and finalEpsilon that result in a single Q network that can drive the marble to the goal position.
The general approach is as follows.
The state of the marble in the original code is given by $(x_t,
\dot{x_t})$. In this new problem, the state will be $(x_t,
\dot{x_t}, g_t)$, where $g_t$ is the goal at time $t$. The
initialState
function now randomly chooses a new goal $g_t$ to be a
value between 1 and 9. nextState
is changed so that the goal
value remains the same in the new state as it was in the old state. Finally,
the reinforcement function is now parameterized to depend on the
current $g_t$ value.
Some figures are unchanged, but as requested I now plot the contour and surface plots for goals 1,5, and 9, and I plot the 0 velocity test plots for all 11 goals values (0 through 10).
As discussed below, I was able to find parameters for the system that work reasonably well, but are not consistently perfect. The main discovery is that it helped to leave the final epsilon rather high (0.6) to insure the network still performs well for unexpected states during usage.
This first run shows the results of the system with the same parameters as provided in the initial code section above.
validActions = np.array([ -1, 0, 1])
def initialState():
goal = np.floor(9*np.random.random_sample()+1) # Random from 1 to 9 inclusive
return np.array([10*np.random.random_sample(), 3*(0.5-np.random.random_sample()), goal])
def nextState(s,a):
s = copy.copy(s) # s[0] is position, s[1] is velocity. a is -1, 0 or 1
deltaT = 0.1 # Euler integration time step
s[0] += deltaT * s[1] # Update position
s[1] += deltaT * (2 * a - 0.2 * s[1]) # Update velocity. Includes friction
if s[0] < 0: # Bound next position. If at limits, set velocity to 0.
s = np.array([0,0,s[2]])
elif s[0] > 10:
s = np.array([10,0,s[2]])
return s
def reinforcement(s): # s is new state
goal = s[2]
return 0 if abs(s[0]-goal) < 1 else -0.1
def policy(qnet, state, epsilon):
if np.random.rand(1) < epsilon:
actioni = np.random.randint(validActions.shape[0])
else:
inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions.reshape((-1,1))))
qs = qnet.use(inputs)
actioni = np.argmax(qs)
return validActions[actioni]
nh = [5,5]
qnet = nn.NeuralNetwork([4] + nh + [1]) # [4, 5, 5, 1]
qnet.setInputRanges(( (0, 10), (-3, 3), (-1,1), (0, 10)))
def plotStatus(qnet, X, R, trial, epsilonTrace, rtrace):
plt.subplot(10,3,1)
plt.plot(epsilonTrace[:trial+1])
plt.ylabel("Random Action Probability ($\epsilon$)")
plt.ylim(0,1)
plt.subplot(10,3,2)
plt.plot(X[:,0])
plt.plot([0,X.shape[0]], [X[0,2],X[0,2]],'--',alpha=0.5,lw=5)
plt.ylabel("$x$")
plt.ylim(-1,11)
qs = qnet.use(np.array([[s,0,5,a] for a in validActions for s in range(11)]))
plt.subplot(10,3,3)
acts = ["L","0","R"]
actsiByState = np.argmax(qs.reshape((len(validActions),-1)),axis=0)
for i in range(11):
plt.text(i,0,acts[actsiByState[i]])
plt.xlim(-1,11)
plt.ylim(-1,1)
plt.text(1,0.2,"Policy for Zero Velocity and goal=5")
plt.axis("off")
plt.subplot(10,3,4)
plt.plot(rtrace[:trial+1],alpha=0.5)
binSize = 20
if trial+1 > binSize:
# Calculate mean of every bin of binSize reinforcement values
smoothed = np.mean(rtrace[:int(trial/binSize)*binSize].reshape((int(trial/binSize),binSize)),axis=1)
plt.plot(np.arange(1,1+int(trial/binSize))*binSize,smoothed)
plt.ylabel("Mean reinforcement")
plt.subplot(10,3,5)
plt.plot(X[:,0],X[:,1])
plt.plot(X[0,0],X[0,1],'o')
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.fill_between([X[0,2]-1,X[0,2]+1],[-5,-5],[5,5],color="red",alpha=0.3)
plt.xlim(-1,11)
plt.ylim(-5,5)
plt.subplot(10,3,6)
qnet.draw(["$x$","$\dot{x}$","$goal$","$a$"],["Q"])
n = 20
positions = np.linspace(0,10,n)
velocities = np.linspace(-5,5,n)
xs,ys = np.meshgrid(positions,velocities)
xsflat = xs.flat
ysflat = ys.flat
plt.subplot(10,3,7)
qs = qnet.use(np.array([[xsflat[i],ysflat[i],1,a] for a in validActions for i in range(len(xsflat))]))
qs = qs.reshape((len(validActions),-1)).T
qsmax = np.max(qs,axis=1).reshape(xs.shape)
cs = plt.contourf(xs,ys,qsmax)
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Max Q for goal=1")
plt.subplot(10,3,8)
acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Actions for goal=1")
s = plt.subplot(10,3,9)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Max Q for goal=1")
s = plt.subplot(10,3,10)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Actions for goal=1")
plt.subplot(10,3,11)
qs = qnet.use(np.array([[xsflat[i],ysflat[i],5,a] for a in validActions for i in range(len(xsflat))]))
qs = qs.reshape((len(validActions),-1)).T
qsmax = np.max(qs,axis=1).reshape(xs.shape)
cs = plt.contourf(xs,ys,qsmax)
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Max Q for goal=5")
plt.subplot(10,3,12)
acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Actions for goal=5")
s = plt.subplot(10,3,13)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Max Q for goal=5")
s = plt.subplot(10,3,14)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Actions for goal=5")
plt.subplot(10,3,15)
qs = qnet.use(np.array([[xsflat[i],ysflat[i],9,a] for a in validActions for i in range(len(xsflat))]))
qs = qs.reshape((len(validActions),-1)).T
qsmax = np.max(qs,axis=1).reshape(xs.shape)
cs = plt.contourf(xs,ys,qsmax)
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Max Q for goal=9")
plt.subplot(10,3,16)
acts = np.array(validActions)[np.argmax(qs,axis=1)].reshape(xs.shape)
cs = plt.contourf(xs,ys,acts,[-2, -0.5, 0.5, 2])
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$\dot{x}$")
plt.title("Actions for goal=9")
s = plt.subplot(10,3,17)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Max Q for goal=9")
s = plt.subplot(10,3,18)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$\dot{x}$")
plt.title("Actions for goal=9")
def testIt(qnet,nTrials,nStepsPerTrial):
for i in range(11):
xs = np.linspace(0,10,nTrials)
plt.subplot(10,3,19+i)
for x in xs:
s = [x,0,i] ## 0 velocity, goal=i
xtrace = np.zeros((nStepsPerTrial,3))
for step in range(nStepsPerTrial):
a = policy(qnet, s, 0.0) # epsilon = 0
s = nextState(s,a)
xtrace[step,:] = s
plt.plot(xtrace[:,0],xtrace[:,1])
plt.xlim(-1,11)
plt.ylim(-5,5)
plt.plot([i,i],[-5,5],'--',alpha=0.5,lw=5)
plt.ylabel('$\dot{x}$')
plt.xlabel('$x$')
plt.title('State Trajectories for goal=' + str(i) + ', $\epsilon=0$')
The run below is with all the same parameters as the code provided (same hidden layers, epsilon, etc) but updated to train on goals from 0 to 10. The final results are not very good, as seen in the graphs.
# Generate graphs for arbitrary goal position with original (not optimized) parameters
# (see later run after the experiment table for optimized parameters)
fig = plt.figure(figsize=(15,50))
epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)
for trial in range(nTrials):
# Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
samples = makeSamples(qnet, nStepsPerTrial)
ns = 3 # position, velocity, goal
na = 1
X = samples[:, :ns+na]
R = samples[:, ns+na:ns+na+1]
nextX = samples[:, ns+na+1:]
nextQ = qnet.use(nextX)
qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
# Decay epsilon
epsilon *= epsilonDecay
# Rest is for plotting
epsilonTrace[trial] = epsilon
rtrace[trial] = np.mean(R)
if trial % (nTrials//5) == 0 or trial == nTrials-1:
plt.clf()
plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
testIt(qnet,10,500)
clear_output(wait=True)
display(fig);
plt.pause(0.01)
clear_output(wait=True)
After the results above with the initial parameters, I tried a variety of parameter changes whose results are summarized in the table below.
Run | nh | gamma | nTrials | nSteps/Trial | nSCGIterations | finalEpsilon | execution time | goals correct |
---|---|---|---|---|---|---|---|---|
Initial | 5,5 | 0.999 | 300 | 500 | 30 | 0.01 | ~61 sec | 1 |
2 | 10,10,10 | 0.999 | 300 | 600 | 60 | 0.001 | ~86 sec | 2 |
3 | 10,10,10 | 0.999 | 600 | 1200 | 120 | 0.001 | ~321 sec | 8 |
4 | 10,10,10 | 0.999 | 400 | 1000 | 100 | 0.01 | ~171 sec | 1 |
5 | 10,10,10 | 0.999 | 600 | 1000 | 120 | 0.01 | ~277 sec | 7 |
6 | 10,10,10 | 0.999 | 600 | 500 | 120 | 0.01 | ~184 sec | 2 |
7 | 10,10,10 | 0.99 | 600 | 500 | 120 | 0.01 | ~187 sec | 1 |
8 | 10,10,10 | 0.99 | 600 | 300 | 120 | 0.01 | ~147 sec | 2 |
9 | 10,10,10 | 0.99 | 600 | 100 | 120 | 0.01 | ~85 sec | 1 |
10 | 20,10 | 0.99 | 600 | 100 | 120 | 0.01 | ~87 sec | 2 |
11 | 20,10 | 0.99 | 600 | 100 | 120 | 0.001 | ~80 sec | 1 |
12 | 30,10,5 | 0.99 | 600 | 50 | 60 | 0.002 | ~70 sec | 2 |
13 | 30,10,5 | 0.99 | 600 | 50 | 20 | 0.002 | ~28 sec | 1 |
14 | 30,10,5 | 0.99 | 600 | 10 | 20 | 0.002 | ~28 sec | 0 |
15 | 30,10,5 | 0.99 | 2000 | 100 | 60 | 0.002 | ~118 sec | 2 |
16 | 20,10 | 0.99 | 2000 | 100 | 60 | 0.002 | ~95 sec | 2 |
16 | 20,10 | 0.99 | 2000 | 100 | 30 | 0.01 | ~73 sec | 0 |
17 | 20,10 | 0.99 | 1000 | 100 | 30 | 0.02 | ~48 sec | 1 |
18 | 30,10,10 | 0.999 | 600 | 1200 | 120 | 0.001 | ~479 sec | 1 |
19 | 10,10 | 0.999 | 600 | 1200 | 120 | 0.001 | ~259 sec | 5 |
20 | 10,10,10 | 0.999 | 1200 | 600 | 60 | 0.001 | ~231 sec | 2 |
21 | 10,10,10 | 0.999 | 600 | 1200 | 60 | 0.02 | ~236 sec | 2 |
22 | 10,10,10 | 0.999 | 600 | 600 | 120 | 0.02 | ~207 sec | 3 |
23 | 10,10,10 | 0.999 | 600 | 600 | 60 | 0.02 | ~160 sec | 4 |
24 | 10,10,10 | 0.999 | 600 | 1200 | 60 | 0.02 | ~240 sec | 4 |
25 | 10,10,10 | 0.999 | 600 | 1200 | 120 | 0.02 | ~313 sec | 2 |
26 | 10,10,10 | 0.9 | 600 | 600 | 60 | 0.02 | ~145 sec | 1 |
27 | 5,5,5 | 0.999 | 600 | 1200 | 120 | 0.001 | ~255 sec | 1 |
28 | 10,10,10 | 0.999 | 3000 | 1200 | 120 | 0.001 | under 2000 sec | 2 |
29 | 5,5 | 0.99 | 4000 | 200 | 60 | 0.01 | under 1700 sec | 2 |
30 | 3,3 | 0.99 | 600 | 200 | 60 | 0.1 | under 60 sec | 2 |
31 | 3 | 0.99 | 600 | 200 | 60 | 0.1 | ~45 sec | 0 |
32 | 5 | 0.99 | 600 | 200 | 60 | 0.1 | ~50 sec | 2 |
33 | 7 | 0.99 | 600 | 200 | 60 | 0.1 | ~47 sec | 2 |
34 | 7,3 | 0.99 | 600 | 500 | 60 | 0.1 | ~97 sec | 6 |
35 | 7,3 | 0.999 | 600 | 500 | 60 | 0.1 | ~91 sec | 5 |
36 | 7,3 | 0.99 | 600 | 500 | 30 | 0.1 | ~69 sec | 3 |
37 | 7,3 | 0.99 | 1000 | 400 | 30 | 0.2 | ~80 sec | 4 |
38 | 7,3 | 0.99 | 1000 | 400 | 30 | 0.5 | ~80 sec | 11 (perfect) |
39 | 7,3 | 0.99 | 1000 | 200 | 30 | 0.5 | ~48 sec | 9 |
40 | 7,3 | 0.99 | 1000 | 200 | 30 | 0.9 | ~43 sec | 5 |
41 | 7,3 | 0.999 | 1000 | 200 | 30 | 0.5 | ~43 sec | 1 |
42 | 7,3 | 0.9 | 1000 | 200 | 30 | 0.5 | ~48 sec | 11 (perfect) |
43 | 7,3 | 0.5 | 1000 | 200 | 30 | 0.5 | ~48 sec | 8 |
44 | 7,3 | 0.98 | 1000 | 200 | 30 | 0.5 | ~55 sec | 4 |
45 | 7,3 | 0.9 | 500 | 200 | 30 | 0.5 | ~49 sec | 1 |
46 | 7,3 | 0.9 | 1000 | 200 | 20 | 0.5 | ~60 sec | 11 (perfect) |
47 | 7,3 | 0.9 | 1000 | 200 | 10 | 0.5 | ~47 sec | 9 |
48 | 7,3 | 0.9 | 1000 | 200 | 60 | 0.5 | ~79 sec | 3 |
49 | 7,3 | 0.9 | 1000 | 100 | 25 | 0.5 | ~35 sec | 0 |
50 | 7,3 | 0.9 | 1000 | 200 | 25 | 0.5 | ~54 sec | 0 |
51 | 7,3 | 0.95 | 1000 | 200 | 25 | 0.5 | ~53 sec | 0 |
52 | 7,3 | 0.98 | 1000 | 200 | 25 | 0.5 | ~61 sec | 4 |
53 | 7,3 | 0.98 | 1000 | 300 | 30 | 0.5 | ~70 sec | 9 |
54 | 5,3 | 0.98 | 1000 | 300 | 30 | 0.5 | ~67 sec | 8 |
55 | 9,3 | 0.98 | 1000 | 300 | 30 | 0.5 | ~80 sec | 2 |
56 | 7,3 | 0.98 | 2000 | 300 | 30 | 0.6 | ~125 sec | 11 (perfect) |
57 | 7,5 | 0.98 | 1000 | 300 | 30 | 0.5 | ~75 sec | 4 |
58 | 7,3 | 0.99 | 2000 | 300 | 30 | 0.6 | ~98 sec | 1 |
59 | 7,3 | 0.98 | 1200 | 300 | 30 | 0.5 | ~75 sec | 8 |
60 | 7,3 | 0.98 | 2000 | 300 | 30 | 0.6 | ~110 sec | 10 |
61 | 7,3 | 0.98 | 2000 | 300 | 30 | 0.6 | ~110 sec | 11 (perfect) |
My 61st run was with the same parameters as my 56th and 60th and it trained well, but the single mistake on run 60 shows the challenge of training to this problem. The table of experiments includes parameters I tested, the time to train, and the final number of test trials that worked correctly for all 11 goal positions (11 is a perfect score). My initial succes at run 3 (8 correct tests) was likely a randomly fortunate Q network; as can be seen in run 28, training the same network longer did not improve results. Indeed, the parameter that helped the most was changing finalEpsilon to a higher value. 0.9 was too high (as seen when runs 39 and 40 are compared), 0.5 and 0.6 worked well. I expect that keeping the random action probability high helped the network stay stable with random trajectories of the marble. Using the "mean reinforcement" graph was helpful to see when a network peaked in its average reinforcement result. Another interesting result is that setting gamma a bit lower seemed to help (0.98 vs 0.99 or more). I had noticed that when trained the marble could get to the goal in about 50 steps, so setting the gamma to 1-1/50 seemed like a good setting. This lets the goal attainment reinforcement be valid even 50 steps away still, but may help initial training by not overly valuing the early Q net results used for next state estimation. nSCGIterations seemed to be best at around 20-30 iterations; I suspect setting this too high results in the network was overtraining on early results preventing good convergence. The table shows other experiments I did to find parameters and training counts that worked well in the network.
# This cell runs the optimized parameters that I found for a variable goal position
gamma = 0.98
nTrials = 2000
nStepsPerTrial = 300
nSCGIterations = 30
finalEpsilon = 0.6
epsilonDecay = np.exp(np.log(finalEpsilon)/(nTrials)) # to produce this final value
nh = [7,3]
qnet = nn.NeuralNetwork([4] + nh + [1])
qnet.setInputRanges(( (0, 10), (-3, 3), (-1,1), (0, 10)))
fig = plt.figure(figsize=(15,50))
epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)
for trial in range(nTrials):
# Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
samples = makeSamples(qnet, nStepsPerTrial)
ns = 3 # position, velocity, goal
na = 1
X = samples[:, :ns+na]
R = samples[:, ns+na:ns+na+1]
nextX = samples[:, ns+na+1:]
nextQ = qnet.use(nextX)
qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
# Decay epsilon
epsilon *= epsilonDecay
# Rest is for plotting
epsilonTrace[trial] = epsilon
rtrace[trial] = np.mean(R)
if trial % (nTrials//5) == 0 or trial == nTrials-1:
plt.clf()
plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
testIt(qnet,10,nStepsPerTrial)
clear_output(wait=True)
display(fig);
plt.pause(0.01)
clear_output(wait=True)
I'm interested in seeing how well I can learn to fly a rocket ship to dock with an orbiting space station. For this excersize, a planet will be at location 0,0 and the 'goal' will have position and velocity affected by gravity (force decays with square of distance to planet). The 'world' will range from -10 to 10 in X and Y with the 'goal' orbiting at a distance of about 6 from the planet. The 'marble' will have position and velocity also, and also be affected by the planet. The 'marble' (rocket) and the 'goal' (space station) will not have gravity between them applied. As recommended in the extra credit section, I'll have 7 actions: no rocket thrust, and thrust in one of 6 directions spread at 60 degree angles. Negative reinforcement of -0.2 is applied if the rocket is more than 5 units from the space station and it is moving counterclockwise around the planet (wrong direction), -0.1 will be given if the rocket is further than 5 units from the space station but moving clockwise. Positive reinforcement of 0.1 will be given if the distance from the rocket to the space station is less than 3 units and 0.2 if the distance is less than 2 units. Crashing into the planet is bad - the rocket stays stuck for a while to build negative reinforcement and then randomly jumps to a corner of the world to help with training.
# Action directions of thrust
# 6 1
# \ /
# 5--0--2
# / \
# 4 3
validActions = np.array([[0,0],[0.25,np.sqrt(3)/4],[0.5,0],[0.25,-np.sqrt(3)/4],
[-0.25,-np.sqrt(3)/4],[-0.5,0],[-0.25,np.sqrt(3)/4]])
def initialState():
# State is: x&y position, x&y velocity, goal x&y position, goal x&y velocity
# World has planet at 0,0 and the grid is -10 to 10 in both X and Y
# The planet starts at X=0, Y=6 with velocity to result in slightly elliptical orbit
return np.array([20*np.random.random_sample()-10, 20*np.random.random_sample()-10,
3*(0.5-np.random.random_sample()), 3*(0.5-np.random.random_sample()),0,6,1,0]);
def nextState(s,a):
s = copy.copy(s)
deltaT = 0.1 # Euler integration time step
s[0] += deltaT * s[2] # Update rocket X position
s[1] += deltaT * s[3] # Update rocket Y position
s[4] += deltaT * s[6] # Update space station X position
s[5] += deltaT * s[7] # Update space station Y position
ax,ay = validActions[a]
r2rocket = s[0]*s[0] + s[1]*s[1]
if r2rocket < 0.0001: # avoid divide by 0 error, crash detection done later
r2rocket = 0.00009
r2station = s[4]*s[4] + s[5]*s[5]
G = 1 # Gravitational strength
gxr = -G*s[0]/r2rocket # Using newton's law of gravity
gyr = -G*s[1]/r2rocket # Using newton's law of gravity
gxs = -G*s[4]/r2station # Using newton's law of gravity
gys = -G*s[5]/r2station # Using newton's law of gravity
s[2] += deltaT * (ax + gxr) # Update rocket X velocity
s[3] += deltaT * (ay + gyr) # Update rocket Y velocity
s[6] += deltaT * gxs # Update space station X velocity
s[7] += deltaT * gys # Update space station Y velocity
if s[0] < -10: # Bound next position. If at limits, set velocity to 0.
s = np.array([-10,s[1],0,s[3],s[4],s[5],s[6],s[7]])
elif s[0] > 10:
s = np.array([10,s[1],0,s[3],s[4],s[5],s[6],s[7]])
if s[1] < -10: # Bound next position. If at limits, set velocity to 0.
s = np.array([s[0],-10,s[2],0,s[4],s[5],s[6],s[7]])
elif s[1] > 10:
s = np.array([s[0],10,s[2],0,s[4],s[5],s[6],s[7]])
if r2rocket < 0.01: # crashed into planet
if r2rocket < 0.0002: # sometimes, pop out to random +/-9 corner to help learning
s = np.array([np.random.choice([-9,9]),np.random.choice([-9,9]),s[0],s[1],s[4],s[5],s[6],s[7]])
else: # otherwise, randomize position within crash zone to improve learning
s = np.array([np.random.random_sample()/10-0.05,np.random.random_sample()/10-0.05,s[0]/9,s[1]/9,s[4],s[5],s[6],s[7]])
return s # No need to bound space station position, it will stay in bounds
def reinforcement(s): # s is new state
distance = np.sqrt((s[0]-s[4])**2 + (s[1]-s[5])**2)
# dot product of clockwise direction and current velocity
clockwise = (s[1]*s[2]-s[0]*s[3]) > 0
return (-0.1 if clockwise else -0.2) if distance > 5 else 0 if distance > 3 else 0.1 if distance > 2 else 0.2
def policy(qnet, state, epsilon):
if np.random.rand(1) < epsilon:
actioni = np.random.randint(validActions.shape[0])
else:
# inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions.reshape((-1,1))))
inputs = np.hstack(( np.tile(state, (validActions.shape[0], 1)), validActions))
qs = qnet.use(inputs)
actioni = np.argmax(qs)
return actioni
def makeSamples(qnet, nStepsPerStart):
samples = []
state = initialState()
a = policy(qnet, state, epsilon)
oldact = a
for iStep in range(nStepsPerStart):
newState = nextState(state, a)
r = reinforcement(newState)
newAct = policy(qnet, newState, epsilon)
# SARSA
samples.append(state.tolist() + validActions[a].tolist() + [r] + newState.tolist() +
validActions[newAct].tolist())
state = newState
oldact = a
a = newAct
return np.array(samples)
def plotStatus(qnet, X, R, trial, epsilonTrace, rtrace):
plt.subplot(5,3,1)
plt.plot(epsilonTrace[:trial+1])
plt.ylabel("Random Action Probability ($\epsilon$)")
plt.ylim(0,1)
plt.subplot(5,3,2)
plt.plot(X[:,0])
plt.plot(X[:,4],'--',alpha=0.5,lw=5)
plt.ylabel("$x$")
plt.ylim(-11,11)
plt.subplot(5,3,6)
plt.plot(X[:,0],X[:,1])
plt.plot(X[0,0],X[0,1],'o')
plt.plot(X[:,4],X[:,5],'--',alpha=0.5,lw=5)
plt.plot(X[0,4],X[0,5],'o',alpha=0.5,lw=5)
plt.title("Rocket and Station path trial")
plt.xlim(-11,11)
plt.ylim(-11,11)
qs = qnet.use(np.array([[s*2-10,5,0,0,0,6,1,0] + validActions[a].tolist()
for a in range(validActions.shape[0]) for s in range(11)]))
plt.subplot(5,3,12)
acts = ["0","1","2","3","4","5","6"]
actsiByState = np.argmax(qs.reshape((validActions.shape[0],-1)),axis=0)
for i in range(11):
plt.text(i*2-10,0,acts[actsiByState[i]])
plt.xlim(-11,11)
plt.ylim(-1,1)
plt.text(-9,0.2,"Policy for Zero Velocity, y=5, goal at (0,6)")
plt.axis("off")
plt.subplot(5,3,4)
plt.plot(rtrace[:trial+1],alpha=0.5)
binSize = 20
if trial+1 > binSize:
# Calculate mean of every bin of binSize reinforcement values
smoothed = np.mean(rtrace[:int(trial/binSize)*binSize].reshape((int(trial/binSize),binSize)),axis=1)
plt.plot(np.arange(1,1+int(trial/binSize))*binSize,smoothed)
plt.ylabel("Mean reinforcement")
plt.subplot(5,3,5)
plt.plot(X[:,1])
plt.plot(X[:,5],'--',alpha=0.5,lw=5)
plt.ylabel("$y$")
plt.ylim(-11,11)
plt.subplot(5,3,3)
plt.plot(R[:,0]*10)
plt.plot(X[:,4],'--',alpha=0.5,lw=5)
plt.ylabel("reinforcement * 10")
plt.ylim(-11,11)
plt.subplot(5,3,9)
qnet.draw(["$x$","$\dot{x}$","$y$","$\dot{y}$","$gx$","$\dot{gx}$","$gy$","$\dot{gy}$","$a$"],["Q"])
plt.subplot(5,3,7)
n = 20
posx = np.linspace(-10,10,n)
posy = np.linspace(-10,10,n)
xs,ys = np.meshgrid(posx,posy)
xsflat = xs.flat
ysflat = ys.flat
qs = qnet.use(np.array([[xsflat[i],ysflat[i],0,0,0,6,1,0] + validActions[a].tolist()
for a in range(validActions.shape[0]) for i in range(len(xsflat))]))
qs = qs.reshape((validActions.shape[0],-1)).T
qsmax = np.max(qs,axis=1).reshape(xs.shape)
cs = plt.contourf(xs,ys,qsmax)
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Max Q: goal (0,6), velocity 0")
plt.subplot(5,3,8)
acts = np.argmax(qs,axis=1).reshape(xs.shape)
cs = plt.contourf(xs,ys,acts,[-0.5,0.5,1.5,2.5,3.5,4.5,5.5,6.5])
plt.colorbar(cs)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Action: goal (0,6), velocity 0")
s = plt.subplot(5,3,10)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,qsmax,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")
plt.title("Max Q: goal (0,6), velocity 0")
s = plt.subplot(5,3,11)
rect = s.get_position()
ax = Axes3D(plt.gcf(),rect=rect)
ax.plot_surface(xs,ys,acts,cstride=1,rstride=1,cmap=cm.viridis,linewidth=0)
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")
plt.title("Action: goal (0,6), velocity 0")
def testIt(qnet,nStepsPerTrial):
for i in range(3): # small. medium, large squares for start positions
plt.subplot(5,3,13+i)
for x in range(4):
sx = -i*3-3 if (x==0) or (x==1) else i*3+3
sy = -i*3-3 if (x==0) or (x==2) else i*3+3
s = [sx,sy,0,0,0,6,1,0] ## 0 velocity
xtrace = np.zeros((nStepsPerTrial,8))
for step in range(nStepsPerTrial):
a = policy(qnet, s, 0.0) # epsilon = 0
s = nextState(s,a)
xtrace[step,:] = s
plt.plot(xtrace[:,0],xtrace[:,1])
plt.plot(xtrace[0,0],xtrace[0,1],'o')
plt.xlim(-11,11)
plt.ylim(-11,11)
plt.plot(xtrace[:,4],xtrace[:,5],'--',alpha=0.5,lw=5)
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('State Trajectories for corners at +/-' + str(i*3+3) + ', $\epsilon=0$')
# This cell runs the optimal network parameters I found for tracking the
# orbiting space station with a rocket.
gamma = 0.98
nTrials = 1000
nStepsPerTrial = 1100 # Almost 3 orbits of space station around planet
nSCGIterations = 20
finalEpsilon = 0.7
epsilonDecay = np.exp(np.log(finalEpsilon)/(nTrials)) # to produce this final value
nh = [8,2]
qnet = nn.NeuralNetwork([10] + nh + [1])
qnet.setInputRanges(( (-10, 10), (-10,10), (-3, 3), (-3, 3), (-10, 10), (-10, 10), (-3, 3), (-3, 3), (-0.5,0.5), (-0.5,0.5)))
fig = plt.figure(figsize=(15,25))
epsilon = 1
epsilonTrace = np.zeros(nTrials)
rtrace = np.zeros(nTrials)
for trial in range(nTrials):
# Collect nStepsPerRep samples of X, R, Qn, and Q, and update epsilon
samples = makeSamples(qnet, nStepsPerTrial)
ns = 8 # X & Y position and velocity for space station and rocket
na = 2
X = samples[:, :ns+na]
R = samples[:, ns+na:ns+na+1]
nextX = samples[:, ns+na+1:]
nextQ = qnet.use(nextX)
qnet.train(X, R + gamma * nextQ, nIterations = nSCGIterations)
# Decay epsilon
epsilon *= epsilonDecay
# Rest is for plotting
epsilonTrace[trial] = epsilon
rtrace[trial] = np.mean(R)
if trial % (nTrials//5) == 0 or trial == nTrials-1:
plt.clf()
plotStatus(qnet, X, R, trial,epsilonTrace,rtrace)
testIt(qnet,nStepsPerTrial)
clear_output(wait=True)
display(fig);
plt.pause(0.01)
clear_output(wait=True)
One point of extra credit will be given for completing each of the following things.