Solving Optimal Control and Search Problems with Reinforcement Learning in MATLAB

Charles W. Anderson and
R. Matthew Kretchmar

Dept. of Computer Science, Colorado State University, Fort Collins, CO,
80523.

anderson@cs.colostate.edu, 970-491-7491, FAX: 970-491-2466

Application categories: Fuzzy Logic/Neural Networks, Control Systems Design

Problems whose solutions optimize an objective function defined over multiple steps generally require considerable a prior knowledge. Dynamic programming techniques are able to solve such multi-stage, sequential decision problems, but they require complete knowledge of the state space, including state transition probabilities. Bertsekas (1995) has recently published a two-volume text that covers current dynamic programming theory and practice. These techniques start with the value of an objective function for the final states. This value is backed up to all states from which each final state can be reached in one step. Using the known transition probabilities, these preceding states are assigned values. This process continues until all states are assigned values. Once this value function is learned, the optimal action can be selected for any state by choosing the action that maximizes the expected value of the next state. The learned value function is a prediction of the sum of future values. Values defined by the problem's objective function are often called reinforcements, because the learning algorithms were first developed as models of classical and instrumental conditioning in animals.

In most real problems, state transition probabilities are not known. Reinforcement learning methods (Bertsekas and Tsitsiklis, 1995) are a way to deal with this lack of knowledge by using each sequence of state, action, and resulting state and reinforcement as a sample of the unknown underlying probability distribution. This information is used to incrementally learn the correct value function. Typically this requires a large number of such samples. The need for long learning periods is offset by the ability to find solutions to problems that could not be solved using conventional dynamic programming methods.

Recent theoretical developments in the reinforcement learning field have made strong connections between dynamic programming and reinforcement learning. Some reinforcement learning algorithms have been proved to converge to the dynamic programming solution. However, these proofs rely on particular forms of the adaptive value function; these forms are such that there is little transfer of learning from one situation to others, so they do not scale well to large problems. To handle larger problems, continuous value functions are required that do transfer from one learning experience to another.

One structure commonly used to learn value functions are neural networks. This is the approach I have taken, starting in 1986 when I trained neural networks to estimate the value function for an inverted pendulum problem (Anderson 1986, 1989). I have also applied reinforcement learning to other problems, including real control problems such as control of heating in buildings (Anderson, et al., 1997) and difficult search problems such as the Towers of Hanoi puzzle (Anderson, 1987). A number of other control problems that are good candidates for reinforcement learning are defined in Anderson and Miller (1990).

The mountain car problem is another problem that has been used by several researchers to test new reinforcement learning algorithms. This involves a two-dimensional world consisting of a valley and a mass that must be pushed back and forth to gain enough momentum to escape the valley. The following section describes my implementation of this problem and a general MATLAB environment for simulating reinforcement learning control problems and solutions.

Combining all of the visualization methods with the ability to modify parameters and control the running of the simulation via a graphical user interface (GUI) results in the researcher being "closer" to the algorithm. A more intuitive grasp of the effects of various parameter values is gained by removing the programming effort it would take to change parameters without the GUI. The GUI editor guide has been very useful in quickly putting together a very functional user interface.

The figure below shows the GUI I have built for demonstrating
reinforcement learning algorithms. The code is publicly available in
the gzipped tar file mtncarMatlab.tar.gz. You are welcome to
download and use this code; please acknowledge this source if you
find it useful. It is not guaranteed to be free of bugs. The value function
is being learned by a
neural network consisting of radial basis functions (Kretchmar and Anderson,
1997). In the **upper left** is a graph of the two-dimensional
world in which the mountain car lives. The car is represented by a box whose
color indicates which direction, left or right, the reinforcement learning
agent is currently pushing. This display can be activated and deactivated by
clicking on the update button below the graph. Deactivating it allows the
simulation to run faster. The number of trials between graph updates can be
modified by changing the value in the text box next to the update button.
Similar update buttons and text boxes appear for every other graph. The
**upper right** graph shows the performance of the reinforcement
learning agent while it is learning. Performance is measured by the number of
steps from initial random positions and velocities of the car to the step at
which the car leaves the valley. Performance is plotted versus the number of
trials. In the **middle region** of the figure are current
parameter values in editable text fields. Changes made by the user to any of
these values are immediately effective. The **lower left** graph
shows the current estimate of the value function. The reinforcement learning
agent is learning a prediction of the number of steps required to leave the
valley from every state, where a state consists of a position and velocity of
the car. The user can change the view of this three-dimensional surface by
clicking and moving the mouse on this graph. The **lower right**
graph shows the actions the learning agent would take for each state of the
car using the current estimate of the value function. Pictured is a late
stage of learning after a good value function has been learned. The green area
covers states for which the car is pushed right, and in the red area the car
is pushed left. A plot of the trajectory of the car's state for the current
trial will be added to this axis.

In the menubar, one pull-down menu has been added, called
**Run**. When pulled down, the user sees the choices
**Start**, **Reset**, and **Pause**.
**Start** obviously will start the simulation. Once started, the
**Pause** menu item becomes enabled, allowing the user to pause
the simulation at any time. The **Reset** menu item will
re-initialize the reinforcement learning agent so it can again learn from
scratch.

The implementation is based on three main structures for the task, the learning agent, and the simulation. Each structure includes fields for parameters, such as the mass of the car in the task structure, and display update flags and tag names. This simulation environment and GUI are still under development. The implementation will be moved to the object-based representation to facilitate the construction of new learning agents and tasks. This will be tested by implementing a second control task and at least one search task, such as a simple game like Tic-Tac-Toe or a puzzle like the Towers of Hanoi.

The resulting environment for experimenting with reinforcement learning methods will be very helpful, both to students wanting to learn more about reinforcement learning and to researchers wanting to study novel extensions of current reinforcement learning algorithms or to apply reinforcement learning to new tasks. A primary goal in designing this environment is flexibility to try different tasks and different value function representations (Kretchmar and Anderson, 1997). The popularity of reinforcement learning is growing rapidly. Stories in the popular press are covering reinforcement learning accomplishments, such as the achievements of Tesauro's program that learns to play backgammon at a master's level, and textbooks are starting to appear. An easy-to-use environment for learning about and experimenting with reinforcement learning would be very popular and a good companion to new textbooks.

More information on this research project is available at http://www.cs.colostate.edu/~anderson. As mentioned above, the Matlab code for this demonstration is publicly upavailable in the gzipped tar file mtncarMatlab.tar.gz.

- Anderson (1986) Learning and Problem Solving with Multilayered Connectionist Systems, PhD thesis, University of Massachusetts, Amherst, MA.
- Anderson (1987) Strategy learning with multilayer connectionist representations. In Proceedings of the Fourth International Workshop on Machine Learning, pp. 103-114.
- Anderson (1989) Learning to control an inverted pendulum with neural networks. IEEE Control Systems Magazine, vol. 9, no. 3, pp. 31-37, April 1989.
- Anderson, Hittle, Katz, and Kretchmar (1997) Synthesis of reinforcement learning, neural networks, and pi control applied to a simulated heating coil. Artificial Intelligence in Engineering.
- Anderson and Miller (1990) A Set of Challenging Control Problems. In Neural Networks for Control, edited by Miller, Sutton, and Werbos, MIT Press, Cambridge, MA, pp. 475-510.
- Bertsekas (1995) Dynamic Programming and Optimal Control, Volumes I and II. Athena Scientific, Belmont, MA.
- Bertsekas and Tsitsiklis (1995) Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.
- Kretchmar and Anderson (1997) Comparison of CMACs and Radial Basis Functions for Local Function Approximators in Reinforcement Learning. In Proceedings of the International Conference on Neural Networks, ICNN'97.

GUI for observing and manipulating the learning and performance of
reinforcement learning algorithms while solving the mountain car problem.