This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Back to Chuck Anderson's Home Page
Abstract: Deep learning algorithms have recently appeared that pretrain hidden layers of neural networks in unsupervised ways, leading to state-of-the-art performance on large classification problems. These methods can also pretrain networks used for reinforcement learning. However, this ignores the additional information that exists in a reinforcement learning paradigm via the ongoing sequence of state, action, new state tuples. This paper demonstrates that learning a predictive model of state dynamics can result in a pretrained hidden layer structure that reduces the time needed to solve reinforcement learning problems.
After training for 0 minutes:
After training for 10 minutes:
After training for 50 minutes:
After training for 100 minutes:
After training for 200 minutes:
Testing, with no exploration:
Testing, with no exploration, slow motion:
Another test sequence, with no exploration, slow motion:
In 2010, we received a grant from the Colorado State University Clean Energy Supercluster titled "Predictive Modeling of Wind Farm Power and On-Line Ooptimization of Wind Turbine Control". This grant is described in the CES Supercluster 2009-2010 Annual Report
Resources we have found useful:
During an extended visit to Colorado State University, Andre Barreto developed a modified gradient-descent algorithm for training networks of radial basis functions. His modification is a more robust approach for learning value functions for reinforcement learning problems. The following publication describes this work.
Jilin Tu completed his MS thesis in 2001. The following is an excerpt from his abstract.
This thesis studies how to integrate statespace models of control systems with reinforcement learning and analyzes why one common reinforcement learning ar chitecture does not work for control systems with Proportional-Integral (PI) controllers. As many control problems are best solved with continuous state and control signals, a continuous reinforcement learning algorithm is then developed and applied to a simulated control problem involving the refinement of a PI controller for the control of a simple plant. The results show that a learning architecture based on a statespace model of the control system outperforms the previous reinforcement l earning architecture, and that the continuous reinforcement learning algorithm ou tperforms discrete reinforcement learning algorithms.
In 1999, Baxter and Bartlett developed their direct-gradient class of algorithms for learning policies directly without also learning value functions. This intrigues me from the viewpoint of function approximation, in that there may be many problems for which the policy is easier to represent than is the value function. It is well known that a value function need not exactly reflect the true value of state-action pairs, but must only value the optimal actions for each state higher than the rest. A function approximator that strives for minimum error may waste valuable function approximator resources. We devised a simple Markov chain task and a very limited neural network that demonstrates this. When applied to this task, Q-learning tends to oscillate between optimal and suboptimal solutions. However, using the same restricted neural network, Baxter and Bartlett's direct-gradient algorithm converges to the optimal policy. This work is described in:
We have experimented with ways of approximating the value and policy functions in reinforcement learning using radial basis functions. Gradient descent does not work well for adjusting the basis functions unless they are close to the correct positions and widths a priori. One way of dealing with this is to "restart" the training of a basis function that has become useless. It is restarted by setting its center and width to values for which the basis function will enable the network as a whole better fit the target function. This is described in:
Here is a link to a web site for our NSF-funded project on Robust Reinforcement Learning for HVAC Control.