Computer Science DepartmentIn this assignment, we make the simulation much more realistic and more general. It is more realistic, because the robot only moves to neighboring states. It is more general, because a model of the maze is not required. Of course we are doing everything in simulation, so we must have the maze model, but the robot's decisions about which state to go to next does not use the model.
You will be implementing a form of reinforcement learning. This is a class of machine learning algorithms that work without the knowledge of a model. They learn a lot like animals learn, through rewards and punishments. For this assignment, you will be punishing your robot for every step that it takes that does not result in the robot being at the goal position. Technically, the algorithm finds the sequence of actions that minimizes the sum of these punishments. Each punishment has a value of 1, so minimizing the sum of these punishments means finding the shortest path. To learn more about reinforcement learning, check out this site, where you can find some online tutorials.
Reinforcement learning works by maintaining a memory in the robot that predicts how many steps it will take to get to the goal from each state. The predictions are used to decide which action to take from any state. This is done by trying each of the four actions and checking the memory to see what the prediction is for each resulting state. The action that takes the robot to the state with the smallest prediction of remaining steps is the action that is taken.
These predictions are learned by experimentation---taking various actions from a state and updating the prediction for that state by considering the prediction currently stored for the state you arrive in. Let Pred(s) be the prediction of remaining steps to get to the goal from state s. Say that the robot is in state s and takes an "up" action and arrives in state t. The prediction for state s is changed to
Pred(s) = 0.1 (1 + Pred(t)) + 0.9 Pred(s)
So, the new prediction for state s is 0.9 of its old value, plus 0.1 of what our current estimate is of what it should be. It should be equal to the prediction of remaining steps from state t plus 1, because it takes 1 step to get to t from s. It's kind of counter-intuitive that this works, because you are updating Pred(s) based only on a current guess at what Pred(t) is, and all predictions are being updated as the robot moves around the maze.
We could use an array for the memory to store the predictions, but to allocate such an array the robot must know how many rows and columns will be needed. If the robot doesn't know this beforehand, then an array can't be used. Instead, you will use a hash table to implement the memory. Use Table.h to implement the hash table. It is a modified version of the hash table code described in the book which we downloaded from the author's web site. You will have to carefully read the comments and code in Table.h to learn how to use it. Each item to be stored is simply the state (row and column) and the floating-point prediction. Form the key for the hash table by combining the row and column in this fashion
f(row,column) = k x row + column + 1
where k is some integer. For this assignment, you will evaluate
how well the robot learns for different values of k.
For k = 5, 6, ..., 20
Allocate new hash table of capacity 300.
Repeat 1000 times (call this 1000 trials)
Set current state to be
maze's start state.
While robot is not at goal
state AND number of steps taken this trial is less than 10,000, do
For each possible action (that does not run into a wall)
Try the action from the current state.
Get next state.
Retrieve any memory of next state's prediction from the hash table
using the hash table find method..
If no memory is found, say the prediction for next state is 0.
If this prediction is lower than current lowest prediction, remember it.
Action is the one with lowest prediction for next state.
Update the prediction for the current state using the equation
You are not using stacks and queues in this assignment. Remember, the robot is simply taking steps, one at a time, from where it currently is.
int main(int argc, char *argv[])You can convert the first command line argument to an integer by doing
int option;Printing the maze with an * at current state is much more fun to watch if you clear the screen before you draw it each time. You can do this by calling the functionif (argc > 1)
option = atoi(argv[1]);
system("clear")The function system allows you to run any unix command, like clear. To use system, you must
#include <stdlib.h>To graph the steps on each trial, use C++ iomanipulators to space over a number of spaces that is proportional to the number of steps, and print an *. To calculate this, you should know the maximum number of steps that will result and the width of your screen, so you can place the farthest right * within the width of the screen. You probably will not know the maximum number of steps, so just guess and adjust it after you have seen some of the output. If your screen is 80 characters wide, and you suspect the maximum number of steps will be 100, then you should scale the number of steps with the expression steps * 80 / 100. Use the iomanipulator setw to do this. Once the goal is reached, print a line using code like
cout << setw(steps * 80 / 20 ) << "*" << endl;To use iomanipulators, remember to
#include <iomanip.h>
cout.setf(ios::fixed, ios::floatfield); cout.precision(1);before printing, and then printed each value using
cout << setw(7) << val ;
a.out < maze.data k = 5 steps last trial = 8 Table spaces used = 24 2.0 2.3 2.1 1.0 0.0 5.0 4.0 3.0 2.0 1.0 6.0 3.1 2.8 2.4 1.8 7.0 2.8 2.8 2.8 2.7 8.0 2.7 2.7 2.7 2.7
... then again for k = 6, k = 7, etc...
a.out 1 < maze.data +---------+ | | G| |---- | | | | | | | | | | | | ------| | | | | | | |*| | +---------+ +---------+ | | G| |---- | | | | | | | | | | | | ------| |*| | | | | | | | +---------+ +---------+ | | G| |---- | | | | | | | |*| | | | ------| | | | | | | | | | +---------+ +---------+ | | G| |---- | | |* | | | | | | | | | ------| | | | | | | | | | +---------+ +---------+ | | G| |---- | | | * | | | | | | | | | ------| | | | | | | | | | +---------+
... and on and on until G is reached ...
a.out 2 < maze.data 0 * 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10 * 11 * 12 * 13 * 14 * 15 * 16 * 17 * 18 * 19 * 20 * 21 * 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30 * 31 * 32 * 33 * 34 * 35 * 36 * 37 * 38 * 39 * 40 * 41 * 42 * 43 * 44 * 45 * 46 * 47 * 48 * 49 * 50 * 51 * 52 * 53 * 54 * 55 * 56 * 57 * 58 * 59 * 60 * 61 * 62 * 63 * 64 * 65 * 66 * 67 * 68 * 69 * 70 * 71 * 72 * 73 * 74 * 75 * 76 * 77 * 78 * 79 * 80 * 81 * 82 * 83 * 84 * 85 * 86 * 87 * 88 * 89 * 90 * 91 * 92 * 93 * 94 * 95 * 96 * 97 * 98 * 99 * 100 * 101 * 102 * 103 * 104 * 105 * 106 * 107 * 108 * 109 * 110 * 111 * 112 * 113 * 114 * 115 * 116 * 117 * 118 * 119 * 120 * 121 * 122 * 123 * 124 * 125 * 126 * 127 * 128 * 129 * 130 * 131 * 132 * 133 * 134 * 135 * 136 * 137 * 138 * 139 * 140 * 141 * 142 * 143 * 144 * 145 * 146 * 147 * 148 * 149 * 150 *
and on and on until 1,000 trials are reached. The position that the asterisk is in above corresponds to 8 steps.
Describe what you see in the printed tables of prediction values. Discuss how and why the number of places used in the hash table varies with k.
Explain how the algorithm for choosing next actions results in picking short paths, given the final learned predictions.
Explain how Table.h implements a hash table. Does it statically or dynamically allocate the table. How does it handle collisions.
Copyright © 1998: Colorado State University for CS200. All rights reserved.