Name: Solved-Assignment 05 -Solution
SKU: 19425
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

In this assignment, you will be implementing two versions of Q-learning for a straightforward navigation task; one version will use the basic algorithm, whereas the other will adjust weights that can then be used to compute Q-values for all state-action pairs.

The download for this assignment contains a single text- le, pipe_world.txt, which describes a simple environment in which:

S indicates the starting location for an agent. G indicates a goal location.

_ indicates an empty location.

M indicates the location of an explosive mine.

In this environment, the agent begins in the starting location, and can move in one of four directions (up, down, left, and right). Any move that attempts to leave the bounds of the map fails (and the agent stays where it is). For any other move, the move succeeds with probability 0:8, and the agent enters the desired square; with probability 0:2, the agent slips to one side or the other as it moves in the expected direction, with each direction of slippage being equally probable. Thus, for instance, if the agent starts in the state marked S in the gure below and attempts to go to the right, it will end up in one of 3 possible locations, each with the probability given.

0.1

S 0.8

0.1

As the agent moves, it receives rewards as follows:

If it enters a location containing a mine, it receives reward r_M = 100.

If it enters the goal location, it receives reward r_G = 0.

Any other result leads to reward r_O = 1.

For full points, you will write code that does the following:

1. (20 pts.) Your code will rst perform Q-learning on the problem. Your code should:

1. 1. Use a data structure to store explicit values Q(s; a) for each state (location in the grid-world) and action encountered during learning.

1. 1. Perform 10; 000 learning episodes; each episode will continue until the agent:

1. 1. 1. Reaches the goal location.

1. 1. 1. Hits a mine.

1. 1. 1. Takes a number of steps equal to the entire size (width height) of the environment. At the beginning of each episode, the agent will begin from the starting location.

1. 1. You will use control parameters as follows:

1. 1. 1. Future discount: = 0:9.

1. 1. 1. Policy randomness: you will start with = 0:9; at the end of every 200 learning episodes, you will reduce the rate, setting it to = 1=(bE=200c + 1). where E is the number of the episode just completed. This means that after E = 200; 400; 600; : : :, we will have = 0:45; 0:3; 0:225; : : :. Note: on the nal run of learning, the agent should set = 0, so that it acts in a purely greedy fashion.

1. 1. 1. Learning rate: you will start with = 0:9; at the end of every 1000 learning episodes, you will reduce the rate, setting it to = 1=(bE=1; 000c+1). where E is the number of the episode just completed. This means that after E = 1000; 2000; 3000; : : :, we will have = 0:45; 0:3; 0:225; : : :.

1. 1. After every 100 learning episodes, your code will pause and do 50 test-runs of its current greedy policy. That is, it will do 50 runs, from the starting location, choosing the action a that maximizes Q(s; a) for each state s (location) it encounters; each run will continue until one of the three stopping conditions indicated above is met. During the runs, the program will keep track of the total reward received, and at the end will write out the average reward to a le. (I recommend writing to a CSV le, as this will make graphing your results later easier.) Note: on the last run of learning, this will be a test of the average performance of the nal learned policy.

1. (20 pts.) After the above is complete, your code will do a feature-based version of Q-learning. In this version, the number of learning episodes and all control parameters will be exactly as before, and again the code will print out the average reward gained over 50 test-runs of its current greedy policy, every 100 learning episodes. However, in this version, there will be no explicit data structure to store Q-values; instead the code will do the following:

1. 1. It will store a weight vector w = (w₁; w₂); initially, w = (0; 0).
  2. For any state-action pair (s; a), there will be a feature vector f_(s;a) = (f₁; f₂) as follows:

1. 1. 1. f₁ is a normalized Manhattan distance metric, calculated by assuming that the action a occurs with no slippage, and then returning f₁ = M D(s⁰)=M D⁺, where s⁰ is the location that would result from that action, M D(s⁰) is the Manhattan distance from s⁰ to the goal location, and M D⁺ is the maximum possible such distance.

See: https://en.wikipedia.org/wiki/Taxicab_geometry

As an example, consider the image below. If we suppose the agent is in location s marked with an A, then for each of the four actions s⁰ would be the state indicated by the corresponding arrow. In this smaller version of the grid, M D⁺ = 4 (from either of the top corners to the goal marked G). Thus, each of the possible directions gives us the f₁ value indicated; for instance, for the down direction, we have f₁ = 1=4 = 0:25.

0.75

0.75 A 0.75

0.25

f₂ is based upon the action a and two inverse distance metrics, calculated based upon the distance to the mines to the left and right of an agent. As shown in the image below, each inverse distance is 1=d, where d is the number of moves needed to reach the mine in that direction. For an agent starting in the square marked A, for instance, the nearest mine to the left is 2 squares away, for an inverse distance of 1=2 = 0:5, and to the right the inverse distance is 1=4 = 0:25.

M	0.5	A	0.25			M

We set f₂ dependent upon action a as follows:

If a is a move left or right, then f₂ is set to the inverse distance in that direction. If a is a move up or down, then f₂ is set to the minimum of the two distances.

Thus, for the example just given, we have:

^f2	= 0:5 if a = left	f₂ = 0:25 if a = up
f₂	= 0:25 if a = right	f₂ = 0:5 if a = down

For any state-action pair the Q-value is calculated as follows: Q(s; a) = w f_(s;a) = w₁f₁ + w₂f₂

After each Q-learning iteration, we update weights as explained in lecture 24:

= r + max Q(s⁰; a⁰) Q(s; a)

a⁰

8i; w_i w_i + f_i(s; a)

(10 pts.) Along with your code (written in clear, well organized, and well documented fashion), you will hand in the following:

A text- le containing the nal policy learned by each of the two algorithms. The policies should each be labeled to indicate which is which. For each square of the grid, print a single character (row by row, column by column):

{ For any square containing a mine or the goal, simply print M or G, respectively, as in the original input.

{ For any other square, print out the action chosen, using U, D, L, and R for up, down, left, and right, respectively.

A graph containing the average-value data generated by the two algorithms after every 100 episodes. The graph axes should be clearly labeled, and it should be clear which data-series is which.

A text- le containing some analysis of the results shown in the graph and nal policies. You need a couple of paragraphs to explain what you are seeing in the data and why (you believe) you get the results that you do.

You will hand in a single compressed archive containing:

Complete source code.

Any instructions necessary to compile and/or execute the code. The text- les and graph described above.

Solved-Assignment 05 -Solution

Description

Related products

Homework 3: Mbed Setup Solution

Homework 2: When Harry Met Sally Solution

Homework 5: Minion Agents Solution

Project 1B: Transformation Matrices Solution

Project 1A: Transformation Matrices Solution