EE5904 Neural Networks Project 2: Q-Learning for World Grid Navigation solved

$35.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

I. Reinforcement Learning
In reinforcement learning (RL), intelligent agents learn to complete the task by
maximizing reward through trials without knowledge of the desired outputs. In this
report, the actions of a robot are simulated, which aims to move from the top-left of a
10 by 10 grid to the bottom-right. The robot adopts Q-learning with πœ–πœ– -greedy
exploration method.
Q-learning: Q-learning is a model-free dynamic programming method in RL,
which means it does not acquire knowledge of the state-transition model. It only uses
reward received from observed state transitions.
The key update rule in Q-learning is the Q-function, measuring the β€œworth” of
state-action pairs:
π‘„π‘„π‘˜π‘˜+1(π‘ π‘ π‘˜π‘˜, π‘Žπ‘Žπ‘˜π‘˜) = π‘„π‘„π‘˜π‘˜(π‘ π‘ π‘˜π‘˜, π‘Žπ‘Žπ‘˜π‘˜) + π›Όπ›Όπ‘˜π‘˜(π‘Ÿπ‘Ÿπ‘˜π‘˜+1 + 𝛾𝛾 max
π‘Žπ‘Žβ€²
π‘„π‘„π‘˜π‘˜(π‘ π‘ π‘˜π‘˜+1, π‘Žπ‘Žβ€²) βˆ’ π‘„π‘„π‘˜π‘˜(π‘ π‘ π‘˜π‘˜, π‘Žπ‘Žπ‘˜π‘˜)),
where (π‘ π‘ π‘˜π‘˜, π‘Žπ‘Žπ‘˜π‘˜) being the state-action pair, π›Όπ›Όπ‘˜π‘˜ being the learning rate, 𝛾𝛾 being the
discount rate, π‘Ÿπ‘Ÿπ‘˜π‘˜ being the reward received at step π‘˜π‘˜. Two strategies of choosing action
at certain state is emphasized, called Exploitation and Exploration. Exploitation selects
the currently known best action: π‘Žπ‘Žπ‘˜π‘˜+1 = max
π‘Žπ‘Žβ€² π‘„π‘„π‘˜π‘˜(π‘ π‘ π‘˜π‘˜+1, π‘Žπ‘Žβ€²
), while Exploration selects
other actions. It’s obvious that Exploitation can benefit the learning process, but
Exploration is also essential to find the optimal policy throughout the whole process.
To balance the trade-off, πœ–πœ–-greedy exploration is adopted, choosing Exploration with a
possibility of πœ–πœ–π‘˜π‘˜, and Exploitation with probability of 1 βˆ’ πœ–πœ–π‘˜π‘˜. πœ–πœ–π‘˜π‘˜ reduces as learning
continues.
In learning process, series trials are conducted, updating the Q-function, until the
goal is reached. With the final Q-function, we are able to determine the optimal action
at each state, which is also known as the optimal policy. Using the optimal policy, the
agent is able to complete the given task.
In summary, by controlling parameters 𝛼𝛼, 𝛾𝛾, πœ–πœ–, different Q-functions can be built
through trials. Using these goal-reached Q-functions, an optimal policy can be obtained
to complete the given task.
II. Task 1: Q-learning with given reward matrix
2.1 Experiment Implementation
In this experiment, the agent is a robot, with actions to move up, right down and
left. The actions are defined as π‘Žπ‘Ž = {1,2,3,4} respectively. The initial state of the robot
is the top-left corner of a 10-by-10 grid, and the destination state is the bottom-right
corner of the grid.
To explore the influence of different 𝛼𝛼, 𝛾𝛾, πœ–πœ–, 8 sets of experiments are conducted,
each containing 10 runs. Within a run, there are 3000 trails to update Q-function. Each
trial consists of a number of time steps (π‘˜π‘˜), which ends when the robot reaches state
100 or when the learning rate at step π‘˜π‘˜ π›Όπ›Όπ‘˜π‘˜ is below a threshold (0.005). The Qfunctions of goal-reached runs are recorded, along with the execution time.
When the execution of a set is finished, the number of the goal-reached runs are
recorded, and the average execution time is computed. After all 8 sets is done executing,
the success count and the average execution time are compared. A test is also conducted
to all Q-functions recorded from the goal-reached runs, to examine the functionality of
the Q-functions and their performance (i.e., steps to finish the task).
2.2 Experiment Results
The goal-reached number and execution time of the 8 sets are shown in TABLE I.
All the Q-functions are tested for functionality, as a result, not all the goal-reached
Q-functions can generate policy that is able to complete the task. For example, policies
generated from 𝛾𝛾 = 0.5, πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ = 100
100+π‘˜π‘˜ shows a dead end: πœ‹πœ‹βˆ—(1) = 3, πœ‹πœ‹βˆ—(2) = 1.
That means the robot can only move from state 1 to state 2 and state 2 to state 1. In the
Q-learning process, the program escaped this dead loop by Exploration, and reached
destination. However, the robot following the policy generated from that Q-function
cannot complete the task. As a result, only two sets passed the functionality test: 𝛾𝛾 =
0.9, πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ = 100
100+π‘˜π‘˜ ; 𝛾𝛾 = 0.9, πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ = 1+5log (π‘˜π‘˜)
π‘˜π‘˜ .
After obtaining all the Q-functions after test, I extract the optimal paths for every
Q-function with greedy policy and compare them. As a result, all the policies from the
Q-functions are the same, thus the total reward and the trajectories are the same. The
total reward is 181.488. The trajectory is shown in Fig. 1.
Figure 1. Trajectory of optimal policy
TABLE I
PARAMETER VALUES AND PERFORMANCE OF Q-LEARNING FROM TASK 1
πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜
No. of goal-reached runs Execution time (sec.)
𝛾𝛾 = 0.5 𝛾𝛾 = 0.9 𝛾𝛾 = 0.5 𝛾𝛾 = 0.9
1
π‘˜π‘˜
0 0 NA NA
100
100 + π‘˜π‘˜
0 10 NA 2.7955
1 + log (π‘˜π‘˜)
π‘˜π‘˜
0 0 NA NA
1 + 5log (π‘˜π‘˜)
π‘˜π‘˜
0 10 NA 5.9324
2.3 Analysis
2.3.1 Performance of different parameter values
To find out the difference of πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ more intuitively, I plot the for different
functions with π‘˜π‘˜ in [1,200].
Figure 2. Value of πœ–πœ–π‘˜π‘˜ and π›Όπ›Όπ‘˜π‘˜
From Fig.2 we can clearly see that 1
π‘˜π‘˜
and 1+log (π‘˜π‘˜)
π‘˜π‘˜ drops rapidly within the first
20 steps, while π‘˜π‘˜
100+π‘˜π‘˜ decreases slowly throughout the 200 steps. It’s also obvious that
1+5log (π‘˜π‘˜)
π‘˜π‘˜ is greater than 1 during the first 14 steps, which is usually not allowed
because πœ–πœ–π‘˜π‘˜ and π›Όπ›Όπ‘˜π‘˜ are both probabilities. But in experiment, 1+5log (π‘˜π‘˜)
π‘˜π‘˜ with 𝛾𝛾 =
0.9 finished the task successfully, which in some extent proves the theory that learning
rate and exploration rate should be bigger at the beginning of the learning process.
Specifically, by setting πœ–πœ–π‘˜π‘˜ to 1 or more, the Q-learning chooses to explore other
actions rather than the current optimal action with 100% probability. This is a useful
under the circumstances of this experiment.
Another factor that may influence the performance is the condition to end a trial.
In the implementation, the condition to end a trial is either the agent reaches the
destination or π›Όπ›Όπ‘˜π‘˜ is less than 0.005. Because π›Όπ›Όπ‘˜π‘˜ is given with different formats, the
time step of π›Όπ›Όπ‘˜π‘˜ to reach 0.005 is different. The maximum time step of the four formats
of π›Όπ›Όπ‘˜π‘˜ is given below:
TABEL II Maximum time steps
Function Maximum time step
1
π‘˜π‘˜ 200
π‘˜π‘˜
100 + π‘˜π‘˜ 19900
1 + log (π‘˜π‘˜)
π‘˜π‘˜ 778
1 + 5log (π‘˜π‘˜)
π‘˜π‘˜ 3777
It’s possible that Q-learning trials with 1
π‘˜π‘˜
and 1+log (π‘˜π‘˜)
π‘˜π‘˜
reaches threshold too fast
to come up with a solution.
2.3.2 Analysis on execution time
On the other hand, the influence of value of 𝛾𝛾 is easier to observe, based on the
definition of Q-function update rule. 𝛾𝛾 is also known as the discount rate. A smaller 𝛾𝛾
forces the agent to focus on the next few steps, while a Q-function with bigger 𝛾𝛾 is
more influenced by the rewards from future steps, i.e., more farsighted.
Also, the average execution time of 1+5log (π‘˜π‘˜)
π‘˜π‘˜
is more than that of π‘˜π‘˜
100+π‘˜π‘˜. One of
the reasons could be the computation complexity of 1+5log (π‘˜π‘˜)
π‘˜π‘˜ is higher, requiring more
execution time. To test the influence of computation complexity, I ran both functions
for 2000 times and record the execution time.
As shown in the table, there is no big difference in the execution time even if we
further multiply the results with 3000 (maximum number of trials within a run). Thus,
the computation complexity is not a main reason.
I also recorded the time steps in each trial and compare. As a result, trials with
πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ = π‘˜π‘˜
100+π‘˜π‘˜ can quickly reach destination within hundreds of steps after first 30
trials. However, trials with πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ = 1+5log (π‘˜π‘˜)
π‘˜π‘˜
reaches destination within hundreds of
steps only after approximately 200-350 trials. The average step in a trial is also shown
in the table below. It’s clear that the time steps of 1+5log (π‘˜π‘˜)
π‘˜π‘˜
is much more than that of
π‘˜π‘˜
100+π‘˜π‘˜, which is most possible the reason causing the difference of the execution time.
TABLE III Execution time of functions
Function Execution time (sec.) Average time step
π‘˜π‘˜
100 + π‘˜π‘˜ 0.000342 278.9643
1 + 5log (π‘˜π‘˜)
π‘˜π‘˜ 0.000410 654.3141
2.4 Conclusion
From the analysis above, we can conclude that:
1) A discount rate 𝛾𝛾 with big value is beneficial.
2) The program has better performance when πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ are with gently decreasing
curves.
3) From the two goal-reached sets, πœ–πœ–π‘˜π‘˜, π›Όπ›Όπ‘˜π‘˜ that decreases the slowest have the
best performance.
III.Task 2: Q-learning with new reward matrix
3.1 Justification of parameter selections
From the results of task 1, we can conclude that learning rate π›Όπ›Όπ‘˜π‘˜ and exploration
rate πœ–πœ–π‘˜π‘˜ should decrease slowly during the trial. Considering that π›Όπ›Όπ‘˜π‘˜ and πœ–πœ–π‘˜π‘˜ are
possibilities and should be less than 1, I choose the format of exp (βˆ’π‘šπ‘šπ‘šπ‘š). By giving
different value of parameter π‘šπ‘š, π›Όπ›Όπ‘˜π‘˜ and πœ–πœ–π‘˜π‘˜ can have different curves.
After trying different π‘šπ‘š with a simple reward function, I found out that the format
of π›Όπ›Όπ‘˜π‘˜, πœ–πœ–π‘˜π‘˜ = exp (βˆ’0.015π‘˜π‘˜) has the best performance.
Also, I choose 𝛾𝛾 = 0.9 because in task 1 it is obviously a good choice. Also, it
passes all the test in the experiment using reward function of my own.