Description

5/5 - (3 votes)

I. Reinforcement Learning
In reinforcement learning (RL), intelligent agents learn to complete the task by
maximizing reward through trials without knowledge of the desired outputs. In this
report, the actions of a robot are simulated, which aims to move from the top-left of a
10 by 10 grid to the bottom-right. The robot adopts Q-learning with 𝜖𝜖 -greedy
exploration method.
Q-learning: Q-learning is a model-free dynamic programming method in RL,
which means it does not acquire knowledge of the state-transition model. It only uses
reward received from observed state transitions.
The key update rule in Q-learning is the Q-function, measuring the “worth” of
state-action pairs:
𝑄𝑄𝑘𝑘+1(𝑠𝑠𝑘𝑘, 𝑎𝑎𝑘𝑘) = 𝑄𝑄𝑘𝑘(𝑠𝑠𝑘𝑘, 𝑎𝑎𝑘𝑘) + 𝛼𝛼𝑘𝑘(𝑟𝑟𝑘𝑘+1 + 𝛾𝛾 max
𝑎𝑎′
𝑄𝑄𝑘𝑘(𝑠𝑠𝑘𝑘+1, 𝑎𝑎′) − 𝑄𝑄𝑘𝑘(𝑠𝑠𝑘𝑘, 𝑎𝑎𝑘𝑘)),
where (𝑠𝑠𝑘𝑘, 𝑎𝑎𝑘𝑘) being the state-action pair, 𝛼𝛼𝑘𝑘 being the learning rate, 𝛾𝛾 being the
discount rate, 𝑟𝑟𝑘𝑘 being the reward received at step 𝑘𝑘. Two strategies of choosing action
at certain state is emphasized, called Exploitation and Exploration. Exploitation selects
the currently known best action: 𝑎𝑎𝑘𝑘+1 = max
𝑎𝑎′ 𝑄𝑄𝑘𝑘(𝑠𝑠𝑘𝑘+1, 𝑎𝑎′
), while Exploration selects
other actions. It’s obvious that Exploitation can benefit the learning process, but
Exploration is also essential to find the optimal policy throughout the whole process.
To balance the trade-off, 𝜖𝜖-greedy exploration is adopted, choosing Exploration with a
possibility of 𝜖𝜖𝑘𝑘, and Exploitation with probability of 1 − 𝜖𝜖𝑘𝑘. 𝜖𝜖𝑘𝑘 reduces as learning
continues.
In learning process, series trials are conducted, updating the Q-function, until the
goal is reached. With the final Q-function, we are able to determine the optimal action
at each state, which is also known as the optimal policy. Using the optimal policy, the
agent is able to complete the given task.
In summary, by controlling parameters 𝛼𝛼, 𝛾𝛾, 𝜖𝜖, different Q-functions can be built
through trials. Using these goal-reached Q-functions, an optimal policy can be obtained
to complete the given task.
II. Task 1: Q-learning with given reward matrix
2.1 Experiment Implementation
In this experiment, the agent is a robot, with actions to move up, right down and
left. The actions are defined as 𝑎𝑎 = {1,2,3,4} respectively. The initial state of the robot
is the top-left corner of a 10-by-10 grid, and the destination state is the bottom-right
corner of the grid.
To explore the influence of different 𝛼𝛼, 𝛾𝛾, 𝜖𝜖, 8 sets of experiments are conducted,
each containing 10 runs. Within a run, there are 3000 trails to update Q-function. Each
trial consists of a number of time steps (𝑘𝑘), which ends when the robot reaches state
100 or when the learning rate at step 𝑘𝑘 𝛼𝛼𝑘𝑘 is below a threshold (0.005). The Qfunctions of goal-reached runs are recorded, along with the execution time.
When the execution of a set is finished, the number of the goal-reached runs are
recorded, and the average execution time is computed. After all 8 sets is done executing,
the success count and the average execution time are compared. A test is also conducted
to all Q-functions recorded from the goal-reached runs, to examine the functionality of
the Q-functions and their performance (i.e., steps to finish the task).
2.2 Experiment Results
The goal-reached number and execution time of the 8 sets are shown in TABLE I.
All the Q-functions are tested for functionality, as a result, not all the goal-reached
Q-functions can generate policy that is able to complete the task. For example, policies
generated from 𝛾𝛾 = 0.5, 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 = 100
100+𝑘𝑘 shows a dead end: 𝜋𝜋∗(1) = 3, 𝜋𝜋∗(2) = 1.
That means the robot can only move from state 1 to state 2 and state 2 to state 1. In the
Q-learning process, the program escaped this dead loop by Exploration, and reached
destination. However, the robot following the policy generated from that Q-function
cannot complete the task. As a result, only two sets passed the functionality test: 𝛾𝛾 =
0.9, 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 = 100
100+𝑘𝑘 ; 𝛾𝛾 = 0.9, 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 = 1+5log (𝑘𝑘)
𝑘𝑘 .
After obtaining all the Q-functions after test, I extract the optimal paths for every
Q-function with greedy policy and compare them. As a result, all the policies from the
Q-functions are the same, thus the total reward and the trajectories are the same. The
total reward is 181.488. The trajectory is shown in Fig. 1.
Figure 1. Trajectory of optimal policy
TABLE I
PARAMETER VALUES AND PERFORMANCE OF Q-LEARNING FROM TASK 1
𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘
No. of goal-reached runs Execution time (sec.)
𝛾𝛾 = 0.5 𝛾𝛾 = 0.9 𝛾𝛾 = 0.5 𝛾𝛾 = 0.9
1
𝑘𝑘
0 0 NA NA
100
100 + 𝑘𝑘
0 10 NA 2.7955
1 + log (𝑘𝑘)
𝑘𝑘
0 0 NA NA
1 + 5log (𝑘𝑘)
𝑘𝑘
0 10 NA 5.9324
2.3 Analysis
2.3.1 Performance of different parameter values
To find out the difference of 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 more intuitively, I plot the for different
functions with 𝑘𝑘 in [1,200].
Figure 2. Value of 𝜖𝜖𝑘𝑘 and 𝛼𝛼𝑘𝑘
From Fig.2 we can clearly see that 1
𝑘𝑘
and 1+log (𝑘𝑘)
𝑘𝑘 drops rapidly within the first
20 steps, while 𝑘𝑘
100+𝑘𝑘 decreases slowly throughout the 200 steps. It’s also obvious that
1+5log (𝑘𝑘)
𝑘𝑘 is greater than 1 during the first 14 steps, which is usually not allowed
because 𝜖𝜖𝑘𝑘 and 𝛼𝛼𝑘𝑘 are both probabilities. But in experiment, 1+5log (𝑘𝑘)
𝑘𝑘 with 𝛾𝛾 =
0.9 finished the task successfully, which in some extent proves the theory that learning
rate and exploration rate should be bigger at the beginning of the learning process.
Specifically, by setting 𝜖𝜖𝑘𝑘 to 1 or more, the Q-learning chooses to explore other
actions rather than the current optimal action with 100% probability. This is a useful
under the circumstances of this experiment.
Another factor that may influence the performance is the condition to end a trial.
In the implementation, the condition to end a trial is either the agent reaches the
destination or 𝛼𝛼𝑘𝑘 is less than 0.005. Because 𝛼𝛼𝑘𝑘 is given with different formats, the
time step of 𝛼𝛼𝑘𝑘 to reach 0.005 is different. The maximum time step of the four formats
of 𝛼𝛼𝑘𝑘 is given below:
TABEL II Maximum time steps
Function Maximum time step
1
𝑘𝑘 200
𝑘𝑘
100 + 𝑘𝑘 19900
1 + log (𝑘𝑘)
𝑘𝑘 778
1 + 5log (𝑘𝑘)
𝑘𝑘 3777
It’s possible that Q-learning trials with 1
𝑘𝑘
and 1+log (𝑘𝑘)
𝑘𝑘
reaches threshold too fast
to come up with a solution.
2.3.2 Analysis on execution time
On the other hand, the influence of value of 𝛾𝛾 is easier to observe, based on the
definition of Q-function update rule. 𝛾𝛾 is also known as the discount rate. A smaller 𝛾𝛾
forces the agent to focus on the next few steps, while a Q-function with bigger 𝛾𝛾 is
more influenced by the rewards from future steps, i.e., more farsighted.
Also, the average execution time of 1+5log (𝑘𝑘)
𝑘𝑘
is more than that of 𝑘𝑘
100+𝑘𝑘. One of
the reasons could be the computation complexity of 1+5log (𝑘𝑘)
𝑘𝑘 is higher, requiring more
execution time. To test the influence of computation complexity, I ran both functions
for 2000 times and record the execution time.
As shown in the table, there is no big difference in the execution time even if we
further multiply the results with 3000 (maximum number of trials within a run). Thus,
the computation complexity is not a main reason.
I also recorded the time steps in each trial and compare. As a result, trials with
𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 = 𝑘𝑘
100+𝑘𝑘 can quickly reach destination within hundreds of steps after first 30
trials. However, trials with 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 = 1+5log (𝑘𝑘)
𝑘𝑘
reaches destination within hundreds of
steps only after approximately 200-350 trials. The average step in a trial is also shown
in the table below. It’s clear that the time steps of 1+5log (𝑘𝑘)
𝑘𝑘
is much more than that of
𝑘𝑘
100+𝑘𝑘, which is most possible the reason causing the difference of the execution time.
TABLE III Execution time of functions
Function Execution time (sec.) Average time step
𝑘𝑘
100 + 𝑘𝑘 0.000342 278.9643
1 + 5log (𝑘𝑘)
𝑘𝑘 0.000410 654.3141
2.4 Conclusion
From the analysis above, we can conclude that:
1) A discount rate 𝛾𝛾 with big value is beneficial.
2) The program has better performance when 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 are with gently decreasing
curves.
3) From the two goal-reached sets, 𝜖𝜖𝑘𝑘, 𝛼𝛼𝑘𝑘 that decreases the slowest have the
best performance.
III.Task 2: Q-learning with new reward matrix
3.1 Justification of parameter selections
From the results of task 1, we can conclude that learning rate 𝛼𝛼𝑘𝑘 and exploration
rate 𝜖𝜖𝑘𝑘 should decrease slowly during the trial. Considering that 𝛼𝛼𝑘𝑘 and 𝜖𝜖𝑘𝑘 are
possibilities and should be less than 1, I choose the format of exp (−𝑚𝑚𝑚𝑚). By giving
different value of parameter 𝑚𝑚, 𝛼𝛼𝑘𝑘 and 𝜖𝜖𝑘𝑘 can have different curves.
After trying different 𝑚𝑚 with a simple reward function, I found out that the format
of 𝛼𝛼𝑘𝑘, 𝜖𝜖𝑘𝑘 = exp (−0.015𝑘𝑘) has the best performance.
Also, I choose 𝛾𝛾 = 0.9 because in task 1 it is obviously a good choice. Also, it
passes all the test in the experiment using reward function of my own.

EE5904 Neural Networks Project 2: Q-Learning for World Grid Navigation solved

Description

Related products

EE5904/ME5404 Neural Networks: Homework #2 SOLVED

EE5904/ME5404 Neural Networks: Homework #3 solved

EE5904/ME5404 Neural Networks: Homework #1 solved