Description
I. Reinforcement Learning
In reinforcement learning (RL), intelligent agents learn to complete the task by
maximizing reward through trials without knowledge of the desired outputs. In this
report, the actions of a robot are simulated, which aims to move from the top-left of a
10 by 10 grid to the bottom-right. The robot adopts Q-learning with ππ -greedy
exploration method.
Q-learning: Q-learning is a model-free dynamic programming method in RL,
which means it does not acquire knowledge of the state-transition model. It only uses
reward received from observed state transitions.
The key update rule in Q-learning is the Q-function, measuring the βworthβ of
state-action pairs:
ππππ+1(π π ππ, ππππ) = ππππ(π π ππ, ππππ) + πΌπΌππ(ππππ+1 + πΎπΎ max
ππβ²
ππππ(π π ππ+1, ππβ²) β ππππ(π π ππ, ππππ)),
where (π π ππ, ππππ) being the state-action pair, πΌπΌππ being the learning rate, πΎπΎ being the
discount rate, ππππ being the reward received at step ππ. Two strategies of choosing action
at certain state is emphasized, called Exploitation and Exploration. Exploitation selects
the currently known best action: ππππ+1 = max
ππβ² ππππ(π π ππ+1, ππβ²
), while Exploration selects
other actions. Itβs obvious that Exploitation can benefit the learning process, but
Exploration is also essential to find the optimal policy throughout the whole process.
To balance the trade-off, ππ-greedy exploration is adopted, choosing Exploration with a
possibility of ππππ, and Exploitation with probability of 1 β ππππ. ππππ reduces as learning
continues.
In learning process, series trials are conducted, updating the Q-function, until the
goal is reached. With the final Q-function, we are able to determine the optimal action
at each state, which is also known as the optimal policy. Using the optimal policy, the
agent is able to complete the given task.
In summary, by controlling parameters πΌπΌ, πΎπΎ, ππ, different Q-functions can be built
through trials. Using these goal-reached Q-functions, an optimal policy can be obtained
to complete the given task.
II. Task 1: Q-learning with given reward matrix
2.1 Experiment Implementation
In this experiment, the agent is a robot, with actions to move up, right down and
left. The actions are defined as ππ = {1,2,3,4} respectively. The initial state of the robot
is the top-left corner of a 10-by-10 grid, and the destination state is the bottom-right
corner of the grid.
To explore the influence of different πΌπΌ, πΎπΎ, ππ, 8 sets of experiments are conducted,
each containing 10 runs. Within a run, there are 3000 trails to update Q-function. Each
trial consists of a number of time steps (ππ), which ends when the robot reaches state
100 or when the learning rate at step ππ πΌπΌππ is below a threshold (0.005). The Qfunctions of goal-reached runs are recorded, along with the execution time.
When the execution of a set is finished, the number of the goal-reached runs are
recorded, and the average execution time is computed. After all 8 sets is done executing,
the success count and the average execution time are compared. A test is also conducted
to all Q-functions recorded from the goal-reached runs, to examine the functionality of
the Q-functions and their performance (i.e., steps to finish the task).
2.2 Experiment Results
The goal-reached number and execution time of the 8 sets are shown in TABLE I.
All the Q-functions are tested for functionality, as a result, not all the goal-reached
Q-functions can generate policy that is able to complete the task. For example, policies
generated from πΎπΎ = 0.5, ππππ, πΌπΌππ = 100
100+ππ shows a dead end: ππβ(1) = 3, ππβ(2) = 1.
That means the robot can only move from state 1 to state 2 and state 2 to state 1. In the
Q-learning process, the program escaped this dead loop by Exploration, and reached
destination. However, the robot following the policy generated from that Q-function
cannot complete the task. As a result, only two sets passed the functionality test: πΎπΎ =
0.9, ππππ, πΌπΌππ = 100
100+ππ ; πΎπΎ = 0.9, ππππ, πΌπΌππ = 1+5log (ππ)
ππ .
After obtaining all the Q-functions after test, I extract the optimal paths for every
Q-function with greedy policy and compare them. As a result, all the policies from the
Q-functions are the same, thus the total reward and the trajectories are the same. The
total reward is 181.488. The trajectory is shown in Fig. 1.
Figure 1. Trajectory of optimal policy
TABLE I
PARAMETER VALUES AND PERFORMANCE OF Q-LEARNING FROM TASK 1
ππππ, πΌπΌππ
No. of goal-reached runs Execution time (sec.)
πΎπΎ = 0.5 πΎπΎ = 0.9 πΎπΎ = 0.5 πΎπΎ = 0.9
1
ππ
0 0 NA NA
100
100 + ππ
0 10 NA 2.7955
1 + log (ππ)
ππ
0 0 NA NA
1 + 5log (ππ)
ππ
0 10 NA 5.9324
2.3 Analysis
2.3.1 Performance of different parameter values
To find out the difference of ππππ, πΌπΌππ more intuitively, I plot the for different
functions with ππ in [1,200].
Figure 2. Value of ππππ and πΌπΌππ
From Fig.2 we can clearly see that 1
ππ
and 1+log (ππ)
ππ drops rapidly within the first
20 steps, while ππ
100+ππ decreases slowly throughout the 200 steps. Itβs also obvious that
1+5log (ππ)
ππ is greater than 1 during the first 14 steps, which is usually not allowed
because ππππ and πΌπΌππ are both probabilities. But in experiment, 1+5log (ππ)
ππ with πΎπΎ =
0.9 finished the task successfully, which in some extent proves the theory that learning
rate and exploration rate should be bigger at the beginning of the learning process.
Specifically, by setting ππππ to 1 or more, the Q-learning chooses to explore other
actions rather than the current optimal action with 100% probability. This is a useful
under the circumstances of this experiment.
Another factor that may influence the performance is the condition to end a trial.
In the implementation, the condition to end a trial is either the agent reaches the
destination or πΌπΌππ is less than 0.005. Because πΌπΌππ is given with different formats, the
time step of πΌπΌππ to reach 0.005 is different. The maximum time step of the four formats
of πΌπΌππ is given below:
TABEL II Maximum time steps
Function Maximum time step
1
ππ 200
ππ
100 + ππ 19900
1 + log (ππ)
ππ 778
1 + 5log (ππ)
ππ 3777
Itβs possible that Q-learning trials with 1
ππ
and 1+log (ππ)
ππ
reaches threshold too fast
to come up with a solution.
2.3.2 Analysis on execution time
On the other hand, the influence of value of πΎπΎ is easier to observe, based on the
definition of Q-function update rule. πΎπΎ is also known as the discount rate. A smaller πΎπΎ
forces the agent to focus on the next few steps, while a Q-function with bigger πΎπΎ is
more influenced by the rewards from future steps, i.e., more farsighted.
Also, the average execution time of 1+5log (ππ)
ππ
is more than that of ππ
100+ππ. One of
the reasons could be the computation complexity of 1+5log (ππ)
ππ is higher, requiring more
execution time. To test the influence of computation complexity, I ran both functions
for 2000 times and record the execution time.
As shown in the table, there is no big difference in the execution time even if we
further multiply the results with 3000 (maximum number of trials within a run). Thus,
the computation complexity is not a main reason.
I also recorded the time steps in each trial and compare. As a result, trials with
ππππ, πΌπΌππ = ππ
100+ππ can quickly reach destination within hundreds of steps after first 30
trials. However, trials with ππππ, πΌπΌππ = 1+5log (ππ)
ππ
reaches destination within hundreds of
steps only after approximately 200-350 trials. The average step in a trial is also shown
in the table below. Itβs clear that the time steps of 1+5log (ππ)
ππ
is much more than that of
ππ
100+ππ, which is most possible the reason causing the difference of the execution time.
TABLE III Execution time of functions
Function Execution time (sec.) Average time step
ππ
100 + ππ 0.000342 278.9643
1 + 5log (ππ)
ππ 0.000410 654.3141
2.4 Conclusion
From the analysis above, we can conclude that:
1) A discount rate πΎπΎ with big value is beneficial.
2) The program has better performance when ππππ, πΌπΌππ are with gently decreasing
curves.
3) From the two goal-reached sets, ππππ, πΌπΌππ that decreases the slowest have the
best performance.
III.Task 2: Q-learning with new reward matrix
3.1 Justification of parameter selections
From the results of task 1, we can conclude that learning rate πΌπΌππ and exploration
rate ππππ should decrease slowly during the trial. Considering that πΌπΌππ and ππππ are
possibilities and should be less than 1, I choose the format of exp (βππππ). By giving
different value of parameter ππ, πΌπΌππ and ππππ can have different curves.
After trying different ππ with a simple reward function, I found out that the format
of πΌπΌππ, ππππ = exp (β0.015ππ) has the best performance.
Also, I choose πΎπΎ = 0.9 because in task 1 it is obviously a good choice. Also, it
passes all the test in the experiment using reward function of my own.