In this homework you will have the complete RL experience. You will work towards implementing
and evaluating the Q-learning algorithm on a simple domain. Q-learning is a fundamental RL
algorithm and has been successfully used to solve a variety of decision-making problems. For this
homework, you will have to think carefully about algorithm implementation, specially exploration
The domain you will be tackling is called Taxi (Taxi-v2). It is a discrete MDP which has been used
for RL research in the past. This will also be your first opportunity to become familiar with the
OpenAI Gym environment (https://gym.openai.com/). This is a cool and unique platform where
users can test their RL algorithms over a selection of domains.
The Taxi problem was introduced in Dietterich(2000). It is a grid-based domain where the goal of
the agent is to pick up a passenger at one location and drop them off in another. There are 4
fixed locations, each assigned a different letter. The agent has 6 actions; 4 for movement, 1 for
pickup, and 1 for dropoff. The domain has a discrete state space and deterministic transitions.
Implement a basic version of the Q-learning algorithm and use it to solve the taxi domain. The
agent should explore the MDP, collect data to the learn the optimal policy and the optimal
Q-value function. (Be mindful of how you handle terminal states, typically if St is a terminal state,
V(St+1) = 0). Use 𝛾 = 0.90. Also, you will see how Epsilon-Greedy strategy can find the optimal
policy despite of finding sub-optimal q-values. Because we are looking for optimal q-values, you
will have to try different exploration strategies.
You can evaluate your agent offline or by uploading your experiment file to the OpenAI server
using a GitHub account. The latter will generate a learning curve (reward/steps vs episodes)
which is indicative of performance. The OpenAI server will indicate if your implementation has
solved the domain, that is, found an optimal policy. Note that all evaluations uploaded to the
OpenAI server are publicly accessible. However, please do not attach your code as a gist
write-up to these evaluation.
Below are the optimal Q values for 5 (state, action) pairs of the Taxi domain.
● Q(462, 4) = -11.374402515
● Q(398, 3) = 4.348907
● Q(253, 0) = -0.5856821173
● Q(377, 1) = 9.683
● Q(83, 5) = -12.8232660372
The concepts explored in this homework are covered by:
○ Exploring Exploration
○ littman-1996.pdf (chapters 1-2)
You will be evaluated based on optimality of results. This will be assessed by your algorithm’s
optimal Q-values for 10 specific state-action pairs (remember to use 𝛾 = 0.90). You will submit
your results to 10 problems selected for you on the rldm website. The values will be graded on a
0.01 precision threshold.
Optionally, you might want to, with the same implementation, solve the environment under
OpenAI’s criteria. If you accomplish that, you definitely learned something about exploration vs
exploitation, the difference between optimal policy and optimal q-values, and should be proud
Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial
Intelligence Research, 13, 227–303.