# Homework #4 CSE 446: Machine Learning

\$30.00

## Expectation Maximization

1. Pandora is a streaming music company like Spotify that was known to buck the collaborative filtering trend1
and instead paid an army of employees to create feature vectors for each song by hand. Suppose you work at a
Pandora clone and have feature vectors x1, . . . , xn ∈ R
d
for all n songs in your database, and a particular user,
for some subset S ⊂ {1, . . . , n}, has listened to song i ∈ S exactly Yi ∈ {1, 2, . . . } times. You would like to
make a playlist for this user so you assume Yi
is Poisson distributed with mean E[Yi
|xi
] =: λi = e
w
T xi
for some
weight vector w ∈ R
d
reflecting the user’s preferences. That is,
p(Yi = y|xi
, w) = λ
y
i
y!
e
−λi =
e
yxT
i w
y!
e
−e
wT xi
.

The maximum likelihood estimator is wb = arg maxw
Q
i∈S p(yi
|xi
, w). The idea is that you would then construct
an m song playlist out of the m songs that maximize x
T
i wb.

a. [3 points] The estimate wb has no closed-form solution. Can the optimization problem be transformed into
a convex optimization problem? If so, suggest a method of solving for wb given {(xi
, yi)}i∈S . (Hint: one
can pose this as a convex optimization problem.).

b. [7 points] You solve for the wb for this user and make a playlist for her. Weeks later you look at her
listening history and observe that sometimes she listens to a particular set of songs and skips over others,
and at some other point she listens to a different set of songs and skips over others. You have the epiphany
that users are human beings whose preferences differ with their mood (e.g., music for workouts, studying,
being sad, etc.). You decide she has k music moods and aim to make k playlists, one for each mood that
could be modeled by a different weight vector w. The problem is that you don’t know which observation
i ∈ S is assigned to which mood. Describe how you would use the EM algorithm to make these k
playlists by introducing additional variables zij that indicate whether song i is suitable for mood j. Your
description should specify in math exactly what computations will be performed in the E step, exactly
what computations will be performed in the M step, how you initialize your parameters (doesn’t have to
be fancy), and what your criterion for convergence is. Make sure that all quantities are defined precisely.
See Murphy Ch. 11 and other references on the course website for a review of EM and ideas.
1Methods like matrix completion can leverage massive user-bases rating lots of items, but suffer from the “cold-start” problem:

you recommend songs based on people’s rating history, but to learn who would like a new song you need lots of people to listen to
that song, but that requires you to suggest it and possibly degrade recommendation performance.
1

## Recommendation System

2. You will build a personalized movie recommender system. There are m = 1682 movies and n = 943 users.
As historical data, every user rated at least 20 movies but some watched many more; the total dataset we
will use has 100, 000 total ratings from all users. The goal is to recommend movies the users haven’t seen.
Namely, consider a matrix R ∈ R
m×n where the entry Ri,j represents the jth user’s rating on movie i on a scale
of {1, . . . , 5}; a higher value represents that the user is more satisfied with the movie. We may think of our
historical data as some observed entries of this matrix while many remain unknown, and we wish to estimate
the unknowns (presumably, we would then suggest the movies that appear most appealing to the user). We will
use the 100K MovieLens dataset available here: https://grouplens.org/datasets/movielens/100k/. The
dataset contains lots of metadata like the titles and genre of the movies, and even information about the users
like their age. While all this metadata can be useful in predicting whether a user will like a movie or not, we
will ignore it and just use the ratings contained in the u.data file. Use this data file and the following python
code to construct a training and test set:
import csv
import numpy as np
data = []
with open(’u.data’) as csvfile:
data.append([int(row[0])-1, int(row[1])-1, int(row[2])])
data = np.array(data)
n = len(data) # n= 100,000
num_users = max(data[:,0])+1 # num_users = 943, indexed 0,…,942
num_items = max(data[:,1])+1 # num_items = 1682 indexed 0,…,1681
np.random.seed(1)
num_train = int(0.8*n)
perm = np.random.permutation(data.shape[0])
train = data[perm[0:num_train],:]
test = data[perm[num_train::],:]

The arrays train and test contain the user-movie-score data representing the training set and the test set,
respectively. Each line takes the form i, j, s, where i is the user index, j is the movie index, and s is the users
score in {1, 2, 3, 4, 5} describing how much they liked the movie (higher is better), which we’re calling Ri,j .
Using train you will train a model that can predict how any user would rate any movie, if such a rating was
made. You will evaluate your model based on the average squared-error on the data in test. Specifically, you
will use train to build a model Rb ∈ R
m×n that hopefully has small test error:
Etest(Rb) = 1
|test|
X
(i,j,Ri,j )∈test
(Rbi,j − Ri,j )
2
.
Note that I am using Rb and R above even though typically it is almost never necessary to allocate an m × n
matrix to perform this calculation; doing so would be incredibly wasteful from a memory perspective since only
about 6.3% of the entries of the matrix R are known in the entire dataset. It is much more efficient to just keep
track of the indices and ratings that are known of than allocating the matrix.

Low-rank matrix factorization is a baseline method for personalized recommendation. It learns a vector representation ui ∈ R
d
for each user and a vector representation vj ∈ R
d
for each movie, such that the inner product
hui
, vj i approximates the rating Ri,j . You will build a simple latent factor model.
You will implement multiple estimators and use the inner product hui
, vj i to predict if user i likes movie j in
the test data. For simplicity, we will put aside best practices and choose hyperparameters by using those that
minimize the test error. You may use fundamental operators from Numpy or PyTorch in your implementation
2
(i.e., numpy.linalg.lstsq, SVD, autograd, etc.) but not any precooked algorithm from a package like Sci-Kit
Learn. If there is a question whether some package crosses the line and is not appropriate for use, it probably
is not appropriate.

a. [5 points] Our first estimator just pools all the users together and, for each movie, outputs as its prediction
the average user rating of that movie in train. That is, if µ ∈ R
m is a vector where µi
is the average
rating of the users that rated the ith movie, write this estimator Rb as a rank-one matrix. What is the
Etest for this simple estimator?

b. [5 points] In this part, we actually will allocate an m × n matrix since our data is relatively small and
it is instructive. Allocate a matrix Rei,j ∈ R
m×n and set its entries equal to the known values in the
training set, and 0 otherwise (that is, Re will be a sparse matrix). For each d = 1, 2, 5, 10, 20, 50 let Rb(d)
be the best rank-d approximation (in terms of squared error) approximation to Re. This is equivalent to
computing the singular value decomposition (SVD) and using just the top d singular values. This learns
a lower dimensional vector representation for users and movies. Refer to the lecture materials and linked
notes on SVD, PCA and dimensionality reduction. You should use an efficient solver; we recommend
scipy.sparse.linalg.svds. For each d = 1, 2, 5, 10, 20, 50 compute the estimator Rb(d) and plot the
average squared error of predictions on the training set and test set on a single plot, as a function of d.
c. [10 points] Repeat the previous problem but instead replace the unknown entries in Re with the average
of the known entries in the training set instead of 0. Briefly, in words, compare your results to that of
part a–is it surprising? One takeaway of this assignment is that there are many ways to deal with missing
data, and some are much better than others.

d. [10 points] Replacing all missing values by a constant is not a completely satisfying solution or starting
point. A more reasonable choice is to minimize the MSE (mean squared error) only on rated movies. Let’s
define a loss function:

L

{ui}
m
i=1, {vj}
n
j=1
:= X
(i,j,Ri,j )∈train
(hui
, vj i − Ri,j )
2 + λ
Xn
i=1
kuik
2
2 + λ
Xm
j=1
kvjk
2
2
,
where λ > 0 is the regularization coefficient. We will implement algorithms to learn vector representations
by minimizing the loss function L({ui}, {vj}).
Note that you may need to tune the hyper-parameter λ to optimize the performance. Also note that this
is a non-convex optimization so your initial starting point may affect the quality of the final solution (since
it may just be a local minimum). Common choices for initializing the {ui}
m
i=1, {vj}
n
j=1 vectors are with
entries drawn from np.random.rand() or np.random.randn() scaled by some scale factor σ > 0 (σ is an
additional hyperparameter). Another popular initialization is to use one of the solutions from part b or c.
Alternating minimization: First, randomly initialize {ui} and {vj}. Then minimize the loss function with
respective to {ui} by treating {vj} as constant vectors, and then minimize the loss function with respect
to {vj} by treating {ui} as constant vectors. Iterate these two steps until both {ui} and {vj} converge.
Note that when one of {ui} or {vj} is given, minimizing the loss function with respect to the other part
has closed-form solutions.

Try d = 1, 2, 5, 10, 20, 50 and plot the mean squared error of train and test as a function of d. You should
never be allocating an m × n matrix for this problem.
e. [10 points] Repeat part d, but instead of using alternating minimization, use batched stochastic gradient
descent.
Stochastic Gradient Descent: First, randomly initialize {ui} and {vj}. Then take a minibatch of random samples from your training set and compute a gradient step, repeat until convergence. This is
non-convex optimization so the initialization matters, and so does the batch size and step size η used (η
3
is a hyperparameter). One strategy is to pick the largest constant value of η such that the loss L still
tends to decrease. Another strategy is to pick a relatively large value of η and then scale it by some factor
β ∈ (0, 1) so that η 7→ βη every time a number of examples are seen that exceeds the size of the training set.

Feel free to modify the loss function to, say, different regularizers if it helps reduce the test error. See
http://www.optimization-online.org/DB_FILE/2011/04/3012.pdf for some ideas.
f. [5 points] Briefly, in words, compare the the results of parts d and e. This is an example where the
loss functions are identical, but the algorithm used has drastic impact on how well the model overfits or
generalizes to new, unseen data.
g. (Extra credit: [5 points] ). Using any algorithm you’d like (as long as you code it up and don’t just call
some pre-cooked machine learning method), find an estimator and clearly describe how to reproduce it,
that achieves a test error of no more than 0.9.

## Deep learning architectures

3. In this problem we will explore different deep learning architectures for a classification task. Go to
http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html and complete the following
tutorials
• What is PyTorch?
• Neural Networks
• Training a classifier

The final tutorial will leave you with a network for classifying the CIFAR-10 dataset, which is where this problem
starts. Just following these tutorials could take a number of hours but they are excellent, so start early. After
completing them, you should be familiar with tensors, two-dimensional convolutions (nn.Conv2d) and fully
connected layers (nn.Linear), ReLu non-linearities (F.relu), pooling (nn.MaxPool2d), and tensor reshaping
(view); if there is any doubt of their inputs/outputs or whether the layers include an offset or not, consult the
API http://pytorch.org/docs/master/.

### A few preliminaries:

• Using a GPU may considerably speed up computations but it is not necessary for these small networks
(one can get away with using one’s laptop).
• Conceptually, each network maps an image x
in ∈ R
32×32×3
(3 channels for RGB) to an output layer
x
out ∈ R
10 where the image’s class label is predicted as arg maxi=0,1,…,9 x
out
i
. An error occurs if the
predicted label differs from its true label.
• In this problem, the network is trained via cross-entropy loss, the same loss we used for multi-class logistic
regression. Specifically, for an input image and label pair (x
in, c) where c ∈ {0, 1, . . . , 9}, if the network’s
output layer is x
out ∈ R
10, the loss is − log( exp(x
out
c P
)
9
c0=0 exp(x
out
c0
)
).
• For computational efficiency reasons, this particular network considers mini-batches of images per training
step meaning the network actually maps B = 4 images per feed-forward so that xe
in ∈ R
B×32×32×3 and
xe
out ∈ R
B×10. This is ignored in the network descriptions below but it is something to be aware of.

• The cross-entropy loss for a neural network is, in general, non-convex. This means that the optimization
method may converge to different local minima based on different hyperparameters of the optimization
procedure (e.g., stepsize). Usually one can find a good setting for these hyperparameters by just observing
the relative progress of training over the first epoch or two (how fast is it decreasing) but you are warned
that early progress is not necessarily indicative of the final convergence value (you may converge quickly
to a poor local minimum whereas a different step size could have poor early performance but converge to
a better final value).

4
• The training method used in this example uses a form of stochastic gradient descent (SGD) that uses
a technique called momentum which incorporates scaled versions of previous gradients into the current
descent direction2
. Practically speaking, momentum is another optimization hyperparameter in addition
to the step size. If this bothers you, you can obtain all the same results using regular stochastic gradient
descent.
• We will not be using a validation set for this exercise. Hyperparameters like network architecture and
step size should be chosen based on the performance on the test set. This is very bad practice for all the
reasons we have discussed over the quarter, but we aim to make this exercise as simple as possible.
• You should modify the training code such that at the end of each epoch (one pass over the training data)
you compute and print the training and test classification accuracy (you may find the running calculation
that the code initially uses useful to calculate the training accuracy).

• While one would usually train a network for hundreds of epochs for it to converge, this can be prohibitively
time consuming so feel free to train your networks for just a dozen or so epochs.
You will construct a number of different network architectures and compare their performance. For all, it is
highly recommended that you copy and modify the existing (working) network you are left with at the end
of the tutorial Training a classifier. For all of the following perform a hyperparameter selection (manually by
hand, random search, etc.) using the test set, report the hyperparameters you found, and plot the training and
test classification accuracy as a function of iteration (one plot per network). Highly sub-optimal hyperparameter choices that lead to drastically worse error rates than your peers will result in points off
(but don’t over do it).
Here are the network architectures you will construct and compare.

a. [15 points] Fully connected output, 0 hidden layers (logistic regression): we begin with the simplest network
possible that has no hidden layers and simply linearly maps the input layer to the output layer. That is,
conceptually it could be written as
x
out = Wvec(x
in) + b
where x
out ∈ R
10
, x
in ∈ R
32×32×3
, W ∈ R
10×3072
, b ∈ R
10 where 3072 = 32 · 32 · 3. For a tensor
x ∈ R
a×b×c
, we let vec(x) ∈ R
abc be the reshaped form of the tensor into a vector (in an arbitrary but
consistent pattern).

b. [15 points] Fully connected output, 1 fully connected hidden layer: we will have one hidden layer denoted
as x
hidden ∈ RM where M will be a hyperparameter you choose (M could be in the hundreds). The
nonlinearity applied to the hidden layer will be the relu (relu(x) = max{0, x}, elementwise). Conceptually,
one could write this network as
x
out = W2relu(W1vec(x
in) + b1) + b2
where W1 ∈ RM×3072
, b1 ∈ RM, W2 ∈ R
10×M, b2 ∈ R
10
.
c. [15 points] Fully connected output, 1 convolutional layer with max-pool: for a convolutional layer W1
with individual filters of size p × p × 3 and output size M (reasonable choices are M = 100, p = 5)
we have that Conv2d(x
in, W1) ∈ R
(33−p)×(33−p)×M. Each convolution will have its own offset applied
to each of the output pixels of the convolution; we denote this as Conv2d(x
in, W) + b1 where b1 is
parameterized in RM. We will then apply a relu (relu doesn’t change the tensor shape) and pool. If
we use a max-pool of size N (a reasonable choice is N = 14 to pool to 2 × 2 with p = 5) we have that
MaxPool(relu(Conv2d(x
in, W1) + b1)) ∈ R
b
33−p
N c×b 33−p
N c×M. We will then apply a fully connected layer to
the output to get a final network given as
x
output = W2vec(MaxPool(relu(Conv2d(x
input, W1) + b1))) + b2
where W2 ∈ R
10×M(b
33−p
N c)
2
, b2 ∈ R
10. The parameters M, p, N (in addition to the step size and momentum) are all hyperparameters.
2See http://www.cs.toronto.edu/~hinton/absps/momentum.pdf for the deep learning perspective on this method.

5
d. (Extra credit: [5 points] ) Returning to the original network you were left with at the end of the tutorial
Training a classifier, tune the different hyperparameters (number of convolutional filters, filter sizes,
dimensionality of the fully connected layers, stepsize, etc.) and train for many epochs to achieve a test
accuracy of at least 87%.

The number of hyperparameters to tune in the last exercise combined with the slow training times will hopefully
give you a taste of how difficult it is to construct good performing networks. It should be emphasized that the
networks we constructed are tiny; typical networks have dozens of layers, each with hyperparameters to tune.
Additional hyperparameters you are welcome to play with if you are so interested: replacing relu max{0, x} with
a sigmoid 1/(1 + e
−x
), max-pool with average-pool, and experimenting with batch-normalization or dropout.
6