## Description

## Expectation Maximization

1. Pandora is a streaming music company like Spotify that was known to buck the collaborative filtering trend1

and instead paid an army of employees to create feature vectors for each song by hand. Suppose you work at a

Pandora clone and have feature vectors x1, . . . , xn ∈ R

d

for all n songs in your database, and a particular user,

for some subset S ⊂ {1, . . . , n}, has listened to song i ∈ S exactly Yi ∈ {1, 2, . . . } times. You would like to

make a playlist for this user so you assume Yi

is Poisson distributed with mean E[Yi

|xi

] =: λi = e

w

T xi

for some

weight vector w ∈ R

d

reflecting the user’s preferences. That is,

p(Yi = y|xi

, w) = λ

y

i

y!

e

−λi =

e

yxT

i w

y!

e

−e

wT xi

.

The maximum likelihood estimator is wb = arg maxw

Q

i∈S p(yi

|xi

, w). The idea is that you would then construct

an m song playlist out of the m songs that maximize x

T

i wb.

a. [3 points] The estimate wb has no closed-form solution. Can the optimization problem be transformed into

a convex optimization problem? If so, suggest a method of solving for wb given {(xi

, yi)}i∈S . (Hint: one

can pose this as a convex optimization problem.).

b. [7 points] You solve for the wb for this user and make a playlist for her. Weeks later you look at her

listening history and observe that sometimes she listens to a particular set of songs and skips over others,

and at some other point she listens to a different set of songs and skips over others. You have the epiphany

that users are human beings whose preferences differ with their mood (e.g., music for workouts, studying,

being sad, etc.). You decide she has k music moods and aim to make k playlists, one for each mood that

could be modeled by a different weight vector w. The problem is that you don’t know which observation

i ∈ S is assigned to which mood. Describe how you would use the EM algorithm to make these k

playlists by introducing additional variables zij that indicate whether song i is suitable for mood j. Your

description should specify in math exactly what computations will be performed in the E step, exactly

what computations will be performed in the M step, how you initialize your parameters (doesn’t have to

be fancy), and what your criterion for convergence is. Make sure that all quantities are defined precisely.

See Murphy Ch. 11 and other references on the course website for a review of EM and ideas.

1Methods like matrix completion can leverage massive user-bases rating lots of items, but suffer from the “cold-start” problem:

you recommend songs based on people’s rating history, but to learn who would like a new song you need lots of people to listen to

that song, but that requires you to suggest it and possibly degrade recommendation performance.

1

## Recommendation System

2. You will build a personalized movie recommender system. There are m = 1682 movies and n = 943 users.

As historical data, every user rated at least 20 movies but some watched many more; the total dataset we

will use has 100, 000 total ratings from all users. The goal is to recommend movies the users haven’t seen.

Namely, consider a matrix R ∈ R

m×n where the entry Ri,j represents the jth user’s rating on movie i on a scale

of {1, . . . , 5}; a higher value represents that the user is more satisfied with the movie. We may think of our

historical data as some observed entries of this matrix while many remain unknown, and we wish to estimate

the unknowns (presumably, we would then suggest the movies that appear most appealing to the user). We will

use the 100K MovieLens dataset available here: https://grouplens.org/datasets/movielens/100k/. The

dataset contains lots of metadata like the titles and genre of the movies, and even information about the users

like their age. While all this metadata can be useful in predicting whether a user will like a movie or not, we

will ignore it and just use the ratings contained in the u.data file. Use this data file and the following python

code to construct a training and test set:

import csv

import numpy as np

data = []

with open(’u.data’) as csvfile:

spamreader = csv.reader(csvfile, delimiter=’\t’)

for row in spamreader:

data.append([int(row[0])-1, int(row[1])-1, int(row[2])])

data = np.array(data)

n = len(data) # n= 100,000

num_users = max(data[:,0])+1 # num_users = 943, indexed 0,…,942

num_items = max(data[:,1])+1 # num_items = 1682 indexed 0,…,1681

np.random.seed(1)

num_train = int(0.8*n)

perm = np.random.permutation(data.shape[0])

train = data[perm[0:num_train],:]

test = data[perm[num_train::],:]

The arrays train and test contain the user-movie-score data representing the training set and the test set,

respectively. Each line takes the form i, j, s, where i is the user index, j is the movie index, and s is the users

score in {1, 2, 3, 4, 5} describing how much they liked the movie (higher is better), which we’re calling Ri,j .

Using train you will train a model that can predict how any user would rate any movie, if such a rating was

made. You will evaluate your model based on the average squared-error on the data in test. Specifically, you

will use train to build a model Rb ∈ R

m×n that hopefully has small test error:

Etest(Rb) = 1

|test|

X

(i,j,Ri,j )∈test

(Rbi,j − Ri,j )

2

.

Note that I am using Rb and R above even though typically it is almost never necessary to allocate an m × n

matrix to perform this calculation; doing so would be incredibly wasteful from a memory perspective since only

about 6.3% of the entries of the matrix R are known in the entire dataset. It is much more efficient to just keep

track of the indices and ratings that are known of than allocating the matrix.

Low-rank matrix factorization is a baseline method for personalized recommendation. It learns a vector representation ui ∈ R

d

for each user and a vector representation vj ∈ R

d

for each movie, such that the inner product

hui

, vj i approximates the rating Ri,j . You will build a simple latent factor model.

You will implement multiple estimators and use the inner product hui

, vj i to predict if user i likes movie j in

the test data. For simplicity, we will put aside best practices and choose hyperparameters by using those that

minimize the test error. You may use fundamental operators from Numpy or PyTorch in your implementation

2

(i.e., numpy.linalg.lstsq, SVD, autograd, etc.) but not any precooked algorithm from a package like Sci-Kit

Learn. If there is a question whether some package crosses the line and is not appropriate for use, it probably

is not appropriate.

a. [5 points] Our first estimator just pools all the users together and, for each movie, outputs as its prediction

the average user rating of that movie in train. That is, if µ ∈ R

m is a vector where µi

is the average

rating of the users that rated the ith movie, write this estimator Rb as a rank-one matrix. What is the

Etest for this simple estimator?

b. [5 points] In this part, we actually will allocate an m × n matrix since our data is relatively small and

it is instructive. Allocate a matrix Rei,j ∈ R

m×n and set its entries equal to the known values in the

training set, and 0 otherwise (that is, Re will be a sparse matrix). For each d = 1, 2, 5, 10, 20, 50 let Rb(d)

be the best rank-d approximation (in terms of squared error) approximation to Re. This is equivalent to

computing the singular value decomposition (SVD) and using just the top d singular values. This learns

a lower dimensional vector representation for users and movies. Refer to the lecture materials and linked

notes on SVD, PCA and dimensionality reduction. You should use an efficient solver; we recommend

scipy.sparse.linalg.svds. For each d = 1, 2, 5, 10, 20, 50 compute the estimator Rb(d) and plot the

average squared error of predictions on the training set and test set on a single plot, as a function of d.

c. [10 points] Repeat the previous problem but instead replace the unknown entries in Re with the average

of the known entries in the training set instead of 0. Briefly, in words, compare your results to that of

part a–is it surprising? One takeaway of this assignment is that there are many ways to deal with missing

data, and some are much better than others.

d. [10 points] Replacing all missing values by a constant is not a completely satisfying solution or starting

point. A more reasonable choice is to minimize the MSE (mean squared error) only on rated movies. Let’s

define a loss function:

L

{ui}

m

i=1, {vj}

n

j=1

:= X

(i,j,Ri,j )∈train

(hui

, vj i − Ri,j )

2 + λ

Xn

i=1

kuik

2

2 + λ

Xm

j=1

kvjk

2

2

,

where λ > 0 is the regularization coefficient. We will implement algorithms to learn vector representations

by minimizing the loss function L({ui}, {vj}).

Note that you may need to tune the hyper-parameter λ to optimize the performance. Also note that this

is a non-convex optimization so your initial starting point may affect the quality of the final solution (since

it may just be a local minimum). Common choices for initializing the {ui}

m

i=1, {vj}

n

j=1 vectors are with

entries drawn from np.random.rand() or np.random.randn() scaled by some scale factor σ > 0 (σ is an

additional hyperparameter). Another popular initialization is to use one of the solutions from part b or c.

Alternating minimization: First, randomly initialize {ui} and {vj}. Then minimize the loss function with

respective to {ui} by treating {vj} as constant vectors, and then minimize the loss function with respect

to {vj} by treating {ui} as constant vectors. Iterate these two steps until both {ui} and {vj} converge.

Note that when one of {ui} or {vj} is given, minimizing the loss function with respect to the other part

has closed-form solutions.

Try d = 1, 2, 5, 10, 20, 50 and plot the mean squared error of train and test as a function of d. You should

never be allocating an m × n matrix for this problem.

e. [10 points] Repeat part d, but instead of using alternating minimization, use batched stochastic gradient

descent.

Stochastic Gradient Descent: First, randomly initialize {ui} and {vj}. Then take a minibatch of random samples from your training set and compute a gradient step, repeat until convergence. This is

non-convex optimization so the initialization matters, and so does the batch size and step size η used (η

3

is a hyperparameter). One strategy is to pick the largest constant value of η such that the loss L still

tends to decrease. Another strategy is to pick a relatively large value of η and then scale it by some factor

β ∈ (0, 1) so that η 7→ βη every time a number of examples are seen that exceeds the size of the training set.

Feel free to modify the loss function to, say, different regularizers if it helps reduce the test error. See

http://www.optimization-online.org/DB_FILE/2011/04/3012.pdf for some ideas.

f. [5 points] Briefly, in words, compare the the results of parts d and e. This is an example where the

loss functions are identical, but the algorithm used has drastic impact on how well the model overfits or

generalizes to new, unseen data.

g. (Extra credit: [5 points] ). Using any algorithm you’d like (as long as you code it up and don’t just call

some pre-cooked machine learning method), find an estimator and clearly describe how to reproduce it,

that achieves a test error of no more than 0.9.

## Deep learning architectures

3. In this problem we will explore different deep learning architectures for a classification task. Go to

http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html and complete the following

tutorials

• What is PyTorch?

• Autograd: automatic differentiation

• Neural Networks

• Training a classifier

The final tutorial will leave you with a network for classifying the CIFAR-10 dataset, which is where this problem

starts. Just following these tutorials could take a number of hours but they are excellent, so start early. After

completing them, you should be familiar with tensors, two-dimensional convolutions (nn.Conv2d) and fully

connected layers (nn.Linear), ReLu non-linearities (F.relu), pooling (nn.MaxPool2d), and tensor reshaping

(view); if there is any doubt of their inputs/outputs or whether the layers include an offset or not, consult the

API http://pytorch.org/docs/master/.

### A few preliminaries:

• Using a GPU may considerably speed up computations but it is not necessary for these small networks

(one can get away with using one’s laptop).

• Conceptually, each network maps an image x

in ∈ R

32×32×3

(3 channels for RGB) to an output layer

x

out ∈ R

10 where the image’s class label is predicted as arg maxi=0,1,…,9 x

out

i

. An error occurs if the

predicted label differs from its true label.

• In this problem, the network is trained via cross-entropy loss, the same loss we used for multi-class logistic

regression. Specifically, for an input image and label pair (x

in, c) where c ∈ {0, 1, . . . , 9}, if the network’s

output layer is x

out ∈ R

10, the loss is − log( exp(x

out

c P

)

9

c0=0 exp(x

out

c0

)

).

• For computational efficiency reasons, this particular network considers mini-batches of images per training

step meaning the network actually maps B = 4 images per feed-forward so that xe

in ∈ R

B×32×32×3 and

xe

out ∈ R

B×10. This is ignored in the network descriptions below but it is something to be aware of.

• The cross-entropy loss for a neural network is, in general, non-convex. This means that the optimization

method may converge to different local minima based on different hyperparameters of the optimization

procedure (e.g., stepsize). Usually one can find a good setting for these hyperparameters by just observing

the relative progress of training over the first epoch or two (how fast is it decreasing) but you are warned

that early progress is not necessarily indicative of the final convergence value (you may converge quickly

to a poor local minimum whereas a different step size could have poor early performance but converge to

a better final value).

4

• The training method used in this example uses a form of stochastic gradient descent (SGD) that uses

a technique called momentum which incorporates scaled versions of previous gradients into the current

descent direction2

. Practically speaking, momentum is another optimization hyperparameter in addition

to the step size. If this bothers you, you can obtain all the same results using regular stochastic gradient

descent.

• We will not be using a validation set for this exercise. Hyperparameters like network architecture and

step size should be chosen based on the performance on the test set. This is very bad practice for all the

reasons we have discussed over the quarter, but we aim to make this exercise as simple as possible.

• You should modify the training code such that at the end of each epoch (one pass over the training data)

you compute and print the training and test classification accuracy (you may find the running calculation

that the code initially uses useful to calculate the training accuracy).

• While one would usually train a network for hundreds of epochs for it to converge, this can be prohibitively

time consuming so feel free to train your networks for just a dozen or so epochs.

You will construct a number of different network architectures and compare their performance. For all, it is

highly recommended that you copy and modify the existing (working) network you are left with at the end

of the tutorial Training a classifier. For all of the following perform a hyperparameter selection (manually by

hand, random search, etc.) using the test set, report the hyperparameters you found, and plot the training and

test classification accuracy as a function of iteration (one plot per network). Highly sub-optimal hyperparameter choices that lead to drastically worse error rates than your peers will result in points off

(but don’t over do it).

Here are the network architectures you will construct and compare.

a. [15 points] Fully connected output, 0 hidden layers (logistic regression): we begin with the simplest network

possible that has no hidden layers and simply linearly maps the input layer to the output layer. That is,

conceptually it could be written as

x

out = Wvec(x

in) + b

where x

out ∈ R

10

, x

in ∈ R

32×32×3

, W ∈ R

10×3072

, b ∈ R

10 where 3072 = 32 · 32 · 3. For a tensor

x ∈ R

a×b×c

, we let vec(x) ∈ R

abc be the reshaped form of the tensor into a vector (in an arbitrary but

consistent pattern).

b. [15 points] Fully connected output, 1 fully connected hidden layer: we will have one hidden layer denoted

as x

hidden ∈ RM where M will be a hyperparameter you choose (M could be in the hundreds). The

nonlinearity applied to the hidden layer will be the relu (relu(x) = max{0, x}, elementwise). Conceptually,

one could write this network as

x

out = W2relu(W1vec(x

in) + b1) + b2

where W1 ∈ RM×3072

, b1 ∈ RM, W2 ∈ R

10×M, b2 ∈ R

10

.

c. [15 points] Fully connected output, 1 convolutional layer with max-pool: for a convolutional layer W1

with individual filters of size p × p × 3 and output size M (reasonable choices are M = 100, p = 5)

we have that Conv2d(x

in, W1) ∈ R

(33−p)×(33−p)×M. Each convolution will have its own offset applied

to each of the output pixels of the convolution; we denote this as Conv2d(x

in, W) + b1 where b1 is

parameterized in RM. We will then apply a relu (relu doesn’t change the tensor shape) and pool. If

we use a max-pool of size N (a reasonable choice is N = 14 to pool to 2 × 2 with p = 5) we have that

MaxPool(relu(Conv2d(x

in, W1) + b1)) ∈ R

b

33−p

N c×b 33−p

N c×M. We will then apply a fully connected layer to

the output to get a final network given as

x

output = W2vec(MaxPool(relu(Conv2d(x

input, W1) + b1))) + b2

where W2 ∈ R

10×M(b

33−p

N c)

2

, b2 ∈ R

10. The parameters M, p, N (in addition to the step size and momentum) are all hyperparameters.

2See http://www.cs.toronto.edu/~hinton/absps/momentum.pdf for the deep learning perspective on this method.

5

d. (Extra credit: [5 points] ) Returning to the original network you were left with at the end of the tutorial

Training a classifier, tune the different hyperparameters (number of convolutional filters, filter sizes,

dimensionality of the fully connected layers, stepsize, etc.) and train for many epochs to achieve a test

accuracy of at least 87%.

The number of hyperparameters to tune in the last exercise combined with the slow training times will hopefully

give you a taste of how difficult it is to construct good performing networks. It should be emphasized that the

networks we constructed are tiny; typical networks have dozens of layers, each with hyperparameters to tune.

Additional hyperparameters you are welcome to play with if you are so interested: replacing relu max{0, x} with

a sigmoid 1/(1 + e

−x

), max-pool with average-pool, and experimenting with batch-normalization or dropout.

6