## Description

1 Linear Regression [20 pts]

In class, we derived a closed form solution (normal equation) for linear regression problem: ˆθ = (XT X)

−1XT Y .

A probabilistic interpretation of linear regression tells us that we are relying on an assumption that each

data point is actually sampled from a linear hyperplane, with some noise. The noise follows a zero-mean

Gaussian distribution with constant variance. Specifically,

Y

i = θ

T Xi + ϵ

i

(1)

where ϵ ∼ N (0, σ2

I), θ ∈ R

d

, and {Xi

, Y i} is the i-th data point. In other words, we are assuming that

each every point is independent to each other and that every data point has same variance.

1Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.

1

(a) Using the normal equation, and the model (Eqn. 1), derive the expectation E[

ˆθ]. Note

that here X is fixed, and only Y is random, i.e. “fixed design” as in statistics. [5 pts]

(b) Similarly, derive the variance Var[

ˆθ]. [5 pts]

(c) Under the white noise assumption above, someone claims that ˆθ follows Gaussian distribution with mean and variance in (a) and (b), respectively. Do you agree with this claim? Why

or why not? [ 4 pts]

(d) For linear regression, please answer whether each of the statements below is true or false?

And Why? [6 pts]

1. Gradient descent is likely to converge to a local minimum rather than a global minimum.

2. One iteration will not change θ if it is initialized at the global minimum.

3. While performing gradient descent, a larger learning rate α will always speed up the

gradient descent converge process as less number of steps need to be taken to reach the

minimum.

2 Ridge Regression [15 pts]

TA: Zongen Li

For linear regression, it is often assumed that y = θ

⊤x + ϵ where θ, x ∈ R

m by absorbing the constant

term, and ϵ ∼ N (0, σ2

) is a Gaussian random variable. Given n i.i.d samples (x

1

, y1

), …,(x

n, yn), we define

y = (y

1

, …, yn)

⊤ and X = (x

1

, …, x

n)

⊤. Thus, we have y ∼ N (Xθ, σ2

I).

(a) Show that the ridge regression estimate is the mean of the posterior distribution under a

Gaussian prior θ ∼ N (0, τ 2

I). Find the explicit relation between the regularization parameter

λ in the ridge regression estimate of the parameter θ, and the variances σ

2

, τ 2

. [12 pts]

(b) Different values of regularization parameter λ will yield different Ridge Regression model

performances. Please design an algorithm that uses K-Fold Cross-Validation to select the best

value of λ from Λ = (λ

1

, …λn), which yields minimised MSE in Ridge Regression. [3 pts]

3 Bayes Classifier [20 pts]

TA: Janvijay Singh

3.1 Joint Bayes vs Naive Bayes Classifier

Suppose you are given the set of data, as in Table 1, with the three boolean input variables a, b, and c, and

a single Boolean output variable K.

(a) According to the joint Bayes classifier, what is the value of the following expressions: (i) P(K =

1 | a = 1 ∧ b = 1 ∧ c = 0); and (ii) P(K = 0 | a = 1 ∧ b = 1). [2.5 pts]

(b) According to the naive Bayes classifier, what is the value of the expressions in part (a). [2.5 pts]

2

a b c K

1 0 1 1

0 1 1 0

0 1 1 1

1 0 0 1

0 0 1 0

0 1 0 0

1 0 0 1

0 0 1 0

Table 1: Data with boolean inputs and outputs for question 3.1.

3.2 Bayes Classifier with Gaussian Class Conditional Distribution

Let us consider a binary classification problem X → Y such that Y ∈ {−1, +1}. As discussed in Lecture 10,

Bayes classifier (or decision rule) for such a classification problem is given as follows:

f(x) = sign

log

p(x | y = 1)p(y = 1)

p(x | y = −1)p(y = −1)

Furthermore, we discussed how geometric shape of decision boundaries depends on the type of class conditional distribution. In the following sub-problems, given a pair of class conditional distributions, you are

required to formally deduce the geometric shape of decision boundaries.

(a) Suppose the class conditional distributions are Gaussians, with non-identical covariance matrices and

means. Write the Bayes classifier as f(x) = sign(h(x)) and simplify h as much as possible. What is the

geometric shape of the decision boundary? [5 pts]

(b) Repeat (a) but assume the two Gaussians have identical covariance matrices. What is the geometric

shape of the decision boundary? [5 pts]

(c) Repeat (a) but assume now that the two Gaussians have covariance matrix which is equal to the identity

matrix. What is the geometric shape of the decision boundary? [5 pts]

4 Basics of optimization[10 pts]

TA: Chukang Zhong

(a) Show that the logistic loss L(z) = log (1 + exp(−z)) is a convex function. [5 pts]

(b) (Jensen’s inequality) Use the definition of a concave function, f, to show that

f

Xm

i=1

αixi

!

≥

Xm

i=1

αif (xi)

where Pm

i=1 αi = 1, αi ≥ 0. [5 pts]

3

5 Logistic Regression[10 pts + 20 extra pts]

TA: Chukang Zhong

Logistic regression is named after the log-odds of success (the logit of the probability) defined as below:

ln

P[Y = 1|X = x]

P[Y = 0|X = x]

where

P[Y = 1|X = x] = exp(w0 + w

T x)

1 + exp(w0 + wT x)

(a) Show that log-odds of success is a linear function of X. [10 pts]

(b) Let us do multi-class classification: assign input vector x

i

, i = 1, . . . , m into one of classes c, c =

1, 2, . . . , C. Rather than doing ”one against all”, you will use the softmax probability,

P

y

i = c | x

i

, θ1, . . . , θC

=

exp

θ

⊤

c x

i

PC

c

′=1 exp

θ

⊤

c

′x

i

Here c

′

is a possible label and y

i

is the discrete training label ∈ {1, 2, 3, …, C}.

Given all the input data

x

1

, y1

,

x

2

, y2

, . . . ,(x

m, ym), derive the log-likelihood that results from using

the softmax probability, as well as the derivative of the loss with respect to one parameter vector θc. Try to

simplify the derivative as much as possible. [Extra Credit: 20 pts]

Hint: for the gradient you can use an indicator function I (yi = c), which is 1 when yi = c and is 0

otherwise. Note that you can use the definition of the softmax probability to simplify the derivative.

6 Programming: Recommendation System [25 pts]

TA: Rachmat Subagia

Personalized recommendation systems are used in a wide variety of applications such as electronic commerce,

social networks, web search, and more. Machine learning techniques play a key role to extract individual

preference over items. In this assignment, we explore this popular business application of machine learning,

by implementing a simple matrix-factorization-based recommender using gradient descent.

Suppose you are an employee in Netflix. You are given a set of ratings (from one star to five stars)

from users on many movies they have seen. Using this information, your job is implementing a personalized

rating predictor for a given user on unseen movies. That is, a rating predictor can be seen as a function

f : U × I → R, where U and I are the set of users and items, respectively. Typically the range of this

function is restricted to between 1 and 5 (stars), which is the the allowed range of the input.

Now, let’s think about the data representation. Suppose we have m users and n items, and a rating

given by a user on a movie. We can represent this information as a form of matrix, namely rating matrix M.

Suppose rows of M represent users, while columns do movies. Then, the size of matrix will be m × n. Each

cell of the matrix may contain a rating on a movie by a user. In M15,47, for example, it may contain a rating

on the item 47 by user 15. If he gave 4 stars, M15,47 = 4. However, as it is almost impossible for everyone to

watch large portion of movies in the market, this rating matrix should be very sparse in nature. Typically,

only 1% of the cells in the rating matrix are observed in average. All other 99% are missing values, which

means the corresponding user did not see (or just did not provide the rating for) the corresponding movi

Our goal with the rating predictor is estimating those missing values, reflecting the user’s preference learned

from available ratings.

Our approach for this problem is matrix factorization. Specifically, we assume that the rating matrix M

is a low-rank matrix. Intuitively, this reflects our assumption that there is only a small number of factors

(e.g, genre, director, main actor/actress, released year, etc.) that determine like or dislike. Let’s define r as

the number of factors. Then, we learn a user profile U ∈ R

m×r and an item profile V ∈ R

n×r

. (Recall that

m and n are the number of users and items, respectively.) We want to approximate a rating by an inner

product of two length r vectors, one representing user profile and the other item profile. Mathematically, a

rating by user u on movie i is approximated by

Mu,i ≈

Xr

k=1

Uu,kVi,k. (2)

We want to fit each element of U and V by minimizing squared reconstruction error over all training data

points. That is, the objective function we minimize is given by

E(U, V ) = X

(u,i)∈M

(Mu,i − U

T

u Vi)

2 =

X

(u,i)∈M

(Mu,i −

Xr

k=1

Uu,kVi,k)

2

(3)

where Uu is the uth row of U and Vi

is the ith row of V . We observe that this looks very similar to the

linear regression. Recall that we minimize in linear regression:

E(θ) = Xm

i=1

(Y

i − θ

T x

i

)

2 =

Xm

i=1

(Y

i −

Xr

k=1

θkx

i

k

)

2

(4)

where m is the number of training data points. Let’s compare (3) and (4). Mu,i in (3) corresponds to Y

i

in (4), in that both are the observed labels. U

T

u Vi

in (3) corresponds to θ

T x

i

in (4), in that both are our

estimation with our model. The only difference is that both U and V are the parameters to be learned in

(3), while only θ is learned in (4). This is where we personalize our estimation: with linear regression, we

apply the same θ to any input x

i

, but with matrix factorization, a different profile Uu are applied depending

on who is the user u.

As U and V are interrelated in (3), there is no closed form solution, unlike linear regression case. Thus,

we need to use gradient descent:

Uv,k ← Uv,k − µ

∂E(U, V )

∂Uv,k

, Vj,k ← Vj,k − µ

∂E(U, V )

∂Vj,k

, (5)

where µ is a hyper-parameter deciding the update rate. It would be straightforward to take partial derivatives

of E(U, V ) in (3) with respect to each element Uv,k and Vj,k. Then, we update each element of U and V

using the gradient descent formula in (5).

(a) Derive the update formula in (5) by solving the partial derivatives. [5 pts]

(b) To avoid overfitting, we usually add regularization terms, which penalize for large values

in U and V . Redo part (a) using the regularized objective function below. [5 pts]

E(U, V ) = X

(u,i)∈M

(Mu,i −

Xr

k=1

Uu,kVi,k)

2 + λ

X

u,k

U

2

u,k + λ

X

i,k

V

2

i,k

(λ is a hyper-parameter controlling the degree of penalization.)

5

(c) Implement myRecommender.m/myRecommender.py by filling the gradient descent part.

You are given a skeleton code myRecommender.m/myRecommender.py. Using the training data rateMatrix,

you will implement your own recommendation system of rank lowRank. The only file you need to edit and

submit is myRecommender.m. In the gradient descent part, repeat your update formula in (b), observing

the average reconstruction error between your estimation and ground truth in training set. You need to set

a stopping criteria, based on this reconstruction error as well as the maximum number of iterations. You

should play with several different values for µ and λ to make sure that your final prediction is accurate.

Formatting information is here:

Input

• rateMatrix: training data set. Each row represents a user, while each column an item. Observed

values are one of {1, 2, 3, 4, 5}, and missing values are 0.

• lowRank: the number of factors (dimension) of your model. With higher values, you would expect

more accurate prediction.

Output

• U: the user profile matrix of dimension user count × low rank.

• V: the item profile matrix of dimension item count × low rank.

Evaluation [10 pts]

Upload your myRecommender.m/myRecommender.py implementation file. (Do not copy and paste your code

in your report. Be sure to upload your myRecommender.m/myRecommender.py file.)

To test your code, try to run homework3.m/homework3.py. You may have noticed that the code prints

both training and test error, in RMSE (Root Mean Squared Error), defined as follows:

X

(u,i)∈M

(Mu,i − f(u, i))2

where f(u, i) is your estimation, and the summation is over the training set or testing set, respectively. For

the grading, we will use another set-aside testing set, which is not released to you. If you observe your test

error is less than 1.00 without cheating (that is, training on the test set), you may expect to see the similar

performance on the unseen test set as well.

Note that we provide homework3.m/homework3.py just to help you evaluate your code easily. You are

not expected to alter or submit this to us. In other words, we will not use this file when we grade your

submission. The only file we take care of is myRecommender3.m/myRecommender3.py.

Grading criteria:

• Your code should output U and V as specified. The dimension should match to the specification.

• We will test your output on another test dataset, which was not provided to you. The test RMSE on

this dataset should be at least 1.05 to get at least partial credit.

• We will measure elapsed time for learning. If your implementation takes longer than 3 minutes for

rank 5, you should definitely try to make your code faster or adjust parameters. Any code running

more than 5 minutes is not eligible for credit.

• Your code should not crash. Any crashing code will be not credited.

6

Report [5 pts]

In your report, show the performance (RMSE) both on your training set and test set, with varied lowRank.

(The default is set to 1, 3, and 5, but you may want to vary it further.) Discuss what you observe with

varied low rank. Also, briefly discuss how you decided your hyper-parameters (µ, λ).

Note

• Do not print anything in your code (e.g, iteration 1 : err=2.4382) in your final submission.

• Do not alter input and output format of the skeleton file. (E.g, adding a new parameter without

specifying its defalut value) Please make sure that you returned all necessary outputs according to the

given skeleton.

• Please do not use additional file. This task is simple enough that you can fit in just one file.

• Submit your code with the best parameters you found. We will grade without modifying your code.

(Applying cross-validation to find best parameters is fine, though you do not required to do.)

• Please be sure that your program finishes within a fixed number of iterations. Always think of a case

where your stopping criteria is not satisfied forever. This can happen anytime depending on the data,

not because your code is incorrect. For this, we recommend setting a maximum number of iteration in

addition to other stopping criteria.

7