## Description

1 Question 1 – Probability (15 points)

Let X1 and X2 be independent, continuous random variables uniformly distributed on [0, 1]. Let X =

max(X1, 2X2). Compute

1. (5 points) E(X).

2. (5 points) V ar(X).

3. (5 points) Cov(X, X1).

2 Question 2 – Parameter estimation (30 points)

This question uses a discrete probability distribution known as the Poisson distribution. A discrete random

variable X follows a Poisson distribution with parameter λ if

p(X = k|λ) = λ

k

k!

e

−λ

k ∈ {0, 1, 2, . . . }

The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at

a bank per hour, or the number of calls handled by a telephone operator per minute.

2.1 Question 2.1 – MLE (10 points)

Consider catching the LIRR train from Stony Brook to Penn Station. Of course, the train is always late, and

assume the amount of delay in minutes is Poisson distributed (i.i.d) with parameter λ. Since you like Times

Square so much, you have visited NYC several times and record the amount of train delays in each trip.

Trip 1 2 3 4 5 6 7

Delay in minutes 4 5 3 5 6 9 3

Let X = (X1, . . . , Xn) be a random vector where Xi

is the number of delay minutes for your i

th trip:

1. (4 points) Give the log-likelihood function of X given λ.

2. (4 points) Compute the MLE for λ in the general case.

3. (2 point) Compute the MLE for λ using the observed X.

1

2.2 Question 2.2 – MAP (10 points)

Now let’s be Bayesian and put a prior over the parameter λ. You talk to your party-goer friends, who tell

you about the expected delays that they experience. You plot the distribution of the expected delays and your

extensive experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe

a good prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf:

p(λ|α, β) = β

α

Γ(α)

λ

α−1

e

−βλ, λ > 0. (1)

Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability

density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not

important for this question. Also, if λ ∼ Γ(α, β), then it has mean α/β and the mode is (α − 1)/β for

α > 1. Assume the prior distribution of λ is Γ(λ|α, β).

1. (5 points) Compute the posterior distribution over λ. Hint:

λ

Pn

i=1 Xi+α−1

e

−λ(n+β)

looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out

a messy integral can lead to the answer but it is unnecessary.

2. (5 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ.

2.3 Question 2.3 – Estimator Bias (10 points)

In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution:

σˆ

2

MLE =

1

N

X

N

i=1

(xi − µˆ)

2

is biased, and that an unbiased estimator of variance is:

σˆ

2

unbiased =

1

N − 1

X

N

i=1

(xi − µˆ)

2

For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear

whether one estimator is preferable to the other. In this problem, we will explore an example in which the

maximum likelihood estimate is dramatically superior to any unbiased estimator.

Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that

is a nonlinear function of λ, namely η = e

−2λ

. Suppose we want to estimate η from a single observation

X ∼ P oisson(λ).

1. [4pts] Let ηˆ = e

−2X. Show that ηˆ is the maximum likelihood estimate of η.

2. [3pts] Show that the bias of ηˆ is e

−2λ − e

λ(1/e2−1). The following Taylor expansion may be useful:

e

x =

X∞

k=0

x

k

k!

3. [3pts] It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased and

briefly explain why this is a bad estimator to use.

2

3 Question 3 – Regression and MLE [15 points]

Suppose the data {(xi

, yi)|xi ∈ <k

, yi ∈ <}n

i=1 are generated by a linear regression model:

yi = wT xi + i

. (2)

3.1 Question 3.1 (6 points)

Assume that 1, · · · , n are independent Gaussians with mean 0, but the variances are different, i.e., i ∼

N (0, σ2

i

). Show that the MLE of w can be found by solving the weighted least squares problem, i.e.,

minimize

w

Xn

i=1

si(yi − wT xi)

2

. (3)

Give the weights si

. Derive the closed-form solution for the MLE of w.

3.2 Question 3.2 (6 points)

If 1, · · · , n are independent Laplace with mean 0 and the same scale parameter b, i.e., the pdf of i

is

p(i) = 1

2b

exp(−

|i|

b

), give the formulation for calculating MLE for w (closed form solution is not required).

3.3 Question 3.3 (3 points)

Sometimes the model in the second question is preferred because its solution tends to be more robust to

noise. Explain why this is true.

4 Question 4 – Programming [40 points]

The Lasso is the problem of solving

arg min

w,b

X

i

(wT xi + b − yi)

2 + λ

X

j

|wj | (4)

Here X is an d × n matrix of data, and xi

is the i

th column of the matrix. y is an n × 1 vector of response

variables, w is a d dimensional weight vector, b is a scalar offset term, and λ is a regularization tuning parameter. For the programming part of this homework, you are required to implement the coordinate descent

method that can solve the Lasso problem.

This question contains three parts, 1) analyze and optimize the time complexity 2) test your code using toy

examples 3) try your code on a real world dataset. The first part involves no programming, but is crucial for

writing an efficient algorithm.

4.1 Time complexity and making your code fast [12 points]

In class, you derived the update rules for coordinate descent, which is shown in Algorithm 1 (including the

update term for b). In this part of question, we will analyze the algorithm and discuss how to make it fast.

There are two key points: utilizing sparsity and caching your predictions. Assume we are using a sparse

matrix representation, so that a dot product takes time proportional to the number of non-zero entries. In the

following questions, your answers should take advantage of the sparsity of X when possible.

1. (2 points) Define the residual ri = yi −(wT xi +b). Write the rule to compute r using matrix notation

(no for loop). Let kXk0

be the number of nonzero entries in X. What is the time complexity to

compute r?

2. (2 points) Simplify the update rule for b using r. Your solution should use matrix notation. There

should no longer be a for loop to sum over i. What is the time complexity when r is already computed.

3

Algorithm 1 Coordinate Descent Algorithm for Lasso

while not converged do

b ← 1

n

Pn

i=1

yi − wT xi

for k ∈ {1, 2, · · · , d} do

ak ← 2

Pn

i=1 x

2

ik

ck ← 2

Pn

i=1 xik

yi − (b +

P

j6=k wjxij )

wk ←

(ck + λ)/ak ck < −λ 0 ck ∈ [−λ, λ] (ck − λ)/ak c > λ

end for

end while

Algorithm 2 Efficient Coordinate Descent Algorithm

ak ← 2

Pn

i=1 x

2

ik % You can compute a without for loop

while not converged do

1. Compute the residual vector r (Q1 of 4.1, recalculate r each iteration to avoid numerical drift)

2. Update b (Q2 of 4.1)

3. Update r (Q3 of 4.1)

for k ∈ {1, 2, · · · , d} do

4. Calculate ck (Q4 of 4.1)

5. Update wk (See Algorithm 1)

6. Update the residual r (Q5 of 4.1)

end for

end while

3. (2 points) Once b is updated, find the update rule for the residual vector r. Again, use no for loop.

What is the time complexity for the update?

4. (2 points) Simplify the computation for ck in Algorithm 1 using the residual vector r. Hint: there

should no longer a sum over j. Suppose zk is the number of non-zeros in the k

th row of X, what is

the time complexity to compute ck when r is already computed?

5. (2 points) Suppose we update wk from w

old

k

to w

new

k

. Derive an update rule for r. Express the time

complexity in terms of zk.

6. (2 points) Putting this all together, you get the efficient coordinate descent algorithm in Algorithm 2.

What is per iteration complexity of this algorithm (cost of each round of the while loop)?

4.2 Implement coordinate descent to solve the Lasso

Now we are ready to implement the coordinate descent algorithm in Algorithm 2. Your code must

• Include an offset term b that is not regularized

• Take optional initial conditions for w.

• Be able to handle sparse X. It is ok for your code not being able to handle dense X. In Python, this

means you can always assume input is a sparse matrix.

• Avoid unnecessary computation (i.e take advantage of what you learned from Problem 4.1)

You may use any language for your implementation, but we recommend Matlab. Matlab is a very useful

language, and you should find that Matlab achieves reasonable performance for this problem. Our implementation takes < 1s for an input matrix X of size 2500 × 10000. Matlab has many toolboxes, and you

cannot use any existing Lasso solver.

Before you get started, here are some hints that you may find helpful:

• The only matrix operations required are: multiplying a matrix and a vector, adding vectors, computing

dot product between two vectors, point-wise matrix operations (e.g., power of 2), calculate the sum of

columns or rows or a matrix. Try to use as much vector/matrix computation as possible.

• The most important check is to ensure the objective value is non-increasing with each step.

• To ensure that a solution (wˆ ,

ˆb) is correct, you also can compute the value

2X(XT wˆ + ˆb − y)

This is a d-dimensional vector that should take the value −λsign(wj ) at j for each wj that is nonzero.

For the zero indices of wˆ , this vector should take values lesser in magnitude than λ. (This is similar to setting the gradient to zero, though more complicated because the objective function is not

differentiable.)

• It is up to you to decide on a suitable stopping condition. A common criteria is to stop when the

objective value does not decrease by more than some small δ during an iteration. If you need your

algorithm to run faster, an easy place to start is to loosen this condition.

• For several problems, you will need to solve the Lasso on the same dataset for many values of λ. This

is called a regularization path. One way to do this efficiently is to start at a large λ, and then for each

consecutive solution, initialize the algorithm with the previous solution, decreasing λ by a constant

ratio until finished.

• The smallest value of λ for which the solution wˆ is entirely zero is given by

λmax = 2

X

y −

1

n

X

i

yi

!

∞

This is helpful for choosing the first λ in a regularization path.

Finally here are some pointers toward useful parts of Matlab

• [I,J,S] = find(X). Finding the row indices, column indices, and non-zero values of a sparse

matrix X.

• Use sum(A, dim) to compute the sum of row or column of a matrix A.

4.3 Try out your work on synthetic data [12 points]

We will now try out your solver with some synthetic data. A benefit of the Lasso is that if we believe

many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively

differentiating between the relevant and irrelevant features.

Let’s see if it actually works. Suppose that x ∈ <d

, y ∈ <, k < d, and pairs of data (xi

, yi) are generated

independently according to the model

yi = (w∗

)

T xi + i

where i ∼ N (0, σ2

) is some Gaussian noise. Note that since k < d, the features k + 1 through d are

unnecessary (and potentially even harmful) for predicting y.

With this model in mind, you are now tasked with the following:

5

1. (8 points) Let N = 250, d = 80, k = 10, and σ = 1. Generate some data by drawing each element

of X ∈ <d×n

from a standard normal distribution N (0, 1). Let b

∗ = 0 and create a w∗ by setting the

first k elements to ±10 (choose any sign pattern) and the remaining elements to 0. Finally, generate a

Gaussian noise vector with variance σ

2

and form y = XT w∗ + b

∗ + .

With your synthetic data, solve multiple Lasso problems on a regularization path, starting at λmax and

decreasing λ by a constant ratio of 2 for 10 steps. Compare the sparsity pattern of your Lasso solution

wˆ to that of the true model parameters w∗

. Record values for precision (number of correct nonzeros

in wˆ /total number of nonzeros in wˆ ) and recall (number of correct nonzeros in wˆ /k).

How well are you able to discover the true nonzeros? Comment on how λ affects these results and

include plots of precision and recall versus λ.

2. (4 points) Change σ to 10, regenerate the data, and solve the Lasso problem using a value of λ that

worked well when σ = 1. How are precision and recall affected? How might you change λ in order

to achieve better precision or recall?

4.4 Predicting goodness points of a wine given its review [16 points + 10 bonus]

We’ll now put the Lasso to work on some real data. For this homework, the task is to predict the points the

wine should get based on its review text alone.

For many Kaggle competitions (and machine learning methods in general), one of the most important

requirements for doing well is the ability to discover great features. We can use our Lasso solver for this as

follows. First, generate a large amount of features from the data, even if many of them are likely unnecessary.

Afterward, use the Lasso to reduce the number of features to a more reasonable amount.

The data was scraped from WineEnthusiast during the week of June 15th, 2017 and can be found on

the analytics website Kaggle (https://www.kaggle.com/zynicide/wine-reviews). As a side

note, browsing other competitions on the site may also give you some ideas for class projects.) This

dataset provides a variety of features, the points, description (review), variety, country, province etc. For

this homework we will be using the points and description features. Points are ratings within range 1-

100. However, the dataset contains description for wines rated in range 80-100. We already provide

you the feature vectors for each review. You can download the data for this homework from- https:

//www.kaggle.com/c/hw1-wine-goodness-cse512-spr18/data.

trainData.txt Training data matrix for predicting number of stars

trainLabels.txt Training labels, list of the number of stars for each review

valData.txt Validation data

valLabels.txt Validation labels

testData.txt Test data

featureTypes.txt List of features used

The data matrices are stored in Matrix Market Format, a format for sparse matrices. Each line is is a

tuple of three elements for instance ID, feature ID, and feature value. Meta information for each feature is

provided in featureTypes.txt. The features are strings of one, two, or three words (n-grams). The

feature values correspond roughly to how often each word appears in the review. All rows have also been

normalized.

In Matlab, you can use the following code to load train data and labels:

D = load(‘trainData.txt’);

trX = sparse(D(:,2), D(:,1), D(:,3));

trLb = load(‘trainLabels.txt’);

6

For this part of the problem, you have the following tasks:

1. (8 points) Solve Lasso to predict the number of points a Wine will receive. Starting at λmax, run

Lasso on the training set, decreasing λ by a factor of 2, using previous solutions as initial conditions

to each problem. Stop when you have considered enough λ’s that, based on validation error, you can

choose a good solution with confidence (for instance, when validation error begins increasing or stops

decreasing significant). At each solution, record the root-mean-squared-error (RMSE) on training and

validation data. In addition, record the number of nonzeros in each solution.

Plot the train and validation RMSE values together on a plot against λ. Separately plot the number of

nonzeros as well.

2. (5 points) Find the λ that achieves best validation performance. Compute a model using this λ and list

the top 10 features with largest weights and the top 10 features with lowest weights. Comment if the

weights make sense intuitively.

3. (3 points) Use your model to predict the points for the reviews in test data. Save the predicted values on

a CSV file predTestLabels.csv (see 5.2 for the format). The order of predicted values should

correspond to the order of the reviews in testData.txt (counting starts from 1). Submit this

predTestLabels.txt file on Kaggle and report the RMSE.

4. (10 bonus points) Submit predTestLabels.csv and enter our Kaggle competition for fame. You

must use your Stony Brook email to register on Kaggle. We will maintain a leader board, and the

top three entries at the end of the competition (due date) will receive 10 bonus points. The ranking is

based on RMSE.

To prevent exploiting test data, you are allowed to make a maximum of 2 submissions per 24 hours.

Your submission will be evaluated immediately and the leader board will be updated.

You are allowed to create new features from the existing set of features. You are allowed to perform

feature normalization on existing and new features. You can use both training and validation data to

train your classifier. You can also use Ridge regression instead of Lasso. If you use Ridge regression,

you must implement it yourself. You cannot use any other classifier types such as SVM or boosting.

5 What to submit?

5.1 Blackboard submission

You will need to submit both your code and your answers to questions on Blackboard. Do not submit the

provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g.,

10947XXXX lionel messi). Zip this folder and submit the zip file on Blackboard. Your submission must be a

zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it should

contain:

1. Answers to Question 1, 2, 3, and 4.1

2. Answers to Question 4.3, including the requested plots.

3. Answers to Question 4.4.1, and 4.4.2, including the requested plots.

5.2 Kaggle submission

For Questions 4.4.3 and 4.4.4, you must submit a CSV file to get the RMSE from the competition site

https://www.kaggle.com/c/hw1-wine-goodness-cse512-spr18. A submission file should

7

contain two columns: instanceId and Prediction. The file should contain a header and have the following

format.

instanceId, P rediction

1, 82

2, 91

… …

(5)

A sample submission file is available from the competition site and our handout.

Cheating warnings: Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You

cannot ask and discuss with students from previous year. You cannot look up the solution online. You must

use your real name to register on Kaggle. Do not create multiple accounts to bypass the submission limitation

per 24 hours. Doing so will be considered cheating.

8