Description
1 Question 1 – Probability (10 points)
Let X1 and X2 be independent, continuous random variables uniformly distributed on [0, 1]. Let X =
min(X1, X2). Compute
1. (3 points) E(X).
2. (3 points) V ar(X).
3. (4 points) Cov(X, X1).
2 Question 2 – Parameter estimation (20 points)
This question uses a discrete probability distribution known as the Poisson distribution. A discrete random
variable X follows a Poisson distribution with parameter λ if
p(X = k|λ) = λ
k
k!
e
−λ
k ∈ {0, 1, 2, . . . }
The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at
a bank per hour, or the number of calls handled by a telephone operator per minute.
2.1 Question 2.1 – MLE (7 points)
Imagine we have a gumball machine which dispenses a random number of gumballs each time you insert
a quarter. Assume that the number of gumballs dispensed is Poisson distributed (i.i.d) with parameter λ.
Curious and sugar-deprived, you insert 8 quarters one at a time and record the number of gumballs dispensed
on each trial:
Trial 1 2 3 4 5 6 7 8
Gumballs 4 1 3 5 5 1 3 8
Let X = (X1, . . . , Xn) be a random vector where Xi
is the number of gumballs dispensed on trial i:
1. (3 points) Give the log-likelihood function of X given λ.
2. (3 points) Compute the MLE for λ in the general case.
3. (1 point) Compute the MLE for λ using the observed X.
1
2.2 Question 2.2 – MAP (6 points)
Now let’s be Bayesian and put a prior over the parameter λ. Your friend tells you that she is a sweet addict
and has used many gumball machines. She shows you the list of gumball machines around Stony Brook and
their expected gumball returns. You plot the distribution of the expected gumball returns and your extensive
experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe a good
prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf:
p(λ|α, β) = β
α
Γ(α)
λ
α−1
e
−βλ, λ > 0. (1)
Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability
density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not
important for this question. Also, if λ ∼ Γ(α, β), then it has mean α/β and the mode is (α − 1)/β for
α > 1. Assume the prior distribution of λ is Γ(λ|α, β).
1. (3 points) Compute the posterior distribution over λ. Hint:
λ
Pn
i=1 Xi+α−1
e
−λ(n+β)
looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out
a messy integral can lead to the answer but it is unnecessary.
2. (3 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ.
2.3 Question 2.3 – Estimator Bias (7 points)
In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution:
σˆ
2
MLE =
1
N
X
N
i=1
(xi − µˆ)
2
is biased, and that an unbiased estimator of variance is:
σˆ
2
unbiased =
1
N − 1
X
N
i=1
(xi − µˆ)
2
For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear
whether one estimator is preferable to the other. In this problem, we will explore an example in which the
maximum likelihood estimate is dramatically superior to any unbiased estimator.
Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that
is a nonlinear function of λ, namely η = e
−2λ
. Suppose we want to estimate η from a single observation
X ∼ P oisson(λ).
1. [3pts] Let ηˆ = e
−2X. Show that ηˆ is the maximum likelihood estimate of η.
2. [2pts] Show that the bias of ηˆ is e
−2λ − e
λ(1/e2−1). The following Taylor expansion may be useful:
e
x =
X∞
k=0
x
k
k!
3. [2pts] It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased an
briefly explain why this is a bad estimator to use.
2
3 Question 3 – Regression and MLE [10 points]
Suppose the data {(xi
, yi)|xi ∈ <k
, yi ∈ <}n
i=1 are generated by a linear regression model:
yi = wT xi + i
. (2)
3.1 Question 3.1 (4 points)
Assume that 1, · · · , n are independent Gaussians with mean 0, but the variances are different, i.e., i ∼
N (0, σ2
i
). Show that the MLE of w can be found by solving the weighted least squares problem, i.e.,
minimize
w
Xn
i=1
si(yi − wT xi)
2
. (3)
Give the weights si
. Derive the closed-form solution for the MLE of w.
3.2 Question 3.2 (4 points)
If 1, · · · , n are independent Laplace with mean 0 and the same scale parameter b, i.e., the pdf of i
is
p(i) = 1
2b
exp(−
|i|
b
), give the formulation for calculating MLE for w (closed form solution is not required).
3.3 Question 3.3 (2 points)
Sometimes the model in the second question is preferred because its solution tends to be more robust to
noise. Explain why this is true.
4 Question 4 – Ridge Regression and LOOCV (20 points)
In class, you learned about using cross validation as a way to estimate the true error of a learning algorithm.
The preferred solution is Leave-One-Out Cross Validation (LOOCV), which provides an almost unbiased
estimate of this true error, but it can take a really long time to compute. In this problem, you will derive an
algorithm for efficiently computing the LOOCV error for ridge regression.
Given a set of n data points and associated labels {xi
, yi
|xi ∈ <k
, yi ∈ <}n
i=1. Ridge regression find the
weight vector w and a bias term b to optimize the following:
minimize
w,b
λ||w||2 +
Xn
i=1
(wT xi + b − yi)
2
. (4)
Q4.1. (4 points) Let w = [w; b], X = [X; 1
T
n
],
¯I = [Ik, 0k; 0
T
k
, 0], C = XX
T
+ λ¯I, and d = Xy. Show
that the solution of Ridge regression is:
w = C−1d (5)
Q4.2. (3 points) Now suppose we remove xi from the training data, let C(i)
, d(i)
, w(i) be the corresponding
matrices for removing xi
. Express C(i)
in terms of C and xi
. Express d(i)
in terms of d and xi
.
Q4.3. (4 points) Express C
−1
(i)
in terms of C−1
and xi
. Hint: use the Sherman-Morrison formula:
(A + uvT
)
−1 = A−1 −
A−1uvT A−1
1 + vT A−1u
(6)
Q4.4. (3 points) Show that
w(i) = w + (C−1xi)
−yi + x
T
i w
1 − x
T
i C−1xi
(7)
3
Algorithm 1: Coordinate Descent Algorithm for Lasso
while not converged do
b ← 1
n
Pn
i=1
yi − wT xi
for k ∈ {1, 2, · · · , d} do
ak ← 2
Pn
i=1 x
2
ik
ck ← 2
Pn
i=1 xik
yi − (b +
P
j6=k wjxij )
wk ←
(ck + λ)/ak ck < −λ 0 ck ∈ [−λ, λ] (ck − λ)/ak c > λ
end
end
Q4.5. (3 points) Show that the leave-one-out error for removing the i
th training data is:
wT
(i)xi − yi =
wT xi − yi
1 − x
T
i C−1xi
(8)
Q4.6. (3 points) The LOOCV is defined as: Pn
i=1(wT
(i)
xi − yi)
2
. What is the algorithmic complexity
of computing LOOCV error using the formula given in Q4.5? How is it compared with the usual way of
computing LOOCV? Note that the complexity of inverting a k × k matrix is O(k
3
).
5 Question 5 – Programming [40 points]
The Lasso is the problem of solving
arg min
w,b
X
i
(wT xi + b − yi)
2 + λ
X
j
|wj | (9)
Here X is an d × n matrix of data, and xi
is the i
th column of the matrix. y is an n × 1 vector of response
variables, w is a d dimensional weight vector, b is a scalar offset term, and λ is a regularization tuning parameter. For the programming part of this homework, you are required to implement the coordinate descent
method that can solve the Lasso problem.
This question contains three parts, 1) analyze and optimize the time complexity 2) test your code using toy
examples 3) try your code on a real world dataset. The first part involves no programming, but is crucial for
writing an efficient algorithm.
5.1 Time complexity and making your code fast [12 points]
In class, you derived the update rules for coordinate descent, which is shown in Algorithm 1 (including the
update term for b). In this part of question, we will analyze the algorithm and discuss how to make it fast.
There are two key points: utilizing sparsity and caching your predictions. Assume we are using a sparse
matrix representation, so that a dot product takes time proportional to the number of non-zero entries. In the
following questions, your answers should take advantage of the sparsity of X when possible.
1. (2 points) Define the residual ri = yi −(wT xi +b). Write the rule to compute r using matrix notation
(no for loop). Let kXk0
be the number of nonzero entries in X. What is the time complexity to
compute r?
Algorithm 2: Efficient Coordinate Descent Algorithm
ak ← 2
Pn
i=1 x
2
ik % You can compute a without for loop
while not converged do
1. Compute the residual vector r (Q1 of 5.1, recalculate r each iteration to avoid numerical drift)
2. Update b (Q2 of 5.1)
3. Update r (Q3 of 5.1)
for k ∈ {1, 2, · · · , d} do
4. Calculate ck (Q4 of 5.1)
5. Update wk (See Algorithm 1)
6. Update the residual r (Q5 of 5.1)
end
end
2. (2 points) Simplify the update rule for b using r. Your solution should use matrix notation. There
should no longer be a for loop to sum over i. What is the time complexity when r is already computed.
3. (2 points) Once b is updated, find the update rule for the residual vector r. Again, use no for loop.
What is the time complexity for the update?
4. (2 points) Simplify the computation for ck in Algorithm 1 using the residual vector r. Hint: there
should no longer a sum over j. Suppose zk is the number of non-zeros in the k
th row of X, what is
the time complexity to compute ck when r is already computed?
5. (2 points) Suppose we update wk from w
old
k
to w
new
k
. Derive an update rule for r. Express the time
complexity in terms of zk.
6. (2 points) Putting this all together, you get the efficient coordinate descent algorithm in Algorithm 2.
What is per iteration complexity of this algorithm (cost of each round of the while loop)?
5.2 Implement coordinate descent to solve the Lasso
Now we are ready to implement the coordinate descent algorithm in Algorithm 2. Your code must
• Include an offset term b that is not regularized
• Take optional initial conditions for w.
• Be able to handle sparse X. It is ok for your code not being able to handle dense X. In Python, this
means you can always assume input is a sparse matrix.
• Avoid unnecessary computation (i.e take advantage of what you learned from Problem 5.1)
You may use any language for your implementation, but we recommend Matlab. Matlab is a very useful
language, and you should find that Matlab achieves reasonable enough performance for this problem. Our
implementation takes < 1s for an input matrix X of size 2500 × 10000. Matlab has many toolboxes, and
you cannot use any existing Lasso solver.
Before you get started, here are some hints that you may find helpful:
• The only matrix operations required are: multiplying a matrix and a vector, adding vectors, computing
dot product between two vectors, point-wise matrix operations (e.g., power of 2), calculate the sum of
columns or rows or a matrix. Try to use as much vector/matrix computation as possible.
5
• The most important check is to ensure the objective value is non-increasing with each step.
• To ensure that a solution (wˆ ,
ˆb) is correct, you also can compute the value
2X(XT wˆ + ˆb − y)
This is a d-dimensional vector that should take the value −λsign(wj ) at j for each wj that is nonzero.
For the zero indices of wˆ , this vector should take values lesser in magnitude than λ. (This is similar to setting the gradient to zero, though more complicated because the objective function is not
differentiable.)
• It is up to you to decide on a suitable stopping condition. A common criteria is to stop when the
objective value does not decrease by more than some small δ during an iteration. If you need your
algorithm to run faster, an easy place to start is to loosen this condition.
• For several problems, you will need to solve the Lasso on the same dataset for many values of λ. This
is called a regularization path. One way to do this efficiently is to start at a large λ, and then for each
consecutive solution, initialize the algorithm with the previous solution, decreasing λ by a constant
ratio until finished.
• The smallest value of λ for which the solution wˆ is entirely zero is given by
λmax = 2
X
y −
1
n
X
i
yi
!
∞
This is helpful for choosing the first λ in a regularization path.
Finally here are some pointers toward useful parts of Matlab
• [I,J,S] = find(X). Finding the row indices, column indices, and non-zero values of a sparse
matrix X.
• Use sum(A, dim) to compute the sum of row or column of a matrix A.
5.3 Try out your work on synthetic data [12 points]
We will now try out your solver with some synthetic data. A benefit of the Lasso is that if we believe
many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively
differentiating between the relevant and irrelevant features.
Let’s see if it actually works. Suppose that x ∈ <d
, y ∈ <, k < d, and pairs of data (xi
, yi) are generated
independently according to the model
yi = (w∗
)
T xi + i
where i ∼ N (0, σ2
) is some Gaussian noise. Note that since k < d, the features k + 1 through d are
unnecessary (and potentially even harmful) for predicting y.
With this model in mind, you are now tasked with the following:
1. (8 points) Let N = 200, d = 75, k = 10, and σ = 1. Generate some data by drawing each element
of X ∈ <d×n
from a standard normal distribution N (0, 1). Let b
∗ = 0 and create a w∗ by setting the
first k elements to ±10 (choose any sign pattern) and the remaining elements to 0. Finally, generate a
Gaussian noise vector with variance σ
2
and form y = XT w∗ + b
∗ + .
With your synthetic data, solve multiple Lasso problems on a regularization path, starting at λmax and
decreasing λ by a constant ratio of 2 for 10 steps. Compare the sparsity pattern of your Lasso solution
6
wˆ to that of the true model parameters w∗
. Record values for precision (number of correct nonzeros
in wˆ /total number of nonzeros in wˆ ) and recall (number of correct nonzeros in wˆ /k).
How well are you able to discover the true nonzeros? Comment on how λ affects these results and
include plots of precision and recall versus λ.
2. (4 points) Change σ to 10, regenerate the data, and solve the Lasso problem using a value of λ that
worked well when σ = 1. How are precision and recall affected? How might you change λ in order
to achieve better precision or recall?
5.4 Predicting the number of stars of a Yelp review [16 points + 10 bonus]
We’ll now put the Lasso to work on some real data. This data was extracted from Yelp’s recruiting competition on the analytics website Kaggle (www.kaggle.com/c/yelp-recruiting. As a side note,
browsing other competitions on the site may also give you some ideas for class projects.) For this competition, the task is to predict the number of useful upvotes a particular review will receive. However, in this
homework, we will look a more fun task: predicting the review’s number of stars based on the review’s text
alone.
For many Kaggle competitions (and machine learning methods in general), one of the most important
requirements for doing well is the ability to discover great features. We can use our Lasso solver for this as
follows. First, generate a large amount of features from the data, even if many of them are likely unnecessary.
Afterward, use the Lasso to reduce the number of features to a more reasonable amount.
Yelp provides a variety of data, such as the review’s text, date, and restaurant, as well as data pertaining
to each business, user, and check-ins. For this homework, we will only consider predicting the number of
stars from the review’s text. We already create a feature vector for each review.
trainData.txt Training data matrix for predicting number of stars
trainLabels.txt Training labels, list of the number of stars for each review
valData.txt Validation data
valLabels.txt Validation labels
testData.txt Test data
featuresTypes.txt List of features used
The data matrices are stored in Matrix Market Format, a format for sparse matrices. Each line is is a
tuple of three elements for instance ID, feature ID, and feature value. Meta information for each feature is
provided in featuresTypes.txt. The features are strings of one, two, or three words (n-grams). The
feature values correspond roughly to how often each word appears in the review. All rows have also been
normalized.
In Matlab, you can use the following code to load train data and labels:
D = load(‘trainData.txt’);
trX = sparse(D(:,2), D(:,1), D(:,3));
trLb = load(‘trainLabels.txt’);
For this part of the problem, you have the following tasks:
1. (8 points) Solve Lasso to predict the number of stars a Yelp review will receive. Starting at λmax, run
Lasso on the training set, decreasing λ by a factor of 2, using previous solutions as initial conditions
to each problem. Stop when you have considered enough λ’s that, based on validation error, you can
choose a good solution with confidence (for instance, when validation error begins increasing or stops
decreasing significant). At each solution, record the root-mean-squared-error (RMSE) on training and
validation data. In addition, record the number of nonzeros in each solution.
7
Plot the train and validation RMSE values together on a plot against λ. Separately plot the number of
nonzeros as well.
2. (5 points) Find the λ that achieves best validation performance. Compute a model using this λ and list
the top 10 features with largest weights and the top 10 features with lowest weights. Comment if the
weights make sense intuitively.
3. (3 points) Use your model to predict the number of stars for the reviews in test data. Save the predicted
values on a CSV file predTestLabels.csv (see 6.2 for the format). The order of predicted values
should correspond to the order of the reviews in testData.txt (counting starts from 1). Submit
this predTestLabels.txt file on Kaggle and report the RMSE.
4. (10 bonus points) Submit predTestLabels.csv and enter our Kaggle competition for fame. You
must use your Stony Brook email to register on Kaggle. We will maintain a leader board, and the
top three entries at the end of the competition (due date) will receive 10 bonus points. The ranking is
based on RMSE.
To prevent exploiting test data, you are allowed to make a maximum of 2 submissions per 24 hours.
Your submission will be evaluated immediately and the leader board will be updated.
You are allowed to create new features from the existing set of features. You are allowed to perform
feature normalization on existing and new features. You can use both training and validation data to
train your classifier. You can also use Ridge regression instead of Lasso. If you use Ridge regression,
you must implement it yourself. You cannot use any other classifier types such as SVM or boosting.
6 What to submit?
6.1 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Do not submit the
provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g.,
10947XXXX heeyoung kwon). Zip this folder and submit the zip file on Blackboard. Your submission must
be a zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it
should contain:
1. Answers to Question 1, 2, 3, 4, and 5.1
2. Answers to Question 5.3, including the requested plots.
3. Answers to Question 5.4.1, and 5.4.2, including the requested plots.
6.2 Kaggle submission
For Questions 5.4.3 and 5.4.4, you must submit a CSV file to get the RMSE from the competition site
inclass.kaggle.com/c/hw1-predicting-the-number-of-stars-of-a-yelp-reivew/.
A submission file should contain two columns: instanceId and Prediction. The file should contain a header
and have the following format.
instanceId, P rediction
1, 2.1
2, 4.2
… …
(10)
A sample submission file is available from the competition site and our handout.
Cheating warnings: Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You
must use your real name to register on Kaggle. Do not create multiple accounts to bypass the submission
limitation per 24 hours. Doing so will be considered cheating.
8