$30.00
Order Now1 Question 1 – Probability (15 points)
Let X1 and X2 be independent, continuous random variables uniformly distributed on [0, 1]. Let X =
max(X1, 2X2). Compute
1. (5 points) E(X).
2. (5 points) V ar(X).
3. (5 points) Cov(X, X1).
2 Question 2 – Parameter estimation (30 points)
This question uses a discrete probability distribution known as the Poisson distribution. A discrete random
variable X follows a Poisson distribution with parameter λ if
p(X = k|λ) = λ
k
k!
e
−λ
k ∈ {0, 1, 2, . . . }
The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at
a bank per hour, or the number of calls handled by a telephone operator per minute.
2.1 Question 2.1 – MLE (10 points)
Consider catching the LIRR train from Stony Brook to Penn Station. Of course, the train is always late, and
assume the amount of delay in minutes is Poisson distributed (i.i.d) with parameter λ. Since you like Times
Square so much, you have visited NYC several times and record the amount of train delays in each trip.
Trip 1 2 3 4 5 6 7
Delay in minutes 4 5 3 5 6 9 3
Let X = (X1, . . . , Xn) be a random vector where Xi
is the number of delay minutes for your i
th trip:
1. (4 points) Give the log-likelihood function of X given λ.
2. (4 points) Compute the MLE for λ in the general case.
3. (2 point) Compute the MLE for λ using the observed X.
1
2.2 Question 2.2 – MAP (10 points)
Now let’s be Bayesian and put a prior over the parameter λ. You talk to your party-goer friends, who tell
you about the expected delays that they experience. You plot the distribution of the expected delays and your
extensive experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe
a good prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf:
p(λ|α, β) = β
α
Γ(α)
λ
α−1
e
−βλ, λ > 0. (1)
Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability
density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not
important for this question. Also, if λ ∼ Γ(α, β), then it has mean α/β and the mode is (α − 1)/β for
α > 1. Assume the prior distribution of λ is Γ(λ|α, β).
1. (5 points) Compute the posterior distribution over λ. Hint:
λ
Pn
i=1 Xi+α−1
e
−λ(n+β)
looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out
a messy integral can lead to the answer but it is unnecessary.
2. (5 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ.
2.3 Question 2.3 – Estimator Bias (10 points)
In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution:
σˆ
2
MLE =
1
N
X
N
i=1
(xi − µˆ)
2
is biased, and that an unbiased estimator of variance is:
σˆ
2
unbiased =
1
N − 1
X
N
i=1
(xi − µˆ)
2
For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear
whether one estimator is preferable to the other. In this problem, we will explore an example in which the
maximum likelihood estimate is dramatically superior to any unbiased estimator.
Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that
is a nonlinear function of λ, namely η = e
−2λ
. Suppose we want to estimate η from a single observation
X ∼ P oisson(λ).
1. [4pts] Let ηˆ = e
−2X. Show that ηˆ is the maximum likelihood estimate of η.
2. [3pts] Show that the bias of ηˆ is e
−2λ − e
λ(1/e2−1). The following Taylor expansion may be useful:
e
x =
X∞
k=0
x
k
k!
3. [3pts] It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased and
briefly explain why this is a bad estimator to use.
2
3 Question 3 – Regression and MLE [15 points]
Suppose the data {(xi
, yi)|xi ∈
end for
end while
Algorithm 2 Efficient Coordinate Descent Algorithm
ak ← 2
Pn
i=1 x
2
ik % You can compute a without for loop
while not converged do
1. Compute the residual vector r (Q1 of 4.1, recalculate r each iteration to avoid numerical drift)
2. Update b (Q2 of 4.1)
3. Update r (Q3 of 4.1)
for k ∈ {1, 2, · · · , d} do
4. Calculate ck (Q4 of 4.1)
5. Update wk (See Algorithm 1)
6. Update the residual r (Q5 of 4.1)
end for
end while
3. (2 points) Once b is updated, find the update rule for the residual vector r. Again, use no for loop.
What is the time complexity for the update?
4. (2 points) Simplify the computation for ck in Algorithm 1 using the residual vector r. Hint: there
should no longer a sum over j. Suppose zk is the number of non-zeros in the k
th row of X, what is
the time complexity to compute ck when r is already computed?
5. (2 points) Suppose we update wk from w
old
k
to w
new
k
. Derive an update rule for r. Express the time
complexity in terms of zk.
6. (2 points) Putting this all together, you get the efficient coordinate descent algorithm in Algorithm 2.
What is per iteration complexity of this algorithm (cost of each round of the while loop)?
4.2 Implement coordinate descent to solve the Lasso
Now we are ready to implement the coordinate descent algorithm in Algorithm 2. Your code must
• Include an offset term b that is not regularized
• Take optional initial conditions for w.
• Be able to handle sparse X. It is ok for your code not being able to handle dense X. In Python, this
means you can always assume input is a sparse matrix.
• Avoid unnecessary computation (i.e take advantage of what you learned from Problem 4.1)
You may use any language for your implementation, but we recommend Matlab. Matlab is a very useful
language, and you should find that Matlab achieves reasonable performance for this problem. Our implementation takes < 1s for an input matrix X of size 2500 × 10000. Matlab has many toolboxes, and you
cannot use any existing Lasso solver.
Before you get started, here are some hints that you may find helpful:
• The only matrix operations required are: multiplying a matrix and a vector, adding vectors, computing
dot product between two vectors, point-wise matrix operations (e.g., power of 2), calculate the sum of
columns or rows or a matrix. Try to use as much vector/matrix computation as possible.
• The most important check is to ensure the objective value is non-increasing with each step.
• To ensure that a solution (wˆ ,
ˆb) is correct, you also can compute the value
2X(XT wˆ + ˆb − y)
This is a d-dimensional vector that should take the value −λsign(wj ) at j for each wj that is nonzero.
For the zero indices of wˆ , this vector should take values lesser in magnitude than λ. (This is similar to setting the gradient to zero, though more complicated because the objective function is not
differentiable.)
• It is up to you to decide on a suitable stopping condition. A common criteria is to stop when the
objective value does not decrease by more than some small δ during an iteration. If you need your
algorithm to run faster, an easy place to start is to loosen this condition.
• For several problems, you will need to solve the Lasso on the same dataset for many values of λ. This
is called a regularization path. One way to do this efficiently is to start at a large λ, and then for each
consecutive solution, initialize the algorithm with the previous solution, decreasing λ by a constant
ratio until finished.
• The smallest value of λ for which the solution wˆ is entirely zero is given by
λmax = 2
X
y −
1
n
X
i
yi
!
∞
This is helpful for choosing the first λ in a regularization path.
Finally here are some pointers toward useful parts of Matlab
• [I,J,S] = find(X). Finding the row indices, column indices, and non-zero values of a sparse
matrix X.
• Use sum(A, dim) to compute the sum of row or column of a matrix A.
4.3 Try out your work on synthetic data [12 points]
We will now try out your solver with some synthetic data. A benefit of the Lasso is that if we believe
many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively
differentiating between the relevant and irrelevant features.
Let’s see if it actually works. Suppose that x ∈
WhatsApp us