$30.00
1 Convexity and norms [30 points]
1. A norm k · k over R
n is defined by the properties: i) nonnegative: kxk ≥ 0 for all x ∈ R
n with equality
if and only if x = 0, ii) absolute scalability: ka xk = a kxk for all a ∈ R and x ∈ R
n, iii) triangle inequality:
kx + yk ≤ kxk + kyk for all x, y ∈ R
n.
a. [3 points] Show that f(x) = (Pn
i=1 xi
) is a norm. (Hint: begin by showing that a + b ≤ a + b for all
a, b ∈ R)
b. [2 points] Show that g(x) = Pn
i=1 xi

1/2
2
is not a norm. (Hint: it suffices to find two points in n = 2
dimensions such that the triangle inequality does not hold)
Context: norms are often used in regularization to encourage specific behaviors of solutions. If we define
kxkp := (Pn
i=1 xi

p
)
1/p then one can show that kxkp is a norm for all p ≥ 1. The important cases of p = 2 and
p = 1 correspond to the penalty for ridge regression and the lasso, respectively.
2. [6 points] For any x ∈ R
n, define the following norms: kxk1 =
Pn
i=1 xi
, kxk2 =
pPn
i=1 xi

2, kxk∞ :=
limp→∞ kxkp = maxi=1,…,n xi
. Show that kxk∞ ≤ kxk2 ≤ kxk1.
3. [3 points] A set A ⊆ R
n is convex if λx + (1 − λ)y ∈ A for all x, y ∈ A and λ ∈ [0, 1].
For each of the greyshaded sets above (IIII), state whether each one is convex, or state why it is not convex
using any of the points a, b, c, d in your answer.
4. [4 points] We say a function f : R
d → R is convex on a set A if f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y) for
all x, y ∈ A and λ ∈ [0, 1].
For each of the greycolored functions above (IIII), state whether each one is convex on the given interval or
state why not with a counterexample using any of the points a, b, c, d in your answer.
a. Function in panel I on [a, c]
b. Function in panel II on [a, c]
c. Function in panel III on [a, d]
d. Function in panel III on [c, d]
5. Use just the definitions above and let k · k be a norm.
a. [3 points] Show that f(x) = kxk is a convex function.
b. [3 points] Show that {x ∈ R
n : kxk ≤ 1} is a convex set.
c. [2 points] Draw a picture of the set {(x1, x2) : g(x1, x2) ≤ 4} where g(x1, x2) =
x1
1/2 + x2
1/2
2
.
(This is the function considered in 1b above specialized to n = 2.) We know g is not a norm. Is the
defined set convex? Why not?
Context: It is a fact that a function f defined over a set A ⊆ R
n is convex if and only if the set {(x, z) ∈ R
n+1 :
z ≥ f(x), x ∈ A} is convex. Draw a picture of this for yourself to be sure you understand it.
6. For i = 1, . . . , n let `i(w) be convex functions over w ∈ R
d
(e.g., `i(w) = (yi − w
>xi)
2
), k · k is any norm,
and λ > 0.
a. [3 points] Show that
Xn
i=1
`i(w) + λkwk
is convex over w ∈ R
d
(Hint: Show that if f, g are convex functions, then f(x) + g(x) is also convex.)
b. [1 points] Explain in one sentence why we prefer to use loss functions and regularized loss functions that
are convex.
Programming: Lasso [40 points]
Given λ > 0 and data (x1, y1), . . . ,(xn, yn), the Lasso is the problem of solving
arg min
w∈Rd,b∈R
Xn
i=1
(x
T
i w + b − yi)
2 + λ
X
d
j=1
wj  (1)
λ is a regularization tuning parameter. For the programming part of this homework, you are required to
implement the coordinate descent method of Algorithm 1 that can solve the Lasso problem.
You may use common computing packages (such as NumPy or SciPy), but do not use an existing Lasso solver
(e.g., of SciKit Learn).
Before you get started, here are some hints that you may find helpful:
• For loops can be slow whereas vector/matrix computation in Numpy is very optimized, exploit this as
much as possible.
• The psuedocode provided has many opportunities to speed up computation by precomputing quantities
like ak before the for loop. These small changes can speed things up considerably.
• As a sanity check, ensure the objective value is nonincreasing with each step.
• It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no element
of w changes by more than some small δ during an iteration. If you need your algorithm to run faster, an
easy place to start is to loosen this condition.
• You will need to solve the Lasso on the same dataset for many values of λ. This is called a regularization
path. One way to do this efficiently is to start at a large λ, and then for each consecutive solution, initialize
the algorithm with the previous solution, decreasing λ by a constant ratio (e.g., by a factor of 2) until
finished.
• The smallest value of λ for which the solution wb is entirely zero is given by
λmax = max
k=1,…,d
2
Xn
i=1
xi,k
yi −
1
n
Xn
j=1
yj
This is helpful for choosing the first λ in a regularization path.
8. We will first try out your solver with some synthetic data. A benefit of the Lasso is that if we believe
many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively
differentiating between the relevant and irrelevant features. Suppose that x ∈ R
d
, y ∈ R, k < d, and pairs of
data (xi
, yi) for i = 1, . . . , n are generated independently according to the model yi = w
T xi + i where
wj =
(
j/k if j ∈ {1, . . . , k}
0 otherwise
where i ∼ N (0, σ2
) is some Gaussian noise (in the model above b = 0). Note that since k < d, the features
k + 1 through d are unnecessary (and potentially even harmful) for predicting y.
With this model in mind, let n = 500, d = 1000, k = 100, and σ = 1. Generate some data by drawing each
xi ∼ N (0, I) so that xi ∈ R
d with yi generated as specified above.
• [10 points] With your synthetic data, solve multiple Lasso problems on a regularization path, starting at
λmax where 0 features are selected and decreasing λ by a constant ratio (e.g., 1.5) until nearly all the
features are chosen. In plot 1, plot the number of nonzeros as a function of λ on the xaxis (Tip: use
plt.xscale(’log’)).
• [10 points] For each value of λ tried, record values for false discovery rate (FDR) (number of incorrect
nonzeros in wb/total number of nonzeros in wb) and true positive rate (TPR) (number of correct nonzeros
in wb/k). In plot 2, plot these values with the xaxis as FDR, and the yaxis as TPR and note that in an
ideal situation we would have an (FDR,TPR) pair in the upper left corner, but that can always trivially
achieve (0, 0) and ( d−k
d
, 1)
Algorithm 1: Coordinate Descent Algorithm for Lasso
while not converged do
b ← 1
n
Pn
i=1
yi −
Pd
j=1 wjxi,j
for k ∈ {1, 2, · · · d} do
ak ← 2
Pn
i=1 x
2
i,k
ck ← 2
Pn
i=1 xi,k
yi − (b +
P
j6=k wjxi,j )
wk ←
(ck + λ)/ak ck < −λ
0 ck ∈ [−λ, λ]
(ck − λ)/ak ck > λ
end
end
3
Comment on the effect of λ in these two plots.
9. We’ll now put the Lasso to work on some real data. Download the training data set “crimetrain.txt” and the
test data set “crimetest.txt” from the website under Homework 1. Store your data in your working directory
and read in the files with:
import pandas as pd
df_train = pd.read_table(“crimetrain.txt”)
df_test = pd.read_table(“crimetest.txt”)
This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more flexible;
unlike Numpy arrays, they store row and column indices along with the values of the data. Each column of
a DataFrame can also, in principle, store data of a different type. For this assignment, however, all data are
floats. Here are a few commands that will get you working with Pandas for this assignment:
df.head() # Print the first few lines of DataFrame df.
df.index # Get the row indices for df.
df.columns # Get the column indices.
df[‘‘foo’’’] # Return the column named ‘‘foo’’’.
df.drop(‘‘foo’’, axis = 1) # Return all columns except ‘‘foo’’.
df.values # Return the values as a Numpy array.
df[‘‘foo’’’].values # Grab column foo and convert to Numpy array.
df.iloc[:3,:3] # Use numerical indices (like Numpy) to get 3 rows and cols.
The data consist of local crime statistics for 1,994 US communities. The response y is the crime rate. The name
of the response variable is ViolentCrimesPerPop, and it is held in the first column of df train and df test.
There are 95 features. These features include possibly relevant variables such as the size of the police force or
the percentage of children that graduate high school. The data have been split for you into a training and test
set with 1,595 and 399 entries, respectively1
.
We’d like to use this training set to fit a model which can predict the crime rate in new communities and
evaluate model performance on the test set. As there are a considerable number of input variables, overfitting
is a serious issue. In order to avoid this, use the coordinate descent LASSO algorithm you just implemented in
the previous problem.
Begin by running the LASSO solver with λ = λmax defined above. For the initial weights, just use 0. Then, cut
λ down by a factor of 2 and run again, but this time pass in the values of ˆw from your λ = λmax solution as
your initial weights. This is faster than initializing with 0 weights each time. Continue the process of cutting λ
by a factor of 2 until the smallest value of λ is less than 0.01. For all plots use a logscale for the λ dimension
(Tip: use plt.xscale(’log’)).
a. [4 points] Plot the number of nonzeros of each solution versus λ.
b. [4 points] Plot the regularization paths (in one plot) for the coefficients for input variables agePct12t29,
pctWSocSec, pctUrban, agePct65up, and householdsize.
c. [4 points] On one plot, plot the squared error on the training and test data versus λ.
d. [4 points] Sometimes a larger value of λ performs nearly as well as a smaller value, but a larger value will
select fewer variables and perhaps be more interpretable. Inspect the weights (on features) for λ = 30.
Which feature variable had the largest (most positive) Lasso coefficient? What about the most negative?
Discuss briefly. A description of the variables in the data set can be found here: http://archive.ics.
uci.edu/ml/machinelearningdatabases/communities/communities.names.
e. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politician
suggests policies that encourage people over the age of 65 to move to high crime areas in an effort to reduce
crime. What is the (statistical) flaw in this line of reasoning? (Hint: fire trucks are often seen around
burning buildings, do fire trucks cause fire?)
1The features have been standardized to have mean 0 and variance 1.
4
Programming: Binary Logistic Regression [30 points]
10. Let us again consider the MNIST dataset, but now just binary classification, specifically, recognizing if a
digit is a 2 or 7. Here, let Y = 1 for all the 7’s digits in the dataset, and use Y = −1 for 2. We will use
regularized logistic regression. Given a binary classification dataset {(xi
, yi)}
n
i=1 for xi ∈ R
d and yi ∈ {−1, 1}
we showed in class that the regularized negative log likelihood objective function can be written as
J(w, b) = 1
n
Xn
i=1
log(1 + exp(−yi(b + x
T
i w))) + λw2
2
Note that the offset term b is not regularized. For all experiments, use λ = 10−1
. Let µi(w, b) = 1
1+exp(−yi(b+x
T
i w)) .
a. [8 points] Derive the gradients ∇wJ(w, b), ∇bJ(w, b) and give your answers in terms of µi(w, b) (your
answers should not contain exponentials).
b. [8 points] Implement gradient descent with an initial iterate of all zeros. Try several values of step sizes
to find one that appears to make convergence on the training set as fast as possible. Run until you feel
you are near to convergence.
(i) For both the training set and the test, plot J(w, b) as a function of the iteration number (and show
both curves on the same plot).
(ii) For both the training set and the test, classify the points according to the rule sign(b + x
T
i w) and
plot the misclassification error as a function of the iteration number (and show both curves on the
same plot).
Note that you are only optimizing on the training set. The J(w, b) and misclassification error plots should
be on separate plots.
c. [7 points] Repeat (b) using stochastic gradient descent with a batch size of 1. Note, the expected gradient
with respect to the random selection should be equal to the gradient found in part (a). Take careful note
of how to scale the regularizer.
d. [7 points] Repeat (b) using stochastic gradient descent with batch size of 100. That is, instead of approximating the gradient with a single example, use 100. Note, the expected gradient with respect to the
random selection should be equal to the gradient found in part (a).
5