Description
Implement Lasso using the Coordinate Descent (CD) algorithm and test your algorithm on
the Boston housing data. Check [Rcode W3 VarSel RidgeLasso.html] on relevant background information on this dataset.
• First, load the transformed Boston Housing Data, Coding2 myData.csv.
• Next write your own function MyLasso to implement CD, which should output estimated Lasso coefficients similar to the ones returned by R with option “standardized
= TRUE”.
In case you don’t know where to start, you can follow the structure given on the next
page to prepare your function. In our script, we run a fixed number of iterations,
“maxit = 100,” which seems enough for this assignment. You could set it to be a
bigger number, or change it to a while loop to stop when some convergence criterion
is satisfied.
• Test your algorithm with the following lambda sequence.
lam .seq = exp(seq ( -1 , -8 , length . out = 80) )
myout = MyLasso (X , y , lam .seq , maxit = 100)
• Produce a path plot for the 13 non-intercept coefficients along the lambda values in
log scale.
• Check the accuracy of your algorithm against the output from glmnet. The maximum
difference between the two coefficient matrices should be less than 0.005.
lasso . fit = glmnet (X , y , alpha = 1 , lambda = lam .seq)
max(abs( coef ( lasso . fit ) – myout ) )
Students who use Python for this assignment can find the dataset Coding2 myData.csv
and the target Lasso coefficients, Coding2 lasso coefs.csv in the Coding Assignments
folder at the Resources page on Piazza.
What you need to submit?
An R/Python Markdown file in HTML format, which should contain all code used to
produce your results.
• You are allowed to use two packages: MASS (for the data) and glmnet.
• Name your file starting with Assignment 2 xxxx netID where “xxxx” is the last 4-dig
of your University ID.
One _var _ lasso = function (r , x , lam ) {
# ##############
# YOUR CODE
# ##############
}
MyLasso = function (X , y , lam .seq , maxit = 50) {
# X: n-by -p design matrix without the intercept
# y: n-by -1 response vector
# lam.seq: sequence of lambda values
# maxit : number of updates for each lambda
n = length ( y )
p = dim( X ) [2]
nlam = length ( lam . seq)
# #############################
# YOUR CODE
# Center and scale X
# Record the corresponding means and scales
# #############################
# Initialize coef vector b and residual vector r
b = XXX
r = XXX
B = XXX
# Triple nested loop
for( m in 1: nlam ) {
lam = XXX # assign lambda value
for ( step in 1: maxit ) {
for( j in 1: p ) {
r = r + ( X [ , j ]*b [ j ])
b [ j ] = One_var_ lasso (r , X [ , j ] , lam )
r = r – X [ , j ] * b [ j ]
}
}
B [m , -1] = b
}
# #############################
# YOUR CODE
# Scale back the coefficients and update the intercepts B[ , 1]
# #############################
return (t( B ) )
}
Note: You need to write your own function One var lasso to solve the one-variable Lasso
for βj . Check hints given on the next page.
2
In the CD algorithm, at each iteration, we repeatedly solve a one-dimensional Lasso problem
for βj while holding the other (p − 1) coefficients at their current values:
min
βj
Xn
i=1
(yi −
X
k6=j
xikβˆ
k − xijβj )
2 + λ
X
k6=j
|βˆ
k| + λ|βj |,
which is equivalent to solving
min
βj
Xn
i=1
(ri − xijβj )
2 + λ|βj |. (1)
where
ri = yi −
X
k6=j
xikβˆ
k.
How to solve (1)? In class we have discussed how to find the minimizer of
f(x) = (x − a)
2 + λ|x|,
which is given by
x
∗ = arg min
x
f(x) = sign(a)(|a| − λ/2)+ =
a − λ/2, if a > λ/2;
0, if |a| ≤ λ/2;
a + λ/2, if a < −λ/2.
(2)
We can rewrite (1) in the form of f(x) and then use the solution given above.
Define two n×1 vectors: r = (r1, . . . , rn)
t with ri being its i-element and xj = (x1j , . . . , xnj )
T
with xij as its i-th element. Then
r1 − x1jβj
r2 − x2jβj
· · ·
rn − xnjβj
=
r1
r2
· · ·
rn
−
x1j
x2j
· · ·
xnj
βj = r − xjβj .
Then rewrite (1) as
Xn
i=1
(ri − xijβj )
2 + λ|βj | = kr − xjβjk
2 + λ|βj |. (3)
The first term above is like the RSS from a regression model with only one predictor (whose
coefficient is βj ) without the intercept. The corresponding LS estimate is given by
βˆ
j = r
txj/kxjk
2
.
3
Then we have
kr − xjβjk
2 = kr − xjβˆ
j + xj (βj − βˆ
j )k
2
= kr − xjβˆ
jk
2 + kxj (βj − βˆ
j )k
2
+2 × inner-product-of-vectors (r − xjβˆ
j ) and xj (βj − βˆ
j )
= kr − xjβˆ
jk
2 + kxj (βj − βˆ
j )k
2
, (4)
where the last equality is due to the fact that the inner product term is zero since vector
(r − xjβˆ
j ) is orthogonal to xj
1
.
Note that the first term of (4) has nothing to do with βj . So to minimize (1) or equivalently
(3) with respect to βj , we can ignore the first term and instead minimize
kxj (βj − βˆ
j )k
2 + λ|βj | = kxjk
2
(βj − βˆ
j )
2 + λ|βj |
= kxjk
2
(βj − βˆ
j )
2 +
λ
kxjk
2
|βj |
∝ (βj − βˆ
j )
2 +
λ
kxjk
2
|βj |.
Now we can use (2), the solution we derived for f(x), with
a = βˆ
j = r
txj/kxjk
2
, λ = λ/kxjk
2
.
1This is because (r−xjβˆj ) represents the residual vector from a regression model with xj being a column
(actually the only column) of the design matrix, so it is orthogonal to xj .
4