## Description

## 1 Logistic Regression and Regularization

Let us consider a binary logistic regression model.

(a) Given n training examples (x1, y1),(x2, y2), …,(xn, yn), please write down the negative log

likelihood (as loss function):

L(w) = − log Yn

i=1

P(Y = yi

|X = xi)

!

.

(b) Is this loss function convex? Provide your reasoning.

(c) Show that the magnitude of the optimal w can go to infinity when the training samples are

linearly seperable (i.e. the data can be perfectly classified by a linear classifier).

(d) A convenient way to resolve the problem described in (c) is to add a regularization term to

the likelihood function as follows:

L(w) = − log Yn

i=1

p(Y = yi

|X = x)

!

+ λkwk

2

2

, (1)

where kwk

2

2 =

P

i w

2

i

and λ > 0. Please compute the gradient respect to wi

, i.e. ∂L(w)

∂wi

.

(e) Show that the problem in Eq. (1) has a unique solution.

## 2 Sparsity via Regularization

Given a set of training data (x1, y1), (x2, y2), · · · ,(xN , yN ) where xi ∈ RD, linear regression learns

the weight vector w (assuming the bias term is absorbed into w) by solving

min

w

X

n

(yi − w>xi)

2

(2)

For many real-world applications (e.g. text and image classification, gene sequence analysis), the

dimensionality D can be very high. In practice, however, only a small portion of features may

be actually useful in predicting the target varible y. This motivates us to enforce certain sparse

structure on w: many elements of w are zeros.

(a) One way to incorporate sparsity into linear regression is to add a regularization term as follows

min

w

X

n

(yi − w>xi)

2 + λkwk0 (3)

where λ > 0, and kwk0 = #{i : wi 6= 0} is called `0-norm of w which is equal to number of

nonzeros in w. The challenge is that kwk0 is not a convex function of w. Provide an example

to show the nonconvexity of kwk0.

(b) One way to overcome the nonconvexity of `0-norm is to use `1-norm: kwk1 =

P

i

|wi

| which

is the sum of absoluate value of each element. Although `1-norm is an approximation of

`0-norm, it does often lead to sparse solution in practice. Prove that the kwk1 is convex in

w.

(c) Due to the nondifferentibility of kwk1, solving

min

w

X

n

(yi − w>xi)

2 + λkwk1 (4)

becomes challenging.

It no longer has an analytic solution. One method to optimize the

objective function is to transform it into a quadratic programming (QP) which has fast

solvers. A QP has the following standard form

min

u

1

2

u

>Qu + c

>u (5)

subject to Au ≤ b (6)

where u ∈ RP is the vector to be optimized (P can be different from D). c ∈ RP , b ∈ RG,

A ∈ RG×P , and Q ∈ RP ×P are coefficients. Show that (4) can be converted into a Standard

QP.

(Hint: introduce D additional variables t1, · · · , tD where td ≥ wd and td ≥ −wd for d =

1, · · · , D. Then show that minimizing kwk1 is equivalent to minimizing PD

d=1 td which is

an upper bound of kwk1. To further transform into a standard QP, stack w and t into a

vector u ∈ R2D, and then form the corresponding Q ∈ R2D×2D, A ∈ R2D×2D, c ∈ R2D and

b ∈ R2D.)

## 3 Kernel Ridge Regression

In this problem, we will provide guidelines for you to derive kernel ridge regression, a nonlinear

extension of linear ridge regression. You will be asked to implement it in the programming part.

Given a set of training data (x1, y1), (x2, y2), · · · ,(xN , yN ) where xi ∈ RD, linear ridge regression learns the weight vector w (assuming the bias term is absorbed into w) by optimizing the

following objective function

min

w

X

n

(yi − w>xi)

2 + λkwk

2

2

(7)

where λ ≥ 0 is the regularization coefficient.

(a) Let X ∈ RN×D be a matrix whose nth row is x

>

n

. y ∈ RN is a vector whose nth element is

yn. Show that the optimal w can be written as

w∗ = (X>X + λID)

−1X>y (8)

where ID denotes the identity matrix with size d × d.

(b) Now we apply a nonlinear feature mapping to each sample xi → Φi = φi(x) ∈ RT where the

dimensionality T is much larger than D. Define Φ ∈ RN×T as a matrix containing all Φi

.

Show that w∗

can be written as

w∗ = Φ>(ΦΦ> + λIN )

−1y (9)

Hint: you may use the following identity for matrices. For any matrix P ∈ Rp×p

, B ∈

Rq×p

, R ∈ Rq×q and assume the matrix inversion is valid, we have

(P

−1 + B

>R−1B)

−1B

>R−1 = PB>(BPB> + R)

−1

(c) Given a testing sample φ(x), show that the prediction ˆy = w∗>φ(x) can be written as

yˆ = y

>(K + λIN )

−1κ(x) (10)

where K ∈ RN×N is a kernel matrix defined as Kij = Φi

>Φj

, κ(x) ∈ RN is a vector with

nth element κ(x)n = φ(xn)

>φ(x). Now you can see that ˆy only depends on the dot product

(or kernel value) of {Φi}.

(d) Compare the computational complexity between linear ridge regression and kernel ridge regression.

## 4 Kernel Construction

The Mercer theorem can be used to construct valid kernel functions. The theorem states that, a

bivariate function k(·, ·) is a positive definite kernel function, if and only if, for any N and any

x1, x2, · · · , xN , the matrix K, where Kij = k(xi

, xj ), is positive semidefinite. That is, all the

eigenvalues of the matrix are non-negative.

An alternative (but equivalent) definition states that,

for every positive semi-definite matrix A ∈ R

n×n and arbitrary vector x ∈ R

n×1

, we have x

>Ax ≥ 0.

Suppose k1(·, ·) and k2(·, ·) are kernel functions, and x and x

0 are any two samples, prove that

k3, k4, k5, k6 and k7 are valid kernel functions using the Mercer theorem:

(a) k3(x, x

0

) = a1k1(x, x

0

) + a2k2(x, x

0

) where a1, a2 ≥ 0.

(b) k4(x, x

0

) = f(x)f(x

0

) where f(·) is a real valued function.

(c) k5(x, x

0

) = g(k1(x, x

0

)) where g(·) is a polynomial function with positive coefficients.

(d) k6(x, x

0

) = k1(x, x

0

)k2(x, x

0

).

(e) k7(x, x

0

) = exp (k1(x, x

0

)).

## 5 Programming – Bias/Variance Tradeoff

Let the function f(x) = 2x

2 +, where is Gaussian noise drawn from N (0, 0.1). We are also given

the following functions:

• g1(x) = 1

• g2(x) = w0

• g3(x) = w0 + w1x

• g4(x) = w0 + w1x + w2x

2

• g5(x) = w0 + w1x + w2x

2 + w3x

3

• g6(x) = w0 + w1x + w2x

2 + w3x

3 + w4x

4

(a) Randomly generate 100 datasets. Each dataset contains 10 samples (xi

, yi), where xi

is

uniformly sampled from [−1, 1] and yi = f(xi). For a given g function, estimate its parameters

using linear regression and compute the sum-square-error on every dataset. Then plot the

histogram of the mean-squared-error. Repeat this for each g function.

Please provide the 6

histogram plots in the report. For each g function, estimate its bias2 and variance, and report

the results.

(b) In (a), change the sample size in each dataset to 100, and repeat the procedures. Plot the

resulting histograms in the report. For each g function, estimate its bias2 and variance, and

report the results.

(c) Given your results in (a) and (b), discuss how the model complexity and sample size affect

the bias2 and variance.

(d) Consider function h(x) = w0 + w1x + w2x

2 + λ(w

2

0 + w

2

1 + w

2

2

) where λ ≥ 0 is a regularization

parameter. Following (b) (i.e. 100 samples per dataset), estimate the bias2 and variance of

h(x) when λ set to 0.01, 0.1, 1, 10 respectively. Discuss how λ affect the bias2 and variance.

## 6 Programming – Regression

In this problem, you will implement linear ridge regression and kernel ridge regression, and work

on a regression dataset.

All programming should be done in MATLAB. You must implement certain functions which we

specify. Otherwise, you may use built-in MATLAB functions. Do not use code from the internet

or from other students. Record answers to questions in your written LaTeX report.

Dataset It contains 3,107 observations on U.S. county votes cast in the 1980 presidential election.

Each observation contains 6 features related to population structure, ecomonical condition, as well

as geographical info of the county. The goal is to predict the portion of the votes. Details of the

dataset can be found at http://lib.stat.cmu.edu/datasets/space_ga.

Preprocessing Randomly split the dataset into the training (80% samples) and test sets (20%

samples). Normalize each feature to have zero mean and unit variance (note that you should only

use the mean and variance information in the training data). When experimenting with any model

described below, do 10 random splits, and report the average test error.

Linear Ridge Regression Implement linear ridge regression described in Problem 2 using Matlab. For the regularization parameter λ, using 5-folder cross validation on the training set to pick

the optimal one from [0, 10−4

, 10−3

, 10−2

, · · · , 102

, 103

].

Note that for each random split of the

data, you need to run cross validation to select λ. Report the optimal λ for each ramdom split.

Additionally, report the average test error.

Kernel Ridge Regression Implement kernel ridge regression described in Problem 2 using

Matlab. You need to try three types of kernels

(a) Linear kernel: k(xi

, xj ) = x

>

i xj .

(b) Polynomial kernel: k(xi

, xj ) = (x

>

i xj + a)

b

, where a and c are additional parameters. You

need to choose a from [−1, −0.5, 0, 0.5, 1], and choose c from [1, 2, 3, 4].

(c) Gaussian RBF kernel: k(xi

, xj ) = exp

−

kxi − xjk

2

2

σ

2

, where σ is an additional parameter.

You need to choose σ

2

from [0.125, 0.25, 0.5, 1, 2, 4, 8].

For each of the kernel type, use 5-folder cross validation on the training set to pick the optimal

λ from [0, 10−4

, 10−3

, 10−2

, · · · , 102

, 103

] and the other kernel parameters. Report the optimal

parameters for each ramdom split. Additionally, report the average test error.

Comparison Does kernel ridge regression with linear kernel give the same results as linear ridge

regression? If not, why? Among three types of kernels, which one performs the best?

Submission Instructions: You need to provide the followings:

• Provide your answers to problems 1-6 in PDF file, named as CSCI567 hw3 fall15.pdf.

You need to submit the homework in both hard copy (at Locker #19 at PHE, with a

box labeled as CSCI567-homework by 6pm of the deadline date) and electronic version

as pdf file on Blackboard. If you choose handwriting instead of typing all the answers,

you will get 40% points deducted.

• Submit ALL the code and report via Blackboard by 6 pm of the deadline date. The only

acceptable language is MATLAB. For your program, you MUST include the main function called CSCI567 hw3 fall15.m in the root of your folder. After running this main file,

your program should be able to generate all of the results needed for this programming

assignment, either as plots or console outputs. You can have multiple files (i.e your subfunctions), however, the only requirement is that once we unzip your folder and execute

your main file, your program should execute correctly.

Please double-check your program

before submitting. You should only submit one .zip file. No other formats are allowed

except .zip file. Also, please name it as [lastname] [firstname] hw3 fall15.zip.

Collaboration: You may collaborate. However, collaboration has to be limited to discussion only

and you need to write your own solution and submit separately. You also need to list with

whom you have discussed.