$35.00

Category: CM 146

Description

5/5 - (5 votes)

1 Perceptron [10 pts]

Design (specify θ for) a two-input perceptron (with an additional bias or offset term) that computes

the following boolean functions. Assume T = 1 and F = −1. If a valid perceptron exists, show

that it is not unique by designing another valid perceptron (with a different hyperplane, not simply

through normalization). If no perceptron exists, state why.

(a) AND (b) XOR

2 Logistic Regression [10 pts]

Consider the objective function that we minimize in logistic regression:

J(θ) = −

X

N

n=1

[yn log hθ (xn) + (1 − yn) log (1 − hθ (xn))]

(a) Find the partial derivatives ∂J

∂θj

.

3 Locally Weighted Linear Regression [10 pts]

Consider a linear regression problem in which we want to “weight” different training instances

differently because some of the instances are more important than others. Specifically, suppose we

want to minimize

J(θ0, θ1) = X

N

n=1

wn (θ0 + θ1xn,1 − yn)

2

.

Here wn > 0. In class, we worked out what happens for the case where all the weights (the wn’s)

are the same. In this problem, we will generalize some of those ideas to the weighted setting.

(a) Calculate the gradient by computing the partial derivatives of J with respect to each of the

parameters (θ0, θ1).

(b) Set each partial derivatives to 0 and solve for θ0 and θ1 to obtain values of (θ0, θ1) that

minimize J.

2

4 Understanding Linear Separability [25 pts]

Definition 1 (Linear Program) A linear program can be stated as follows:

Let A be an m × n real-valued matrix, ~b ∈ R

m, and ~c ∈ R

n

. Find a ~t ∈ R

n

, that minimizes the

linear function

z(~t) = ~c T~t

subject to A~t ≥ ~b

In the linear programming terminology, ~c is often referred to as the cost vector and z(~t) is referred

to as the objective function.

1 We can use this framework to define the problem of learning a linear

discriminant function.2

The Learning Problem:3 Let ~x1, ~x2, . . . , ~xm represent m samples, where each sample ~xi ∈ R

n

is an n-dimensional vector, and ~y ∈ {−1, 1}

m is an m × 1 vector representing the respective labels

of each of the m samples. Let ~w ∈ R

n be an n × 1 vector representing the weights of the linear

discriminant function, and θ be the threshold value.

We predict ~xi to be a positive example if ~w

T ~xi + θ ≥ 0. On the other hand, we predict ~xi to be a

negative example if ~w

T ~xi + θ < 0.

We hope that the learned linear function can separate the data set. That is,

yi =

(

1 if ~w

T ~xi + θ ≥ 0

−1 if ~w

T ~xi + θ < 0.

(1)

In order to find a good linear separator, we propose the following linear program:

min δ

subject to yi( ~w

T

~xi + θ) ≥ 1 − δ, ∀(~xi

, yi) ∈ D

δ ≥ 0 (2)

where D is the data set of all training examples.

1Note that you don’t need to really know how to solve a linear program since we will use Matlab to obtain the

solution (although you may wish to brush up on Linear Algebra). Also see the appendix for more information on

linear programming.

2This discussion closely parallels the linear programming representation found in Pattern Classification, by Duda,

Hart, and Stork.

3Note that the notation used in the Learning Problem is unrelated to the one used in the Linear Program

definition. You may want to figure out the correspondence.

3

(a) A data set D = {(~xi

, yi)}

m

i=1 that satisfies condition (1) above is called linearly separable.

Show that if D is linearly separable, there is an optimal solution to the linear program (2)

with δ = 0

(b) Now show that there is an optimal solution with δ = 0, then D is linearly separable.

(c) What can we say about the linear separability of the data set if there exists a hyperplane

that satisfies condition (2) with δ > 0?

(d) An alternative LP formulation to (2) may be

min δ

subject to yi( ~w

T

~xi + θ) ≥ −δ, ∀(~xi

, yi) ∈ D

δ ≥ 0

Find the optimal solution to this formulation (independent of D) to illustrate the issue with

such a formulation.

(e) Let ~x1 ∈ R

n

, ~x1

T =

1 1 1

and y1 = 1. Let ~x2 ∈ R

n

, ~x2

T =

−1 −1 −1

and y2 = −1.

The data set D is defined as

D = {( ~x1, y1),( ~x2, y2)}.

Consider the formulation in (2) applied to D. What are possible optimal solutions?

4

WhatsApp us