## Description

## Question 1: Neural Network [100 points, evenly among parts]

We will build a neural network to perform binary classification. Each input item has two real-valued features

x = (x1, x2), and the class label y is either 0 or 1. Our neural network has a very simple structure:

ReLU A

C Sigmoid

B ReLU

x1 x2

1 1

1

w8 w9

w2

w5 w3

w6

w1 w4

w7

There is one hidden layer with two hidden units A, B, and one output layer with a single output unit C.

The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output

layer. Each unit also has a constant bias 1 input with the corresponding weight. See figure below. Units A

are B are ReLU, namely

fA(z) = fB(z) = max(z, 0).

Unit C is a sigmoid,

fC (z) = σ(z) = 1

1 + e−z

.

Write a program Neural.java with the following command line format:

$java Neural FLAG [args]

Where the optional arguments are real valued.

1. This neural network is fully defined by the nine weights w1, . . . , w9. We will first focus on predictions

given fixed weights.

Remark: we will use both single index and double index to refer to a weight. Single index corresponds

to the figure above. Double index, on the other hand, is used to describe the algorithm and denotes

the “from → to” nodes that the edge is connecting. For example, w8 is the same as wA,C , w2 is the

same as wx1,A, and w1 is the same as w1,A where we used “1” to denote the constant bias input of

one. These should be clear from the context.

Recall in a neural network for any unit j, it first collects input from lower units:

uj =

X

i:i→j

wijvi

where vi

is the output of lower unit i. Specifically, if i is an input unit then vi = xi

; if i is the bias

then vi = 1. The unit j then passes uj through its nonlinear function fj () to produce its output vj :

vj = fj (uj ).

When FLAG=100, arg1 … arg9 are the weights w1, . . . , w9, and arg10=x1, arg11=x2. Print uA, vA, uB, vB, uC , vC

on the same line separated by space. When printing real values, show 5 digits after decimal point by

rounding (but do not round the actual variables). For example,

$java Neural 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1

0.00000 0.00000 0.30000 0.30000 0.97000 0.72512

$java Neural 100 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 -0.2 1.7

2.18000 2.18000 1.43000 1.43000 1.34000 0.79249

$java Neural 100 4 3 2 1 0 -1 -2 -3 -4 -4 1

-6.00000 0.00000 0.00000 0.00000 -2.00000 0.11920

2. Given a training item x = (x1, x2) and its label y, the error made by the neural network on the item

is defined as

E =

1

2

(vC − y)

2

.

The partial derivative with respect to the output layer variable vC is

∂E

∂vC

= vC − y.

The partial derivative with respect to the intermediate variable uC is

∂E

∂uC

=

∂E

∂vC

f

0

C (uC ).

Recall f

0

C (uC ) = σ

0

(uC ) = σ(uC )(1 − σ(uC )) = vC (1 − vC ).

When FLAG=200, arg1 … arg9 are the weights w1, . . . , w9, arg10=x1, arg11=x2, and arg12=y. Print

E,

∂E

∂vC

, and ∂E

∂uC

on the same line separated by space. For example,

$java Neural 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1 1

0.03778 -0.27488 -0.05479

$java Neural 200 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 -0.2 1.7 0

0.31402 0.79249 0.13032

$java Neural 200 4 3 2 1 0 -1 -2 -3 -4 -4 1 0

0.00710 0.11920 0.01252

3. The partial derivative with respect to hidden layer variable vj is

∂E

∂vj

=

X

k:j→k

wjk

∂E

∂uk

.

And

∂E

∂uj

=

∂E

∂vj

∂vj

∂uj

.

Recall our hidden layer units are ReLU, for which

∂vj

∂uj

=

∂ max(uj , 0)

∂uj

=

1, uj ≥ 0

0, uj < 0

Note we define the derivative to be 1 when uj = 0.

When FLAG=300, arg1 … arg9 are the weights w1, . . . , w9, arg10=x1, arg11=x2, and arg12=y. Print

∂E

∂vA

,

∂E

∂uA

,

∂E

∂vB

, and ∂E

∂uB

on the same line separated by space.

$java Neural 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1 1

-0.04383 -0.04383 -0.04931 -0.04931

$java Neural 300 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 -0.2 1.7 0

0.03910 0.03910 0.02606 0.02606

$java Neural 300 4 3 2 1 0 -1 -2 -3 -4 -4 1 0

-0.03755 0.00000 -0.05006 -0.05006

4. Now we can compute the partial derivative with respect to the edge weights:

∂E

∂wij

= vi

∂E

∂uj

.

When FLAG=400, arg1 … arg9 are the weights w1, . . . , w9, arg10=x1, arg11=x2, and arg12=y. Print

∂E

∂w1

, . . .

∂E

∂w9

on the same line separated by space. For example,

$java Neural 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1 1

-0.04383 -0.04383 0.04383 -0.04931 -0.04931 0.04931 -0.05479 0.00000 -0.01644

$java Neural 400 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 -0.2 1.7 0

0.03910 -0.00782 0.06647 0.02606 -0.00521 0.04431 0.13032 0.28411 0.18636

$java Neural 400 4 3 2 1 0 -1 -2 -3 -4 -4 1 0

0.00000 0.00000 0.00000 -0.05006 0.20025 -0.05006 0.01252 0.00000 0.00000

5. Now we perform one step of stochastic gradient descent. With step size η, we update the weights:

wi = wi − η

∂E

∂wi

, i = 1 . . . 9.

When FLAG=500, arg1 … arg9 are the weights w1, . . . , w9, arg10=x1, arg11=x2, arg12=y, and

arg13=η. Print four lines:

(a) the old w1 . . . w9

(b) the error E under the old w

(c) the updated w1 . . . w9

(d) the error E after the update

For example,

$java Neural 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1 1 0.1

0.10000 0.20000 0.30000 0.40000 0.50000 0.60000 0.70000 0.80000 0.90000

0.03778

0.10438 0.20438 0.29562 0.40493 0.50493 0.59507 0.70548 0.80000 0.90164

0.03617

$java Neural 500 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 -0.2 1.7 0 0.1

1.00000 0.90000 0.80000 0.70000 0.60000 0.50000 0.40000 0.30000 0.20000

0.31402

0.99609 0.90078 0.79335 0.69739 0.60052 0.49557 0.38697 0.27159 0.18136

0.29972

$java Neural 500 4 3 2 1 0 -1 -2 -3 -4 -4 1 0 0.1

4.00000 3.00000 2.00000 1.00000 0.00000 -1.00000 -2.00000 -3.00000 -4.00000

0.00710

4.00000 3.00000 2.00000 1.00501 -0.02002 -0.99499 -2.00125 -3.00000 -4.00000

0.00371

6. Long long ago, in a galaxy far far away, 117 students took an AI class. We have a data set where x1

is each student’s score on homework 2, x2 is their midterm score (both scaled to between 0 and 1),

and the binary label y is whether they received an A at the end of semester (y = 1) or not (y = 0).

We already randomly split the data set into a training set of n = 67 items, an evaluation set with 25

items, and a test set with 25 items. These files are denoted by the file names. We will train the neural

network to predict if a student will receive an A at the end of semester.

We first define one “epoch” as going over the training items once in the natural order:

(a) FOR j = 1 TO n

(b) wi = wi − η

∂Ej

∂wi

, i = 1 . . . 9

(c) PRINT w’s evaluation set error

(d) END

where Ej is the error on the j-th training item. In other words, this is stochastic gradient descent

(except that we go over training items in order rather than randomly, but we shall ignore the difference

for now). Step (c) computes the evaluation set error as follows:

X

25

k=1

1

2

(vC (xk, w) − yk)

2

where (xk, yk) is an evaluation set item, and we used the notation vC (xk, w) to mean the neural network

output when the input is xk and the weights are w.

When FLAG=600, arg1 … arg9 are the initial weights w1, . . . , w9, arg10=η. Run one epoch (i.e. going

over the 67 training items once). For each training item, print three lines:

(a) x1, x2, y

(b) the updated w1 . . . w9

(c) the evaluation set error under the updated w

Therefore, you will print 67 ∗ 3 = 201 lines in total. For example,

$java Neural 600 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1

0.98000 0.87000 1.00000

0.10049 0.20048 0.30043 0.40055 0.50054 0.60048 0.70062 0.80034 0.90087

7.18915

0.92000 0.88000 0.00000

0.09486 0.19530 0.29548 0.39422 0.49471 0.59491 0.69358 0.79648 0.89110

7.12401

…

0.81000 0.75000 0.00000

-0.13841 0.01138 0.16820 0.06903 0.24590 0.41273 0.22295 0.73651 0.52301

3.93300

0.94000 0.69000 0.00000

-0.13841 0.01138 0.16820 0.06135 0.23868 0.40744 0.20827 0.73651 0.51442

3.87761

7. We now run T epochs. For clarity, we will only print evaluation set error at the end of each epoch.

When FLAG=700, arg1 … arg9 are the initial weights w1, . . . , w9, arg10=η, arg11=T. At the end of

each epoch, print w, then on a separate line the evaluation set error.

$java Neural 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 3

-0.13841 0.01138 0.16820 0.06135 0.23868 0.40744 0.20827 0.73651 0.51442

3.87761

-0.15422 0.01050 0.18620 -0.10760 0.12065 0.33611 -0.14930 0.73903 0.41698

2.99174

-0.18275 0.01188 0.22380 -0.19706 0.06861 0.31792 -0.35328 0.74892 0.41936

2.71163

8. FLAG=800 has the same input as FLAG=700. But we will stop as soon as evaluation set error starts

to increase after completing a whole epoch. If evaluation set error decreases or stays the same, we

will step after T epochs. Print the number of epochs that actually happened, the final w, and the

evaluation set error on different lines. Now you have a well-trained neural network. For a test item

x, you can make a binary classification prediction by thresholding the network output at 1/2 (predict

positive class if the output is exactly 1/2). Compute the test set classification accuracy (note: not the

squared error) and print it, too.

$java Neural 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 10000

788

-3.51459 1.20379 4.45346 -2.72310 0.93258 3.44859 -3.99435 2.11219 1.63381

0.87178

0.96000

$java Neural 800 -0.8 0.5 0 -1 0 -0.7 -0.7 0.9 0.1 0.1 10000

2

-0.80000 0.50000 0.00000 -1.00000 0.00000 -0.70000 -0.65318 0.90000 0.10000

2.56847

0.72000