## Description

Question 1

1-dimensional linear regression: Assume a 1-dimensional linear regression model y = w0 +

w1x. The residual sum of squares (RSS) of the training data Dtrain = {(x1, y1), . . . ,(xN , yN )}

can be written as:

RSS(w0, w1) = X

N

n=1

(yn − w0 − w1xn)

2

We estimate the weights w0, w1 by minimizing the above error.

(a) Show that minimizing RSS results in the following closed-form expression:

w

∗

1 =

P

N

n=1

xnyn − N

1

N

P

N

n=1

xn

1

N

P

N

n=1

yn

P

N

n=1

x

2

n − N

1

N

P

N

n=1

xn

2

w

∗

0 =

1

N

X

N

n=1

yn

!

− w1

1

N

X

N

n=1

xn

!

Tip: Set the partial derivatives ϑRSS(w0,w1)

ϑw0

and ϑRSS(w0,w1)

ϑw1

equal to 0. Then solve a 2 × 2

system of linear equations with respect to w0 and w1.

(b) Show that the above expressions for w

∗

0

and w

∗

1

are equivalent to the following:

w

∗

1 =

P

N

n=1

(xi − x¯)(yi − y¯)

P

N

i=1

(xi − x¯)

2

w

∗

0 = ¯y − w1x¯

where ¯x =

1

N

PN

n=1 xn and ¯y =

1

N

PN

n=1 yn are the sample means of input features and outcome

values, respectively.

(c) How would you interpret the above expression in terms of the descriptive statistics (e.g.

sample mean, variance, co-variance) of populations {xn}

N

n=1 and {yn}

N

n=1?

1

Question 2

Principled method for learning the step size in gradient descent: In class we discussed

that when we use gradient descent to minimize target function J(w) with respect to w, the step

size α(k) at iteration k is a crucial hyperparameter. We further said that we can experimentally

determine α(k) through cross-validation. There is actually a principled way for computing the

optimal α(k) in each iteration and we are going to derive the expression for that.

(a) According to Taylor series expansion, a differentiable function f(x) can be written around

x0 as follows:

f(x) = f(x0) + (∇f|x=x0

)

T

· (x − x0) + 1

2

(x − x0)

T

· Hf |x=x0

· (x − x0) + . . .

where ∇f are the gradient vector and Hf Hessian matrix of f evaluated at x0.

Let w(k) be the value of w at the k

th iteration of gradient descent. Show that the second order

Taylor expansion of the target function J(w) around w(k) is the following:

J(w) ≈ J(w(k)) + (∇J|w=w(k)

)

T

· (w − w(k)) + 1

2

(w − w(k))T

· HJ |w=w(k)

· (w − w(k))

where ∇J are the gradient vector and HJ Hessian matrix of J evaluated at w(k).

(b) Show that the above expression of J(w) evaluated at w(k + 1) (i.e. at the (k + 1)th gradient

descent iteration) can be written as:

J(w(k + 1)) ≈J(w(k)) −

∇J|w=w(k)

2

2

· α(k)+

+

1

2

∇J|w=w(k)

T HJ |w=w(k)

∇J|w=w(k)

· α

2

(k)

Tip: Take into account the gradient descent update rule w(k + 1) = w(k) − α(k) · ∇J|w=w(k)

(c) Show that minimizing the above expression with respect to the step size α(k) results in:

α(k) =

∇J|w=w(k)

2

2

∇J|w=w(k)

T HJ |w=w(k)

∇J|w=w(k)

The above expression gives a closed-form solution of the step size at iteration k (i.e. a(k)) that

minimizes the target function at the next iteration.

(d) What is the cost of computing a(k) at each iteration k using the above expression?

Question 3

Predicting forest fires: Forest fires are a major environmental issue endangering human

lives. This renders their fast detection a key element for controlling them and potentially

preventing them. Since it is hard for humans to monitor all forests, we can use automatic tools

based on local sensors to do that. Through these sensors we can get information regarding the

meteorological conditions, such as temperature, wind, relative humidity (RH), and amount of

rain. We can also compute several fire hazard indexes, such as the forest fire weather index

(FWI), fine fuel moisture code (FFMC), duff moisture code (DMC), drought code (DC), and

initial spread index (ISI). Using these measures, we can predict whether fire is going to occur in

the forest, as well as to estimate the amount of burned area. Such data are part of the “Forest

Fires Data Set” of the UCI Machine Learning Repository and their description can be found

here: http://archive.ics.uci.edu/ml/datasets/Forest+Fires

Inside “Homework 1” folder on Piazza you can find two files including the train and test

data (named “train.csv” and “test.csv”) for our experiments. The rows of these files refer to the

data samples, while the columns denote the features (columns 1-12) and the outcome variable

(column 13), as describe bellow:

1. X: x-axis spatial coordinate of the forest: 1 to 9

2. Y: y-axis spatial coordinate of the forest: 2 to 9

3. month: month of the year: 1 to 12 to denote ”jan” to ”dec”

4. day: day of the week: 1 to 7 to denote ”mon” to ”sun”

5. FFMC: FFMC index from the FWI system

6. DMC: DMC index from the FWI system

7. DC: DC index from the FWI system

8. ISI: ISI index from the FWI system

9. temp: temperature in Celsius degrees

10. RH: relative humidity

11. wind: wind speed in km/h

12. rain: outside rain in mm/m2

13. area: the burned area of the forest (this is the outcome variable)

(a) Data exploration: Inspect the input features (e.g. you can plot histograms, scatter plots,

etc.). Which of the features are continuous and which categorical?

(b) Classification: From data exploration, we can notice that the the outcome value (i.e. the

burned area) is zero for many samples, meaning that the corresponding forests are not affected

by fire. Therefore we can dichotomize the outcome variable, based on whether its corresponding

value is zero or greater than zero. This creates the following two classes:

Class 0: Forests not affected by the fire, i.e. area = 0

Class 1: Forests affected by the fire, i.e. area > 0

After dichotomizing the outcome variable, we can run a classification task to predict whether or

not fire will occur in a certain forest based on the input features.

(b.i) Implement a K-Nearest Neighbor classifier (K-NN) using the euclidean distance as a distance measure to perform the above binary classification task. Reminder: Don’t forget to

normalize the features.

(b.ii) Explore different values of K through cross-validation on the training set. Plot the classification accuracy, i.e. (#samples correctly classified) / (total #samples), against the different

values of K.

(b.iii) Report the classification accuracy on the test set using the best K from cross-validation.

3

(b.iv) Bonus: Instead of using the euclidean distance for all features, experiment with different

types of distances or distance combinations, i.e. Hamming distance for categorical features.

Report your findings.

(c) Linear Regression: Among the forests that were affected by the fire, we can use linear

regression to predict the actual amount of area that was burned. For this task, we will only

use the samples of the train and test set with burned area (column 13) greater than

zero, i.e. area > 0.

(c.i) Plot the histogram of the outcome variable. What do you observe? Plot the histogram of

the logarithm of the outcome value, i.e. log(area). What do you observe now?

(c.ii) Implement a linear regression model to fit the outcome data using the ordinary least

squares (OLS) solution.

(c.iii) Test your model on the test data and compute the residual sum of squares error (RSS)

and the correlation between the actual and predicted outcome variable.

(c.iii) Bonus: Experiment with different non-linear functions of the input features. Report

your findings on the train and test sets.

4