DDA3020 Homework 1

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (1 vote)

1 Written Problems (50 points)

1.1. (Matrix Differential, 20 points) It is a fundamental ability to derive the derivatives of
some functions in machine learning. Nine basic cases can be summarized as in Table 1. Nevertheless,
only three of them (i.e., cells in gray background) are common in our course. In this problem, you
will follow the next steps to derive the derivatives for these cases.

(1) Layout of Derivatives
• Differentiation of a scalar function w.r.t. a vector: If f : R
n
7→ R is a scalar
function of n variables, w ∈ R
n
is a n × 1 vector, then differentiation of f(w) w.r.t. w

Table 1: Vector/Matrix derivatives
f f F
w
f
w
df
dw
dF
dw
w
df
dw
df
dw
dF
dw
W df
dW
df
dW
dF
dW
results in a n × 1 vector;
df
dw
=







∂f
∂w1
∂f
∂w2
.
.
.
∂f
∂wn






• Differentiation of a scalar function w.r.t. a matrix: If f : R
m×n
7→ R is a scalar
function of mn variables, w ∈ R
m×n
is a m × n matrix, then differentiation of f(w) w.r.t.
w results in a m × n matrix;
df
dw
=




∂f
∂w11
· · ·
∂f
∂w1n
.
.
. . . .
.
.
.
∂f
∂wm1
· · ·
∂f
∂wmn



• Differentiation of a vector function w.r.t. a vector: If f : R
n
7→ R
m is a vector
function of size m × 1 and w is a n × 1 vectors, then differentiation of f(w) w.r.t. w
results in a n × m matrix.
df
T
dw
=




∂f1
∂w1
· · ·
∂fh
∂w1
.
.
. . . .
.
.
.
∂f1
∂wd
· · ·
∂fh
∂wd



We also call this matrix the gradient matrix of f w.r.t. w. The tranpose of this matrix is
exactly the Jacobian matrix df
dwT (i.e.,
df
dw
in convention) as
df
dwT
=




∂f1
∂w1
· · ·
∂f1
∂wd
.
.
. . . .
.
.
.
∂fh
∂w1
· · ·
∂fh
∂wd



(2) (10 points) Derivation by definition: Use definition of multivariate calculus to derive the
derivatives
Example 1 : For f(w) = a
T w with w ∈ R
n
, we have df
dw = a since
df
dw
=

Pn
i=1 aixi
∂xi
=
∂aixi
∂xi
= ai
Derive the following derivatives by definition.

• (5 pts) f(w) = wT Aw with w ∈ R
n and A ∈ R
n×n
;
• (5 pts) f(w) = Aw with A ∈ R
m×n and w ∈ R
n
.

(3) (Bonus task: 5 points) Derivation by differentiation: derive the derivatives based on the
relation between differentiation and gradient as
f : R
n
7→ R : df =
Xn
i=1
∂f
∂wi
dwi = ( df
dw
)
T
dw = tr(( df
dw
)
T
dw)
f : R
m×n
7→ R : df =
X
i,j
∂f
∂Wij
dWij = tr(( df
dW
)
T
dW)

Example 2 : For f(W) = a
TW b with W ∈ R
m×n
, a ∈ R
m, and b ∈ R
n
, we have df
dW = abT
since
df = d(a
TW b) = d(a
T
)W b + a
T
d(W)b + a
TWd(b) = a
T
d(W)b
= tr(a
T
d(W)b) = tr(baT
dW) = tr((abT
)
T
dW)

Hint: Some useful formulas are here.
d(X + Y ) = dX + dY d(XY ) = (dX)Y + Xd(Y ) dXT = (dX)
T
tr(XT
) = tr(X) tr(XY ) = tr(Y X) d(tr(X)) = tr(dX)
Derive the following derivatives by differentiation.
• (2 pts) f(w) = wT Aw with A ∈ R
n×n and w ∈ R
n
;
• (3 pts) f(W) = tr(WT AW) with A ∈ R
m×m and W ∈ R
m×n
.

(4) (10 points) Derivation by chain rule: In the realm of machine learning, the input is
commonly a feature vector while the loss is almost always a scalar objective function. Therefore,
chain rule is useful for gradient update.

(a) (10 pts) Vector chain ending with a scalar: For x → y1 → y2 → · · · → yn → z, the chain
rule is
dz
dxT
=
dz
dyT
n
∂yn
∂y
T
n−1
· · ·
∂y2
∂y
T
1
∂y1
xT

Assume ℓ = ∥Xw − y∥
2
2
, let z = Xw − y, then ℓ = ∥z∥
2
2 = z
T z. Consequently, here is a
chain like w → z → ℓ. Follow the chain rule, the derivative ∂ℓ
∂wT =
∂ℓ
∂zT
∂z
∂wT which implies
that
∂ℓ
∂w
= ( ∂z
∂wT
)
T ∂ℓ
∂z

Determine the closed-form solution of ∂ℓ
∂w
.

(b) (Bonus Task, 5 points) Matrix chain ending with a scalar: For X → Y → ℓ, the chain
rule is
∂ℓ
∂Xij
=
X
kl
∂ℓ
∂Ykl
∂Ykl
∂Xij

Assume ℓ = f(Y ) where Y = AX + B, derive the derivative of ∂ℓ
∂X .
Hint: Firstly, it has
∂Ykl
∂Xij
=

P
s AksXsl
∂Xij
=
∂AkiXil
∂Xij
= Akiδlj
where δlj = 1 for l = j; otherwise, δlj = 0.
Secondly, we have
∂ℓ
∂Xij
=
X
kl
∂ℓ
∂Ykl
Akiδlj =
X
k
∂ℓ
∂Ykj
Aki = AT
:,i(
∂ℓ
∂Y
):,j
Lastly, you can derive the form of ∂ℓ
∂X .

1.2. (Convexity of functions, 10 points) Prove that: (1) f(x) = ReLU(x) = (x)
+ = max(0, x)
is convex for x ∈ R; (2) f(x) = |x| is convex for x ∈ R; (3) f(x) = ∥Ax − b∥
2
2
is convex where
A ∈ R
m×n and x ∈ R
n
.

1.3. (Gradient Descent, 10 points) Suppose we have training data {(x1, y1),(x2, y2), . . . ,(xN , yN )},
where xi ∈ R
d and yi ∈ R
k
, i = 1, 2, . . . , N. (1) Find the closed-form solution of the following problem.
min
W,b
X
N
i=1
αi∥yi − W xi − b∥
2
2
,
where the diagonal of diagonal matrix diag(A) = (α1, α2, . . . , αN )
T are weights for different
sample; (2) Show how to use gradient descent to solve the problem.

Hint: You can use either definition or differentiation method to derive the derivatives. If you
use differentiation method, please note that
X
N
i=1
αi∥yi − W xi − b∥
2
2 = tr[(Y − XW)
T A(Y − XW)]
where Y = (y1, y2, . . . , yN )
T ∈ R
N×k
, X = [(x
T
1
, 1)T
,(x
T
2
, 1)T
, . . . ,(x
T
N , 1)T
]
T ∈ R
N×(d+1)
,
W = (W, b)
T ∈ R
(d+1)×k
, and A = diag(α1, α2, . . . , αN ).

1.4. (Maximum Likelihood Estimation, 10 points) Suppose x1, x2, . . . , xN are drawn from
N (µ, σ2
). Show that the maximum likelihood estimation (MLE) of σ
2
is ˆσ
2
MLE =
1
N
PN
n=1(xn −
µMLE)
2
.

2 Programming (50 points)

2.1. (Polynomial regression, 20 points) In this exercise, we will try to fit a non-linear function
g with polynomial regression on the feasible space X = [0, 11]:
Unknown g(x) =?
Construct f(x) = Xn
i=0
αix
i ⇐⇒ f(x) = w
T x

, x′ =










1
x
x
2
.
.
.
x
n










, s.t. ∀x ∈ X, f(x) ≈ g(x)

Where n is the polynomial degree of freedom and is manually chosen.

Follow the instructions given in the jupyter notebook. At the end of the exercise, you will retrieve
an estimation of the desired function and make some comment on this method.

2.2. (Linear regression, 30 points) The CSV or XLS file contains a dataset for regression.
There are 7750 samples with 25 features (described in the doc file). This data is for the purpose of
bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model
operated by the Korea Meteorological Administration over Seoul, South Korea.

This data consists of summer data from 2013 to 2017. The input data is largely composed
of the LDAPS model’s next-day forecast data, in-situ maximum and minimum temperatures of
present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum
and minimum air temperatures) in this data. Hindcast validation was conducted for the period
from 2015 to 2017.

You need to delete the first two attributes (station and date), and use attributes 3-23 to predict
attributes 24 and 25. Randomly split the data into two parts, one contains 80% of the samples
and the other contains 20% of the samples. Use the first part as training data and train a linear
regression model and make prediction on the second part. Report the training error and testing
error in terms of RMSE.

Repeat the splitting, training, and testing for 10 times. Use a loop and print the RMSEs in each
trial.

Note that you need to write the codes of learning the parameters by yourself. Do not use the
classification or regression packages of Sklearn. You can use their tools to shuffle the data randomly
for splitting.