Homework 2 – Deep Learning CS/DS 541

$35.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (2 votes)

1. XOR problem [10 points, on paper]: Show (by deriving the gradient, setting to 0, and solving
mathematically, not in Python) that the values for w = (w1, w2) and b that minimize the function
J(w, b) in Equation 6.1 (in the Deep Learning textbook) are: w1 = 0, w2 = 0, and b = 0.5.
2. L2-regularized Linear Regression via Stochastic Gradient Descent [20 points, in Python]:
Train a 2-layer neural network (i.e., linear regression) for age regression using the same data as in
homework 1. Your prediction model should be ˆy = x
>w + b. You should regularize w but not b. Note
that, in contrast to Homework 1, this model includes a bias term.
Instead of optimizing the weights of the network with the closed formula, use stochastic gradient
descent (SGD). There are several different hyperparameters that you will need to choose:
• Mini-batch size ˜n.
• Learning rate .
• Number of epochs.
• L2 Regularization strength α.
In order not to cheat (in the machine learning sense) – and thus overestimate the performance of the
network – it is crucial to optimize the hyperparameters only on a validation set. (The training set
would also be acceptable but typically leads to worse performance.) To create a validation set, simply
set aside a fraction (e.g., 20%) of the age regression Xtr.npy and age regression ytr.npy to be
the validation set; the remainder (80%) of these data files will constitute the “actual” training data.
While there are fancier strategies (e.g., Bayesian optimization – another probabilistic method, by the
way!) that can be used for hyperparameter optimization, it’s common to just use a grid search over a
few values for each hyperparameter. In this problem, you are required to explore systematically (e.g.,
using nested for loops) at least 4 different parameters for each hyperparameter.
Performance evaluation: Once you have tuned the hyperparameters and optimized the weights so
as to minimize the cost on the validation set, then: (1) stop training the network and (2) evaluate the
network on the test set. Report the performance in terms of unregularized MSE.
3. Regularization to encourage symmetry [10 points, on paper]: Faces (and some other kinds of
data) tend to be left-right symmetric. How can you use L2 regularization to discourage the weights
from becoming too asymmetric? For simplicity, consider the case of a tiny 1×2 “image”. Hint: instead
of using α
2 w>w =
α
2 w>Iw as the L2 penalty term (where α is the regularization strength), consider a
different matrix in the middle. Your answer should consist of a 2×2 matrix S as well as an explanation
of why it works.
4. Recursive state estimation in Hidden Markov Models [10 points, on paper]: Teachers try
to monitor their student’s knowledge of the subject-matter, but teachers cannot directly peer inside
students’ brains. Hence, they must make inferences about what the student knows based on students’
observable behavior, i.e., how they perform on tests, their facial expressions during class, etc. Let
random variable (RV) Xt represent the student’s state, and let RV Yt represent the student’s observable
behavior, at time t. We can model the student as a Hidden Markov Model (HMM):
(a) Xt depends only on the previous state Xt−1, not on any states prior to that (Markov property),
i.e.
P(xt | x1, . . . , xt−1) = P(xt | xt−1)
(b) The student’s behavior Yt depends only on his/her current state Xt, i.e.:
P(yt | xt, y1, . . . , yt−1) = P(yt | xt)
1
(c) Xt cannot be observed directly (it is hidden).
A probabilistic graphical model for the HMM is shown below, where only the observed RVs are shaded
(the latent ones are transparent):
Xt
Yt
Xt-1
Yt-1
X1
Y1

Suppose that the teacher already knows:
• P(yt | xt) (observation likelihood), i.e., the probability distribution of the student’s behaviors
given the student’s state.
• P(xt | xt−1) (transition dynamics), i.e., the probability distribution of the student’s current state
given the student’s previous state.
The goal of the teacher is to estimate the student’s current state Xt given the entire history of observations Y1, . . . , Yt he/she has made so far. Show that the teacher can, at each time t, update his/her
belief recursively:
P(xt | y1, . . . , yt) ∝ P(yt | xt)
X
xt−1
P(xt | xt−1)P(xt−1 | y1, . . . , yt−1)
where P(xt−1 | y1, . . . , yt−1) is the teacher’s belief of the student’s state from time t − 1, and the
summation is over every possible value of the previous state xt−1. Hint: You will need to use Bayes’
rule, i.e., for any RVs A, B, and C:
P(a | b, c) = P(b | a, c)P(a | c)
P(b | c)
However, since the denominator in the right-hand side does not depend on a, this can also be rewritten
as:
P(a | b, c) ∝ P(b | a, c)P(a | c)
5. Linear-Gaussian prediction model [15 points, on paper]:
Probabilistic prediction models enable us to estimate not just the “most likely” or “expected” value
of the target y (see figure above, right), but rather an entire probability distribution about which
target values are more likely than others, given input x (see figure above, left). In particular, a linearGaussian model is a Gaussian distribution whose expected value (mean µ) is a linear function of the
input features x, and whose variance is σ
2
:
P(y | x) = N (µ = x
>w, σ2
) = 1

2πσ2
exp 

(y − x
>w)
2

2

2
Note that, in general, σ
2
can also be a function of x (heteroscedastic case). Moreover, non-linear
Gaussian models are also completely possible, e.g., the mean (and possibly the variance) of the Gaussian
distribution is output by a deep neural network. However, in this problem, we will assume that µ is
linear in x, and that σ
2
is the same for all x (homoscedastic case).
MLE: The parameters of probabilistic models are commonly optimized by maximum likelihood estimation (MLE). (Another common approach is maximum a posteriori estimation, which allows the
practitioner to incorporate a “prior belief” about the parameters’ values.) Suppose the training dataset
D = {(x
(i)
, y(i)
)}
n
i=1. Let the parameters/weights of the linear-Gaussian model be w, such that the
mean µ = x
>w. Prove that the MLE of w and σ
2 given D is:
w =
Xn
i=1
x
(i)x
(i)>
!−1 Xn
i=1
x
(i)
y
(i)
!
σ
2 =
1
n
Xn
i=1
(x
(i)>
w − y
(i)
)
2
Note that this solution – derived based on maximizing probability – is exactly the same as the optimal
weights of a 2-layer neural network optimized to minimize MSE.
Hint: Follow the same strategy as the MLE derivation for a biased coin in Class2.pdf. For a linearGaussian model, the argmax of the likelihood equals the argmax of the log-likelihood. The log of the
Gaussian likelihood simplifies beautifully.
Put your code in a Python file called homework2 WPIUSERNAME1.py
(or homework2 WPIUSERNAME1 WPIUSERNAME2.py for teams). For the proofs, please create a PDF called
homework2 WPIUSERNAME1.pdf
(or homework2 WPIUSERNAME1 WPIUSERNAME2.pdf for teams). Create a Zip file containing both your Python
and PDF files, and then submit on Canvas.
3