Sale!

CS689: Machine Learning Homework 3

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (6 votes)

Questions:
1. (70 points) Mixture Models for Mixed Data: A common problem in real-world data analysis is learning
models from mixed data. In this problem, you will implement a general probabilistic mixture model for
learning from mixed data including real values, binary values, categorical values, and counts. We let Z be
the mixture indicator random variable. We allow K mixture components. We define four blocks of variables
X = [XR, XB, XCa
, XCo]. The corresponding data values are denoted by x = [x
R, x
B, x
Ca
, x
Co]. We
will assume the dimensionality of variables in each block is DR, DB, DCa
, DCo. The number of categories
for categorical variable d is Cd. The joint distribution of the mixture indicator variable and the data variables
is given by:
1
P(X = x, Z = z|θ) = P(Z = z|θ
M)
Y
T ∈{R,B,Ca,Co}
P(XT = x
T
|Z = z, θT
z
) (1)
P(Z = z|θ
M) = θ
M
z
(2)
P(XR = x
R|Z = z, θR) = N (x
R; µz, Σz) (3)
P(XB = x
B|Z = z, θB) = Y
DB
d=1

B
dz)
[x
B
d
]
(1 − θ
B
dz)
[1−x
B
d
]
(4)
P(XCa = x
Ca|Z = z, θCa) =
D
YCa
d=1
Y
Cd
c=1

Ca
cdz)
[x
Ca
d =c]
(5)
P(XCo = x
Co|Z = z, θCo) =
D
YCo
d=1
Poisson(x
Co
d
; θ
Co
dz ) (6)
We specify the following prior distribution P(θ) (up to normalization constants) on the model parameters,
which we will use to form regularized/penalized/MAP estimates of the parameters during learning.
P(µz) = N (µz, 0, 1002
I) (7)
P(Σz) ∝
1
|Σ|
1/2
exp(−
1
2
trace(0.01 · I · Σ
−1
z
)) (8)
P(θ
B
dz) ∝ (θ
B
dz)
(1/100)
· (1 − θ
B
dz)
(1/100) (9)
P(θ
Ca
cz ) ∝
Y
Cd
c=1

Ca
cdz)
(1/100) (10)
P(θ
Co
dz ) ∝ (θ
Co
dz )
(1/100) exp(−θ
Co
dz ) (11)
a. (5 pts) Derive an expression for the marginal probability P(X = x|θ).
b. (5 pts) Derive an expression for the posterior distribution P(Z = z|X = x, θ).
c. (5 pts) Explain how the log marginal likelihood PN
n=1 log P(X = xn|θ) and log posterior log P(Z =
z|X = x, θ) can be computed in numerically stable ways using the log-sum-exp trick.
d. (10 pts)Starting from the regularized lower bound on the log marginal likelihood given by the expression
below, derive the EM algorithm for this model.
Q(θ) = X
N
n=1

Eqn(z)
[log P(X = xn, Z = z|θ)] + H(qn(z))
+ log P(θ) (12)
qn(z) = P(Z = z|X = xn, θ) (13)
e. (10 pts) Let Xi be a variable in X and X−i be the vector of all variables except for Xi
. Derive an
expression for the posterior predictive distribution P(Xi
|X−i = x−i
, θ). (Hint: you will need a different

expression for each type of variable).
f. (20 pts) Starting from the provided template (mixture.py), implement a class for this model including the functions fit, posterior, log_posterior, log_maginal_likelihood, predict,
set_model_params, and get_model_params. Note that the predict function will only be tested
for cases where the missing dimension is real-valued. For full credit, all of your implementations must use
numerically stable computations.
g. (5 pts) Use your implementation of the EM algorithm to learn optimal model parameters for K = 3
using the Q1 training data. Use 50 EM iterations. Provide a plot of the of the log marginal likelihood as a
function of the EM iteration.
h. (10 pts) For each test case in Q1 test data, exactly one real variable is missing (specified by a nan
value). Devise a method for choosing the number of mixture components, learn a mixture model, and use
it to make predictions for the missing data values. The value you should predict is the posterior mean
of the missing observation: EP(Xi|x−i,θ)
[xi
]. Use the provided code to save your predictions to the file
q1_prediction.npy and upload this file to Gradescope for scoring. As your answer to this question,
report your final prediction error and explain how you chose the number of mixture components. Note that
MSE will be used as the prediction error metric, and that you will likely need to increase the number of
learning iterations when increasing K.
2. (30 points) Semi-Supervised Learning with Neural Networks In this problem, you will experiment
with semi-superivsed learning using neural networks. The network architecture is specified below. The
hidden layer h will use 32 units with ReLU non-linearity and is fully connected to the inputs x ∈ R
D.
There are two output layers. The first output layer, xˆ, has the same size as the input x and will use linear
units. The x → h → xˆ path in the model is a basic auto-encoder network. We denote the function
corresponding to this half of the network by xˆ = f
a
(x, θ). The second parallel output layer is a multi-class
classification layer that is fully connected to the hidden units and will use a softmax non-linearity to produce
a probabilistic class output yˆ. The x → h → yˆ path in the model is thus a standard feed-forward neural
network classifier. We denote the function corresponding to this half of the network by yˆ = f
c
(x, θ). We
represent the class labels using a one-hot encoding such that yc = 1 indicates that the example belongs to
class c.
Input Layer (784): x
Hidden Layer (32): h
Class Probability (10): !”
w1: 784×32
w3: 32×10
b1: 32
b3: 10
Reconstruction Layer (784): #$
w2: 32×784 b2: 784
In semi-supervised learning, the main idea is that the available data include a mix of both labeled and
un-labeled examples. To learn the model, we define a composite loss that includes both an auto-encoder
component and a classification loss component. Each data case consists of a feature vector xn and a one-hot
encoded class label yn. When a class label is not available for a given data case, ycn = 0 for all c. The
learning problem is given below where α is a parameter that trades off between the auto-encoder loss (mean
3
squared error) and the classification loss (cross-entropy). In the questions below, you can implement this
network using your choice of pytorch (version 0.2.0) or tensorflow (version 1.3).
θ∗ = arg min
θ
1
N
X
N
n=1
−α
X
9
C=0
ycn log yˆcn + (1 − α)
1
D
X
D
d=1
(xdn − xˆdn)
2
!
a. (5 pts) Explain in theory how to compute this objective function in a numerically stable way giving
equations to support your approach. Next, explain how your approach can be implemented in your chosen
framework.
b. (10 pts)Starting from the provided template (nn.py), implement a Scikit-Learn compatible class for the
model shown above including objective, fit, predict_x, predict_y, set_model_params,
and get_model_params functions. As your answer to this question, describe your approach to learning
in detail (parameter initialization, optimization algorithm, stepsize selection and convergence rules used,
acceleration techniques, etc.), and submit your commented code for auto grading as described above.
c. (5 pts) Using the provided Q2 training data set, learn the model using a range of values of α between
0 and 1. Provides plots of the classification error rate and the auto-encoder loss on the training data as a
function of α.
d. (10 pts)Select a cross-validation approach and use it with the provided training data to optimize as many
of the network and learning hyper-parameters as you can (this can include the number of hidden units as well
as the number of hidden layers). Use the provided Q2 test data to make classification predictions for all of
the training data cases. Use the provided code to save your predictions to the file q2_prediction.npy
and upload this file to Gradescope for scoring. As your answer to this question, report your best prediction
error, describe which hyper-parameters you tried to optimize, and what optimal values you found for them.
Support your explanations with appropriately chosen plots.
4