Sale!

CS689: Machine Learning – Homework 4

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (4 votes)

Introduction: Bayesian modeling is often most useful when the amount of data is low relative to the number
of parameters, resulting in significant uncertainty about the optimal parameter values. In this question, you
will experiment with a basic rejection sampling method for drawing posterior samples for a binary logistic
regression model with a spherical multivariate normal prior on the parameters. The data vectors x = [x1, x2]
are two-dimensional. The weight vector is thus w = [w1, w2] and the bias is b. The complete parameter
vector is thus θ = [w1, w2, b]. The prior is zero mean (µ0 = [0, 0, 0]), with a spherical covariance matrix
Σ0 = 100 · I. The model is defined below.
1
P(Y = y|X = x, θ) = 
1
1 + exp(−(wx> + b))[y=1] 
exp(−(wx> + b))
1 + exp(−(wx> + b))[y=0]
(1)
P(θ|µ0, Σ0) = N (θ; µ0, Σ0) (2)
The rejection sampling algorithm that you will implement is described in MLPP Section 23.3.3. Pseudo code
for the sampler is listed below. This algorithm takes as input the number of desired samples S, a training
data set D = {(xn, yn)}1:N and the prior parameters µ0 and Σ0, and produces samples from the distribution
P(θ|D, µ0, Σ0). This sampler continues running until the desired number of samples S is obtained.
Algorithm 1 Smith and Gelfand Rejection Sampling
Inputs: S, D, µ0, Σ0
θMLE ← maxθ L(D, θ)
PMLE = P(D|θMLE)
for (s from 1 to S) do
Accept ← False
while (Accept=False) do
θs ← mvnrnd(µ0, Σ0)
Ps ← P(D|θs)
u ← unirnd(0, 1)
if (Ps/PMLE ≥ u) then
Accept ← True
end if
end while
end for
ΘRS ← [θ1, …, θS]
return ΘRS
In this algorithm, θMLE is the maximum likelihood estimate for the parameters θ. Pmle = P(D|θMLE) =
QN
n=1 P(Y = yn|X = xn, θMLE) is the probability of the data under the parameters θMLE. mvnrnd(µ, Σ)
is a function that returns a sample from a multivariate normal distribution with parameters µ, Σ. θs is a parameter vector sampled from the prior. Ps is the probability of the data under the parameters θs. unirnd(a, b)
is a function that draws a sample from the continuous uniform distribution on [a, b]. ΘRS = [θ1, …, θS] is
the set of samples produced by the algorithm.
Questions:
1. (30 points) Implementation: In this question, you will implement the rejection sampler and several
other methods that will be used to evaluate the sampler.
a. (10 pts)As with many machine learning algorithms, a straightforward implementation of Algorithm 1 may
be numerically unstable. In particular, when N is large, a naive computation of P(D|θMLE) and P(D|θs)
may underflow. Explain why and describe a numerically stable computation for the ratio Ps/P MLE that is
required in the rejection step of the algorithm.
b. (10 pts) Using your numerically stable implementation of Ps/P MLE, implement Algorithm 1. You
2
should use scikit-learn’s logistic regression code to obtain θMLE (be careful to set the estimator’s C parameter properly during learning to obtain the MLE, and use a convergence tolerance of 1e-2). You should also
use methods from numpy.random or scipy.stats to draw the required random samples. As you answer to
this question, include the corresponding code in blr.py in a method named rejection_sampler.
c. (10 pts)To help visualize the output of the sampling process, write a function that takes a set of parameters
θ and plots the corresponding decision boundary. To accomplish this, your code should convert the implicit
function representation of the decision boundary w1x1 +w2x2 +b = 0 to an explicit equation x2 = ax1 +c
for the line corresponding to the boundary, and then plot this line. As you answer to this question, derive
and report the formulas for a and c in the equation x2 = ax1 + c given w1, w2, b. Include the corresponding
code in your code submission in blr.py in a method named plot_boundary.
2. (40 points) Learning: The training set DT r contains 50 data cases. Let DT r
M be a data set consisting of
the first M data cases in DT r. For M ∈ {10, 20, 30, 40, 50}, use your sampler implementation to draw a set
of 100 samples ΘRS
M from the posterior P(θ|DT r
M , µ0, Σ0). Use the five sets of 100 samples ΘRS
10 , …, ΘRS
50
to answer the following questions.1
a. (5 pts) For each M ∈ {10, 20, 30, 40, 50}, produce one scatter plot showing the data set DT r
M with points
colored by class. Overlay the decision boundaries corresponding to the samples in ΘRS
M returned by your
sampler.
b. (5 pts) For each M ∈ {10, 20, 30, 40, 50}, produce a scatter plot showing the data set DT r
M with points
colored by class. Overlay the decision boundary corresponding to the posterior mean of the parameters
estimated as θMean
M =
1
S
P
θ∈ΘRS
M
θ.
c. (5 pts) For each M ∈ {10, 20, 30, 40, 50}, produce a scatter plot showing the data set DT r
M with points
colored by class. Overlay the decision boundary corresponding to the posterior mode (the MAP estimate of
the parameters) θMAP
M computed using DT r
M . The posterior mode θMAP can be obtained by calling scikitlearn’s logistic regression code with an appropriate value of C derived from the prior. As your answer to
this question, provide the requested plots along with the formula for C in terms of µ0 and Σ0.
d. (5 pts) For each M ∈ {10, 20, 30, 40, 50}, produce a scatter plot showing the data set DT r
M with points
colored by class. Overlay the decision boundary corresponding to the MLE θMLE
M computed using DT r
M .
The MLE θMLE can be also obtained by calling scikit-learn’s logistic regression code with an appropriate
value of C (use a convergence tolerance of 1e-2).
e. (5 pts) The above plots show estimated and sampled parameters using their decision boundaries in
the space of the data. Another useful visualization is to look at the estimated and sampled parameters in
parameter space. In this question we will focus on plotting the weight parameters in weight space. For each
M ∈ {10, 20, 30, 40, 50}, produce one scatter plot showing the sampled values of the weights contained in
ΘRS
M . In the same plot, also include points that correspond to the estimated weights in θMean
M , θMAP
M , and
1
For each of parts (a)-(d), the answer requires a total of five plots, one each for M ∈ {10, 20, 30, 40, 50}. The horizontal axis
of each of these plots should correspond to the value of x1, and the vertical axis should correspond to x2. The x and y axis limits for
these plots should be fixed to [−1, 1]. For part (e), the answer requires a total of five plots, one each for M ∈ {10, 20, 30, 40, 50}.
The horizontal axis of each plot should correspond to the value of w1, and the vertical axis should correspond to w2. All five plots
should have the same x axis limits as well as the same y axis limits.
3
θMLE
M computed using DT r
M . Make sure to color code the points or use different marker styles. Also include
a legend.
f. (5 pts) Referring to your visualizations, explain in what way θMLE
M differs from θMean
M when M is small.
Explain why this is the case.
g. (5 pts) Referring to your visualizations, explain what effect increasing the size M of the training data set
has on the set of samples produced in ΘRS
M . Explain why this is the case.
h. (5 pts) Explain what effect increasing the amount of training data M should have on the relationship
between θMLE
M , θMAP
M , and θMean
M according to theory. Referring to your visualizations, do you see this
effect? Explain why or why not.
3. (30 points) Prediction: In this question, we will look at the properties of the Bayesian logistic regression posterior predictive distribution relative to making predictions using different plug-in estimators.
a. (5 pts) In order to use the S samples in ΘRS
M to compute the posterior predictive distribution for test data,
you will implement a simple Monte Carlo approximation as shown below. As you answer to this question,
include the corresponding code in blr.py in a method named predictive_distribution.
P(Y = y|X = x, D
T r
M , µ0, Σ0) = Z
P(Y = y|X = x, θ)P(θ|DT r
M , µ0, Σ0)dθ (3)

1
S
X
θ∈ΘRS
M
P(Y = y|X = x, θ) (4)
b. (20 pts) Next, for M ∈ {10, 20, 30, 40, 50}, compute the following estimates of test set log likelihood.
Provide a single plot showing the test set log likelihood versus the number of number of training cases for
all of the methods. The result will be a single plot with four trend lines. Color code the lines and provide a
legend.
• Bayesian inference: L
B
M =
PN
n=1 log P(Y = yn|X = xn, DT r
M , µ0, Σ0)
• Plug-in Mean: LMean
M =
PN
n=1 log P(Y = yn|X = xn, θMean
M )
• Plug-in MAP: LMAP
M =
PN
n=1 log P(Y = yn|X = xn, θMAP
M )
• Plug-in MLE: LMLE
M =
PN
n=1 log P(Y = yn|X = xn, θMLE
M )
c. (5 pts) Which method appears to perform best in terms of test set log likelihood? Explain the effect of
increasing the amount of training data M on the relative performance of this collection of methods.
4