## Description

1. Nature’s rejection sampler, 30 points. A light source emits photons whose angle x is

normally distributed around θ:

p(x|θ) = N (x|µ = θ, σ2 = 42

)

Then the photons pass through a grating which absorbs photons with a probability that depends

on their angle. We denote the event that a photon makes it through the grating by saying that

the Bernoulli variable g = 1. For this particular grating, p(g|x, θ) = Ber(g|f(x, θ)) where

f(x, θ) = sin2

(5(x − θ))

25 sin2

(x − θ)

(1.1) .

The resulting unnormalized probability p(x, g = 1|θ = 0), is shown below as a function of x.

(a) [4 points] Use numerical integration (i.e. a brute-force sum over many values) to estimate

the fraction of photons that get absorbed on average, p(g = 0|θ = 0). That is, estimate

R

p(x, g = 0|θ = 0)dx by summing over 10,000 values of x, equally spaced from −20 to 20.

(b) [4 points] Implement a rejection sampler to sample from p(x|g = 1, θ = 0). Hint: Since

it’s a pmf, p(g|x, θ = 0) ≤ 1. Another hint: doing ancestral sampling in the above model

forms a rejection sampler. Plot a histogram 10,000 accepted samples, and print the fraction

of accepted samples.

(c) [4 points] Implement a self-normalized importance sampler to estimate the fraction of all

photons that get absorbed, p(g = 0|θ = 0). Use p(x|θ = 0) as a proposal distribution. As

a reminder, the formula for a self-normalized importance sampler for the expectation of a

function f over an (unnormalized) target distribution ˜p(z) and proposal distribution q(z)

1

is:

eˆ(x1, x2, . . . , xK) = 1

K

X

K

i=1

f(xi)

p˜(xi)

q(xi)

1

K

PK

j=1

p˜(xj )

q(xj )

(1.2) wherexi ∼iid q(x)

If you can derive a simpler expression by canceling terms, you don’t have to implement the

entire formula. Print the estimate of the fraction of photons that are absorbed, using 1000

samples.

(d) [4 points] A physicist is trying to estimate the location θ of the center of a light source.

She has a Cauchy prior p(θ) = 1

10π(1+( θ

10 )

2)

. She observes a photon emitted from the particle

at position x = 1.7. Plot the unnormalized density p(x = 1.7, g = 1, θ), as a function of θ.

(e) [10 points] Write a Metropolis-Hastings sampler to sample from p(θ|x = 1.7, g = 1). For

the proposal distribution, use a Gaussian centered around the current sample. Fiddle with

the proposal distribution and the number of iterations of your sampler until the histogram

of samples looks roughly like the true unnormalized density. Plot the histogram of samples

from your chain.

(f) [4 points] The scientist wants to know how confident she should be, based on her observation

of the photon, that her instrument is calibrated to within 3 degrees around zero. Use samples

from your MH chain to estimate the posterior probability p(−3 < θ < 3|x = 1.7, g = 1).

2. Gradient estimators, 20 points. All else being equal, it’s useful for a gradient estimator

to be unbiased. The unbiasedness of a gradient estimator guarantees that, if we decay the step

size and run stochastic gradient descent for long enough (see Robbins & Monroe), it will converge

to a local optimum.

The standard REINFORCE, or score-function estimator is defined as:

gˆSF[f] = f(b)

∂

∂θ log p(b|θ), b ∼ p(b|θ)(2.1)

(a) [5 points] First, let’s warm up with the score function. Prove that the score function has

zero expectation, i.e. Ep(x|θ)

[∇θ log p(x|θ)] = 0. Assume that you can swap the derivative

and integral operators.

(b) [5 points] Show that REINFORCE is unbiased: Ep(b|θ)

f(b)

∂

∂θ log p(b|θ)

=

∂

∂θEp(b|θ)

[f(b)].

(c) [5 points] Show that REINFORCE with a fixed baseline is still unbiased, i.e. show that

Ep(b|θ)

[f(b) − c]

∂

∂θ log p(b|θ)

=

∂

∂θEp(b|θ)

[f(b)] for any fixed c.

(d) [5 points] If the baseline depends on b, then REINFORCE will in general give biased

gradient estimates. Give an example where Ep(b|θ)

[f(b) − c(b)] ∂

∂θ log p(b|θ)

6=

∂

∂θEp(b|θ)

[f(b)]

for some function c(b), and show that it is biased.

The takeaway is that you can use a baseline to reduce the variance of REINFORCE, but not

one that depends on the current action.

3. Comparing variances of gradient estimators, 25 points. If we restrict ourselves to

consider only unbiased gradient estimators, then the main (and perhaps only) property we need

to worry about is the variance of our estimators. In general, optimizing with a lower-variance

unbiased estimator will converge faster than a high-variance unbiased estimator. However, which

estimator has the lowest variance can depend on the function being optimized. In this question,

2

we’ll look at which gradient estimators scale to large numbers of parameters, by computing their

variance as a function of dimension.

For simplicity, we’ll consider a toy problem. The goal will be to estimate the gradients of

the expectation of a sum of D independent one-dimensional Gaussians. Each Gaussian has unit

variance, and its mean is given by an element of the D-dimensional parameter vector θ:

f(x) = X

D

d=1

(3.1) xd

L(θ) = Ex∼p(x|θ)

[f(x)] = Ex∼iidN (θ,I)

“X

D

d=1

xd

#

(3.2)

(a) [4 points] As a warm-up, compute the variance of a single-sample simple Monte Carlo

estimator of the objective L(θ):

LˆMC =

X

D

d=1

xd, where each xd ∼iid N (θd, 1)(3.3)

That is, compute V

h

LˆMCi

as a function of D.

(b) [5 points] The score-function, or REINFORCE estimator with a baseline has the form:

gˆ

SF[f] = [f(x) − c(θ)] ∂

∂θ log p(x|θ), x ∼ p(x|θ)(3.4)

Derive a closed form for this gradient estimator for the objective defined above as a deterministic function of , a D-dimensional vector of standard normals. Set the baseline to

c(θ) = PD

d=1 θd. Hint: When simplifying ∂

∂θ log p(x|θ), you shouldn’t take the derivative

through x, even if it depends on θ. To help keep track of what is being differentiated, you

can use the notation ∂

∂θ g(

¯θ, θ) to denote taking the derivative only w.r.t. the second θ.

(c) [8 points] Derive the variance of the above gradient estimator. Because gradients are Ddimensional vectors, their covariance is a D×D matrix. To make things easier, we’ll consider

only the variance of the gradient with respect to the first element of the parameter vector,

θ1. That is, derive the scalar value V

gˆ

SF

1

as a function of D. Hint: The third moment of

a standard normal is 0, and the fourth moment is 3. As a sanity check, consider the case

where D = 1.

(d) [8 points] Next, let’s look at the gold standard of gradient estimators, the reparameterization gradient estimator, where we reparameterize x = T(θ, ):

gˆ

REPARAM[f] = ∂f

∂x

∂x

∂θ , ∼ p()(3.5)

In this case, we can use the reparameterization x = θ + , with ∼ N (0, I). Derive this

gradient estimator for ∇θL(θ), and give V

gˆ

REPARAM

1

as a function of D.

4. Complete course quiz, 5 points. We’re planning to update the prerequisites for this

course next year. To help us plan, go to this course’s page on quercus, and fill out the short quiz

on which courses you took before this one, and how well they prepared you for the different parts

of this course.

3