Description
1 Generator: real inference
The model has the following form:
Y = f(Z; W) + (1)
Z ∼ N(0, Id), ∼ N(0, σ2
ID), d < D. (2)
f(Z; W) maps latent factors into image Y , where W collects all the connection weights
and bias terms of the ConvNet.
Adopting the language of the EM algorithm, the complete data model is given by
log p(Y, Z; W) = log[p(Z)p(Y |Z, W)] (3)
= −
1
2σ
2
||Y − f(Z; W)||2 −
1
2
||Z||2 + const. (4)
The observed-data model is obtained by intergrating out Z: p(Y ; W) = R
p(Z)p(Y |Z, W)dZ.
The posterior distribution of Z is given by p(Z|Y, W) = p(Y, Z; W)/p(Y ; W) ∝ p(Z)p(Y |Z, W)
as a function of Z.
We want to minimize the observed-data log-likelihood, which is L(W) = Pn
i=1 log p(Yi
P
; W) =
n
i=1 log R
p(Yi
, Zi
; W)dZi
. The gradient of L(W) can be calculated according to the following well-known fact that underlies the EM algorithm:
∂
∂W log p(Y ; W) = 1
P(Y ; W)
∂
∂W Z
p(Y, Z; W)dZ (5)
= Ep(Z|Y,W)
[
∂
∂W log p(Y, Z; W)]. (6)
The expectation with respect to p(Z|Y, W) can be approximated by drawing samples
from p(Z|Y, W) and then compute the Monte Carlo average.
The Langevin dynamics for sampling Z ∼ p(Z|Y, W) iterates
Zτ+1 = Zτ + δUτ +
δ
2
2
[
1
σ
2
(Y − f(Zτ ; W)) ∂
∂Z f(Zτ ; W) − Zτ ], (7)
where τ denotes the time step for the Langevin sampling, δ is the step size, and Uτ
denotes a random vector that follows N(0, Id).
1
The stochastic gradient algorithm can be used for learning, where in each iteration,
for each Zi
, only a single copy of Zi
is sampled from p(Zi
|Yi
, W) by running a finite
number of steps of Langevin dynamics starting from the current value of Zi
, i.e., the
warm start. With {Zi} sampled in this manner, we can update the parameter W based
on the gradient L
0
(W), whose Monte Carlo approximation is:
L
0
(W) ≈
Xn
i=1
∂
∂W log p(Yi
, Zi
; W) (8)
= −
Xn
i=1
∂
∂W
1
2σ
2
||Yi − f(Zi
; W)||2
(9)
=
Xn
i=1
1
σ
2
(Yi − f(Zi
; W)) ∂
∂W f(Zi
; W). (10)
Algorithm 1 describes the details of the learning and sampling algorithm.
Algorithm 1 Generator: real inference
Input:
(1) training examples {Yi
, i = 1, ..., n},
(2) number of Langevin steps l,
(3) number of learning iterations T.
Output:
(1) learned parameters W,
(2) inferred latent factors {Zi
, i = 1, ..., n}.
1: Let t ← 0, initialize W.
2: Initialize Zi
, for i = 1, ..., n.
3: repeat
4: Inference step: For each i, run l steps of of Langevin dynamics to sample Zi ∼
p(Zi
|Yi
, W) with warm start, i.e., starting from the current Zi
, each step follows
equation 7.
5: Learning step: Update W ← W +γtL
0
(W), where L
0
(W) is computed according
to equation 10, with learning rate γt
.
6: Let t ← t + 1.
7: until t = T
1.1 TO DO
For the lion-tiger category, learn a model with 2-dim latent factor vector. Fill the blank
part of ./GenNet/GenNet.py. Show:
(1) Reconstructed images of training images, using the inferred z from training images.
2
(2) Randomly generated images, using randomly sampled z.
(3) Generated images with linearly interpolated latent factors from (−2, 2) to (−2, 2).
For example, you inperlolate 8 points from (−2, 2) for each dimension of z. Then you
will get a 8 × 8 panel of images. You should be able to seee that tigers slight change to
lion.
(4) Plot of loss over iteration.
2 Descriptor: real sampling
The descriptor model is as follows:
pθ(Y ) = 1
Z(θ)
exp [fθ(Y )] p0(Y ), (11)
where p0(Y ) is the reference distribution such as Gaussian white noise
p0(Y ) ∝ exp
−kY k
2
/2σ
2
(12)
The scoring function fθ(Y ) is defined by a bottom-up ConvNet whose parameters are
denoted by θ. The normalizing constant Z(θ) = R
exp [fθ(Y )] p0(Y )dY is analytically
intractable. The energy function is
Eθ(Y ) = 1
2σ
2
kY k
2 − fθ(Y ). (13)
pθ(Y ) is an exponential tilting of p0.
Suppose we observe training examples {Yi
, i = 1, ..., n} from an unknown data distribution Pdata(Y ). The maximum likelihood learning seeks to maximize the log-likelihood
function
L(θ) = 1
n
Xn
i=1
log pθ(Yi). (14)
If the sample size n is large, the maximum likelihood estimator minimizes the KullbackLeibler divergence KL(Pdatakpθ) from the data distribution Pdata to the model distribution pθ. The gradient of L(θ) is
L
0
(θ) = 1
n
Xn
i=1
∂
∂θ fθ(Yi) − Eθ
∂
∂θ fθ(Y )
, (15)
where Eθ denotes the expectation with respect to pθ(Y ). The key to the above identity
is that ∂
∂θ log Z(θ) = Eθ[
∂
∂θ fθ(Y )].
The expectation in equation (15) is analytically intractable and has to be approximated by MCMC, such as Langevin dynamics, which iterates the following step:
Yτ+1 = Yτ −
δ
2
2
∂
∂Y Eθ(Yτ ) + δUτ
= Yτ −
δ
2
2
Yτ
σ
2
−
∂
∂Y fθ(Yτ )
+ δUτ , (16)
where τ indexes the time steps of the Langevin dynamics, δ is the step size, and Uτ ∼
N(0, I) is Gaussian white noise. The Langevin dynamics relaxes Yτ to a low energy
region, while the noise term provides randomness and variability. A Metropolis-Hastings
step may be added to correct for the finite step size δ. We can also use Hamiltonian
Monte Carlo for sampling the generative ConvNet.
We can run ˜n parallel chains of Langevin dynamics according to (16) to obtain the
synthesized examples {Y˜
i
, i = 1, ..., n˜}. The Monte Carlo approximation to L
0
(θ) is
L
0
(θ) ≈
1
n
Xn
i=1
∂
∂θ fθ(Yi) −
1
n˜
Xn˜
i=1
∂
∂θ fθ(Y˜
i) (17)
=
∂
∂θ
1
n˜
Xn˜
i=1
Eθ(Y˜
i) −
1
n
Xn
i=1
Eθ(Yi)
,
which is used to update θ.
To make Langevin sampling easier, we use mean images of training images as the
sampling starting point. That is, we down-sampled each training image to a 1×1 patch,
and up-sample this patch to the size of training image. We use cold start for Langevin
sampling, i.e., at each iteration, we start sampling from mean images.
Algorithm 2 describes the details of the learning and sampling algorithm.
Algorithm 2 Descriptor: real sampling
Input:
(1) training examples {Yi
, i = 1, ..., n},
(2) number of Langevin steps l,
(3) number of learning iterations T.
Output:
(1) estimated parameters θ,
(2) synthesized examples {Y˜
i
, i = 1, ..., n}.
1: Let t ← 0, initialize θ.
2: repeat
3: For i = 1, ..., n, initialize Y˜
i to be the mean image of Yi
.
4: Run l steps of Langevin dynamics to evolve Y˜
i
, each step following equation (16).
5: Update θt+1 = θt + γtL
0
(θt), with step size γt
, where L
0
(θt) is computed according
to equation (17).
6: Let t ← t + 1.
7: until t = T
2.1 TO DO
For the egret category, learn a descriptor model. Fill the blank part of ./DesNet/DesNet.py.
Show:
4
(1) Synthesized images.
(2) Plot of training loss over iteration.
3 What to submit
Write a report to show your results. And zip the report with your code.