1 Generator: real inference
The model has the following form:
Y = f(Z; W) + (1)
Z ∼ N(0, Id), ∼ N(0, σ2
ID), d < D. (2) f(Z; W) maps latent factors into image Y , where W collects all the connection weights and bias terms of the ConvNet. Adopting the language of the EM algorithm, the complete data model is given by log p(Y, Z; W) = log[p(Z)p(Y |Z, W)] (3) = − 1 2σ 2 ||Y − f(Z; W)||2 − 1 2 ||Z||2 + const. (4) The observed-data model is obtained by intergrating out Z: p(Y ; W) = R p(Z)p(Y |Z, W)dZ. The posterior distribution of Z is given by p(Z|Y, W) = p(Y, Z; W)/p(Y ; W) ∝ p(Z)p(Y |Z, W) as a function of Z. We want to minimize the observed-data log-likelihood, which is L(W) = Pn i=1 log p(Yi P ; W) = n i=1 log R p(Yi , Zi ; W)dZi . The gradient of L(W) can be calculated according to the following well-known fact that underlies the EM algorithm: ∂ ∂W log p(Y ; W) = 1 P(Y ; W) ∂ ∂W Z p(Y, Z; W)dZ (5) = Ep(Z|Y,W) [ ∂ ∂W log p(Y, Z; W)]. (6) The expectation with respect to p(Z|Y, W) can be approximated by drawing samples from p(Z|Y, W) and then compute the Monte Carlo average. The Langevin dynamics for sampling Z ∼ p(Z|Y, W) iterates Zτ+1 = Zτ + δUτ + δ 2 2 [ 1 σ 2 (Y − f(Zτ ; W)) ∂ ∂Z f(Zτ ; W) − Zτ ], (7) where τ denotes the time step for the Langevin sampling, δ is the step size, and Uτ denotes a random vector that follows N(0, Id). 1 The stochastic gradient algorithm can be used for learning, where in each iteration, for each Zi , only a single copy of Zi is sampled from p(Zi |Yi , W) by running a finite number of steps of Langevin dynamics starting from the current value of Zi , i.e., the warm start. With {Zi} sampled in this manner, we can update the parameter W based on the gradient L 0 (W), whose Monte Carlo approximation is: L 0 (W) ≈ Xn i=1 ∂ ∂W log p(Yi , Zi ; W) (8) = − Xn i=1 ∂ ∂W 1 2σ 2 ||Yi − f(Zi ; W)||2 (9) = Xn i=1 1 σ 2 (Yi − f(Zi ; W)) ∂ ∂W f(Zi ; W). (10) Algorithm 1 describes the details of the learning and sampling algorithm. Algorithm 1 Generator: real inference Input: (1) training examples {Yi , i = 1, ..., n}, (2) number of Langevin steps l, (3) number of learning iterations T. Output: (1) learned parameters W, (2) inferred latent factors {Zi , i = 1, ..., n}. 1: Let t ← 0, initialize W. 2: Initialize Zi , for i = 1, ..., n. 3: repeat 4: Inference step: For each i, run l steps of of Langevin dynamics to sample Zi ∼ p(Zi |Yi , W) with warm start, i.e., starting from the current Zi , each step follows equation 7. 5: Learning step: Update W ← W +γtL 0 (W), where L 0 (W) is computed according to equation 10, with learning rate γt . 6: Let t ← t + 1. 7: until t = T 1.1 TO DO For the lion-tiger category, learn a model with 2-dim latent factor vector. Fill the blank part of ./GenNet/ Show: (1) Reconstructed images of training images, using the inferred z from training images. 2 (2) Randomly generated images, using randomly sampled z. (3) Generated images with linearly interpolated latent factors from (−2, 2) to (−2, 2). For example, you inperlolate 8 points from (−2, 2) for each dimension of z. Then you will get a 8 × 8 panel of images. You should be able to seee that tigers slight change to lion. (4) Plot of loss over iteration. 2 Descriptor: real sampling The descriptor model is as follows: pθ(Y ) = 1 Z(θ) exp [fθ(Y )] p0(Y ), (11) where p0(Y ) is the reference distribution such as Gaussian white noise p0(Y ) ∝ exp −kY k 2 /2σ 2 (12) The scoring function fθ(Y ) is defined by a bottom-up ConvNet whose parameters are denoted by θ. The normalizing constant Z(θ) = R exp [fθ(Y )] p0(Y )dY is analytically intractable. The energy function is Eθ(Y ) = 1 2σ 2 kY k 2 − fθ(Y ). (13) pθ(Y ) is an exponential tilting of p0. Suppose we observe training examples {Yi , i = 1, ..., n} from an unknown data distribution Pdata(Y ). The maximum likelihood learning seeks to maximize the log-likelihood function L(θ) = 1 n Xn i=1 log pθ(Yi). (14) If the sample size n is large, the maximum likelihood estimator minimizes the KullbackLeibler divergence KL(Pdatakpθ) from the data distribution Pdata to the model distribution pθ. The gradient of L(θ) is L 0 (θ) = 1 n Xn i=1 ∂ ∂θ fθ(Yi) − Eθ ∂ ∂θ fθ(Y ) , (15) where Eθ denotes the expectation with respect to pθ(Y ). The key to the above identity is that ∂ ∂θ log Z(θ) = Eθ[ ∂ ∂θ fθ(Y )]. The expectation in equation (15) is analytically intractable and has to be approximated by MCMC, such as Langevin dynamics, which iterates the following step: Yτ+1 = Yτ − δ 2 2 ∂ ∂Y Eθ(Yτ ) + δUτ = Yτ − δ 2 2 Yτ σ 2 − ∂ ∂Y fθ(Yτ ) + δUτ , (16) where τ indexes the time steps of the Langevin dynamics, δ is the step size, and Uτ ∼ N(0, I) is Gaussian white noise. The Langevin dynamics relaxes Yτ to a low energy region, while the noise term provides randomness and variability. A Metropolis-Hastings step may be added to correct for the finite step size δ. We can also use Hamiltonian Monte Carlo for sampling the generative ConvNet. We can run ˜n parallel chains of Langevin dynamics according to (16) to obtain the synthesized examples {Y˜ i , i = 1, ..., n˜}. The Monte Carlo approximation to L 0 (θ) is L 0 (θ) ≈ 1 n Xn i=1 ∂ ∂θ fθ(Yi) − 1 n˜ Xn˜ i=1 ∂ ∂θ fθ(Y˜ i) (17) = ∂ ∂θ 1 n˜ Xn˜ i=1 Eθ(Y˜ i) − 1 n Xn i=1 Eθ(Yi) , which is used to update θ. To make Langevin sampling easier, we use mean images of training images as the sampling starting point. That is, we down-sampled each training image to a 1×1 patch, and up-sample this patch to the size of training image. We use cold start for Langevin sampling, i.e., at each iteration, we start sampling from mean images. Algorithm 2 describes the details of the learning and sampling algorithm. Algorithm 2 Descriptor: real sampling Input: (1) training examples {Yi , i = 1, ..., n}, (2) number of Langevin steps l, (3) number of learning iterations T. Output: (1) estimated parameters θ, (2) synthesized examples {Y˜ i , i = 1, ..., n}. 1: Let t ← 0, initialize θ. 2: repeat 3: For i = 1, ..., n, initialize Y˜ i to be the mean image of Yi . 4: Run l steps of Langevin dynamics to evolve Y˜ i , each step following equation (16). 5: Update θt+1 = θt + γtL 0 (θt), with step size γt , where L 0 (θt) is computed according to equation (17). 6: Let t ← t + 1. 7: until t = T 2.1 TO DO For the egret category, learn a descriptor model. Fill the blank part of ./DesNet/ Show: 4 (1) Synthesized images. (2) Plot of training loss over iteration. 3 What to submit Write a report to show your results. And zip the report with your code.