STA414/STA2014 and CSC412/CSC2506 Assignment 2 version 2: Stochastic Variational Inference in the TrueSkill Model

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (5 votes)

The goal of this assignment is to get you familiar with the basics of Bayesian inference in large models
with continuous latent variables, and the basics of stochastic variational inference.
Background We’ll implement a variant of the TrueSkill model, a player ranking system for competitive
games originally developed for Halo 2. It is a generalization of the Elo rating system in Chess. For the
curious, the original 2007 NIPS paper introducing the trueskill paper can be found here: http://papers.
nips.cc/paper/3079-trueskilltm-a-bayesian-skill-rating-system.pdf
This assignment is based on one developed by Carl Rasmussen at Cambridge for his course on probabilistic
machine learning: http://mlg.eng.cam.ac.uk/teaching/4f13/1920/
0.1 Model definition
We’ll consider a slightly simplified version of the original trueskill model. We assume that each player has a
true, but unknown skill zi 2 R. We use N to denote the number of players.
The prior. The prior over each player’s skill is a standard normal distribution, and all player’s skills are
a prior independent.
The likelihood. For each observed game, the probability that player i beats player j, given the player’s
skills zA and zB, is:
p(A beat B|zA, zB) = (zi zj )
where
(y) = 1
1 + exp(y)
There can be more than one game played between a pair of players, and in this case the outcome of each
game is independent given the players’ skills. We use M to denote the number of games.
The data. The data will be an array of game outcomes. Each row contains a pair of player indices. The
first index in each pair is the winner of the game, the second index is the loser. If there were M games
played, then the array has shape M ⇥ 2.
1
1 Implementing the model [10 points]
(a) [2 points] Implement a function log prior that computes the log of the prior over all player’s skills.
Specifically, given a K ⇥ N array where each row is a setting of the skills for all N players, it returns
a K ⇥ 1 array, where each row contains a scalar giving the log-prior for that set of skills.
(b) [3 points] Implement a function logp a beats b that, given a pair of skills za and zb evaluates the
log-likelihood that player with skill za beat player with skill zb under the model detailed above. To
ensure numerical stability, use the function log1pexp that computes log(1 + exp(x)) in a numerically
stable way. This function is provided by StatsFuns.jl and imported already, and also by Python’s
numpy.
(c) [3 points] Assuming all game outcomes are i.i.d. conditioned on all players’ skills, implement a
function all games log likelihood that takes a batch of player skills zs and a collection of observed
games games and gives a batch of log-likelihoods for those observations. Specifically, given a K ⇥ N
array where each row is a setting of the skills for all N players, and an M ⇥ 2 array of game outcomes,
it returns a K ⇥ 1 array, where each row contains a scalar giving the log-likelihood of all games for
that set of skills. Hint: You should be able to write this function without using for loops, although you
might want to start that way to make sure what you’ve written is correct. If A is an array of integers,
you can index the corresponding entries of another matrix B for every entry in A by writing B[A].
(d) [2 points] Implement a function joint log density which combines the log-prior and log-likelihood
of the observations to give p(z1, z2,…,zN , all game outcomes)
2 Examining the posterior for only two players and toy data [10
points]
To get a feel for this model, we’ll first consider the case where we only have 2 players, A and B. We’ll
examine how the prior and likelihood interact when conditioning on di↵erent sets of games.
Provided in the starter code is a function skillcontour! which evaluates a provided function on a grid of
zA and zB’s and plots the isocontours of that function. As well there is a function plot line equal skill!.
We have included an example for how you can use these functions.
We also provided a function two player toy games which produces toy data for two players. I.e.
two player toy games(5,3) produces a dataset where player A wins 5 games and player B wins 3 games.
(a) [2 points] For two players A and B, plot the isocontours of the joint prior over their skills. Also plot
the line of equal skill, zA = zB. Hint: you’ve already implemented the log of the likelihood function.
(b) [2 points] Plot the isocontours of the likelihood function. Also plot the line of equal skill, zA = zB.
(c) [2 points] Plot isocountours of the joint posterior over zA and zB given that player A beat player B
in one match. Since the contours don’t depend on the normalization constant, you can simply plot
the isocontours of the log of joint distribution of p(zA, zB, A beat B) Also plot the line of equal skill,
zA = zB.
(d) [2 points] Plot isocountours of the joint posterior over zA and zB given that 10 matches were played,
and player A beat player B all 10 times. Also plot the line of equal skill, zA = zB.
(e) [2 points] Plot isocountours of the joint posterior over zA and zB given that 20 matches were played,
and each player beat the other 10 times. Also plot the line of equal skill, zA = zB.
For all plots, label both axes.
2
3 Stochastic Variational Inference on Two Players and Toy Data
[18 points]
One nice thing about a Bayesian approach is that it separates the model specification from the approximate inference strategy. The original Trueskill paper from 2007 used message passing. Carl Rasmussen’s
assignment uses Gibbs sampling, a form of Markov Chain Monte Carlo. We’ll use gradient-based stochastic
variational inference, which wasn’t invented until around 2014.
In this question we will optimize an approximate posterior distribution with stochastic variational inference to approximate the true posterior.
(a) [5 points] Implement a function elbo which computes an unbiased estimate of the evidence lower
bound. As discussed in class, the ELBO is equal to the KL divergence between the true posterior
p(z|data), and an approximate posterior, q(z|data), plus an unknown constant. Use a fully-factorized
Gaussian distribution for q(z|data). This estimator takes the following arguments:
• params, the parameters of the approximate posterior q(z|data).
• A function logp, which is equal to the true posterior plus a constant. This function must take a
batch of samples of z. If we have N players, we can consider B-many samples from the joint over
all players’ skills. This batch of samples zs will be an array with dimensions (N,B).
• num samples, the number of samples to take.
This function should return a single scalar. Hint: You will need to use the reparamterization trick
when sampling zs.
(b) [2 points] Write a loss function called neg toy elbo that takes variational distribution parameters
and an array of game outcomes, and returns the negative elbo estimate with 100 samples.
(c) [5 points] Write an optimization function called fit toy variational dist which takes initial variational parameters, and the evidence. Inside it will perform a number of iterations of gradient descent
where for each iteration :
(a) Compute the gradient of the loss with respect to the parameters using automatic di↵erentiation.
(b) Update the parameters by taking an lr-scaled step in the direction of the descending gradient.
(c) Report the loss with the new parameters (using @info or print statements)
(d) On the same set of axes plot the target distribution in red and the variational approximation in
blue.
Return the parameters resulting from training.
(d) [2 points] Initialize a variational distribution parameters and optimize them to approximate the joint
where we observe player A winning 1 game. Report the final loss. Also plot the optimized variational
approximation contours (in blue) aand the target distribution (in red) on the same axes.
(e) [2 points] Initialize a variational distribution parameters and optimize them to approximate the joint
where we observe player A winning 10 games. Report the final loss. Also plot the optimized variational
approximation contours (in blue) aand the target distribution (in red) on the same axes.
(f) [2 points] Initialize a variational distribution parameters and optimize them to approximate the joint
where we observe player A winning 10 games and player B winning 10 games. Report the final loss.
Also plot the optimized variational approximation contours (in blue) aand the target distribution (in
red) on the same axes.
For all plots, label both axes.
3
4 Approximate inference conditioned on real data [24 points]
Load the dataset from tennis data.mat containing two matrices:
• W is a 107 by 1 matrix, whose i’th entry is the name of player i.
• G is a 1801 by 2 matrix of game outcomes (actually tennis matches), one row per game. The first
column contains the indices of the players who won. The second column contains the indices of the
player who lost.
Compute the following using your code from the earlier questions in the assignment, but conditioning on
the tennis match outcomes:
(a) [1 point] For any two players i and j, p(zi, zj |all games) is always proportional to p(zi, zj , all games). In
general, are the isocontours of p(zi, zj |all games) the same as those of p(zi, zj |games between i and j)?
That is, do the games between other players besides i and j provide information about the skill of
players i and j? A simple yes or no suces.
Hint: One way to answer this is to draw the graphical model for three players, i, j, and k, and the
results of games between all three pairs, and then examine conditional independencies. If you do this,
there’s no need to include the graphical models in your assignment.
(b) [5 points] Write a new optimization function fit variational dist like the one from the previous
question except it does not plot anything. Initialize a variational distribution and fit it to the joint
distribution with all the observed tennis games from the dataset. Report the final negative ELBO
estimate after optimization.
(c) [2 points] Plot the approximate mean and variance of all players, sorted by skill. For example, in
Julia, you can use: perm = sortperm(means); plot(means[perm], yerror=exp.(logstd[perm]))
There’s no need to include the names of the players.
(d) [2 points] List the names of the 10 players with the highest mean skill under the variational model.
(e) [3 points] Plot the joint approximate posterior over the skills of Roger Federer and Rafael Nadal. Use
the approximate posterior that you fit in question 4 part b.
(f) [5 points] Derive the exact probability under a factorized Guassian over two players’ skills that one
has higher skill than the other, as a function of the two means and variances over their skills. Express
your answer in terms of the cumulative distribution function of a one-dimensional Gaussian random
variable.
• Hint 1: Use a linear change of variables yA, yB = zA zB, zB. What does the line of equal skill
look like after this transformation?
• Hint 2: If X ⇠ N (µ, ⌃), then AX ⇠ N (Aµ, A⌃AT ) where A is a linear transformation.
• Hint 3: Marginalization in Gaussians is easy: if X ⇠ N (µ, ⌃), then the ith element of X has a
marginal distribution Xi ⇠ N (µi, ⌃ii)
(g) [2 points] Using the formula from part c, compute the exact probability under your approximate
posterior that Roger Federer has higher skill than Rafael Nadal. Then, estimate it using simple Monte
Carlo with 10000 examples, again using your approximate posterior.
(h) [2 points] Using the formula from part c, compute the probability that Roger Federer is better than
the player with the lowest mean skill. Compute this quantity exactly, and then estimate it using simple
Monte Carlo with 10000 examples, again using your approximate posterior.
(i) [2 points] Imagine that we knew ahead of time that we were examining the skills of top tennis players,
and so changed our prior on all players to Normal(10, 1). Which answers in this section would this
change? No need to show your work, just list the letters of the questions whose answers would be
di↵erent in expectation.
4