Description
Confidence bounds
1.1 (Wald’s identity) Let X1, X2, . . . be a sequence of iid random variables. For j ∈ {0, 1}, under Hj we have
that Xi ∼ pj . Let Pj (·), Ej (·) denote the probability and expectation under Hj . Assume that the support of
p0 and p1 are equal and furthermore, that supx∈support(p0)
p1(x)
p0(x) ≤ κ. Fix some δ ∈ (0, 1). If Lt =
Qt
i=1
p1(Xi)
p0(Xi)
and τ = min{t : Lt > 1/δ}, we showed in class that the false alarm probability P0(Lτ > 1/δ) ≤ δ. Assume
that E1[τ ] < ∞. Show that log(1/δ)
KL(p1||p0) ≤ E1[τ ] ≤
log(κ/δ)
KL(p1||p0) where KL(p1|p0) = R
p1(x) log( p1(x)
p0(x)
)dx is the
Kullback Leibler divergence between p1 and p0.
1.2 (Method of Mixtures) Let X1, X2, . . . be a sequence of iid random variables where X1 ∼ N (µ, 1). Under
H0 we have µ = 0 and under H1 assume µ = θ.
• Define Lt(θ) := exp(Stθ − tθ2/2) where St =
Pt
s=1 Xs. Show that Qt
i=1
p1(Xi)
p0(Xi) = Lt(θ).
• Now suppose the sequence X1, X2, . . . is still iid and E[X1] = µ in the above notation, but now the
distribution of X1 is unknown other than the knowledge that E[exp(λ(X1 − µ))] ≤ e
λ
2/2
. Show that
Lt(θ) defined above is a super-martingale under H0.
• Assume the setting of the previous step. Define L¯
t =
R
θ
Lt(θ)h(θ)dθ where h(θ) = √
1
2πσ2
e
− θ
2
2σ2 and
σ > 0. Show that L¯
t is a super-martingale under H0.
• Fix δ ∈ (0, 1). Show that P0(∃t ∈ N : L¯
t > 1/δ) ≤ δ.
• Conclude that if Z1, Z2, . . . are iid random variables with E[exp(λ(Z1 − E[Z1])] ≤ exp(λ
2/2), then for
any σ > 0
P
∃t ∈ N : |
1
t
Xt
s=1
(Zs − E[Zs])| >
r
1 +
1
tσ2
r
2 log(1/δ) + log(tσ2 + 1)
t
!
≤ δ. (1)
1.3 (Best arm identification) Consider an n-armed multi-armed bandit problem where the jth pull of the ith
arm yields a random variable Xi,j ∼ N (θi
, 1). The objective of the player is to strategically pull arms until
getting to a point that they can predict the index of the arm with the highest mean, at which time they stop
and output this estimated arm. Suppose an oracle told the player that θ, the true means they are playing
against, is equal to ∆ej for some j = 1, . . . , n where ej is a vector of all zeros except a 1 in the jth location.
Note that while ∆ > 0 is known to the player, which is the true j is unknown.
• Due to the symmetry of the known problem setup, it is conceivable that the following algorithm is
optimal: play every arm the same number of times (say, τ times), and then declare that the arm with
the highest empirical mean is best. Provide a sufficient condition on τ such that such a procedure
correctly identifies the location of the true j index with probability at least 1 − δ (don’t forget the
union bound!).
• Argue that if each arm is pulled the same number of times this value of τ up to a constant factor
(ignoring dependence on δ) is necessary. Hint1
• The previous two parts suggest that any algorithm that pulls every arm the same number of times
and identifies the best arm requires essentially n∆−2
log(n/δ) total pulls. We will beat this with an
adaptive procedure that requires just O(n∆−2
log(1/δ)) pulls.
Algorithm: Initialize S1 = [n]. At each around t ≥ 1, while |St| > 1, pull every arm in St and then
set St+1 = {i ∈ St :
Pt
s=1(Xi,s − ∆) ≥ −∆t/2 −
log(1/δ)
∆
}.
1
If Zi ∼ N (0, σ2
) for i = 1, . . . , n show that P(maxi=1,…,n Zi ≥
p
σ2 log(n)) > c for some absolute constant c.
1
We showed in class that if Zs ∼ N (0, 1) then for any α > 0 and ρ ∈ (0, 1) we have
max (
P
[∞
t=1
{
Xt
s=1
Zs < −
αt
2
−
log(1/ρ)
α
}
!
, P
[∞
t=1
{
Xt
s=1
Zs >
αt
2
+
log(1/ρ)
α
}
!) ≤ ρ. (2)
Conclude that if j is the index of the best-arm (so that Xj,s ∼ N (∆, 1)) then with probability at least
1 − δ arm j remains in St for all t ≥ 1.
• For i 6= j (so that Xi,s ∼ N (0, 1)) define the random variables
ρi
:= sup (
ρ ∈ (0, 1) : \∞
t=1
{
Xt
s=1
Xi,s ≤
∆t
4
+
log(1/ρ)
∆/2
}
)
.
If Ti = max{t : i ∈ St} show that Ti ≤ 4∆−2
log(1/δ) + 8∆−2
log(1/ρi).
• Note that by (2) we have P(ρi ≤ ρ) ≤ ρ for any ρ ∈ (0, 1). Use this fact to show that E [
Pn
i=1 Ti
] ≤
cn∆−2
log(1/δ) for some absolute constant c > 0. While not necessary for this problem, it is also
possible to show that the right hand side holds with probability at least 1 −δ (see sub-Gamma random
variables).
In general, when the means are unknown, curved boundaries with a UCB-like algorithm [1] can be used to
identify the best arm using just O(log(1/δ)
Pn
i=2 ∆−2
log(log(∆−2
))) total pulls, where ∆i = maxj θj − θi
using a similar analysis.
Linear regression and experimental design
2.1 Exercise 20.2 of [SzepesvariLattimore]
Non-parametric bandits
4.1 Let FLip be a set of functions defined over [0, 1] such that for each f ∈ FLip we have f : [0, 1] → [0, 1] and
for every x, y ∈ [0, 1] we have |f(y)−f(x)| ≤ L|y−x| for some known L > 0. At each round t the player chooses
an xt ∈ [0, 1] and observes a random variable yt ∈ [0, 1] such that E[yt] = f∗(xt) where f∗ ∈ FLip. Define
the regret of an algorithm after T steps as RT = E
hPT
t=1 f∗(x?) − f∗(xt)
i
where x? = arg maxx∈[0,1] f∗(x).
• Propose an algorithm, that perhaps uses knowledge of the time horizon T, that achieves RT ≤ O(T
2/3
)
regret (Okay to ignore constant, log factors).
• Argue that this is minimax optimal (i.e., improvable in general through the use of an explicit example,
with math, but no formal proof necessary).
Experiments
5.1 Suppose we have random variables Z1, Z2, . . . that are iid with E[exp(λ(Z1 − E[Z1])] ≤ exp(λ
2/2). Then
for any fixed t ∈ N we have the standard tail bound
P
|
1
t
Xt
s=1
(Zs − E[Zs])| >
r
2 log(2/δ)
t
!
≤ δ. (3)
The bound in (1) holds for all t ∈ N simultaneously (i.e., not for just a fixed t) at the cost of a slightly
inflated bound. Let δ = 0.05. Plot the ratio of the confidence bound of (1) to (3) as a function of t for values
of σ
2 ∈ {10−6
, 10−5
, 10−4
, 10−3
, 10−2
, 10−1
, 100}. What do you notice about where the ratio is smallest
with respect to σ
2
(if the smallest value is at the edge, your t is not big enough–it should be in the many
millions)? Suppose you are observing a stream of iid Gaussian random variables Z1, Z2, . . . and you are
trying to determine whether their mean is positive or negative. If someone tells you they think the absolute
value of the mean is about .01, how would you choose σ
2 and use the above confidence bound to perform
this test using as few total observations as possible?
2
5.2 G-optimal design This problem addresses finding a G-optimal design. Fix (x1, . . . , xn) ⊂ R
d
. Let
4n = {p ∈ R
n :
Pn
i=1 pi = 1, pi ≥ 0 ∀i ∈ [n]} and A(λ) = Pn
j=1 λjxjx
>
j
. For some λ ∈ 4n let f(λ) =
maxi=1,…,n kxik
2
A(λ)−1 and g(λ) = − log(|A(λ)|) (these are the G and D optimal designs, respectively). We
wish to find a discrete allocation of size N for G-optimal design. Below are strategies to output an allocation
(I1, . . . , IN ) ∈ [n]. Choose one strategy and one h ∈ {f, g} to obtain samples from a G-optimal design
1. Greedy For i = 1, . . . , 2d select Ii uniformly at random with replacement from [n]. For t = 2d +
1, . . . , N select It = arg mink∈[n] h(xkx
>
k +
Pt−1
j=1 xIj x
>
Ij
). Return (I1, . . . , IN ).
2. Frank Wolfe For i = 1, . . . , 2d select Ii uniformly at random with replacement from [n] and let
λ
(2d) =
1
2d
P2d
i=1 eIi
. For k = 2d + 1, . . . , N select Ik = arg minj∈[n]
∂h(λ)
∂λj
|
λ=λ
(k−1) and set λ
(k) =
(1 − ηk)λ
(k−1) + ηkeIk where ηk = 2/(k + 1). Return (I1, . . . , IN ). You will need to derive the partial
gradients ∂h(λ)
∂λi
. Hint2
.
Let λb =
1
N
PN
i=1 eIi
. We are going to evaluate f(λb) in a variety of settings. For a ≥ 0 and n ∈ N let
xi ∼ N (0, diag(σ
2
)) with σ
2
j = j
−a
for j = 1, . . . , d with d = 10. On a single plot with n ∈ {10 + 2i}
10
i=1 on
the x-axis, plot your chosen method for a ∈ {0, 0.5, 1, 2} as separate lines with N = 1000.
5.3 Experimental design for function estimation.
Let f : [0, 1]d → R be some unknown function and fix some locations of interest X = {xi}
n
i=1. We wish
to output an estimate fb of f uniformly well over X in the sense that maxx∈X E[(fb(x) − f(x))2
] is small.
To build such an estimate, we can make N = 1000 measurements. If we measure the function at location
X ∈ X we assume we observe Y = f(X) + η where η ∼ N (0, 1). While you can choose any N measurement
locations in X you’d like (with repeats) you have to choose all locations before the observations are revealed.
This problem emphasizes that even though we’re studying linear functions and linear bandits in class, this
can encode incredibly rich functions using non-linear transformations.
1. Linear functions Assume f(x) = hx, θ∗i for some θ∗ ∈ R
d
. Propose a strategy to select your N
measurements and fit fb. What guarantees can you make about maxx∈X E[(fb(x) − f(x))2
]? What
about the random quantity maxx∈X (fb(x) − f(x))2
?
2. Sobolev functions For simplicity, let d = 1 and assume f(x) is absolutely continuous and R 1
0
|f
0
(x)|
2dx <
∞. This class of functions is known as the First-order Sobolev space3
. Remarkably, this set of
functions is equivalent to a reproducing kernel Hilbert space (RKHS) for the kernel k(x, x0
) = 1 +
min{x, x0}. While a discussion of all its properties are beyond the scope of this class (see [2, Ch.12]
for an excellent treatment) it follows that there exists a sequence of orthogonal functions φ(x) :=
(φ1(x), φ2(x), φ3(x), . . .) such that each φk : [0, 1] → R and
• for all i, j ∈ N we have R 1
0
φi(x)φj (x)dx = βi1{i = j} for a non-increasing sequence {βi}i
• k(x, x0
) = hφ(x), φ(x
0
)i := P∞
k=1 φ(x)kφk(x
0
),
• and any function f in this class can be written as f(x) = P∞
k=1 αkφk(x) with P∞
k=1 α
2
k
/βk < ∞.
Because φ(x) could be infinite dimensional, its not immediately clear why this is convenient. For finite
datasets though (i.e., n < ∞) there always exists a finite dimensional φ : X → R
|X | where k(xi
, xj ) =
hφ(xi), φ(xj )i. Precisely, if one computes the matrix [K]i,j = k(xi
, xj ) and K =
Pn
k=1 vkv
>
k
ek is the
resulting eigenvalue decomposition where {ek}k is a non-increasing sequence, then for all i ∈ [n] we
have [φi
]j := [φ(xi)]j := vi,j√ej and there exists a θ ∈ R
n such that
f(xi) = hθ, φii =
Xn
j=1
θjvi,j√
ej .
2
I = A(λ)A(λ)−1
. Thus, 0 = ∂
∂λi
I =
∂
∂λi
A(λ)
· A(λ)−1 + A(λ) ·
∂
∂λi
A(λ)−1
.
d
dX log(|X|) = (A−1
)
T .
3This class is quite rich including functions as diverse as all polynomials a+bx+cx8
, trigonometric functions sin(ax)+cos(bx),
exponential functions e
ax + b
x, and even non-smooth functions like max{1 − x, 3x}. However, it does not include functions like
1/x or √
x because their derivatives blow up at x = 0
3
Because the ej are rapidly going to zero, we usually do not include all n, but cut the sum off at some
d n so that φi ∈ R
d and f(xi) ≈
Pd
j=1 θjvi,j√ej . To summarize, for each i ∈ [n] we have defined
a map xi → φi with φi ∈ R
d and f(x) ≈ hφi
, θ∗i for some unknown θ∗. Here is a piece of code that
generates our set X = {xi}
n
i=1 ⊂ [0, 1] and these features Φ = {φi}
n
i=1 ⊂ R
d
:
1 import numpy as np
2 n =300
3 X = np . concatenate ( ( np . linspace (0 ,1 ,50) , 0.25+ 0.01* np . random . randn (250) ) , 0)
4 X = np . sort (X)
5
6 K = np . zeros ((n ,n))
7 for i in range ( n):
8 for j in range ( n):
9 K[i ,j] = 1+ min (X[i] ,X[j ])
10 e , v = np . linalg . eigh (K ) # eigenvalues are increasing in order
11 d = 30
12 Phi = np . real (v @ np . diag ( np . sqrt ( np . abs (e ))) ) [: ,(n -d ) ::]
Let’s define some arbitrary function and see how well this works!
1 def f(x):
2 return -x **2 + x* np . cos (8* x ) + np . sin (15* x )
3
4 f_star = f(X)
5
6 theta = np . linalg . lstsq ( Phi , f_star , rcond = None ) [0]
7 f_hat = Phi @ theta
Observe that there is nearly no error in the reconstruction even though we’re using a linear model in
R
30. This is because we defined a very good basis Φ ⊂ R
d
. We could have achieved the same result
by learning in kernel space and adding regularization (this is known as Bayesian experimental design).
Instead of choosing d we would have to choose the amount of regularization, so there is still always a
hyperparameter to choose.
Let’s get back to estimating f from noisy samples. Now that we have made this a linear estimation
problem, we are exactly in the setting of the first part of this problem. Consider the below listing:
1 def observe ( idx ):
2 return f(X[ idx ]) + np . random . randn ( len ( idx ))
3
4 def sample_and_estimate (X , lbda , tau ):
5 n , d = X. shape
6 reg = 1e -6 # we can add a bit of regularization to avoid divide by 0
7 idx = np . random . choice ( np . arange (n) , size = tau , p= lbda )
8 y = observe ( idx )
9
10 XtX = X [ idx ]. T @ X[ idx ]
11 XtY = X [ idx ]. T @ y
12
4
13 theta = np . linalg . lstsq ( XtX + reg * np . eye (d) , XtY , rcond = None ) [0]
14 return Phi @ theta , XtX
15
16 T = 1000
17
18 lbda = G_optimal ( Phi )
19 f_G_Phi , A = sample_and_estimate ( Phi , lbda , T )
20 conf_G = np . sqrt ( np .sum ( Phi @ np . linalg . inv ( A) * Phi , axis =1) )
21
22 lbda = np . ones (n)/ n
23 f_unif_Phi , A = sample_and_estimate ( Phi , lbda , T)
24 conf_unif = np . sqrt ( np . sum ( Phi @ np . linalg . inv (A) * Phi , axis =1) )
Use your implementation of G optimal from the previous problem and use the sample and estimate
function given above. Your tasks (create a legend for all curves, label all axes, and provide a title for
all plots):
(a) Plot 1: x-axis should be the x locations in [0, 1]. First line is the CDF of the uniform distribution
over X . The second line is the G-optimal allocation over X . Comment on the relative shapes,
and how this relates to the distribution over the x’s shown in the left-most plot above.
(b) Plot 2: x-axis should be the x locations in [0, 1]. Plot f star, f G Phi, f unif Phi.
(c) Plot 3: x-axis should be the x locations in [0, 1]. First line is the absolute value of f G Phi minus
f star, second line is f unif Phi minus f star, third line is p
d/n, fourth line is conf G, fifth
line is conf unif. Comment on what these lines have to do with each other.
5.4 Linear bandits Implement the elimination algorithm (use your implementation of G-optimal design),
UCB, and Thompson sampling (see listings below). Use the precise setup as the previous problem and set
X ← {φi}
n
i=1. For T = 40, 000, on a single plot with t ∈ [T] on the x-axis, plot Rt = maxx∈X Pt
s=1 f(x)−yt
with a line for each algorithm. Comment on the results–what algorithm would you recommend to minimize
regret?
Elimination algorithm
Input: T ∈ N
Initialize τ = 100, δ = 1/T, γ = 1, U = 1 (supposed to be an upper bound on kθ∗k2), V0 = γI, S0 = 0,
Xb0 = X
for: k = 1, . . . , bT /τ c
λ
(k) = arg min
λ∈4Xck
max
x∈Xbk
x
>
X
x0∈Xbk
λx0x
0x
0>
−1
x
Draw x(k−1)τ+1, . . . , xkτ ∼ λ
(k)
For t = (k − 1)τ + 1, . . . , kτ , pull arm xt and observe yt = f(xt) + ηt where ηt ∼ N (0, 1)
Vk = Vk−1 +
Pkτ
t=(k−1)τ+1 xtx
>
t
, Sk = Sk−1 +
Pkτ
t=(k−1)τ+1 xtyt, θk = V
−1
k Sk
βk =
√γU +
p
2 log(1/δ) + log(|Vk|/|V0|)
xbk = arg maxx0∈Xbk
hx
0
, θki
Xbk+1 = Xbk − {x ∈ Xbk : hxbk − x, θki ≥ βkkxbk − xkV
−1
k
}
UCB algorithm
Input: T ∈ N
Initialize δ = 1/T, γ = 1, U = 1 (supposed to be an upper bound on kθ∗k2), V0 = γI, S0 = 0
for: t = 0, 1, 2, . . . , T − 1
βt =
√γU +
p
2 log(1/δ) + log(|Vt|/|V0|)
θt = V
−1
t St
xt = arg maxx∈X hx, θti + kxkV
−1
t
βt
Pull arm xt and observe yt = f(xt) + ηt where ηt ∼ N (0, 1)
Vt+1 = Vt + xtx
>
t
, St+1 = St + xtyt
5
Thompson sampling algorithm
Input: T ∈ N
Initialize γ = 1, V0 = γI, S0 = 0
for: t = 0, 1, 2, . . . , T − 1
θt = V
−1
t St
θet ∼ N (θt, V −1
t
)
xt = arg maxx∈X hx, θeti
Pull arm xt and observe yt = f(xt) + ηt where ηt ∼ N (0, 1)
Vt+1 = Vt + xtx
>
t
, St+1 = St + xtyt
References
[1] Kevin Jamieson, Matthew Malloy, Robert Nowak, and S´ebastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
[2] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University
Press, 2019.
6