Description

5/5 - (5 votes)

Probability
Concentration inequalities are at the heart of most arguments in statistical learning theory and bandits.
Refer to [1] for more details.
1.1 (Markov’s Inequality) Let X be a positive random variable. Prove that P(X > λ) ≤
E[X]
λ
.
1.2 (Jensen’s Inequalty) Let X be a random vector in R
d and let φ : R
d → R be convex. Then φ(E[X]) ≤
E
P
[φ(X)]. Show this inequality for the special case when X has discrete support. That is, for pi ≥ 0 and
n
i=1 pi = 1, and (x1, . . . , xn) ∈ R
n show that φ(
Pn
i=1 pixi) ≤
Pn
i=1 piφ(xi).
1.3 (Sub-additivity of sub-Gaussian) For i = 1, . . . , n assume Xi
is an independent random variable with
E[exp(λ(Xi − E[Xi
])] ≤ exp(λ
2σ
2
i
/2). If Z =
Pn
i=1 Xi find a ∈ R and b ≥ 0 such that E[exp(λ(Z − a))] ≤
exp(λ
2
b/2).
1.4 (Maximal inequality) For i = 1, . . . , n let each Xi be an independent, random variable that satisfies
E[exp(λXi)] ≤ exp(σ
2
i λ
2/2) for all λ > 0. Show that E[ max
i=1,…,n
Xi
] ≤
r
8 max
i=1,…,n
σ
2
i
log(n). Hint1
. If
σ1 σ2 = · · · = σn how would you expect E[maxi=1,…,n Xi
] to behave (intuitive justification is enough)?
The Upper Confidence Bound Algorithm.
Consider the following algorithm for the multi-armed bandit problem.
Algorithm 1: UCB
Input: Time horizon T, 1-subGaussian arm distributions P1, · · · , Pn
with unknown means µ1, · · · , µn
Initialize: Let Ti(t) denote the number of times arm i has been
pulled up to (inclusive) time t and let Ti = Ti(T). Pull each arm
once.
for: t = n + 1, · · · , T
Pull arm It = arg maxi=1,··· ,n µbi,Ti(t−1) +
q2 log(2nT 2)
Ti(t−1) and observe
draw from Pi
Let µbi,Ti(t) be the empirical mean of the first Ti(t) pulls.
In the following exercises, we will compute the regret of the UCB algorithm and show it matches the regret
bound from lecture. Without loss of generality, assume that the best arm is µ1. For any i ∈ [n], define the
sub-optimality gap ∆i = µ1 − µi
. Define the regret at time T as RT = E[
PT
t=1 µ
∗ − µIt
] = Pn
i=1 ∆iE[Ti
].
2.1 Consider the event
E =
\
i∈[n]
\
s≤T
(
|µbi,s − µi
| ≤ r
2 log(2nT2)
s
)
.
Show that P(E) ≥ 1 −
1
T
.
2.2 On event E, show that Ti ≤ 1 + 8 log(2nT 2
)
∆2
i
for i 6= 1.
2.3 Show that E[Ti
] ≤
8 log(2nT 2
)
∆2
i
+ 2. When n ≤ T, conclude by showing that RT ≤
Pn
i=2
24 log(2T)
∆i
+ 2∆i

.
1Apply Jensen’s inequality to the identity E[maxi Xi] = 1
λ
log(exp(λE[maxi Xi])) for any λ > 0
1
Thompson Sampling.
We consider the following Bayesian setting. Consider n arms and let p0 be an n-dimensional prior distribution
over [−1, 1]n such that θ
∗ ∼ p0 is drawn before the start of the game (e.g., p0 is uniform over [−1, 1]n). At
any time t, when we pull arm i ∈ [n] we observe a random variable Xi,t ∈ [−1, 1] where E[Xi,t] = θ
∗
i
.
Algorithm 1: Thompson Sampling
Input: Time horizon T, arm distributions ν1, · · · , νn
Assume the prior distribution p0 over R
n is known and that θ
∗ ∼ p0
(so that θ
∗ ∈ R
n). Let pt(·|I1, XI1,1, · · · , It, XIt,t) be the posterior
distribution on θ
∗ at time t.
for: t = 1, · · · , T
Sample θ
(t) ∼ pt−1 (Note: θ
(t) ∈ R
n)
Pull arm It = arg maxi≤n θ
(t)
i
to observe XIt,t
Compute exact posterior update pt
Denote the σ-algebra generated by the observations at time t by Ft = σ(I1, XI1,1, · · · , It, XIt,t) (if you are
unfamiliar with σ-algebras, don’t worry too much – conditioning on the σ-algebra just means conditioning
on the choices of arms and the rewards observed). The Bayesian Regret of an algorithm is
BRT = Eθ
∗∼p0
“X
T
t=1
max
i=1,…,n
θ
∗
i − θ
∗
It
#
= Eθ
∗∼p0
”
E
“X
T
t=1
max
i=1,…,n
θ
∗
i − θ
∗
It

θ
∗
##
Assume that expectations, if not explicitly specified, are with respect to all randomness including θ
∗ ∼ p0,
I1, . . . , IT , and observations.
3.1 On a given run of the algorithm, let θbi,s denote the empirical mean of the first s pulls from arm i, note
that E[θbi,s] = θ
∗
i
. Let the good event be
E =
\
i∈[n]
\
t≤T
(
|θbi,t − θ
∗
i
| ≤ r
2 log(2/δ)
t
)
.
Show that P(E
c
) ≤ nT δ.
3.2 (Key idea.) Argue that P(i
∗ = ·|Ft−1) = P(It = ·|Ft−1).
3.3 Define Ut(i) = min{1, θb
i,Ti(t) +
q2 log(2/δ)
Ti(t)
}. If i
∗ = arg maxi θ
∗
i
, show that Eθ
∗∼p0
[EIt
[θ
∗
i
∗ − θ
∗
It
|Ft−1]] =
Eθ
∗∼p0
[θ
∗
i
∗ − Ut(i
∗
)] + Eθ
∗∼p0
[EIt
[Ut(It) − θ
∗
It
|Ft−1]]. Conclude that BRT = Eθ
∗∼p0
[
PT
t=1 θ
∗
i
∗ − Ut(i
∗
) +
PT
t=1 EIt
[Ut(It) − θ
∗
It
|Ft−1]]. Hint2
.
3.4 Show that BRT ≤ 4δnT2 + E
h
E
h
1{E} PT
t=1 Ut(It) − θ
∗
It

θ
∗
ii ≤ O(δnT2 +
p
T n log(1/δ)). Hint3
3.5 Make an appropriate choice of δ and state a final regret bound.
In general, giving frequentist bounds on the regret is significantly more difficult. We refer the interested
reader to [2] and the tutorial [3] for more details. This exercise is motivated by [4]
2Tower rule of expectation.
3Apply Jensen’s to Pn
i=1
√
Ti.
2
Algorithm 1: Explore-then-Commit
Input: Time horizon T, m ∈ N, 1-sub-Gaussian arm distributions
P1, · · · , Pn with unknown means µ1, · · · , µn
for: t = 1, · · · , T
If t ≤ mn, choose It = (t mod n) + 1
Else, It = arg maxi µbi,m
Empirical Experiments
Implement UCB, Thompson Sampling (TS), and Explore-then-Commit (ETC). Let Pi = N (µi
, 1) for
i = 1, . . . , n. For Thompson sampling, define the prior for the ith arm as N (0, 1).
4.1 Let n = 10 and µ1 = 0.1 and µi = 0 for i > 1. On a single plot, for an appropriately large T to see
expected effects, plot the regret for the UCB, TS, and ETC for several values of m.
4.2 Let n = 40 and µ1 = 1 and µi = 1 − 1/
√
i − 1 for i > 1. On a single plot, for an appropriately large T
to see expected effects, plot the regret for the UCB, TS, and ETC for several values of m.
Lower Bounds on Hypothesis Testing
Consider n samples X1, · · · , Xn ∼ P where P ∈ {P0, P1} (assume for simplicity that these are probability distributions on R). A hypothesis test for H0 : P = P0, H1 : P = P1 is a function φ(x1, · · · , xn) : R
n → {0, 1}
that takes the data as input and returns the null or the alternative. Assume that the dPi = pi(x)dx so
that the probability density function exists (think: pi(x) = √
1
2π
e
−(x−µi)
2/2
). If x ∈ R
n is the vector of
n observations, define pi(x) := Qn
j=1 pi(xj ). In this problem, we will lower bound the number of samples
needed by any hypothesis test on a fixed number of samples. Convince yourself, at least intuitively, that any
best-arm identification algorithm for two arms will take at least as many samples as this hypothesis test takes.
5.1 Show infφ max{P0(φ = 1), P1(φ = 0)} ≥ 1
2
R
Rn min(p0(x), p1(x))dx. Hint4
.
5.2 Let’s continue on. Show 1
2
R
x∈Rn min(p0(x), p1(x))dx ≥
1
4
R
x∈Rn
p
p0(x)p1(x)dx2
. Hint56
.
5.3 One more step. Show R
x∈Rn
p
p0(x)p1(x)dx2
≥ exp
−
R
x∈Rn log
p1(x)
p0(x)

p1(x)dx
. Hint7
.
5.4 The final quantity is known as the KL-Divergence between distributions. Now assume that P0 =
N(µ01n, In) and P1 = N(µ11n, In) where In is the n × n identity matrix and 1n ∈ R
n is the all ones
vector. Show (or look up) KL(P0||P1).
5.5 Conclude that to acheive a test that accurately determines whether the sample of size n came from P0
or P1 with a probability of error less than δ, we necessarily have n ≥ 2∆−2
log(1/4δ) where ∆ = µ1 − µ0.
Remark: The art of lower bounds is well established and extensive in statistics. See [5] for more details in
the hypothesis testing setting. In the bandit setting, see [6].
References
[1] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of
independence. Oxford university press, 2013.
4Bound the max below by the average
5Note that for a, b > 0 we have ab = min{a, b} max{a, b}.
6Apply Cauchy Schwartz.
7Jensen’s inequality.
3
[2] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In
Conference on Learning Theory, pages 39–1, 2012.
[3] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson
sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
[4] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations
Research, 39(4):1221–1243, 2014.
[5] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer New York, 2009.
[6] Emilie Kaufmann, Olivier Capp´e, and Aur´elien Garivier. On the complexity of best-arm identification in multiarmed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
4

Homework 1 CSE 542: Interactive Learning

Description

Related products

CS 1332 Homework 1: Array List

Database Systems, CSCI 4380-01 Homework 1

CS5785 Homework 1