Description
1. (1 pts, 0.25 pts each) Linearity. Are the following functions linear? Justify your answer.
(a) f(x) = kxk
2
2
(b) f(x) = kxk1
(c) f(x) = 1
2
x
T Qx + p
T x + r
(d) f(x) = c
T x + b
T Ax
2. (1 pt, 0.25 each) Using the properties of norms, verify that the following are norms, or prove that they are not
norms
(a) Direct sum. f : R
d → R, f(x) = P
k
x[k]
(b) Sum of square roots, squared. f : R
d → R, f(x) = Pd
k=1 p
|x[k]|
2
(c) (Shifted) entropy function f : R
d → R, f(x) = −
Pd
i=1(x[i] + 1
2
) log2
(x[i] + 1
2
) where −1/2 ≤ x[i] ≤ 1/2 for all
i = 1, …, d
(d) Weighted 2-norm. f : R
d → R, f(x) = qPd
k=1
|x[k]|
2
k
3. (1 pt, 0.25 each) Independent or not independent Variables A and B are random variables for two distributions.
Decide if A and B are independent. Justify your answer.
(a) A and B are discrete random variables and have the following p.m.f.s
pA(a) =
0.25, a = red
0.25, a = blue
0.5, a = green
, pB(b) =
0.3, b = hat
0.3, b = T-shirt
0.2, b = skirt
0.2, b = shoes
and pA,B(a, b) are defined by the table below
a = red a = blue a = green
b = hat 0.075 0.075 0.15
b = T-shirt 0.075 0.075 0.15
b = skirt 0.05 0.05 0.1
b = shoes 0.05 0.05 0.1
(b) A and B are uniform distributions, where
fA(a) = (
1 −1 ≤ a ≤ 0
0 else,
fB(b) = (
1 0 ≤ b ≤ 1
0 else,
,
fA,B(a, b) = (
4/3 |a + b| ≤ 1/2, −1 ≤ a ≤ 0, 0 ≤ b ≤ 1
0 else,
(c) A follows the p.m.f.
pA(a) = (
0.5, a = 1
0.5, a = −1
and B = A · C where
pC (c) = (
0.9, c = 1
0.1, c = −1
1
(d) A and B are Gaussian distributions, with the following properties:
E[A] = 0, E[B] = 1, E[A
2
] = 1, E[(B − 1)2
] = 1/2, E[A(B − 1)] = −1.
Writing in terms of the usual Gaussian distribution form, if we form a random vector as X =
A
B
, then
µ =
E[A]
E[B]
, Σ =
E[(A − E[A])2
] E[(A − E[A])(B − E[B])]
E[(A − E[A])(B − E[B])] E[(B − E[B])2
]
4. (2 pts) Probability and statistics. I have 4 children, Alexa, Siri, Googs, and Zuckie. Every morning I tell them to
put on their socks.
Alexa only listens to me on Mondays and Thursdays and puts on her socks. The rest of the days, she puts on
her socks only half of the time. She either puts on both her socks or none of her socks.
Siri always runs and gets her socks, but only puts one sock on.
Googs tells me all this random trivia about socks, but never puts on his socks.
Zuckie wears both his socks 4/7 of the time and sells the rest of them to CambridgeAnalytica.
Assume the children all act independently. Round all answers to at least 3 significant digits.
(a) (0.5 pts) What are the chances that either Alexa or Zuckie is wearing a sock?
(b) (0.5 pts) On a random day, a girl is wearing a sock. What are the chances that it’s Alexa?
(c) (0.5 pts) What is the expected number of socks being worn by each child?
(d) (0.5 pts) What is the variance in the number of socks being work by each child?
5. (2 pts, 0.5 points each) Conditional independence vs independence. Tom is a blue-gray cat with a bushy tail,
and Jerry is a brown mouse with a rope-like tail. After many years of fighting, they both decided to settle down,
and now have thriving families. Tom has 10 kids and Jerry has 40 kids. Tom’s kids are all cats like him, with bushy
tails. Half of Tom’s kids are blue, while the other half is gray. Jerry’s kids are all brown mice, with rope-like tails.
(a) I pick up a baby animal at random. What is the probability that … (fill in the table)
fur \ tail furry rope-like
blue
gray
brown
(b) Are the features “fur color” and “tail texture” independent or dependent, without knowing the type of animal?
(Show mathematically.)
(c) Now Tom comes over and says, “I’m very proud of my baby girl, of whom you are holding.” What is the
probability that (fill in the table)
fur \ tail furry rope-like
blue
gray
brown
2
(d) Are the features “fur color” and “tail texture” independent or dependent, now that I know the animal is Tom’s
cherished baby daughter? (Show mathematically.)
6. (3 pts) Exponential distribution. Wait time is often modeled as an exponential distribution, e.g.
Pr(I wait less than x hours at the DMV) = (
1 − e
−λx, x > 0
0, x ≤ 0,
and this cumulative density function is parametrized by some constant λ > 0. A random variable X distributed
according to this CDF is denoted as X ∼ exp[λ].
(a) (0.25 pts) In terms of λ, give the probability distribution function for the exponential distribution.
(b) (0.25 pts) Show that if X ∼ exp(λ), then the mean of X is 1/λ and the variance is 1/λ2
.
(You may use a symbolic integration tool such as Wolfram Alpha. If you do wish to do the integral by hand,
my hint is to review integration by parts.)
(c) (0.5 pts) Now suppose I run a huge server farm, and I am monitoring the server’s ability to respond to web
requests. I have m observations of delay times, x1, …, xm, which I assume are i.i.d., distributed according to
exp[λ] for some λ. Given these m observations, what is the maximum likelihood estimate λˆ of λ?
(d) (1 pt) Given the estimate of λˆ in your previous question, is 1/λˆ an unbiased estimate of the mean wait time?
Is 1/λˆ2 an unbiased estimate of the variance in wait time?
(e) (1 pt) Now let’s consider x1, …, xm drawn i.i.d. from a truncated exponential distribution, e.g.
pλ,c(x) =
0 if x > c or x < 0
λ exp(−λx)
1 − exp(−λc)
else.
Using Hoefdding’s inequality, give a range of values that account for the uncertainty in your guess. That is,
as a function of xi
, m and δ, give a range of values [λˆmin, λˆmax] such that
Pr(λˆmin ≤ E[X] ≤ λˆmax) ≥ 1 − δ.
3
Challenge!
Generalizing the Cauchy Schwartz inequality
We saw in lecture the Cauchy Schwartz inequality, which says that for any two vectors x ∈ R
n, y ∈ R
n,
x
T
y ≤ kxk2kyk2, xT
y = kxk2kyk2 ⇐⇒ x = cy
for some positive scalar c.
1. Holder’s inequality generalizes this to dual norms. That is, for any p and q where 1
p +
1
q = 1, we have
x
T
y ≤ kxkpkykq
where equality is reachable for a specific choice of x and y.
(a) When p = 1, the corresponding q is q = +∞, and kxk∞ is the max-norm, e.g. kxk∞ = maxi
|xi
|. Prove
Holder’s inequality for this choice of p and q. That is, prove that for any x and any y,
x
T
y ≤ kxk1kyk∞.
(b) For p = 1, q = +∞, given x, list the entire set of possible y where x
T y = kxk1kyk∞. Hint: The rule for
x = [0, 0, …, 0]T
is different than for x if all the values are nonzero.
2. We can further generalize Holder’s inequality to include conic constraints. For example, suppose x is restricted to
the nonnegative orthant, e.g. xi ≥ 0 for all i. Then
x
T
y ≤
X
i
xi
!
max
i
yi
. (∗)
(a) Prove (∗).
(b) List the set of y such that, given x, (∗) is true with equality.
3. The singular value decomposition of a matrix X ∈ R
m×n decomposes X to its singular values and vectors. It is
usually written as
X =
Xr
i=1
siuiv
T
i
where ui ∈ R
m are the left singular values, vi ∈ R
n are the right singular values, and si are positive scalars. Here,
r is the rank of X. Each singular vector are orthonormal, e.g.
u
T
i uj =
(
1 if i = j
0 else,
v
T
i
vj =
(
1 if i = j
0 else.
The trace norm of X is the sum of its singular values (denoted kXk∗). The spectral norm of X is its largest singular
value (denoted kXk2).
(a) Prove the following generalization of the Cauchy Schwartz inequality for matrices X ∈ R
m×n and Y ∈ R
m×n
tr(XT Y ) ≤ kXk∗kY k2.
(b) Given X, describe the set of Y such that tr(XT Y ) = kXk∗kY k2.
4