Description
COMS 4771 HW3
1 Inconsistency of the fairness definitions
Recall the notation and definitions of group-based fairness conditions:
Notation:
Denote X 2 Rd, A 2 {0, 1} and Y 2 {0, 1} to be three random variables: non-sensitive features of an instance, the instance’s sensitive feature and the target label of the instance respectively,
such that (X, A, Y ) ⇠ D. Denote a classifier f : Rd ! {0, 1} and denote Yˆ := f(X).
For simplicity, we also use the following abbreviations:
P := P(X,A,Y )⇠D and Pa := P(X,a,Y )⇠D
Group based fairness definitions:
– Demographic Parity (DP)
P0[Yˆ = ˆy] = P1[Yˆ = ˆy] 8yˆ 2 {0, 1}
(equal positive rate across the sensitive attribute)
– Equalized Odds (EO)
P0[Yˆ = ˆy | Y = y] = P1[Yˆ = ˆy | Y = y] 8y, y ˆ 2 {0, 1}
(equal true positive- and true negative-rates across the sensitive attribute)
– Predictive Parity (PP)
P0[Y = y | Yˆ = ˆy] = P1[Y = y | Yˆ = ˆy] 8y, y ˆ 2 {0, 1}
(equal positive predictive- and negative predictive-value across the sensitive attribute)
Unfortunately, achieving all three fairness conditions simultaneously is not possible. An impossibility theorem for group-based fairness is stated as follows.
• If A is dependent on Y , then Demographic Parity and Predictive Parity cannot hold at the
same time.
• If A is dependent on Y and Yˆ is dependent on Y , then Demographic Parity and Equalized
Odds cannot hold at the same time.
• If A is dependent on Y , then Equalized Odds holds and Predictive Parity cannot hold at the
same time.
These three results collectively show that it is impossible to simultaneously satisfy the fairness definitions except in some trivial cases.
(i) State a scenario where all three fairness definitions are satisfied simultaneously.
(ii) Prove the first statement.
(iii) Prove the second statement.
(iv) Prove the third statement.
Hint: First observe that
P0[Y = y|Yˆ = ˆy] = P1[Y = y|Yˆ = ˆy] 8y, y ˆ 2 {0, 1}
is equivalent to:
P0[Y = 1|Yˆ = ˆy] = P1[Y = 1|Yˆ = ˆy] 8yˆ 2 {0, 1}.
A necessary condition for PP is the equality of positive predictive value (PPV):
P0[Y = 1|Yˆ = 1] = P1[Y = 1|Yˆ = 1]
To prove the third statement, it is enough to prove a stronger statement: if A is dependent on
Y , Equalized Odds and equality of Positive Predictive Value cannot hold at the same time.
Next, try to express the relationship between FPRa (= Pa[Yˆ = 1|Y = 0]) and FNRa (=
Pa[Yˆ = 0|Y = 1]) using pa (= P[Y = 1 | A = a]) and PPVa (= Pa[Y = 1|Yˆ = 1]),
8a 2 {0, 1} and finish the proof.
2 Combining multiple classifiers
The concept of “wisdom-of-the-crowd” posits that collective knowledge of a group as expressed
through their aggregated actions or opinions is superior to the decision of any one individual in the
group. Here we will study a version of the “wisdom-of-the-crowd” for binary classifiers: how can
one combine prediction outputs from multiple possibly low-quality binary classifiers to achieve an
aggregate high-quality final output? Consider the following iterative procedure to combine classifier
results.
Input:
– S – a set of training samples: S = {(x1, y1),…,(xm, ym)}, where each yi 2 {1, +1}
– T – number of iterations (also, number of classifiers to combine)
– F – a set of (possibly low-quality) classifiers. Each f 2 F, is of the form f : X ! {1, +1}
Output:
– F – a set of selected classifiers {f1,…,fT }, where each fi 2 F.
– A – a set of combination weights {↵1,…, ↵T }
Iterative Combination Procedure:
– Initialize distribution weights D1(i) = 1
m [for i = 1,…,m]
– for t = 1,…,T do
– // ✏j is weighted error of j-th classifier w.r.t. Dt
– Define ✏j := Pm
i=1 Dt(i) · 1[yi 6= fj (xi)] [for each fj 2 F]
– // select the classifier with the smallest (weighted) error
– ft = arg minfj2F ✏j
– ✏t = minfj2F ✏j
– // recompute weights w.r.t. performance of ft
– Compute classifier weight ↵t = 1
2 ln 1✏t
✏t
– Compute distribution weight Dt+1(i) = Dt(i) exp(↵tyift(xi))
– Normalize distribution weights Dt+1(i) = P
Dt+1(i)
i Dt+1(i)
– endfor
– return weights ↵t, and classifiers ft for t = 1,…,T.
Final Combined Prediction:
– For any test input x, define the aggregation function as: g(x) := P
t ↵tft(x), and return the prediction as sign(g(x)).
We’ll prove the following statement: If for each iteration t there is some t > 0 such that
✏t = 1
2 t (that is, assuming that at each iteration the error of the classifier ft is just t better than
random guessing), then error of the aggregate classifier
err(g) := 1
m
X
i
1[yi 6= sign(g(xi))] exp(2
X
T
t=1
2
t ).
That is, the error of the aggregate classifier g decreases exponentially fast with the number of combinations T!
(i) Let Zt := P
i Dt+1(i) (i.e., Zt denotes the normalization constant for the weighted distribution Dt+1). Show that
DT +1(i) = 1
m
1
Q
t Zt
exp(yig(xi)).
(ii) Show that error of the aggregate classifier
Q
g is upper bounded by the product of Zt: err(g)
t Zt.
(hint: use the fact that 0-1 loss is upper bounded by exponential loss)
(iii) Show that Zt = 2p✏t(1 ✏t).
(hint: noting Zt = P
i Dt(i) exp(↵tyift(xi)), separate the expression for correctly and incorrectly classified cases and express it in terms of ✏t)
(iv) By combining results from (ii) and (iii), we have that err(g) Q
t 2
p✏t(1 ✏t), now show
that:
Y
t
2
p✏t(1 ✏t) = Y
t
q
1 42
t exp(2
X
t
2
t ).
Thus establishing that err(g) exp(2
P
t 2
t ).
3 1-Norm Support Vector Machine
(i) Recall the standard support vector machine formulation:
minimize kwk2
2
subject to yi(w · xi + w0) 1, i = 1, . . . , m,
where m is the number of points and n is the number of dimensions, is a quadratic program
because the objective function is quadratic and the constraints are affine. A linear program
on the other hand uses only affine objective function and constraints, and is generally easier
to solve than a quadratic program.
By replacing the 2-norm in the objective function with the
1-norm (kxk1 = Pn
j=1 |xj |), we get
minimize kwk1
subject to yi(w · xi + w0) 1, i = 1, . . . , m.
Note that the objective function here is not linear because there are absolute values involved.
Show that this problem is equivalent to a linear program with 2n variables and m + 2n constraints.
(ii) The Chebyshev (`1) distance between two points x and y is defined as maxi |xi yi|. Show
that the 1-norm SVM maximizes the Chebyshev distance between the two separating hyperplanes w · x + w0 = ±1. (Hint: Show that the vector (sign(w1),…sign(wn)) minimizes the
l1 distance from the origin to the plane w · x = 2.)
(iii) When the input data are not perfectly separable, we can apply a soft-margin approach (this is
an alternative to the usual slack-variables approach discussed in class):
minimize kwk1 +Xm
i=1
[1 yi(w · xi + w0)]+ , (1)
where [·]+ is the hinge loss function given by max(0, ·). Note that we’ve replaced the constraints with a penalty in the objective function.
·
Using the fact that strong duality always applies for linear programs, show that (1) can be
expressed as
maximize k⇡k1
subject to
Xm
j=1
yixij⇡i
1 j = 1, . . . , n,
Xm
i=1
yi⇡i = 0,
0 ⇡i 1 i = 1, . . . , n,
where ⇡ 2 Rm. (Hint: First express (1) as a linear program and then find its dual.)
(iv) Suppose we know that the output y depends only on a few input variables (i.e. the optimal w
is sparse). Would the 1-norm or 2-norm SVM make more sense? Justify your answer.
4 Estimating Model Parameters for Regression
Let P be the probability distribution on Rd⇥R for the random pair(X, Y )(where X = (X1,…,Xd))
such that
X1,…,Xd ⇠iid N(0, 1), and Y |X = x ⇠ N(xT, kxk2), x 2 Rd
Here, = (1,…, d) 2 Rd are the parameters of P, and N(µ, 2) denotes the Gaussian distribution with mean µ and variance 2.
(i) Let (x1, y1),…,(xn, yn) 2 Rd ⇥ R be a given sample, and assume xi 6= 0 for all i =
1,…,n. Let f be the probability density for P as defined above. Define Q : Rd ! R by
Q() := 1
n
Xn
i=1
ln f(xi, yi), 2 Rd.
Write a convex optimization problem over the variables = (1,…, d) 2 Rd such that its
optimal solutions are maximizers of Q over all vector of Euclidean length at most one.
(ii) Let (x1, y1),…,(xn, yn) 2 Rd ⇥ R be a given sample, and assume xi 6= 0 for all i =
1,…,n. Let f be the probability density for P as defined above. Define Q : Rd ! R by
Q() := 1
n
Xn
i=1
ln f(xi, yi), 2 Rd.
Find a system of linear equations A = b over variables = (1,…, d) 2 Rd such that the
solutions are maximizers of Q over all vectors in Rd.