# CSC311 Homework 2 Expected Loss and Bayes Optimality

\$30.00

## Description

1. [9pts] Expected Loss and Bayes Optimality
You are running an email service, and one of your key features is a spam filter. Every email
is either spam or non-spam, which we represent with the target t ∈ {Spam, NonSpam}. You
need to decide whether to keep it in the inbox or remove it to the spam folder. We represent
this with the decision variable y ∈ {Keep, Remove}. We’d like to remove spam emails and
keep non-spam ones, but the customers will be much more unhappy if we remove a non-spam
email than if we keep a spam email. We can represent this with the following loss function
J (y, t):
NonSpam Spam
Keep 0 1
Remove 500 0
Your studies indicate that 20% of the emails are spam, i.e. Pr(t = Spam) = 0.2.
(a) [2pts] Evaluate the expected loss E[J (y, t)] for the policy that keeps every email (y =
Keep), and for the policy that removes every email (y = Remove).
(b) [2pts] Now suppose you get to observe a feature vector x for each email, and using
your knowledge of the joint distribution p(x, t), you infer p(t| x). Determine how you
will make Bayes optimal decision y∗ ∈ {Keep, Remove} given the conditional probability
Pr(t = Spam | x).
1
https://markus.teach.cs.toronto.edu/2022-09
2
http://www.cs.toronto.edu/~rahulgk/courses/csc311_f22/index.html
1
CSC311 Fall 2022 Homework 2
(c) [4pts] After some analysis, you found two words that are indicative of an email being
spam: “Viagra” and “Nigeria”. You define two input features:
x1 = 1 if the email contains the word “Viagra” and x1 = 0 otherwise, and
x2 = 1 if the email contains the word “Nigeria” and x2 = 0 otherwise.
For a spam email, the two features have the following joint distribution:
x2 = 0 x2 = 1
x1 = 0 0.45 0.25
x1 = 1 0.18 0.12
For a non-spam email, the two features have the following joint distribution:
x2 = 0 x2 = 1
x1 = 0 0.996 0.002
x1 = 1 0.002 0
Determine how you will make Bayes optimal decision y∗ ∈ {Keep, Remove} given the
(d) [1pts] What is the expected loss E[J (y∗, t)] for the Bayes optimal decision rule from
part (c)?
2
CSC311 Fall 2022 Homework 2
2. [3pts] Feature Maps.
Suppose we have the following 2-D data-set for binary classification:
x1 x2 y
-2 -1 1
1 2 0
2 3 1
(a) [3pts] Explain why this data-set is NOT linearly separable in a few sentences. Hint:
(b) [0pts] Suppose you are interested in studying if the above data-set can be separable by
a quadratic functions y = w1x1 + w2x
2
2
. Write all the constraints that w1, w2 have to
satisfy. You do not need to solve for w1, w2 using the constraints.
Solution (provided to help prep for future tests). The constraint on w1 and w2
are as follows:
−2w1 + w2 ≥ 0
w1 + 4w2 < 0
2w1 + 9w2 ≥ 0
3
CSC311 Fall 2022 Homework 2
3. [16pts] kNN vs. Logistic Regression. In this problem, you will compare the performance
and characteristics of two classifiers: k-Nearest Neighbors and Logistic Regression. You will
complete the provided code in q2/ and experiment with the completed code. You should
understand the code instead of using it as a black box.
3.1. k-Nearest Neighbors. Use the supplied kNN implementation to predict labels for
mnist_valid, using the training set mnist_train.
(a) [3pts] Implement the function run_knn in run_knn.py that runs kNN for different
values of k ∈ {1, 3, 5, 7, 9} and plots the classification accuracy on the validation set
as a function of k. The classification accuracy is the number of correctly predicted
data points divided by the total number of data points. Include the plot in the
write-up.
(b) [2pts] Choose a value of k (strictly positive integer) and justify your choice. Let’s
denote the chosen value by k

. Report the validation and test accuracies of KNN
for k

. Also, report the validation and test accuracies of KNN for k
∗ + 2 and k
∗ − 2
(report the latter if your value of k

is greater than two).
How do the test and validation accuracies change as we change the value of k?
In general, you shouldn’t peek at the test set multiple times, but we do this for this
question as an illustrative exercise.
3.2. Logistic Regression. Read the provided code in run_logistic_regression.py and
logistic.py. You need to implement the logistic regression model, where the cost is
defined as:
J =
X
N
i=1
LCE(y
(i)
, t(i)
) = X
N
i=1

−t
(i)
log y
(i) − (1 − t
(i)
) log(1 − y
(i)
)

,
where N is the total number of data. Note that the cost function is the total loss instead
of the average loss.
(a) [4pts] Implement functions logistic_predict, evaluate, and logistic in the file
logistic.py.
(b) [5pts] Complete the missing parts in the function run_logistic_regression. Run
the code on both mnist_train and mnist_train_small. Check whether the value
returned by run_check_grad is small to make sure your implementation in part (a)
is correct.
Experiment with the hyper-parameters for the learning rate, the number of iterations, and the initial values of the weights. If you have a smaller learning rate, your
model will take longer to converge.
If you get NaN/Inf errors, you may try to reduce your learning rate or initialize
with smaller weights. Report the best hyper-parameter settings you found and the
final cross entropy and classification accuracy on the training, validation, and test
sets. You should only compute the test error once you have selected your best
hyper-parameter settings using the validation set.
4
CSC311 Fall 2022 Homework 2
(c) [2pts] Examine how the cross entropy changes as training progresses. Generate and
report 2 plots, one for each of mnist_train and mnist_train_small. In each plot,
you need show two curves: one for the training set and one for the validation set.
Run your code several times and observe if the results change. If they do, how would
you choose the best parameter settings?
5
CSC311 Fall 2022 Homework 2
4. [8pts] Locally Weighted Regression.
(a) [2pts] Let the training set be denoted by {(x
(1), y(1)), ..,(x
(N)
, y(N)
)}. Consider positive
weights a
(1), …, a(N)
. We will define the weighted least squares problem as follows.
w∗ = arg min
1
2
X
N
i=1
a
(i)
(y
(i) − wT x
(i)
)
2 +
λ
2
||w||2
Show that the closed-form solution to the weighted least squares problem is given by the
formula below.
w∗ =

XT AX + λI
−1XT Ay
where X is the N by D design matrix and A is a diagonal matrix where Aii = a
(i)
.
Note that the loss function is defined to be the total loss (not the average loss) over all
the training examples.
(b) [2pts] Locally reweighted least squares combines ideas from k-NN and linear regression.
For each new test example x, we
(1) compute distance-based weights for each training example
a
(i) =
exp

||x − x
(i)
||2

2
!
X
j
exp

||x − x
(j)
||2

2
!,
(2) computes the optimal weights
w∗ = arg min
1
2
X
N
i=1
a
(i)
(y
(i) − wT x
(i)
)
2 +
λ
2
||w||2
,
(3) and predicts
yˆ = x
T w∗
.
Complete the implementation of locally reweighted least squares by filling in the missing
parts in q4.py.
(c) [2pts] Randomly hold out 30% of the dataset as a validation set. Compute the average
loss for different values of τ in the range [10,1000] on both the training set and the
validation set. Plot the training and validation losses as a function of τ (using a log
scale for τ ).
(d) [1pt] How would this algorithm behave as τ → ∞? How would the algorithm behave
when τ → 0?
(e) [1pt] Compared to the traditional linear regression, what is one advantage and one