## Description

1. [4pts] Nearest Neighbours and the Curse of Dimensionality. In this question, you

will verify the claim from lecture that “most” points in a high-dimensional space are far away

from each other, and also approximately the same distance. There is a very neat proof of this

fact which uses the properties of expectation and variance. If it’s been a long time since you’ve

studied these, you may wish to review the Tutorial 1 slides, or the Metacademy resources3

.

(a) [1pts] For each choice of dimension d ∈ [20

, 2

1

, 2

2

, …, 2

10], sample 100 points from the

unit cube, and record the following average distances between all pairs of points, as well

as the standard deviation of the distances.

i. Squared Euclidean or `2 distance = |x − yk

2

2 =

P

j

(xj − yj )

2

ii. `1 distance = kx − yk1 =

P

j

|xj − yj |

Plot both the average and the standard deviation as a function of d.

(You may wish to use np.mean and np.std to compute the statistics, and matplotlib

for plotting. You may find numpy.random.rand helpful in sampling from the unit cube.)

Include the output figure in your solution PDF (hw1_writeup.pdf).

(b) [1pts] In this question, we aim to verify our simulations in (a) by deriving the analytical

form of averaged Euclidean distance and variance of Euclidean distance. Suppose we

sample two points X and Y independently from a unit cube in d dimensions and define

the squared Euclidean distance R = Z1+· · ·+Zd with Zi = (Xi−Yi)

2

. Given that E[Zi

] =

1

6

and Var[Zi

] = 7

180 , determine E[R] and Var[R] using the properties of expectation and

variance. You may give your answer in terms of the dimension d.

Basic rule of expectation and variance: E[Zi+Zj ] = E[Zi

]+E[Zj ] and if we further

know Zi and Zj are independent, then Var[Zi + Zj ] = Var[Zi

] + Var[Zj ].

1

https://markus.teach.cs.toronto.edu/csc311-2021-09

2

http://www.cs.toronto.edu/~rahulgk/courses/csc311_f22/index.html

3

https://metacademy.org/graphs/concepts/expectation_and_variance

1

CSC311 Fall 2022 Homework 1

(c) [2pts] In probability theory, one can derive that P(|Z − E[Z]| ≥ a) ≤

Var[Z]

a

2 for any

random variable Z. (This fact is known as Markov’s Inequality.) Based on your answer

to part (b), explain why does this support the claim that in high dimensions, “most

points are approximately the same distance”? Let’s justify this step-by-step:

i. We want to bound the probability that any given distance is at most d distance

away from its expectation. How would you mathematically write down this event,

E?

ii. Use Markov’s Inequality to bound P(E).

iii. Apply the result in part (b) and note what happens to P(E) as d goes to ∞.

2. [8pts] Decision Trees. This question is taken from a project by Lisa Zhang and Michael

Guerzhoy.

In this question, you will use the scikit-learn decision tree classifier to classify real vs. fake

news headlines. The aim of this question is for you to read the scikit-learn API and get

comfortable with training/validation splits.

We will use a dataset of 1298 “fake news” headlines (which mostly include headlines of articles

classified as biased, etc.) and 1968 “real” news headlines, where the “fake news” headlines

are from https://www.kaggle.com/mrisdal/fake-news/data and “real news” headlines are

from https://www.kaggle.com/therohk/million-headlines. The data were cleaned by

removing words from fake news titles that are not a part of the headline, removing special

characters from the headlines, and restricting real news headlines to those after October

2016 containing the word “trump”. For your interest, the cleaning script is available as

clean_script.py on the course web page, but you do not need to run it. The cleaned-up

data are available as clean_real.txt and clean_fake.txt on the course web page.

Each headline appears as a single line in the data file. Words in the headline are separated

by spaces, so just use str.split() in Python to split the headlines into words.

You will build a decision tree to classify real vs. fake news headlines. Instead of coding

the decision trees yourself, you will do what we normally do in practice — use an existing

implementation. You should use the DecisionTreeClassifier included in sklearn. Note

that figuring out how to use this implementation is a part of the assignment.

Here’s a link to the documentation of sklearn.tree.DecisionTreeClassifier: http://

scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.

html

All code should be included in the file hw1_code.py which you submit through MarkUs.

(a) [2pt] Write a function load_data which loads the data, preprocesses it using a vectorizer

(http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_

extraction.text), and splits the entire dataset randomly into 70% training, 15% validation, and 15% test examples.

(b) [2pt] Write a function select_model which trains the decision tree classifier using at

least 5 different values of max_depth, as well as three different split criteria (information gain, log loss and Gini coefficient), evaluates the performance of each one on

the validation set, and prints the resulting accuracies of each model. You should use

DecisionTreeClassifier, but you should write the validation code yourself. In your

solution PDF (hw1_writeup.pdf), include the output of this function as well as a plot

of the validation accuracy vs. max_depth.

2

CSC311 Fall 2022 Homework 1

(c) [1pt] Now let’s stick with the hyperparameters which achieved the highest validation

accuracy. Extract and visualize the first two layers of the tree. Your visualization may

look something like what is shown below, but it does not have to be an image: it is

perfectly fine to display text. It may be hand-drawn. Include your visualization in your

solution PDF (hw1_writeup.pdf).

(d) [3pts] Write a function compute_information_gain which computes the information

gain of a split on the training data. That is, compute I(Y, xi), where Y is the random

variable signifying whether the headline is real or fake, and xi

is the keyword chosen for

the split.

Report the outputs of this function for the topmost split from the previous part, and for

several other keywords.

3. [8pts] Regularized Linear Regression. For this problem, we will use the linear regression

model from the lecture:

y =

X

D

j=1

wjxj + b.

In lecture, we saw that regression models with too much capacity can overfit the training data

and fail to generalize. We also saw that one way to improve generalization is regularization:

adding a term to the cost function which favors some explanations over others. For instance,

we might prefer that weights not grow too large in magnitude. Elastic Net regularization

combines their `1 and `2 norms and encourages them to stay small. It adds the following

penalty:

R(w) = λ1

2

||w||1 +

λ2

2

w>w = λ1

X

D

j=1

|wj | +

λ2

2

X

D

j=1

w

2

j

3

CSC311 Fall 2022 Homework 1

to the cost function, for some λ1, λ2 ≥ 0. It is also possible to apply different regularization

penalties in each dimension. The formulation would be:

J

αβ

reg (w) = 1

2N

X

N

i=1

y

(i) − t

(i)

2

| {z }

=J

+

X

D

j=1

αj |wj | +

1

2

X

D

j=1

βjw

2

j

| {z }

=R

,

where i indexes the data points, αj , βj ≥ 0 for all j, and J is the same squared error cost

function from lecture. Note that in this formulation, there is no regularization penalty on the

bias parameter. Also note that when αj = βj = 0, you don’t apply any regularization on

j-th dimension. For this question, show your work in detail as most points are allocated in

showing how you obtained your answer.

(a) [3pts] Determine the gradient descent update rules for the regularized cost function

J

αβ

reg . You may notice that the absolute value function is not differentiable everywhere,

in particular at 0. For the purpose of this question, let us assume that the gradient at

0 is 0. Your answer should have the form:

If wj > 0:

wj ← · · ·

b ← · · ·

If wj = 0:

wj ← · · ·

b ← · · ·

If wj < 0:

wj ← · · ·

b ← · · ·

This form of regularization is a version of what is sometimes called “weight decay”.

Based on this update rule, why do you suppose that is?

Hint: Try writing the `1 term as a piecewise functions and determine the gradient for

each piece separately.

(b) [3pts] For the remaining part of the question, consider the special case where λ1 = 0.

In other words, we only apply the `2 penalty. It is possible to solve this regularized regression problem, also called Ridge Regression, directly by setting the partial derivatives

equal to zero. In this part, for simplicity, we will drop the bias term from the model, so

our model is:

y =

X

D

j=1

wjxj .

It is possible to derive a system of linear equations of the following form for J

β

reg:

∂J

β

reg

∂wj

=

X

D

j

0=1

Ajj0wj

0 − cj = 0.

Determine formulas for Ajj0 and cj .

4

CSC311 Fall 2022 Homework 1

(c) [2pts] Based on your answer to part (b), determine formulas for A and c, and derive a

closed-form solution for the parameter w. Note that, as usual, the inputs are organized

into a design matrix X with one row per training example.

5