## Description

## Problems:

1. (From Russell & Norvig) Suppose I pick some decision tree on functions going from 5

Boolean variables to a Boolean output. Now, I generate all possible inputs xi

, and run them

through the decision tree to produce the corresponding classes yi

. Finally, I take this generated dataset of all (xi

, yi) pairs, and run a greedy decision tree learning algorithm using

information gain as the splitting criterion. Am I guaranteed to get back the same tree that

generated the data? If not, what kind of guarantee can I give about the tree that is returned?

2. Consider the following dataset: the class variable is whether or not a car gets good mileage,

and the features are Cylinders (either 4 or 8), Displacement (High, Medium, or Low), and

Horsepower (High, Medium, or Low).

Cylinders Displacement Horsepower GoodMPG?

4 Low Low No

8 High High No

8 High Medium Yes

8 High High No

4 Low Medium Yes

4 Medium Medium No

Give a decision tree that classifies this dataset perfectly. If you were to split this dataset using

information gain, what would the first feature chosen to split on be?

3. Consider a training set of size n, where each example has two continuous features, no two

examples have the exact same value for any of the two features, and the problem is a twoclass problem. Suppose the training set is linearly separable. Can a decision tree correctly

separate the data? What if the dataset is not linearly separable? In either case, if a decision

tree can correctly separate the data, give the tightest bound that you can on the depth of that

tree.

1

4. Suppose you apply bagging and boosting to a hypothesis space of linear separators. Will the

hypothesis space of the ensemble still be linear for boosting? For bagging?

5. (From Russell & Norvig) Construct an SVM that computes the XOR function. Use values of

+1 and -1 for the inputs and outputs. Map inputs (x1, x2) into a space consisting of x1 and

x1x2. Draw the four input points in this space and the maximal margin separator. What is

the margin? Now draw the separating line back in the original input space.

6. The key point of the so-called “kernel trick” in SVMs is to learn a classifier that effectively separates the training data in a higher dimensional space without having to explicitly compute the representation Φ(x) of every point x in the original input space. Instead,

all the work is done through the kernel function that computes dot products K(xi

, xj) =

Φ(xi) · Φ(xj).

Show how to compute the squared Euclidean distance in the projected space between any

two points xi

, xj

in the original space without explicitly computing the Φ mapping, instead

using the kernel function K.

### Bonus Problem (25 extra points):

Bagging reduces the variance of a classifier by averaging over several classifiers trained on subsets

of the original training data. The subsets are obtained by uniform subsampling with replacement.

I.e. if your data is S = {(~x1, y1), . . . ,(~xn, yn)}, at each iteration you create a new data set S

0 with n

random picks, picking each example pair with probability 1

n

each time. As a result you could end

up with multiple identical pairs, or some not present at all.

Let pn(m, k) be the probability that you have drawn m unique examples after k picks with |S| = n.

So clearly pn(m, k) = 0 whenever m > k (because you cannot end up with more unique elements

m than you have drawn), and also pn(m, k) = 0 whenever m > n.

(a) What are the base-case values of pn(1, 1), pn(m, 1), pn(1, k)?

(b) Assume you are have already picked k−1 elements. What is the probability that the k

th pick

will not increase your number of unique elements m? What is the probability that it will?

(c) Can you express pn(m, k) in terms of pn(m, k − 1) and pn(m − 1, k − 1)?

(d) Write out the formula for Ek=n[

m

n

], the expected ratio of unique elements (m) and the total

number of elements (n) after n picks (i.e. k = n) from set S with |S| = n.

(e) Write a little recursive function (in the programming language of your choice) that evaluates

Ek=n[

m

n

]. Plot its value as n increases. What value does it converge to?

(f) If you average over M classifiers, trained on sub-sets S

0

1

, . . . , S0

M where |S

0

i

| = n, what is the

probability that at least one input pair is never picked in any of the training data sets S

0

i

?

Plot this function as M increases. (Assuming that n is large enough for the convergence as

observed in (c).)

2