## Description

Problem 1: Model Selection and Validation [20 points]

(a) Consider the training data Strain shown in Figure 1, and the two test sets Stest1 and

Stest2 shown in Figures 2 and 3. o’s indicate positive cases, x’s indicate negative.

What are the training and testing accuracies (for both test sets), expressed in terms

of the fraction of points classified correctly, for a linear model which classifies a case

as positive if x2 ≥ x1?

(b) Give the same for the decision tree shown in Figure 5.

(c) Give the χ

2

statistic for the McNemar test comparing these two models for Stest1.

With what confidence can we say that one model generalizes better than the other

for this validation set?

(d) Repeat part c for Stest2.

2-1

(e) Did your answers change for the two test sets in parts a and b? What about parts c

and d? Why might one stay the same, while the other changes?

Figure 1: Training data Strain

Figure 2: Test data Stest1 Figure 3: Test data Stest2

2-2

Figure 4: Decision tree for Problem 1

2-3

Problem 2: Model Averaging with Decision Trees [40 points]

In this problem we will look at a general machine learning technique of model averaging,

applied to decision trees. We will compare how a collection of multiple decision trees

fares against a single tree, in a simple binary classification task on synthetic data.

In parts a and b of this problem, you are given synthetic data and you are asked to

experimentally investigate the effect of combing multiple decision trees into a single

classifier.

In parts c and d of this problem, you will investigate this idea from a mathematical

standpoint, and in part e, you will combine your empirical and theoretical results.

For parts a and b you are given a small toy dataset with 290 training instances and binary

labels (circle.train – in the usual SVMLight format). The instances xtrain ∈ R

2

(in the

SVMLight file, features 0 and 1 correspond to x and y respectively) were generated by

sampling a 10×10 grid, such that:

• Exactly 145 points fall inside a circle of radius 3 at the center of the square. These

instances belong to the positive class.

• Exactly 145 points fall outside a circle of radius 3 at the center of the square.

These instances belong to the negative class.

–

+

Classes

10

10

R=3

(0, 0)

(5,5)

Figure 5: Synthetic dataset for Problem 2

(a) Individal decision trees Implement and train a TDIDT decision tree on the provided

training set to obtain 0 training error. Use information gain as a criterion for splitting, with intermediate nodes of the tree splitting according to the attribute ≥

threshold where threshold is an integer (as in HW1).

Generate a plot that shows the decision boundary (region) of this decision tree.

The easiest way to do it, is to produce a fine grid of points (test instances) in the

10×10 square, query your decision tree for the label at each point, and then color the

2-4

respective point with the color of the corresponding label. With a fine enough grid,

this will generate regions nicely shaded according to their predicted class.

(b) Averaging decision trees We will now explore the idea of combining miltiple decision

trees for producing a single classifier. We will be training 101 decision trees as follows:

For j ∈ [1, 101]

• Randomly select 20% of your training data into a smaller training set Xj

.

• Train a decision tree Tj to obtain 0 training error on the training set Xj (same

splitting criteria as above).

End For

Note that the same training instance may appear multiple times in different subsets

Xj

Generate a plot that shows a decision boundary (region) of 2 such decision trees.

Select two trees that are particularly illustrative examples of overfitting. Plot their

decision boundaries (regions) as you did in part a.

Now you will combine all 101 decision trees into a single classifier. For a test instance

xtest, this classifier will predict +1 if the majority of the 101 decision trees predict

+1 on xtest, and −1 otherwise.

Generate a plot that shows a decision boundary (region) of this combined classifier.

Comment on how the boundary of the combined classifier compares to the boundaries of the individual classifiers, and to the true boundary that was used to generate

the data. How does the combined classifier compare to the individual classifiers in

terms of overfitting?

(c) Show that a prediction ˆy ∈ {−1, 1} of a decision tree T on test instance xtest can be

expressed as follows:

yˆ = sign Xn

i

yiK(xtest, xi)

!

where ˆy = T(xtest), n is the number of training instances and K(xtest, xi) is a similarity measure between a test instance xtest and a training instance xi

, and yi

is the label

(binary) of the training instance xi

. The similarity measure K(xtest, xi) = 1/k if the

test instance xtest is in the same leaf with the training instance xi and 0 otherwise,

and k is the total number of training instances in that leaf.

(d) Now consider the case of averaging predictions from M different trees:

2-5

yˆ = sign

1

M

X

M

j

Xn

i

yiKj (xtest, xi)

!

where Kj (·, ·) is the similarity measure corresponding to tree Tj

. Show that this

predicion can also be written in the form:

yˆ = sign Xn

i

yiK˜ (xtest, xi)

!

where K˜ is a new similarity measure. Give an expression for K˜ in terms of the

variables above, and an intuitive interpretation of the similarity measure for the

decision tree model average. Assume K(xtest, xi) = 0 if xi 6∈ Xj

.

2-6

Problem 3: Text Categorization with Decision Trees [40 points]

In this problem, we will be performing text categorization with a decision tree. We

will be classifying online newsgroup posts into 1 of 4 categories: AUTO, COMPUTERS,

RELIGION, SPORTS. We will use the same bag-of-words feature representation as used

in HW1. Three files are supplied for this problem: groups.train, groups.test, group.vocab.

groups.train and groups.test are in the familiar SVMLight format and each include a total

of 2000 documents (instances), and a vocabulary (feature) size of 2000. Feature-word

mappings are provided in the groups.vocab file, where the first number on every line of

the groups.vocab file specifies the index of the feature.

(a) Decision Tree I Implement and train a TDIDT decision tree with the information

gain criterion, with splits of the form feature x > 0 (i.e. test if a word corresponding

to feature x is present in the document) to obtain 0 training error. Report test

error that this tree obtains on the test set groups.test.

(b) Informative words List the words (from groups.vocab) corresponding to the splits

at the top 2 levels of the tree. List the words corresponding to the splits at the

bottom 2 levels of the tree. What difference do you observe between the words at

the top and the words at the bottom of the tree. What can you hypothesize about

the generalization ability of the complete tree you just trained?

(c) Early stopping Modify the decision tree you learned in part a by restricting its depth

to 10 (i.e. cut off all nodes below the 10th level). Report training and test accuracy

of this classifier on the test set group.test.

(d) Comparing classifiers We will now compare the generalization accuracy of the decision tree you trained in part a with the early-stopped tree in part c. Use the

McNemar test and explain whether you can conclude with greater than 95% confidence whether one of the two classifiers has lower generalization accuracy than the

other. Show all work that gets you to your conclusion.

(e) Comparing learning algorithms Suppose that you hypothesize that simply checking

for the appearance of the word in the document is not enough. You decide to modify your decision tree learning algorithm to additionally check if the word had also

appeared more than once (i.e. your tree can now split according to the rule feature

x > {0, 1}. Modify your decision tree to accommodate for this change, and compare

the generalization accuracy of the decision trees produced by the training algorithm

in part c (one threshold, 10 levels) and the modified algorithm that includes the

additional splitting threshold (two thresholds, 10 levels). Concatenate training and

test sets supplied, and perform 5-fold cross-validation over the combined sets. Use

the paired t-test to establish whether the two algorithms have different classification

accuracies at the 95% confidence level (you can ignore that the different folds are

not statistically independent when you apply the test, but always keep in mind that

2-7

the test is only approximate in this situation). Show all work that gets you to your

conclusion.

(f) Model selection We will now explore the technique of cross-validation for picking an

optimal parameter (maximum depth of the decision tree) for our classifier with 1

threshold (feature x > 0). Perform 5-fold cross-validation over the combined set

from part e once for each value of maximum depth in the set {2, 3, 5, 10, 50, 80}.

Plot the average accuracy for each run on a linear scale. What tree depth results in

the highest average accuracy?

2-8