CS 4780/5780 Assignment 2: Decision Trees and Hypothesis Testing

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (2 votes)

Problem 1: Model Selection and Validation [20 points]
(a) Consider the training data Strain shown in Figure 1, and the two test sets Stest1 and
Stest2 shown in Figures 2 and 3. o’s indicate positive cases, x’s indicate negative.
What are the training and testing accuracies (for both test sets), expressed in terms
of the fraction of points classified correctly, for a linear model which classifies a case
as positive if x2 ≥ x1?
(b) Give the same for the decision tree shown in Figure 5.
(c) Give the χ
2
statistic for the McNemar test comparing these two models for Stest1.
With what confidence can we say that one model generalizes better than the other
for this validation set?
(d) Repeat part c for Stest2.
2-1
(e) Did your answers change for the two test sets in parts a and b? What about parts c
and d? Why might one stay the same, while the other changes?
Figure 1: Training data Strain
Figure 2: Test data Stest1 Figure 3: Test data Stest2
2-2
Figure 4: Decision tree for Problem 1
2-3
Problem 2: Model Averaging with Decision Trees [40 points]
In this problem we will look at a general machine learning technique of model averaging,
applied to decision trees. We will compare how a collection of multiple decision trees
fares against a single tree, in a simple binary classification task on synthetic data.
In parts a and b of this problem, you are given synthetic data and you are asked to
experimentally investigate the effect of combing multiple decision trees into a single
classifier.
In parts c and d of this problem, you will investigate this idea from a mathematical
standpoint, and in part e, you will combine your empirical and theoretical results.
For parts a and b you are given a small toy dataset with 290 training instances and binary
labels (circle.train – in the usual SVMLight format). The instances xtrain ∈ R
2
(in the
SVMLight file, features 0 and 1 correspond to x and y respectively) were generated by
sampling a 10×10 grid, such that:
• Exactly 145 points fall inside a circle of radius 3 at the center of the square. These
instances belong to the positive class.
• Exactly 145 points fall outside a circle of radius 3 at the center of the square.
These instances belong to the negative class.

+
Classes
10
10
R=3
(0, 0)
(5,5)
Figure 5: Synthetic dataset for Problem 2
(a) Individal decision trees Implement and train a TDIDT decision tree on the provided
training set to obtain 0 training error. Use information gain as a criterion for splitting, with intermediate nodes of the tree splitting according to the attribute ≥
threshold where threshold is an integer (as in HW1).
Generate a plot that shows the decision boundary (region) of this decision tree.
The easiest way to do it, is to produce a fine grid of points (test instances) in the
10×10 square, query your decision tree for the label at each point, and then color the
2-4
respective point with the color of the corresponding label. With a fine enough grid,
this will generate regions nicely shaded according to their predicted class.
(b) Averaging decision trees We will now explore the idea of combining miltiple decision
trees for producing a single classifier. We will be training 101 decision trees as follows:
For j ∈ [1, 101]
• Randomly select 20% of your training data into a smaller training set Xj
.
• Train a decision tree Tj to obtain 0 training error on the training set Xj (same
splitting criteria as above).
End For
Note that the same training instance may appear multiple times in different subsets
Xj
Generate a plot that shows a decision boundary (region) of 2 such decision trees.
Select two trees that are particularly illustrative examples of overfitting. Plot their
decision boundaries (regions) as you did in part a.
Now you will combine all 101 decision trees into a single classifier. For a test instance
xtest, this classifier will predict +1 if the majority of the 101 decision trees predict
+1 on xtest, and −1 otherwise.
Generate a plot that shows a decision boundary (region) of this combined classifier.
Comment on how the boundary of the combined classifier compares to the boundaries of the individual classifiers, and to the true boundary that was used to generate
the data. How does the combined classifier compare to the individual classifiers in
terms of overfitting?
(c) Show that a prediction ˆy ∈ {−1, 1} of a decision tree T on test instance xtest can be
expressed as follows:
yˆ = sign Xn
i
yiK(xtest, xi)
!
where ˆy = T(xtest), n is the number of training instances and K(xtest, xi) is a similarity measure between a test instance xtest and a training instance xi
, and yi
is the label
(binary) of the training instance xi
. The similarity measure K(xtest, xi) = 1/k if the
test instance xtest is in the same leaf with the training instance xi and 0 otherwise,
and k is the total number of training instances in that leaf.
(d) Now consider the case of averaging predictions from M different trees:
2-5
yˆ = sign
1
M
X
M
j
Xn
i
yiKj (xtest, xi)
!
where Kj (·, ·) is the similarity measure corresponding to tree Tj
. Show that this
predicion can also be written in the form:
yˆ = sign Xn
i
yiK˜ (xtest, xi)
!
where K˜ is a new similarity measure. Give an expression for K˜ in terms of the
variables above, and an intuitive interpretation of the similarity measure for the
decision tree model average. Assume K(xtest, xi) = 0 if xi 6∈ Xj
.
2-6
Problem 3: Text Categorization with Decision Trees [40 points]
In this problem, we will be performing text categorization with a decision tree. We
will be classifying online newsgroup posts into 1 of 4 categories: AUTO, COMPUTERS,
RELIGION, SPORTS. We will use the same bag-of-words feature representation as used
in HW1. Three files are supplied for this problem: groups.train, groups.test, group.vocab.
groups.train and groups.test are in the familiar SVMLight format and each include a total
of 2000 documents (instances), and a vocabulary (feature) size of 2000. Feature-word
mappings are provided in the groups.vocab file, where the first number on every line of
the groups.vocab file specifies the index of the feature.
(a) Decision Tree I Implement and train a TDIDT decision tree with the information
gain criterion, with splits of the form feature x > 0 (i.e. test if a word corresponding
to feature x is present in the document) to obtain 0 training error. Report test
error that this tree obtains on the test set groups.test.
(b) Informative words List the words (from groups.vocab) corresponding to the splits
at the top 2 levels of the tree. List the words corresponding to the splits at the
bottom 2 levels of the tree. What difference do you observe between the words at
the top and the words at the bottom of the tree. What can you hypothesize about
the generalization ability of the complete tree you just trained?
(c) Early stopping Modify the decision tree you learned in part a by restricting its depth
to 10 (i.e. cut off all nodes below the 10th level). Report training and test accuracy
of this classifier on the test set group.test.
(d) Comparing classifiers We will now compare the generalization accuracy of the decision tree you trained in part a with the early-stopped tree in part c. Use the
McNemar test and explain whether you can conclude with greater than 95% confidence whether one of the two classifiers has lower generalization accuracy than the
other. Show all work that gets you to your conclusion.
(e) Comparing learning algorithms Suppose that you hypothesize that simply checking
for the appearance of the word in the document is not enough. You decide to modify your decision tree learning algorithm to additionally check if the word had also
appeared more than once (i.e. your tree can now split according to the rule feature
x > {0, 1}. Modify your decision tree to accommodate for this change, and compare
the generalization accuracy of the decision trees produced by the training algorithm
in part c (one threshold, 10 levels) and the modified algorithm that includes the
additional splitting threshold (two thresholds, 10 levels). Concatenate training and
test sets supplied, and perform 5-fold cross-validation over the combined sets. Use
the paired t-test to establish whether the two algorithms have different classification
accuracies at the 95% confidence level (you can ignore that the different folds are
not statistically independent when you apply the test, but always keep in mind that
2-7
the test is only approximate in this situation). Show all work that gets you to your
conclusion.
(f) Model selection We will now explore the technique of cross-validation for picking an
optimal parameter (maximum depth of the decision tree) for our classifier with 1
threshold (feature x > 0). Perform 5-fold cross-validation over the combined set
from part e once for each value of maximum depth in the set {2, 3, 5, 10, 50, 80}.
Plot the average accuracy for each run on a linear scale. What tree depth results in
the highest average accuracy?
2-8