Description
Problem 1: VC Dimension (35 pts)
1. Consider a binary classification problem for data points in R
2 with a hypothesis space consisting of pairs of parallel lines such that any point between the pair is classified as 0+0 and
points outside of the pair are classified as −. What is the VC dimension of this hypothesis
space? Prove it. How many samples would be sufficient to guarantee that an optimal learning
algorithm will attain an accuracy of .8 with probability at least .95?
2. Consider a binary classification problem for data points in R. Let H be the hypothesis space
of all intervals in R. Given an interval in H, points inside the interval are classified as 0+0
and the remaining points are classified as 0−0
. Consider the boosted hypothesis space H0
that
takes a pair of hypotheses from H and takes the sign of their weighted combination (similar
to what would be produced by two rounds of boosting). Specifically,
H0 = {f|f(x) = sign(α1h1(x) + α2h2(x)) for some h1, h2 ∈ H and α1, α2 ∈ R}.
To break ties, if α1h1(x)+α2h2(x) = 0, the hypothesis should return a 0+0
. What is V C(H0
)?
Prove it.
Problem 2: Medical Diagnostics (65 pts)
For this problem, you will use the data set provided with this problem set. The data has been
divided into two pieces heart train.data and heart test.data. These data sets were generated using
the UCI SPECT heart data set (follow the link for information about the format of the data). Note
that the class label is the first column in the data set.
1. Suppose that the hypothesis space consists of all decision trees with exactly two attribute
splits (repetition along the same path is allowed) for this data set.
(a) Run the adaBoost algorithm for five rounds to train a classifier for this data set. Draw
the 5 selected trees in the order that they occur and report the and α, generated by
adaBoost, for each.
(b) Run the adaBoost algorithm for 10 rounds of boosting. Plot the accuracy on the training
and test sets versus iteration number.
1
2. Now, suppose that the hypothesis space consists of only height 1 decision trees for this data
set (only one attribute split).
(a) Use coordinate descent to minimize the exponential loss function for this hypothesis
space over the training set. You can use any initialization and iteration order that you
would like other than the one selected by adaBoost. What is the optimal value of α that
you arrived at? What is the corresponding value of the exponential loss on the training
set?
(b) What is the accuracy of the resulting classifier on the test data?
(c) What is the accuracy of adaBoost after 20 rounds for this hypothesis space on the test
data? How does the α learned by adaBoost compare to the one learned by gradient
descent?
(d) Use bagging, with 20 bootstrap samples, to produce an average classifier for this data
set. How does it compare to the previous classifiers in terms of accuracy on the test set?
(e) Which of these 3 methods should be preferred for this data set and why?
2