Description
You will find within it: boosting.py, boosting.sh, boosting check.py, boosting test.py, clas
sifier.py, data loader.py, decision stump.py, decision tree.py, decision tree.sh, deci
sion tree check.py, decision tree test.py, pegasos.py, pegasos.sh, requirements.txt
Download the data Please download mnist subset.json from Piazza/Resource/Homework. Please DO
NOT push it into your repository when you submit your results. Otherwise, you will get 20% deduction of
your score of this assignment.
High-level descriptions
Dataset We will use mnist subset (images of handwritten digits from 0 to 9). This is the same subset
of the full MNIST that we used for Homework 1 and Homework 2. As before, the dataset is stored in a
JSON-formated file mnist subset.json. You can access its training, validation, and test splits using the keys
‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load mnist subset.json to the variable x.
Then, x[
0
train0
] refers to the training set of mnist subset. This set is a list with two elements: x[
0
train0
][0]
containing the features of size N (samples) ×D (dimension of features), and x[
0
train0
][1] containing the
corresponding labels of size N.
Tasks
1. You will be asked to implement the linear support vector machine (SVM) for binary classification
(Sect. 1). Specifically, you will
• finish implementing the following three functions—objective function, pegasos train,
and pegasos test—in pegasos.py. Refer to pegasos.py and Sect. 1 for more information.
• run the script pegasos.sh after you finish your implementation. This will output pegasos.json.
• add, commit, and push (1) the pegasos.py, and (2) the pegasos.json file that you have
created.
2. You will be asked to implement the boosting algorithms with decision stump (Sect. 2). Specifically,
you will
• finish implementing the following classes — boosting, decision stump and AdaBoost. Refer to boosting.py, decision stump.py and Sect. 2 for more information.
• test and debug your code using the provided boosting check.py.
• run the script boosting.sh (which executes boosting test.py). This will output boost
ing.json.
• add, commit, and push (1) the boosting.py, (2) the decision stump.py, and (3) the boost
ing.json.
3. You will be asked to implement the decision tree classifier (Sect. 3). Specifically, you will
1
• finish implementing the following classes — DecisionTree, TreeNode. Refer to decision tree.py
and Sect. 3 for more information.
• test and debug your code using the provided decision tree check.py.
• run the script decision tree.sh (which executes decision tree test.py). This will output decision tree.json.
• add, commit, and push (1) the decision tree.py, (2) the decision tree.json
As in the previous homework, you are not responsible for loading/pre-processing data; we have done
that for you (e.g., we have pre-processed the data so it has two classes: digits 0–4 and digits 5–9.). For
specific instructions, please refer to text in Sect. 1 and 2 and the instructions in pegasos.py and boost
ing.py.
Cautions Please do not import packages that are not listed in the provided code. Follow the instructions
in each section strictly to code up your solutions. Do not change the output format. Do not modify the
code unless we instruct you to do so. A homework solution that does not match the provided setup,
such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that
your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add,
commit, and push all the required files, including your code and generated output files.
2
Problem 1 Pegasos: a stochastic gradient based solver for linear SVM
In this question, you will build a linear SVM (i.e. without kernel) classifier using the Pegasos algorithm [1].
Given a training set {(xn ∈ RD, yn ∈ {1, −1})}
N
n=1
, the primal formulation of linear SVM can be written as
L2-regularized hinge loss as shown in the class:
min
w
λ
2
||w||2
2 +
1
N ∑n
max{0, 1 − ynwT
xn}. (1)
Note that here we include the bias term b into parameter w by appending x with 1, which is slightly different
from what was discussed in the class.
Instead of turning it into dual formulation, we are going to solve the primal formulation directly with
a gradient-base algorithm. Recall for (batch) gradient descent, at each iteration of parameter update, we
compute the gradients for all data points and take the average (or sum). When the training set is large
(i.e., N is a large number), this is often too computationally expensive. Stochastic gradient descent with
mini-batch alleviates this issue by computing the gradient on a subset of the data at each iteration.
Pegasos, a representative solver of eq. (1), is exactly applying stochastic gradient descent with minibatch to this problem, with an extra step of projection described below (this is also called projected stochastic gradient descent). The pseudocode of Pegasos is given in Algorithm 1. At the t-th iteration of parameter
update, we first take a mini-batch of data At of size K [step (3)], and define A
+
t ⊂ At to be the subset
of samples for which wt suffers a non-zero loss [step (4)]. Next we set the learning rate ηt = 1/(λt)
[step (5)]. We then perform a two-step parameter update as follows. We first compute (1 − ηtλ)wt
and for all samples (x, y) ∈ A
+
t we add the vector
yηt
K
x to (1 − ηtλ)wt
. We denote the resulting vector by
wt+ 1
2
[step (6)]. This step is in fact exactly stochastic gradient descent: wt+ 1
2
= wt − ηt∇t where
∇t = λwt −
1
K ∑
(x,y)∈A
+
t
yx
is the (sub)gradient of the objective function on the mini-batch At at wt
. Last, we set wt+1 to be the projection of wt+ 1
2
onto the set
B = {w : ||w||2 ≤ 1/√
λ}
to control its L2 norm. This can be obtained simply by scaling wt+1 by min (
1,
1/√
λ
||wt+ 1
2
||)
[step (e)].
For more details of Pegasos algorithm you may refer to the original paper [1].
Now you are going to implement Pegasos and train a binary linear SVM classifier on the dataset
mnist subset.json. You are going to implement three functions—objective function, pegasos train,
and pegasos test—in a script named pegasos.py. You will find detailed programming instructions in
the script.
1.1 Finish the implementation of the function objective function, which corresponds to the objective
function in the primal formulation of SVM.
1.2 Finish the implementation of the function pegasos train, which corresponds to Algorithm 1.
3
Algorithm 1 The Pegasos algorithm
Require: a training set S = {(xn ∈ RD, yn ∈ {1, −1})}
N
n=1
, the total number of iterations T, the batch size
K, and the regularization parameter λ.
Output: the last weight wT+1.
1: Initialization: w1 = 0
2: for t = 1, 2, · · · , T do
3: Randomly choose a subset of data At ∈ S (with replacement) such that |At
| = K
4: Set A
+
t = {(x, y) ∈ At
: ywT
t
xt < 1}
5: Set ηt =
1
λt
6: Set wt+ 1
2
= (1 − ηtλ)wt +
ηt
K
∑(x,y)∈A
+
t
yx
7: Set wt+1 = min (
1,
1/√
λ
||wt+ 1
2
||)
wt+ 1
2
8: end for
1.3 After you train your model, run your classifier on the test set and report the accuracy, which is defined
as:
# of correctly classified test samples
# of test samples
Finish the implementation of the function pegasos test.
1.4 After you complete above steps, run pegasos.sh, which will run the Pegasos algorithm for 500 iterations with 6 settings (mini-batch size K = 100 with different λ ∈ {0.01, 0.1, 1} and λ = 0.1 with different
K ∈ {1, 10, 1000}), and output a pegasos.json that records the test accuracy and the value of objective
function at each iteration during the training process.
What to do and submit: run script pegasos.sh. It will take mnist subset.json as input and generate
pegasos.json. Add, commit, and push both pegasos.py and pegasos.json before the due date.
1.5 Based on pegasos.json, you are encouraged to make plots of the objective function value versus the
number of iterations (i.e., a convergence curve) in different settings of λ and K, but you are not required to
submit these plots.
References
[1] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter Mathematical Programming, 2011,
Volume 127, Number 1, Page 3. Pegasos: primal estimated sub-gradient solver for SVM.
4
Problem 2 Boosting
In the lectures, we discussed boosting algorithms, which construct a strong (binary) classifier based on iteratively adding one weak (binary) classifier into it. A weak classifier is learned to (approximately) maximize
the weighted training accuracy at the corresponding iteration, and the weight of each training example is
updated after every iteration.
In this question, you will implement decision stumps as weak classifiers and AdaBoost. For simplicity,
instead of implementing an actual base algorithm that learns decision stumps, we fix a finite set of decision
stumps H in advance and pick the one with smallest weighted error at each round of boosting.
2.1 General boosting framework A boosting algorithm sequentially learns βt and ht ∈ H for t = 0, . . . , T −
1 and finally constructs a strong classifier as H(x) = sign h
∑
T−1
t=0
βtht(x)
i
.
Please read the class “Boosting” defined in boosting.py, and then complete the class by implementing
the “TODO” part(s) as indicated in boosting.py.
2.2 Decision stump A decision stump h ∈ H is a classifier characterized by a triplet (s ∈ {+1, −1}, b ∈
R, d ∈ {0, 1, · · · , D − 1}) such that
h(s,b,d)
(x) = (
s, if xd > b,
−s, otherwise.
(2)
That is, each decision stump function only looks at a single dimension xd of the input vector x, and check
whether xd
is larger than b. Then s decides which label to predict if xd > b.
Please first read classifier.py and decision stump.py, and then complete the class “DecisionStump” by implementing the “TODO” part(s) as indicated in decision stump.py.
2.3 AdaBoost AdaBoost is a powerful and popular boosting method, summarized in Algorithm 2.
Algorithm 2 The AdaBoost algorithm
Require: H: A set of classifiers, where h ∈ H takes a D-dimensional vector as input and outputs either +1
or −1, a training set {(xn ∈ RD, yn ∈ {+1, −1})}
N−1
n=0
, the total number of iterations T.
Ensure: Learn H(x) = sign h
∑
T−1
t=0
βtht(x)
i
, where ht ∈ H and βt ∈ R.
1: Initialization D0(n) = 1
N
, ∀n ∈ {0, 1, · · · , N − 1}.
2: for t = 0, 1, · · · , T − 1 do
3: find ht = argminh∈H ∑n Dt(n)I[yn 6= h(xn)]
4: compute et = ∑n Dt(n)I[yn 6= ht(xn)]
5: compute βt =
1
2
log
1 − et
et
6: compute Dt+1(n) = (
Dt(n) exp(−βt), if yn = ht(xn)
Dt(n) exp(βt), otherwise
, ∀n ∈ {0, 1, · · · , N − 1}
7: normalize Dt+1(n) ←
Dt+1(n)
∑n
0 Dt+1(n
0)
, ∀n ∈ {0, 1, · · · , N − 1}
8: end for
Please read the class ”AdaBoost” defined in boosting.py, and then complete the class by implementing the ”TODO” part(s) as indicated in boosting.py.
2.4 Final submission Please run boosting.sh, which will generate boosting.json. Add, commit,
and push boosting.py, decision stump.py and boosting.json before the due date.
5
Problem 3 Decision Tree
In this question, you will implement a simple decision tree algorithm. For simplicity, only discrete features
are considered here.
3.1 Conditional Entropy Recall that the quality of a split can be measured by the conditional entropy.
Please read decision tree.py and complete the function “conditional entropy” (where we use base 2
for logarithm).
3.2 Tree construction The building of a decision tree involves growing and pruning. For simplicity, in
this programming set you are only asked to grow a tree.
As discussed a tree can be constructed by recursively splitting the nodes if needed, as summarized in
Algorithm 3 (whose interface is slightly different from the pseudocode discussed in the class). Please read
decision tree.py and complete the function “split”.
Algorithm 3 The recursive procedure of decision tree algorithm
1: function TREENODE.SPLIT(self)
2: find the best attribute to split the node
3: split the node into child nodes
4: for child node in child nodes do
5: if child node.splittable then . See Slide 27 of Lec 7 for the three “non-splittable” cases
6: child node.split()
7: end if
8: end for
9: end function
3.3 Final submission Please run decision tree.sh, which will generate decision tree.json. Add,
commit, and push decision tree.py, decision tree.json before the due date.
6