Description
1 Theory
I: A “warm up” problem 5 points
Consider instances of X drawn from the uniform distribution D on [−1, 1]. Let f denote the actual
labeling function mapping each instance to its label y ∈ {−1, 1} with probabilities
P r(y = 1|x > 0) = 0.9
P r(y = −1|x > 0) = 0.1
P r(y = 1|x ≤ 0) = 0.1
P r(y = −1|x ≤ 0) = 0.9
The hypothesis h predicts the label for each instance as defined below:
h(x) = (
1 if x > 0
0 otherwise
Measure the success of this predictor by calculating the training error for h.
II: Bayes Optimal Predictor 5 points
Show that for every probability distribution D, the Bayes optimal predictor fD is, in fact, optimal.
That is, show that for every classifier g : X → {0, 1},
LD(fD) ≤ LD(g).
III: Perceptron with a Learning Rate 5 points
Let us modify the perceptron algorithm as follows:
In the update step, instead of setting w(t+1) = w(t) + yixi every time there is a misclassification,
we set w(t+1) = w(t) + ηyixi
instead, for some 0 < η < 1. Show that this modified perceptron will (a) perform the same number of iterations as the original, and (b) converge to a vector that points to the same direction as the output of the original perceptron. IV: Unidentical Distributions 5 points Let X be a domain and let D1, D2, …, Dm be a sequence of distributions over X . Let H be a finite class of binary classifiers over X and let f ∈ H. Suppose we are getting a sample S of m examples such that the instances are independent but not identically distributed, the i th instance is sampled from Di and then yi is set to be f(xi). Let Dm denote the average, i.e., Dm = (D1+ . . . +Dm)/m. 1 CSE 353 Homework I Due by Mar 29, 2019 Fix an accuracy parameter ∈ (0, 1). Show that P r h ∃ h ∈ H s.t. L(Dm,f) (h) > and L(S,f)
(h) = 0i
≤ |H|e
−m
2 Programming
In this section, you will be working with a Breast Cancer Prediction dataset. Diagnosis of breast
cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny
speck of calcium is seen (on an x-ray). In this dataset, there are 569 observations, with each
observation consisting of 5 features (mean radius, mean texture, mean perimeter, mean area, mean
smoothness). The last column is the diagnosis – 0 indicates that the finding was benign, and 1
indicates that it was malignant.
I: Perceptron 25 points
The first task is to write your own code in Java or Python to implement the perceptron learning
algorithm. Your goal is minimize the empirical risk (i.e., no need to do validation or structural risk
minimization) for the perceptron to learn the diagnosis.
As you can tell, this is a binary classification task. For this task, submit your code for the perceptron
learning algorithm. This code should be able to read the dataset, and eventually provide the learned
weight vector as its final output.
II: A modified Perceptron algorithm 15 points
The second task is to modify the perceptron algorithm to exploit the observations that do not
lead to updated weights. This modified perceptron algorithm does the following: for each weight
vector w, (i) it keeps count of the number of consecutive correctly classified observations until the
next update, and (ii) it always keeps the weight vector w that had the longest such streak. This
variation is called the pocket algorithm because it keeps the best solution seen so far “in the
pocket”, and returns this rather than the last solution (see Alg. 1).
Your submission for this task should be your code for the pocket algorithm. Just like the original
perceptron code, this code too should be able to read the dataset, and eventually provide the
learned weight vector as its final output. You may write a totally separate code for the pocket
algorithm, or you may choose to incoroporate it as a ‘version’ of the perceptron algorithm. An
example of the latter approach could be something that looks like
$ python perceptron.py –version naive –dataset /path/to/data/filename.csv
$ python perceptron.py –version pocket –dataset /path/to/data/filename.csv
III: Linear Regression for Classification 25 points
The third and final programming task is to use linear regression for classification. In class, we
saw how logistic regression has a nice probabilistic interpretation that can be used for binary
classification. Linear regression, however, does not.
What you are required to do, instead, is divide the dataset into two sub-datasets according to
whether the diagnosis is 0 or 1. Next, run linear regression on each of these two sub-datasets to
obtain two linear subspaces that fit the benign findings and the malignant findings, respectively.
Once you have obtained these two subspaces (both through ERM), the actual binary classfication
will be done by computing the Euclidean distance of each observation from the two subspaces, and
assigning the label of the closer fit. That is, once you obtain the two vectors w0 (for benign) and
CSE 353 Homework I Due by Mar 29, 2019
w1 (for malignant) as output of the two linear regressions, an observation will be classified as 0 or
1 depending on which of these two weight vectors it is closer to.
Your submission for this task should be your code for linear regression. This code should be able
to perform the whole task defined above, and not require the user to manually divide the dataset
and call the ‘usual’ linear regression code multiple times. For example, the user should be able to
do something like
$ python linearregressionclassifier.py –dataset /path/to/data/filename.csv
and obtain the classifier’s performance.
IV: Final Report 15 points
Along with your code, you are also required to submit a short report (no more than 2 pages)1
. The
report must contain the following:
• The performance of each model (i.e., the empirical risk).
• A brief discussion (at most a half-page) about what you observed regarding the runtime and
termination of the original perceptron algorithm, and what you did if/when the algorithm did
not stop after some reasonable amount of time. In this discussion, also investigate why the
algorithm did or did not stop and justify your reasoning.
• A brief discussion (at most a half-page) about what you observed regarding the runtime and
termination of the pocket algorithm, especially when compared to the original perceptron
algorithm.
• A brief discussion (at most a half-page) about using linear regression in this way and what
might be the potential benefits and/or drawbacks of the approach.
• A README section explaining how to run your code. Keep in mind that the grader will only
run the code, and not do any manual work (e.g., placing the dataset in the same folder as
your code, or divide the dataset into two, etc. etc.) to make sure that your code runs “as is”!
The report will be graded based on (a) performance details, (b) replicability of experiments, (c)
explanation of how to run your code, and (d) the quality of the discussions about each algorithm.
Notes
What programming languages are allowed?
1. Java (JDK 1.8 or above), Python (3.x).
2. You are NOT allowed to use any data science and/or machine learning library for this assignment. If you have doubts about whether the use of a library is acceptable, please ask before
assuming that it can be used!
3. You can use libraries for the matrix multiplication and eigenvalue decomposition steps required in linear regression. Beyond that, though, the ERM code for linear regression must be
your own original code.
What should we submit, and how?
Submit a single .zip archive containing (i) one folder simply called “code”, containing all
your code, (ii) a PDF document for your report, and (iii) a PDF document for the theory
part of your submission. Please do NOT handwrite the solutions and scan or take pictures.
For the PDF documents, use either LATEXor MS Word to write the solutions, and export to
PDF. Anything else will not be graded.
1For the report, please stick to single line spacing.
CSE 353 Homework I Due by Mar 29, 2019
Input:
Training data: S = {(xi
, yi)}
n
i=1, X = 1 × R
d
, Y = {−1, 1}.
Learning rate parameter: 0 < η < 1
Maximum number of steps: maxiter
Initialization:
w ← 0
wpocket ← 0
run ← 0
runpocket ← 0
while w has updated ∧ run < maxiter do
draw the next labeled example (x, y) from S;
if (hw, xi ≥ 0 ∧ y = −1) ∨ (hw, xi < 0 ∧ y = 1) then if run > runpocket then
wpocket ← w;
runpocket ← run;
end
w ← w + ηx;
run ← 0;
else
run ← run +1;
end
end
Output: wpocket ∈ R
d+1
.
Algorithm 1: Pocket Algorithm