## Description

1 Theory

I: A “warm up” problem 5 points

Consider instances of X drawn from the uniform distribution D on [−1, 1]. Let f denote the actual

labeling function mapping each instance to its label y ∈ {−1, 1} with probabilities

P r(y = 1|x > 0) = 0.9

P r(y = −1|x > 0) = 0.1

P r(y = 1|x ≤ 0) = 0.1

P r(y = −1|x ≤ 0) = 0.9

The hypothesis h predicts the label for each instance as defined below:

h(x) = (

1 if x > 0

0 otherwise

Measure the success of this predictor by calculating the training error for h.

II: Bayes Optimal Predictor 5 points

Show that for every probability distribution D, the Bayes optimal predictor fD is, in fact, optimal.

That is, show that for every classifier g : X → {0, 1},

LD(fD) ≤ LD(g).

III: Perceptron with a Learning Rate 5 points

Let us modify the perceptron algorithm as follows:

In the update step, instead of setting w(t+1) = w(t) + yixi every time there is a misclassification,

we set w(t+1) = w(t) + ηyixi

instead, for some 0 < η < 1. Show that this modified perceptron will
(a) perform the same number of iterations as the original, and
(b) converge to a vector that points to the same direction as the output of the original perceptron.
IV: Unidentical Distributions 5 points
Let X be a domain and let D1, D2, ..., Dm be a sequence of distributions over X . Let H be a finite
class of binary classifiers over X and let f ∈ H. Suppose we are getting a sample S of m examples
such that the instances are independent but not identically distributed, the i
th instance is sampled
from Di and then yi
is set to be f(xi). Let Dm denote the average, i.e., Dm = (D1+ . . . +Dm)/m.
1
CSE 353 Homework I Due by Mar 29, 2019
Fix an accuracy parameter ∈ (0, 1). Show that
P r h
∃ h ∈ H s.t. L(Dm,f)
(h) > and L(S,f)

(h) = 0i

≤ |H|e

−m

2 Programming

In this section, you will be working with a Breast Cancer Prediction dataset. Diagnosis of breast

cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny

speck of calcium is seen (on an x-ray). In this dataset, there are 569 observations, with each

observation consisting of 5 features (mean radius, mean texture, mean perimeter, mean area, mean

smoothness). The last column is the diagnosis – 0 indicates that the finding was benign, and 1

indicates that it was malignant.

I: Perceptron 25 points

The first task is to write your own code in Java or Python to implement the perceptron learning

algorithm. Your goal is minimize the empirical risk (i.e., no need to do validation or structural risk

minimization) for the perceptron to learn the diagnosis.

As you can tell, this is a binary classification task. For this task, submit your code for the perceptron

learning algorithm. This code should be able to read the dataset, and eventually provide the learned

weight vector as its final output.

II: A modified Perceptron algorithm 15 points

The second task is to modify the perceptron algorithm to exploit the observations that do not

lead to updated weights. This modified perceptron algorithm does the following: for each weight

vector w, (i) it keeps count of the number of consecutive correctly classified observations until the

next update, and (ii) it always keeps the weight vector w that had the longest such streak. This

variation is called the pocket algorithm because it keeps the best solution seen so far “in the

pocket”, and returns this rather than the last solution (see Alg. 1).

Your submission for this task should be your code for the pocket algorithm. Just like the original

perceptron code, this code too should be able to read the dataset, and eventually provide the

learned weight vector as its final output. You may write a totally separate code for the pocket

algorithm, or you may choose to incoroporate it as a ‘version’ of the perceptron algorithm. An

example of the latter approach could be something that looks like

$ python perceptron.py –version naive –dataset /path/to/data/filename.csv

$ python perceptron.py –version pocket –dataset /path/to/data/filename.csv

III: Linear Regression for Classification 25 points

The third and final programming task is to use linear regression for classification. In class, we

saw how logistic regression has a nice probabilistic interpretation that can be used for binary

classification. Linear regression, however, does not.

What you are required to do, instead, is divide the dataset into two sub-datasets according to

whether the diagnosis is 0 or 1. Next, run linear regression on each of these two sub-datasets to

obtain two linear subspaces that fit the benign findings and the malignant findings, respectively.

Once you have obtained these two subspaces (both through ERM), the actual binary classfication

will be done by computing the Euclidean distance of each observation from the two subspaces, and

assigning the label of the closer fit. That is, once you obtain the two vectors w0 (for benign) and

CSE 353 Homework I Due by Mar 29, 2019

w1 (for malignant) as output of the two linear regressions, an observation will be classified as 0 or

1 depending on which of these two weight vectors it is closer to.

Your submission for this task should be your code for linear regression. This code should be able

to perform the whole task defined above, and not require the user to manually divide the dataset

and call the ‘usual’ linear regression code multiple times. For example, the user should be able to

do something like

$ python linearregressionclassifier.py –dataset /path/to/data/filename.csv

and obtain the classifier’s performance.

IV: Final Report 15 points

Along with your code, you are also required to submit a short report (no more than 2 pages)1

. The

report must contain the following:

• The performance of each model (i.e., the empirical risk).

• A brief discussion (at most a half-page) about what you observed regarding the runtime and

termination of the original perceptron algorithm, and what you did if/when the algorithm did

not stop after some reasonable amount of time. In this discussion, also investigate why the

algorithm did or did not stop and justify your reasoning.

• A brief discussion (at most a half-page) about what you observed regarding the runtime and

termination of the pocket algorithm, especially when compared to the original perceptron

algorithm.

• A brief discussion (at most a half-page) about using linear regression in this way and what

might be the potential benefits and/or drawbacks of the approach.

• A README section explaining how to run your code. Keep in mind that the grader will only

run the code, and not do any manual work (e.g., placing the dataset in the same folder as

your code, or divide the dataset into two, etc. etc.) to make sure that your code runs “as is”!

The report will be graded based on (a) performance details, (b) replicability of experiments, (c)

explanation of how to run your code, and (d) the quality of the discussions about each algorithm.

Notes

What programming languages are allowed?

1. Java (JDK 1.8 or above), Python (3.x).

2. You are NOT allowed to use any data science and/or machine learning library for this assignment. If you have doubts about whether the use of a library is acceptable, please ask before

assuming that it can be used!

3. You can use libraries for the matrix multiplication and eigenvalue decomposition steps required in linear regression. Beyond that, though, the ERM code for linear regression must be

your own original code.

What should we submit, and how?

Submit a single .zip archive containing (i) one folder simply called “code”, containing all

your code, (ii) a PDF document for your report, and (iii) a PDF document for the theory

part of your submission. Please do NOT handwrite the solutions and scan or take pictures.

For the PDF documents, use either LATEXor MS Word to write the solutions, and export to

PDF. Anything else will not be graded.

1For the report, please stick to single line spacing.

CSE 353 Homework I Due by Mar 29, 2019

Input:

Training data: S = {(xi

, yi)}

n

i=1, X = 1 × R

d

, Y = {−1, 1}.

Learning rate parameter: 0 < η < 1
Maximum number of steps: maxiter
Initialization:
w ← 0
wpocket ← 0
run ← 0
runpocket ← 0
while w has updated ∧ run < maxiter do
draw the next labeled example (x, y) from S;
if (hw, xi ≥ 0 ∧ y = −1) ∨ (hw, xi < 0 ∧ y = 1) then
if run > runpocket then

wpocket ← w;

runpocket ← run;

end

w ← w + ηx;

run ← 0;

else

run ← run +1;

end

end

Output: wpocket ∈ R

d+1

.

Algorithm 1: Pocket Algorithm