CSE 353– Homework II

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (2 votes)

1 Theory
I: Logistic regression and na¨ıve Bayes 12 points
Suppose in a binary classification problem, the input variable x is n-dimensional and the ouput is
a binary class label y ∈ Y = {0, 1}. In this situation, there is an interesting connection between
two learners: logistic regression and na¨ıve Bayes classifier.
(a) Write down the expressions for the class conditional probability for each class, i.e., P(y = 1|x)
and P(y = 0|x), for logistic regression. (2 points)
(b) Using Bayes’ rule, derive the posterior probabilities for each class, i.e., P(y = 1|x) and P(y =
0|x), for na¨ıve Bayes. (2 points)
(c) Assuming a Gaussian likelihood function in each of the n dimensions, write down the full
likelihood function f(x|d) for na¨ıve Bayes. (2 points)
(d) Assuming a uniform prior on the two classes and using the results from parts (b) and (c)
above, derive a full expression for P(y = 1|x) for na¨ıve Bayes. (3 points)
(e) Show that with appropriate manipulation and parameterization, P(y = 1|x) in na¨ıve Bayes
from part (d) is equivalent to P(y = 1|x) for logistic regression in part (a). (3 points)
II: Failure of Cross-Validation 4 points
Cross-validation works well in practice, but there are some pathological cases where it might fail.
Suppose that the label is chosen at random according to P[y = 1] = P[y = 0] = 1/2. Consider a
learning algorithm that outputs the constant predictor h(x) = 1 if the number of 1s in the training
set labels is odd, and h(x) = 0 otherwise. Prove that the difference between the leave-one-out
estimate and the true error in such a case is always 1/2.
III: Decision Tree 4 points
Show that any binary classifier h : [0, 1]d → {0, 1} can be implemented as a decision tree of height
at most d + 1, with internal nodes of the type “Is xi = 0?” for some i ∈ {1, 2, . . . , d}.
2 Experiment
I: Decision Tree 40 points
In this section, you will be working with a somewhat morbid dataset. This is a dataset of passengers
on the Titanic. Each row in this dataset has 12 columns, where the second column, “Survived”, is
what we would like to estimate using a machine learning algorithm.
1
CSE 353 Homework II Due by Apr 24, 2019
Some of the features are obvious from their names (i.e., column headers), the others are:
• pclass – ticket class (1st, 2nd, or 3rd).
• sibsp – the number of siblings/spouses aboard the Titanic.
• parch – the number of parents/children aboard the Titanic (some children traveled with a
nanny, so parch = 0 for them).
• ticket and cabin are just the ticket number and the cabin number, respectively.
• age – age in years (fractional age indicates that age was less than 1 year).
• embarked – three different ports of embarkation were (S)outhampton, (Q)ueenstown, and
(C)herbourg.
Your task is write your own code in Java or Python to implement the decision tree algorithm we
learnt in class (ID3). Instead of training and testing on the same data, however, you must split
the data in a 60-40 ratio to train on the 60% sub-dataset and test on the remaining 40%. You
must also calculate the accuracy of prediction on the training data and the test data, and plot the
accuracy numbers as the tree depth increases. This should lead to a plot that looks like the one in
the slide on overfitting when we covered “Decision Trees” in class.
Note that in this code, you will have to
• handle continuous variables, and
• avoid overfitting
Your submission for this task should be your code for learning and testing the decision tree. This
code should be able to perform the training and testing tasks separately, and the user should be
simply able to train on the 60% sub-dataset by doing something like
$ python id3.py –dataset /path/to/data/filename.csv –train
and test on the remaining 40% by
$ python id3.py –dataset /path/to/data/filename.csv –test
You may assume that the number 60 and 40 are fixed, and the user will not change them around.
II: Final Report 20 points
Along with your code, you are also required to submit a short report (no more than 1 page)1
. The
report must contain the following:
• The accuracy plot of the decision tree model (the tree depth and accuracy on the x- and
y-axes, respectively. The plot must be clear enough the reader to distinguish between, say,
0.85 and 0.9. Otherwise, include a table with the accuracy numbers in addition to the plot.
• A brief discussion (at most a half-page) about what you observed regarding (i) the depth at
which the tree started overfitting, (ii) the most discriminatory features, and (iii) the least
discriminatory features.
• Also include a brief discussion about whether some features that you expected to be completely
useless (e.g., ticket or cabin number) surprised you by exhibiting predictive power (or vice
versa . . . i.e., some features that you thought would be very predictive actually turned out to
be useless).
1For the report, please stick to single line spacing.
CSE 353 Homework II Due by Apr 24, 2019
• A README section explaining how to run your code. Keep in mind that the grader will
only run the code, and not do any manual work (e.g., placing the dataset in the same folder,
etc.). So, make sure that your code runs “as is” once the path to the entire dataset2 has been
specified.
The report will be graded based on (a) performance details, (b) replicability of experiments, (c)
explanation of how to run your code, and (d) the quality of the discussions about each algorithm.
Notes
What programming languages are allowed?
1. Java (JDK 1.8 or above, but don’t use Java 11), Python (3.x).
2. You are NOT allowed to use any data science and/or machine learning library for this assignment. If you have doubts about whether the use of a library is acceptable, please ask before
assuming that it can be used!
3. Calculations for entropy, information gain, etc. must be your own code.
4. You are NOT allowed to use any libraries that provide decision tree data structures off-theshelf.
5. For the accuracy plot, you may use any library you want as long as the use of that library
(e.g., pandas) is strictly restricted to the generation of the plot as an image file, and has
nothing to with the rest of your code.
What should we submit, and how?
Submit a single .zip archive containing (i) one folder simply called “code”, containing all
your code, (ii) a PDF document for your report, and (iii) a PDF document for the theory
part of your submission. Please do NOT handwrite the solutions and scan or take pictures.
For the PDF documents, use either LATEXor MS Word to write the solutions, and export to
PDF. Anything else will not be graded.
2Remember that just like the previous assignment, the grader will also not perform the 60-40 split in any way;
this should be done by your code.