## Description

Background

In this miniproject you will implement two classification techniques—logistic regression and naive Bayes—and compare these two algorithms on four distinct datasets. The goal is to gain experience implementing these algorithms from

scratch and to get hands-on experience comparing performance of different models.

Task 1: Acquire, preprocess, and analyze the data

Your first task is to acquire the data, analyze it, and clean it (if necessary). We will use two fixed and two open datasets

in this project, outlined below.

• Dataset 1 (Ionosphere): this is a dataset where the goal is to predict whether a radar return from ionosphere is

‘good’ or ‘bad’. This radar data was collected by a system in Goose Bay, Labrador. Get it from:

https://archive.ics.uci.edu/ml/datasets/ionosphere

• Dataset 2 (Adult Data Set): also known as “Census Income” dataset, this is a dataset where the goal is to

predict the whether income exceeds $50K/yr based on census data. Get it from:

https://archive.ics.uci.edu/ml/datasets/Adult

• Dataset 3 and 4: pick two additional datasets for classification from the UCI datasets at:

https://archive.ics.uci.edu/ml/datasets.php

The essential subtasks for this part of the project are:

1

1. Download the datasets (noting the correct subsets to use, as discussed above).

2. Load the datasets into NumPy objects (i.e., arrays or matrices) in Python. Remember to convert the wine dataset

to a binary task, as discussed above.

3. Clean the data. Are there any missing or malformed features? Are there are other data oddities that need to be

dealt with? You should remove any examples with missing or malformed features and note this in your

report. For categorical variables you can use one-hot encoding.

4. Compute basic statistics on the data to understand it better. E.g., what are the distributions of the positive vs.

negative classes, what are the distributions of some of the numerical features? what are the correlations between

the features? how does the scatter plots of pair-wise features look-like for some subset of features?

Task 2: Implement the models

You are free to implement these models as you see fit, but you should follow the equations that are presented in

the lecture slides, and you must implement the models from scratch (i.e., you cannot use SciKit Learn or any other

pre-existing implementations of these methods).

In particular, your two main tasks in the part are to:

1. Implement logistic regression, and use (full batch) gradient descent for optimization.

2. Implement naive Bayes, using the appropriate type of likelihood for features.

You are free to implement these models in any way you want, but you must use Python and you must implement

the models from scratch (i.e., you cannot use SciKit Learn or similar libraries). Using the numpy package,

however, is allowed and encouraged. Regarding the implementation, we recommend the following approach (but

again, you are free to do what you want):

• Implement both models as Python classes. You should use the constructor for the class to initialize the model

parameters as attributes, as well as to define other important properties of the model.

• Each of your models classes should have (at least) two functions:

– Define a fit function, which takes the training data (i.e., X and y)—as well as other hyperparameters (e.g.,

the learning rate and/or number of gradient descent iterations)—as input. This function should train your

model by modifying the model parameters.

– Define a predict function, which takes a set of input points (i.e., X) as input and outputs predictions (i.e.,

yˆ) for these points. Note that you need to convert probabilities to binary 0-1 predictions by thresholding the

output at 0.5!

• In addition to the model classes, you should also define a functions evaluate_acc to evaluate the model accuracy.

This function should take the true labels (i.e., y), and target labels (i.e., yˆ) as input, and it should output the accuracy

score.

• Lastly, you should implement a script to run k-fold cross validation.

Task 3: Run experiments

The goal of this project is to have you explore linear classification and compare different features and models. Use

5-fold cross validation to estimate performance in all of the experiments. Evaluate the performance using accuracy.

You are welcome to perform any experiments and analyses you see fit (e.g., to compare different features), but at a

minimum you must complete the following experiments in the order stated below:

1. Compare the accuracy of naive Bayes and logistic regression on the four datasets.

2

2. Test different learning rates for gradient descent applied to logistic regression. Use a threshold for change in the

value of the cost function as termination criteria, and plot the accuracy on train/validation set as a function of

iterations of gradient descent.

3. Compare the accuracy of the two models as a function of the size of dataset (by controlling the training size).

As an example, see Figure 1 here 1

.

Note: The above experiments are the minimum requirements that you must complete; however, this project

is open-ended. For example, you might investigate different stopping criteria for the gradient descent in logistic

regression, develop an automated approach to select a good subset of features. You do not need to do all of these

things, but you should demonstrate creativity, rigour, and an understanding of the course material in how you run your

chosen experiments and how you report on them in your write-up.

Deliverables

You must submit two separate files to MyCourses (using the exact filenames and file types outlined below):

1. code.zip: Your data processing, classification and evaluation code (as some combination of .py and .ipynb files).

2. writeup.pdf: Your (max 5-page) project write-up as a pdf (details below).

Project write-up

Your team must submit a project write-up that is a maximum of five pages (single-spaced, 11pt font or larger; minimum

0.5 inch margins, an extra page for references/bibliographical content can be used). We highly recommend that students use LaTeX to complete their write-ups. This first mini-project report has relatively strict requirements, but

as the course progresses your project write-ups will become more and more open-ended. You have some flexibility in how you report your results, but you must adhere to the following structure and minimum requirements:

Abstract (100-250 words) Summarize the project task and your most important findings. For example, include

sentences like “In this project we investigated the performance of linear classification models on two benchmark

datasets”, “We found that the logistic regression approach was achieved worse/better accuracy than naive Bayes and

was significantly faster/slower to train.”

Introduction (5+ sentences) Summarize the project task, the two datasest, and your most important findings. This

should be similar to the abstract but more detailed. You should include background information and citations to

relevant work (e.g., other papers analyzing these datasets).

Datasets (5+ sentences) Very briefly describe the and how you processed them. Describe the new features you come

up with in detail. Present the exploratory analysis you have done to understand the data, e.g. class distribution.

Results (7+ sentences, possibly with figures or tables) Describe the results of all the experiments mentioned in

Task 3 (at a minimum) as well as any other interesting results you find. At a minimum you must report:

1. A discussion of how the logistic regression performance (e.g., convergence speed) depends on the learning rate.

(Note: a figure would be an ideal way to report these results).

2. A comparison of the accuracy of naive Bayes and logistic regression on both datasets.

3. Results demonstrating that the feature subset and/or new features you used improved performance.

1Ng AY, Jordan MI. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural

information processing systems 2002 (pp. 841-848).

3

Discussion and Conclusion (5+ sentences) Summarize the key takeaways from the project and possibly directions

for future investigation.

Statement of Contributions (1-3 sentences) State the breakdown of the workload across the team members.

Evaluation

The mini-project is out of 100 points, and the evaluation breakdown is as follows:

• Completeness (20 points)

– Did you submit all the materials?

– Did you run all the required experiments?

– Did you follow the guidelines for the project write-up?

• Correctness (40 points)

– Are your models implemented correctly?

– Are your reported accuracies close to the reference solutions?

– Do your proposed features actually improve performance, or do you adequately demonstrate that it was

not possible to improve performance?

– Do you observe the correct trends in the experiments (e.g., comparing learning rates)?

• Writing quality (25 points)

– Is your report clear and free of grammatical errors and typos?

– Did you go beyond the bare minimum requirements for the write-up (e.g., by including a discussion of

related work in the introduction)?

– Do you effectively present numerical results (e.g., via tables or figures)?

• Originality / creativity (15 points)

– Did you go beyond the bare minimum requirements for the experiments? For example, you could investigate different stopping criteria for logistic regression, investigate which features are the most useful (e.g.,

using correlation metrics), or propose an automated approach to select a good subset of features.

– Note: Simply adding in a random new experiment will not guarantee a high grade on this section! You

should be thoughtful and organized in your report.

Final remarks

You are expected to display initiative, creativity, scientific rigour, critical thinking, and good communication skills.

You don’t need to restrict yourself to the requirements listed above – feel free to go beyond, and explore further.

You can discuss methods and technical issues with members of other teams, but you cannot share any code or data

with other teams.

4