Description
Questions
1. Given n boolean variables (x1, …, xn), we define our target classification function f(·)
as f(·) = 1 if at least 3 variables are active. For n=5 show how this function can be
represented as (1) Boolean function and a (2) Linear function.
2. Let CONB be the set of all different monotone conjunctions defined over n boolean
variables. What is the size of CONB?
3. Let CONL be the set of all linear functions defined over n boolean variables that are
consistent with the functions in CONB . We say that two functions fb ∈ CONB and
fl ∈ CONL are consistent if ∀x; fb(x) = fl(x). What is the size of CONL?
1
2
4. Define in one sentence: Mistake bound (your answer should include all the components
described in class).
5. Suggest a mistake bound algorithm for learning Boolean conjunctions (hint: recall
the elimination algorithm for monotone conjunctions). Show that your algorithm is a
mistake bound algorithm for Boolean conjunctions.
6. Given a linearly separable dataset consisting of 1000 positive examples and 1000 negative examples, we train two linear classifiers using the perceptron algorithm. We
provide the first classifier with a sorted dataset in which all the positive examples
appear first, and then the negative examples appear. The second classifier is trained
by randomly selecting examples at each training iteration. (1) Will both classifiers
converge? (2) what will be the training error of each one of the classifiers?
7. We showed in class that using the Boolean kernel function K(x, y) = 2same(x,y) we can
express all conjunctions over the attributes of x and y (with both the positive and
negative literals). This representation can be too expressive. Suggest a kernel function
K(x, y) that can express all conjunctions of size at most k (where k is a parameter,
k < n).
Programming Assignment
In this part of the assignment you are required to implement the perceptron and logistic
function and observe their performance on a binary sentiment classification task.
1. Dataset/Task The task will be a binary sentiment classification task where your models will take as input a document of text and then produce a binary output saying it’s
either positive (1) or negative (0). The dataset is essentially a table of two rows where
the first row represents some text and the second row corresponds to its sentiment.
Ideally, you want your trained pmodel to accurately differentiate between positive and
negative sentiment by looking at words that are used in the text.
You will be given a single csv file that contains all the data corresponding to this task
called data.csv. One component of this task is to split the data into a train and test
data sets. The splits are determined by you. It is required that you use 5-fold cross
validation when training your models. You should not use the test set in picking the
optimal hyperparameter values. Simply report the performance of your models with
most optimal hyperparameter values on the test set after observing the performance
on the training datasets.
To program the splitting of the dataset, take a look at the function split dataset in
Execution.py. You will have to program how the original dataset will be split into
train and test datasets. It is imperative that you maintain the input and outputs of
the function as described in the program file. Additionally, each one of the dataset
outputs must have column called ‘Label’ which has either a 0 or 1 to label negative
or positive sentiment respectfully. For example, look at Fig. 1 below to see a sample
format. It’s important to keep this format because when your programs will be graded,
3
Figure 1: Sample pandas dataframe that contains the column ‘Label’ which represents the
sentiment for each row in column ‘Text’.
the datasets will be passed in with this format and the ground-truth labels for every
dataset will be extracted by simply calling the column ‘Label’ (look at Execution.py
for more information).
2. Feature Selection An important task within this homework is to find an effective
way to represent text into a meaningful quantitative representation to be used by the
perceptron. One example way is to use a representation called bag-of-words. This
representation simply represents a piece of text by the frequency of the words that
appear. For example, take the two sentences below:
• John likes to play basketball.
• Mary prefers to play soccer.
Now, we build a vocabulary of words. So, our vocabulary would be the following set:
{John, M ary, likes, prefer, to, play, soccer, basketball}. Since we have 8 distinct words
in our vocabulary, we can use 8-dimensional vectors where each dimension corresponds
to a word and the value in each dimension represents the frequency. For example,
• (1, 0, 1, 0, 1, 1, 0, 1)
• (0, 1, 0, 1, 1, 1, 1, 0).
If we take the first vector, we see that there is a 1 in the 1st, 3rd, 5th, 6th, and 8th
dimensions which corresponds to 1 word of each John, likes, to, play, and basketball.
This is one such representation to translate text into a vector representation.
While bag-of-words is a pretty simple approach, there are several drawbacks such that
it doesn’t preserve sentence structure and syntax that could be useful as features in
a model. In fact, if the features extracted from the text are not meaningful, then the
performance of the machine learning model can suffer greatly. Therefore, it is highly
important to construct and preserve relevant features that will be used as input to
the perceptron model. For example, another widely used technique is called n-gram
modeling which you can learn more from reading section 3.1 in the following pdf:
https://web.stanford.edu/~jurafsky/slp3/3.pdf.
4
You are encouraged to explore and try out several different feature engineering processes and find the one that maximizes the performance of your model. It is highly
encouraged if you want to pursue other feature processing techniques that are not listed
as examples here.
3. Implementation In this assignment, you will be implementing two different models
and comparing their performance on the sentiment classification task. The program
Perceptron.py is where you will be implementing the perceptron algorithm. The program Logistic.py is where you will be implementing and training the logistic function.
Both Perceptron.py and Logistic.py are setup to do the following things:
• Train the given model on the training data.
• Predict labels on some test set.
You will have to train and implement both models separately using the programs
listed above. Program the perceptron and its training algorithm on the binary sentiment classification task. The perceptron contains two hyperparameters that you must
experiment with: the learning rate and maximum number of iterations. The learning
rate controls how large each update affects the weight vector, and maximum iterations
controls how many examples the training algorithm cycles through. For the logistic
function, you will train it via stochastic gradient descent using logistic loss (same as
binary cross-entropy). You will have two hyperparameters when training the logistic
function: the learning rate and max number of epochs. The latter hyperparameter
controls how many epochs stochastic gradient descent should run for before terminating.
You have the freedom to modify most of the starter code and implement it how you
feel is optimal. However, there are certain requirements, which are essentially the same
for both programs, that you must follow that are detailed explicitly in the comments
of the starter code. As a high level overview, you may not change the inputs to the
init function and the input parameters and return statements of either train() or
predict() may not change. You can customize what code to place inside these functions
but these requirements must hold. The reason for this is to streamline the process of
evaluating your programs for test time on datasets that your model hasn’t seen before.
It is highly important that the features you use are consistent for both the perceptron
and the logistic function to yield a fair comparison.
The starter code and the comments contain explicit instructions on the requirements
and what pieces of the program you can modify.
4. Evaluation There is another program called Execution.py that is given to you to
evaluate the performance of your model on the datasets. This program creates both
models, trains them on the train data set, produces predicted labels, and evaluates
their accuracy compared to the ground-truth.
In this program, there is one function which you must program which is split dataset.
The function split dataset takes in the full csv dataset file and you must specify how
to split it into train and test sets.
5
The primary way you will be evaluating the performance of both your models will be
through classification accuracy which is already implemented in Execution.py. When
you run Execution.py, it will produce these accuracies on the train and test sets. Please
look at the code for further information.
You can simply run Execution.py from the command line in the following way:
python3 Execution.py
It is important to note that the python command assumes that a python3 interpreter is
being used. Additionally, it is important that Execution.py, Perceptron.py, Logistic.py,
and the .csv files stay in the same folder because Execution.py will be load the data
files assuming it’s in the same directory.
It is important to not alter the code for specific sections in this program because
these exact functions will be used to evaluate your code and model. More detailed
information is contained in the comments of the starter code.
5. Report You will need to submit a report detailing the results of your algorithm. Please
have the following sections labeled in the order specified. Below are the following
requirements for the report:
(a) Explain how you split the dataset and why you selected those splits for the training
and test sets.
(b) Explain your feature engineering processes, the different strategies you used, and
which method you ultimately used. Also, give an explanation and a simple example of how your strategy worked in processing the raw dataset. Explain how
you handled unseen features.
(c) It’s important to show how you selected these hyperparameter values. For each
model, plot the values of learning rate (x-axis) vs. accuracy (y-axis) for a fixed
number of max iterations (or number of max epochs for the logistic function).
Since you will be conducting 5-fold cross validation, average the accuracy across
the 5 folds and plot that across different learning rates you have picked. You
should have multiple of such plots for different number of max iterations (or
number of max epochs for the logistic function).
(d) Report the optimal values for the hyperparameters and explain how you arrived
at those values. Report the accuracy of your perceptron and logistic function
on the test set with those selected hyperparameter values. Finally, compare and
evaluate the training and performance of both models against each other.