Description

5/5 - (4 votes)

1. (28 points) Text Classification
In this problem, we will use the Na¨ıve Bayes algorithm for text classification. The dataset for this problem
is a subset of the Yelp dataset and has been obtained from this website. Given a users review, task is to
predict the stars given by the reviewer. Read the website for more details about the dataset. You have
been provided with separate training and test files containing 534K reviews (samples) and 133K reviews
respectively. Data is available at this link. A review comes from one of the five categories (class label).
Here, class label represents stars given by the user along with the review. Please refer to README in data
directory for more details.
(a) (10 points) Implement the Na¨ıve Bayes algorithm to classify each of the articles into one of the given
categories. Report the accuracy over the training as well as the test set. In the remaining parts below,
we will only worry about test accuracy.
Notes:
• Make sure to use the Laplace smoothing for Na¨ıve Bayes (as discussed in class) to avoid any zero
probabilities. Use c = 1.
1
• You should implement your algorithm using logarithms to avoid underflow issues.
• You should implement Na¨ıve Bayes from the first principles and not use any existing Matlab/Python modules.
(b) (2 points) What is the test set accuracy that you would obtain by randomly guessing one of the
categories as the target class for each of the review (random prediction). What accuracy would you
obtain if you simply predicted the class which occurs most of the times in the training data (majority
prediction)? How much improvement does your algorithm give over the random/majority baseline?
(c) (3 points) Read about the confusion matrix. Draw the confusion matrix for your results in the part
(a) above (for the test data only). Which category has the highest value of the diagonal entry? What
does that mean? What other observations can you draw from the confusion matrix? Include the
confusion matrix in your submission and explain your observations.
(d) (4 points) The dataset provided to is in the raw format i.e., it has all the words appearing in
the original set of articles. This includes words such as ‘of’, ‘the’, ‘and’ etc. (called stopwords).
Presumably, these words may not be relevant for classification. In fact, their presence can sometimes
hurt the performance of the classifier by introducing noise in the data. Similarly, the raw data treats
different forms of the same word separately, e.g., ‘eating’ and ‘eat’ would be treated as separate words.
Merging such variations into a single word is called stemming.
• Read about stopword removal and stemming (for text classification) online.
• Use the script provided with the data to you to perform stemming and remove the stop-words in
the training as well as the test data. You are free to use other tools as well.
• Learn a new model on the transformed data. Again, report the accuracy.
• How does your accuracy change over test set? Comment on your observations.
(e) (5 points) Feature engineering is an essential component of Machine Learning. It refers to the
process of manipulating existing features/constructing new features in order to help improve the
overall accuracy on the prediction task. For example, instead of using each word as a feature, you may
treat bi-grams (two consecutive words) as a feature. Come up with at least two alternative features
and learn a new model based on those features. Add them on top of your model obtained in part
(d) above. Compare with the test set accuracy that you obtained in parts (a) and parts (d). Which
features help you improve the overall accuracy? Comment on your observations.
(f) (2 points) Receiver Operating Characteristic (ROC) metric is generally used to evaluate classifier
output quality. ROC curves typically feature true positive rate on the Y axis, and false positive
rate on the X axis. ROC curves are typically used in binary classification to study the output of a
classifier. In order to extend ROC curve and ROC area to multi-label classification, it is necessary
to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve
by considering each element of the label indicator matrix as a binary prediction (micro-averaging).
Another evaluation measure for multi-label classification is macro-averaging, which gives equal weight
to the classification of each label. Read more about ROC curves from this link. Plot the suitable ROC
curve as per the problem requirement and comment on your observation. You can use any available
libraries of your choice for the same.
2. (30 points) Fashion MNIST Article Classification
In this problem, we will use Support Vector Machines (SVMs) to build an image classifier, to classify
whether an image contains a sneaker, shirt, bag, etc. We will be solving the SVM optimization problem
using a general purpose convex optimization package as well as using a customized solver from sklearn.
The dataset that we would be using can be downloaded from the course website. Note that we have given
you a subset of the original Fashion MNIST Dataset. You can read more about the original dataset from
this link. Every column except the last represents a feature where the feature value denotes the grayscale
value (0-255) of the corresponding pixel in the image. The last column gives the corresponding label. Since
the features represent grayscale values, we may not want to normalize the data to zero mean and unit
variance as described in the class. But for this problem, you may find it helpful to simply scale all the
values to the range [0, 1] (down from [0 255]).
(a) Binary Classification:
Let d be the last digit of your entry number. Then, we would start with binary classification problem
over the images of article classes (d vs (d + 1) mod 10) in this Section. In particular, you should take
the subset of images for the article classes d and (d + 1) mod 10 from the train/validation/test data
2
provided to you and perform the following experiments 1
. The article classes and their corresponding
labels can be found here.
i. (8 points) Download and install the CVXOPT package. Express the SVM dual problem (with
a linear kernel) in the a form that the CVXOPT package can take. You will have to think about
how to express the SVM dual objective in the form α
T P α + q
T α + c matrix where P is an m × m
matrix (m being the number of training examples), q is an m-sized column vector and c is a
constant. For your optimization problem, remember to use the constraints on αi
’s in the dual.
Use the SVM formulation which can handle noise and use C = 1.0 (i.e. C in the expression
1
2w
T w + C ∗
P
i
ξi). You can refer this link to get a working overview of cvxopt module and it’s
formulation. Obtain the set of support vectors from your optimization for the binary classification
problem. Furthermore, calculate the weight vector w and the intercept term b and classify each
of the examples in the validation and test file into one of the two labels. Report the validation
and test set accuracy obtained. You will need to carefully think about how to represent w and b
in this case.
ii. (6 points) Use CVXOPT package to solve the dual SVM problem using a Gaussian kernel. Think
about how the P matrix will be represented. What are the set of support vectors in this case?
Note that you may not be able to explicitly store the weight vector (w) or the intercept term (b)
in this case. Use your learned model to classify the validation and test examples and report the
accuracies obtained. Use C = 1.0 and γ = 0.05 (i.e. γ in K(x, z) = exp−γ∗kx−zk
2
) for this part.
How do these compare with the ones obtained with in the linear kernel?
(b) Multi-Class Classification:
In this section, we will work with the entire subset of the data provided to you focusing on a multi-class
classification problem. We will work with a Gaussian kernel for this section.
i. (4 points) In class, we described the SVM formulation for a binary classification problem. In
order to extend this to the multi-class setting, we train a model on each pair of classes to get
k
2

classifiers, k being the number of classes. During prediction time, we output the class which has
the maximum number of votes from all the
k
2

classifiers. You can read more about one-vs-one
classifier setting at the following link. Using your solver from previous section, implement one-vsone multi-class SVM. Use a Gaussian Kernel with C = 1.0 and γ = 0.05. Classify given dataset
and report validation and test set accuracy. In case of ties, choose the label with the highest score.
ii. (4 points) Now train a multi-class SVM on this dataset using the Scikit SVM library . Repeat
part (a) using a a Gaussian kernel with γ = 0.05. Use C = 1.0 as earlier. Report the validation as
well as test set accuracies. How do your results compare with those obtained in part (a) above. As
earlier, compare the computational cost (training time) of the two implementations? Comment.
iii. (4 points) Draw the confusion matrix as done in the first Part (Naive Bayes) of this assignment
for both of the above parts (2(a) and 2(b)). What do you observe? Which articles are mis-classified
into which ones most often? Do the results make sense?
iv. (4 points) Validation set is typically used to estimate the best value of the model parameters
(e.g., C in our problem with linear kernel) by randomly selecting small subset of data as validation
set, training on train set (minus the validation set) and making predictions on validation set. For a
detailed introduction, you can refer to this video. You can check the correctness of your intuition
by trying this test. K-fold cross validation is another such techniques in this regard that we use in
practice. In this technique we divide our training data into K-folds or parts and then treat each
part as our validation set once and train on the remaining K-1 parts. You can read more about
cross validation here 2
(see Section 1). for more details. This process is repeated for a range of
model parameter values and the parameters which give best K-fold cross validation accuracy are
reported as the best parameters. We will use Scikit for this part.
For this problem, we will do a 5-fold cross validation to estimate the best value of the C parameter
for the Gaussian kernel case. Test data should not be touched. Fix γ as 0.05 and vary the value
of C in the set {10−5
, 10−3
, 1, 5, 10} and compute the 5-fold cross validation accuracy for each
value of C. Also, compute the corresponding accuracy on the test set. Now, plot both the 5-fold
cross validation accuracy as well as the test set accuracy on a graph as you vary the value of C on
x-axis (you may use log scale on x-axis). What do you observe? Which value of C gives the best
1Note that different students will be experimenting with different class pairs depending on their entry number
2These are from Andrew Ng notes posted on the course website, and the link is available only from the internal IIT Delhi network.
Feel free to read additional material online about this topic
5-fold cross validation accuracy? Does this value of the C also give the best test set accuracy?
Comment on your observations.
3. (12 points) Large scale text classification
Machine learning algorithms, loss functions and optimizer have their advantages and disadvantages in terms
of accuracy and training time. For this exercise, we’ll use the same Yelp dataset as used in Q1.
Notes:
• You are free to use the the scikit-learn implementation of the algorithms for this question.
• You should use a validation set instead of K-fold cross validation for this question.
• The complexity of the algorithm depends on the number of features as well. Please keep that in mind
while deciding on the vocabulary/features.
(a) (6 points) Report and compare the training time and accuracy of the Na¨ıve Bayes and SVM algorithm
(with LIBLINEAR). Comment on the accuracy and training time of both algorithms.
(b) (6 points) SVM objective can also be optimized using the SGD algorithm. Report the training time
and accuracy on the given dataset. How does the SGD solver fair against LIBLINEAR?
(c) [Extra Fun – no credits!]
• You can take a look at full version of Yelp dataset and compare your algorithms at even larger
scale. Please note that training algorithms at such scale can be expensive in terms of computation
and memory.
• Multi-class problem can also be tackled using one-vs-rest strategy. Compare one-vs-rest and onevs-one in terms of training time and accuracy. You can read more about Multiclass SVMs in Sec.
7.1.3 of PRML book [by Christopher Bishop].
4

COL 774: Assignment 2

Description

Related products

COL 774: Assignment 3

COL 774: Machine Learning. Assignment 1