Description
Problem 1 (Features; 10 points). It is common to pre-preprocess the feature vectors in R
d before
passing them to a learning algorithm. Two simple and generic ways to pre-process are as follows.
• Centering: Subtract the mean µˆ :=
1
|S|
P
(x,y)∈S x (of the training data) from every feature
vector:
x 7→ x − µˆ .
• Standardization: Perform centering, and then divide every feature by the per-feature standard deviation ˆσi =
q 1
|S|
P
(x,y)∈S
(xi − µˆi)
2:
(x1, x2, . . . , xd) 7→
x1 − µˆ1
σˆ1
,
x2 − µˆ2
σˆ2
, . . . ,
xd − µˆd
σˆd
.
For each of the following learning algorithms, and each of the above pre-processing transformations,
does the transformation affect the learning algorithm?
(a) The classifier based on the generative model where class conditional distributions are multivariate Gaussian distributions with a fixed covariance equal to the identity matrix I. Assume
MLE is used for parameter estimation.
(b) The 1-NN classifier using Euclidean distance.
(c) The greedy decision tree learning algorithm with axis-aligned splits. (For concreteness, assume
Gini index is used as the uncertainty measure, and the algorithm stops after 20 leaf nodes.)
(d) Empirical Risk Minimization: the (intractable) algorithm that finds the linear classifier (both
the weight vector and threshold) that has the smallest training error rate.
To make this more precise, consider any training data set S from X × Y, and let ˆfS : X → Y be
the classifier obtained from the learning algorithm with training set S. Let φ: X → X 0 be the
pre-processing transformation; and let ˜fS0 : X
0 → Y be the classifier obtained from the learning
algorithm with training set S
0
, where S
0
is the data set from X
0 × Y containing (φ(x), y) for each
(x, y) in S. We say φ affects the learning algorithm if, there is a set of training data S such that
the classifier ˆfS is not the same as x 7→ ˜fS0(φ(x)).
You should assume the following: (i) the per-feature standard deviations are never zero; (ii)
there are never any “ties” whenever you compute an arg max or an arg min; (iii) there are no issues
with numerical precision or computational efficiency.
What to submit: “yes” or “no” for each pre-processing and learning algorithm pair, along with
a brief statement of justification.
1
Problem 2 (More features; 30 points). Download the review data set reviews tr.csv (training
data) and reviews te.csv (test data) from Courseworks. This data set is comprised of reviews
of restaurants in Pittsburgh; the label indicates whether or not the reviewer-assigned rating is at
least four (on a five-point scale). The data are in CSV format (where the first line is the header);
the first column is the label (“label”), and the second column is the review text (“text”). The text
has been processed to remove non-alphanumeric symbols and to make all letters lowercase.
Write a script that, using only the training data, tries different methods of data representation
and different methods for learning a linear classifier, and ultimately selects—via five-fold cross
validation—and uses one combination of these methods to train a final classifier. (You can think
of the choice of these methods as “hyperparameters”.)
The data representations to try are the following.
1. Unigram representation.
In this representation, there is a feature for every word w, and the feature value associated
with a word w in a document d is
tf(w; d) := number of times word w appears in document d .
(tf is short for term frequency.)
2. Term frequency-inverse document frequency (tf-idf ) weighting.
This is like the unigram representation, except the feature associated with a word w in a
document d from a collection of documents D (e.g., training data) is
tf(w; d) × log10(idf(w; D)),
where tf(w; d) is as defined above, and
idf(w; D) :=
|D|
number of documents in D that contain word w
.
This representation puts more emphasis on rare words and less emphasis on common words.
(There are many variants of tf-idf that are unfortunately all referred to by the same name.)
Important: When you apply this representation to a new document (e.g., a document in the
test set), you should still use the idf defined with respect to D. This, however, becomes
problematic if a word w appears in a new document but did not appear in any document in
D: in this case, idf(w; D) = |D|/0 = ∞. It is ambiguous what should be done in these cases,
but a safe recourse, which you should use in this assignment, is to simply ignore words w that
do not appear in any document in D.
3. Bigram representation.
In addition to the unigram features, there is a feature for every pair of words (w1, w2) (called
a bigram), and the feature value associated with a bigram (w1, w2) in a given document d is
tf((w1, w2); d) := number of times bigram (w1, w2) appears consecutively in document d .
In the sequence of words “a rose is a rose”, the bigrams that appear are: (a,rose), which
appears twice; (rose, is); and (is, a).
4. Another data representation of your own choosing or design.
Some examples:
2
• Extend the bigram representation to n-gram representations (for any positive integer n)
to consider n-tuples of words appearing consecutively in documents.
• Extend to k-skip-n-grams: instead of just counting the number of times the n-tuples
(w1, w2, . . . , wn) appears consecutively in the document, also count occurrences where
wi and wi+1 are at most k words apart.
• Use variants of tf-idf weighting, such as one that replace tf(w; d) with 1+log10(tf(w; d)).
This is better for documents where words often appear in bursts.
• Select specific words or n-grams you think are informative for the classification task.
The learning methods to try are the following.
1. Averaged-Perceptron (with some modifications described below).
Averaged-Perceptron is like Voted-Perceptron (which uses Online Perceptron), except instead
of forming the final classifier as a weighed-majority vote over the various linear classifiers,
you simply form a single linear classifier by taking a weighted average the weight vectors and
thresholds of all the linear classifiers used by Online Perceptron.
Use the following two modifications:
• Run Online Perceptron to make two passes through the data. Before each pass, randomly
shuffle the order of the training examples. Note that a total of 2n + 1 linear classifiers
are considered during the run of the algorithm (where n is the number of training data).
• Instead of averaging the weights and thresholds for all 2n + 1 linear classifiers, just
average these things for the final n + 1 linear classifiers.
You should use Averaged-Perceptron with all four data representations.
2. Na¨ıve Bayes classifier with parameter estimation as in the previous homework assignment.
Note that with this learning method, a word that appears more than once in a document is
treated the same as appearing exactly once.
You only need to use Na¨ıve Bayes with the unigram representation.
3. (Optional.) Any other learning method you like.
You must write your own code to implement Averaged-Perceptron. The code should be easy
to understand (e.g., by using sensible variable names and comments). For other things
(e.g., Na¨ıve Bayes, cross validation, feature processing), you can use your own or existing library
implementations in MATLAB (+Statistics/ML library) and Python (+numpy, scipy, sklearn), but
you are responsible for the correctness of these implementations as per the specifications from the
course lectures. Provide references for any such third-party implementations you use (e.g., specific
scikit-learn or MATLAB functions that perform cross validation).
What to submit:
1. A concise and unambiguous description of the fourth data representation you try (and also
of any additional learning methods you try).
2. Cross-validation error rates for all data representations / learning methods combinations.
3. Name of the method ultimately selected using the cross validation procedure.
4. Training and test error rates of the classifier learned by the selected method.
5. Source code and scripts that you write yourself (in separate files).
3
Problem 3 (MLE; 10 points).
(a) Consider the statistical model P = {Pµ,σ2 : µ ∈ R
d
, σ2 > 0}, where Pµ,σ2 is the multivariate
Gaussian distribution with mean µ and covariance matrix σ
2I. Give a formula for the MLE
of σ
2 given the data {xi}
n
i=1 from R
d
, which is regarded as an iid sample. Also give a clear
derivation of the formula in which you briefly justify each step.
(b) Consider the statistical model P = {Pθ
: θ ∈ N}, where Pθ is the distribution of the random
pair (X, Y ) in N × {0, 1} such that:
• X is uniformly distributed on {1, 2, . . . , θ};
• for any x ∈ {1, 2, . . . , θ}, the conditional distribution of Y given X = x is specified by
Pθ(Y = 1 | X = x) = x/θ.
Give a formula for the MLE of θ given a single observation (x, y) ∈ N × {0, 1}. Also give a
clear derivation of the formula in which you briefly justify each step.
4