EECS445: Introduction to Machine Learning Homework #4

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (6 votes)

1 [24 points] K-means for image compression
In this problem, we will apply the K-means algorithm to lossy image compression, by reducing the number
of colors used in an image. CTools contains a 512×512 image of a mandrill represented in 24-bit color.
This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each ranging
from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The straightforward
representation of this image therefore takes about 262144 ⇥ 3 = 786432 bytes (a byte being 8 bits). To
compress the image, we will use K-means to reduce the image to K = 16 colors. More specifically, each pixel
in the image is considered a point in the three-dimensional (r, g, b)-space. To compress the image, we will
cluster these points in color-space into 16 clusters, and replace each pixel with the closest cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while. (several minutes
even on a fast computer!)
(a) Load mandrill-large.tiff into matlab by typing A = double(imread(’mandrill-large.tiff’));.
Now, A is a “three dimensional matrix,” and A(:,:,1), A(:,:,2) and A(:,:,3) are 512⇥512 arrays that
respectively contain the red, green, and blue values for each pixel. Enter imshow(uint8(round(A)));
to display the image.
(b) Since the large image has 262144 pixels and would take a while to cluster, we will instead run vector
quantization on a smaller image. Repeat part (a) with mandrill-small.tiff. Treating each pixel’s
(r, g, b) values as an element of R3, run K-means with 16 clusters on the pixel data from this smaller
image, iterating (preferably) to convergence, but in no case for less than 30 iterations. For initialization,
set each cluster centroid to the (r, g, b)-values of a randomly chosen pixel in the image.
(c) Take the matrix A from mandrill-large.tiff, and replace each pixel’s (r, g, b) values with the value of
the closest cluster centroid. Display the new image, and compare it visually to the original image. Hand
in all your code and a printout of your compressed image. Comment on what regions of the original
image are better preserved and what regions are not preserved well.
1
(d) If we represent the image with these reduced (16) colors as described in (c), by (approximately) what
factor have we compressed the image?
compression factor = space used to store original image
space used to store compressed image
2 [20 points] Cross-Validation on Hyper-Parameters of SVM
In the previous assignment (Homework3), you have implemented an SVM classifier using two di↵erent
optimization methods, Batch Gradient Descent and Stochastic Gradient Descent (SGD). In this question,
we will only consider SGD since it converges faster and requires fewer iterations over the training set. SGD
solves this minimization:
min
w,b
E(w, b) = X
N
i=1
E(i)
(w, b)
where, E(i)
(w, b) = 1
2N ||w||2 + C max ⇣
0, 1 t
(i)
(wT x(i) + b)

The pseudo-code for SGD minimization is summarized in Homework3 Q3, in that question, you were
given the values for C and ⌘0 to use. Your task in this question is find good values for C and ⌘0 (a.k.a
hyper-paremeters1) using cross validation. Load the matlab file q2 data.mat, which loads the variables
q2x train, q2t train, q2x test and q2t test.
[Note: the dataset svm data.mat is di↵erent than dataset of Homework3 Q3, therefore the C and ⌘0 may
be di↵erent than what was recommended for use in Homework3. Nonetheless, you may reuse parts of your
code from Homework3]
(a) [3 points] Write down the expressions for rwE(w, b) and @
@bE(w, b), as well as the SGD update rules
for both parameters.
(b) [8 points] Implement “hold-out cross validation” using a (50% : 50%) split. In other words, train on
the first half of the training examples, and validate on the second half of the training examples to choose
a good configuration of C and ⌘0. Try all pairs of C and ⌘0 where C 2 {0.01, 0.1, 1, 10, 100, 1000} and
⌘0 2 {0.01, 0.5, 1.0, 10, 100}. Run 200 iterations of SGD for each combination of C and ⌘0. Report the C
and ⌘0 that minimize the validation error. If there were multiple pairs (C, ⌘0) that achieve the minimum
error, choose the pair for which ⌘0 is smallest. Finally, using those cross validated hyper-parameters,
train your SVM using the entire training set then compute and report the test error (i.e. number of
mis-classified test examples).
(c) [9 points] Implement “k-fold cross validation” using k = 10, known as 10-fold validation. Try all pairs
of C and ⌘0 where C 2 {0.01, 0.1, 1, 10, 100, 1000} and ⌘0 2 {0.01, 0.5, 1.0, 10, 100}. Run 200 iterations
of SGD for each combination of C and ⌘0. Report the C and ⌘0 that minimize the validation error.
Finally, using those cross validated hyper-parameters, train your SVM using the entire training set then
compute and report the test error (i.e. number of mis-classified test examples).
3 [20 points] Intuitions behind Kernels in classification
In the previous assignment, you have proven some useful properties of kernels. In addition, you have
Kernelized an algorithm. In this (short) question, you will explore why Kernels are useful. For this question,
load the matlab file q3 data.mat, which loads the variables q3x train, q3t train, q3x test and q3t test.
1The parameters of SVM are w and b, which are fit to the training data. The hyper-parameters are chosen before training
begins.
2
(a) [6 points] Train an SVM over the training data using matlab’s built-in svmtrain using a linear kernel
(i.e. no kernel). Plot the training data and the separating hyperplane. Use svmclassify to classify all
test examples. Calculate and report the classification accuracy.
[hint: svmtrain accepts a ’showplot’ argument which plots the training points and the separating hyperplane]
(b) [5 points] Train an SVM over the training data using a Radial Basis Function (RBF) kernel, also known
as Gaussian Kernel. Plot the training data and the separating hyperplane. Report the test classification
accuracy using the RBF kernel.
(c) [3 points] Which Kernel did better? RBF or linear (/ no kernel)? Why did this happen?
(d) [6 points] Use 5-fold cross validation to determine the best hyper-parameter, for the SVM with RBF
kernel. Try 2 {0.2, 0.5, 1, 1.5, 2, 2.5, 3} [Note: you can use the help command to find out how to setup
hyper parameters when using the svmtrain command
(e) [Extra credit 8 points] Repeat step a through d using LibSVM instead of Matlab’s sum toolkit. You
can find information about how to download and use LibSVM by Google libsvm and download libsvm.
4 [16 points] Update rules for a 2-layer Neural Network
In this question, you will derive the update rules for a 2-layer neural network. Our small Neural Network is
depicted in this diagram:
• The first (input) layer takes a training example x that is 3-dimensional (M = 3).
• Given an example x on the input layer, the second layer (a.k.a first hidden layer) computes a tanh(.)
transformation of the input layer2. In particular, each of the 3 depicted hidden neurons compute the
the following:
hk = tanh
0
@bk +X
3
j=1
wkjxj
1
A = tanh
bk + wk
T x

; for k = 1, 2, 3
Where wkj is the weight of the parameter (or edge, as depicted in the graph) connecting xj and hk,
and the vector wk
T = [wk1, wk2, wk3] As discussed in class, we can define a W matrix to contain all
wk vectors like:
W =
2
4
wT
1
wT
2
wT
3
3
5
2the lecture slides define zk = bk + wkT x and hk = tanh(zk)
3
• Using the matrix W, we can define the vector h to be:
h = tanh (b + Wx)
Where h = [h1, h2, h3] and b = [b1, b2, b3]
• Finally, the output layer (which contains only one, output neuron) is a logistic layer. In particular, the
output neuron is computes the following:
t
ˆ= (b4 + ✓T h)
where (.) is the logistic function, (a) = 1
1+ea , ✓ is the vector of containing the weights of edges
connecting hidden neurons to the output neuron, and b4 is the bias for the output neuron.
Given a training example x and its label t, we define the error on the example as the cross-entropy loss:
E(x, t) =
tlog(t
ˆ) + (1 t) log(1 t
ˆ)

If we want to optimize the network parameters (✓, b4,W, b) via SGD, we would first need to find expressions for the derivatives of the error function above with respect to each of the parameters.
(a) [6 points] Given a training example (x, t), find the derivatives of the error function with respect to
parameters of the output neuron. In particular, calculate expressions for r✓E(x, t) and @
@b4 E(x, t).
(b) [10 points] Given a training example (x, t), find the derivatives of the error function with respect to
parameters of the hidden layer. In particular, calculate expressions for rWE(x, t) and rbE(x, t).
4