Description
P1: Kernel PCA [5 points]
1. concetric.mat contains 152 data points each of which is a 2-dimensional column vector. If
you scatter plot them on a 2D space, you’ll see that they form the concentric circles you saw
in class.
2. Do kernel PCA on this data set to transform them into a new 3-dimensional feature space,
i.e. X ∈ R
2×152 ⇒ Kernel PCA ⇒ Z ∈ R
3×152
.
3. In theory, you have to do the centering and normalization procedures in the feature space,
but for this particular data set, you can forget about it.
4. You remember how the after-transformation 3D features look like, right? Now your data set
is linearly separable. Train a perceptron that does this linear classification. In other words,
your perceptron will take the transformed version of your data (a 3-dimensional vector) and
do a linear combination with three weights w = [w1, w2, w3]
⊤ plus the bias term b:
yi = σ
w⊤Z:,i + b
, (1)
where yt is the scalar output of your perceptron and σ is the activation function you define.
If you choose logistic function as your activation, then you’d better label your samples with
0 or 1. You know, the first 51 samples are for the inner circle, so their label will be 0, while
for the other outer ring samples, I’d label them with 1.
5. You’ll need to write a backpropagation algorithm to estimate your parameters w and b, which
minimize the error between yi and ti
. There are a lot of choices, but you can use the sum of
the squared error: P
i
1
2
(yi − ti)
2
. Note that this dummy perceptron doesn’t have a hidden
layer, because the nonlinearity is taken care of by the kernel method. Note that you may
want to initialize your parameters with some small (uniform or Gaussian) random numbers
centered around zero.
6. Your training procedure will be sensitive to the choice of the learning rate and the initialization
scheme (the range of your random numbers). So, you may want to try out a few different
choices. Good luck!
7. Do NOT use TensorFlow or PyTorch’s automatic gradient computation feature for this. You
have to write your own backpropagation routine in Numpy.
P2: Neural Networks [5 points]
1. Build a neural network that has a single hidden layer for the concentric circles problem.
Instead of taking the kernel PCA results as the input, your neural network will take the raw
X matrix as the input. Therefore, instead of a perceptron with 3 input units, now your neural
network will have 2 input units as in the previous problem, each of which will take one of the
coordinates of a sample.
2. As we skip kernel PCA, we need to implement the nonlinear part within the network. To
this end, we will convert this 2D data point into a 3D feature vector first, a procedure that
corresponds to the first layer of your neural network:
X
(2)
:,i = σ
W(1)X
(1)
:,i + b
(1)
, (2)
where X
(1)
:,i ∈ R
2×1
is the input to the network (i-th column vector of X), while X
(2)
:,i ∈ R
3×1
is the output of the hidden units. What that means is that the weight matrix W(1) ∈ R
3×2
and the bias vector b
(1) ∈ R
3×1
.
3. In the next (i.e., final) layer, you’ll do the usual linear classification on X
(2)
:,i , which will be
similar to the perceptron you developed in Problem 3, although this time the input to the
perceptron is X
(2)
:,i , not the kernel PCA results:
yi = σ
w(2)⊤X
(2)
:,i + b
(2)
, (3)
where w(2) and b
(2) correspond to w and b in Problem 1, respectively.
4. Note that you train these two layers altogether. Your backpropagation should work from the
final layer back to the first layer. Eventually, you want to estimate these parameters with
your backpropagation: W(1) ∈ R
3×2
, w(2) ∈ R
3×1
, b
(1) ∈ R
3×1
, b(2) ∈ R
1
.
5. Again, do NOT use any of the deep learning frameworks, such as TensorFlow or PyTorch,
to get around the differentiation. Write up your own backpropagation routine and update
algorithms.
P3: Spoken MNIST [5 points]
1. Download the spoken digit dataset from here: https://zenodo.org/record/1342401#.YjyUfiB0UE.
2. In the recordings subfolder, you will see a bunch of .wav files. Its encoding schme is like this:
[digit class] [subject name] [example number].wav. For example, 7 jackson 30.wav
means 30th recording of Jackson pronouncing number 7. There are four subjects, each speaks
each digit 50 times. Today, we will focus on Jackson’s recordings. Take a listen to them to
see how they sound like.
3. Divide Jackson’s recordings into training and testing sets. Since there are 50 recordings per
digit, use 45 of them for training and 5 of them as testing. Then, there will be 450 training
examples in total for all 10 digits and 50 for testing.
4. Train 10 HMM models, one for each digit class. You will use 45 recordings of one digit class
to train one HMM model and so on.
5. This time, I wouldn’t ask you to implement the whole HMM learning algorithms as they
are so complicated. But I want you to know the basic mechanism how a small-scale speech
recognition system works. So, let’s use a Python HMM toolbox, which you can find here:
https://hmmlearn.readthedocs.io/en/latest/.
6. There are various HMM versions you can choose from, but let’s use GMMHMM, which might
be the most popular choice anyway. I briefly mentioned about it at the end of the HMM
module, but it’s basically a variant of HMM whose emission probabilities are defined by a
GMM. If you think of HMM as an elaborated GMM with complicated prior (i.e., the transition
probabilities), GMM-HMM can be seen as a GMM model working on top of another GMM
model, necessitating nested EM-like algorithms. Anyway, you don’t have to worry about it
today.
7. First, for digit class C, let’s convert each of the 45 utterances into an MFCC spectrogram.
You will use a toolbox for this: librosa.feature.mfcc. As a result, you produce X
(C)
i ∈
R
20×T
(C)
i , where T
(C)
i
is the number of MFCC vectors of the i-th utterance in the digit class
C. Store both the MFCC matrix X
(C)
i
and length information T
(C)
i
. Repeat this for all 45
utterances.
8. In theory, you can feed the 45 utterances one by one to your HMM training algorithm, but the
HMM toolbox you use has a more elegant way to deal with the situation. You will concatenate all the MFCC matrices from 45 utterances into a one big matrix and feed it just once:
X(C) = [X
(C)
1
, X
(C)
2
, . . . , X
(C)
45 ]. The only thing is, the algorithm doesn’t know the boundaries between different utterances. Once again, the toolbox provides a nice way to inform
the algorithm of this, by feeding a vector of per-utterance length information, i.e., T
(C) =
[T
(C)
1
, T(C)
2
, . . . , T(C)
45 ]. Take a look at the manual of the fit function of the GMMHMM class
here: https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.GMMHMM.
9. So, you first need to initialize a GMMHMM class and call the fit function to train it. When
you initialize it, you can choose to change a few hyperparameters, such as the number of
iterations, the tolerance threshold (for the stopping criterion), the number of hidden states
3
and GMM latent components (small numbers just work fine for these), etc. I adjusted them
to instruct the learning algorithm run for a long enough time to converge.
10. Repeat this process for all digit classes: C = {0, 1, . . . , 9}. Now you have 10 HMM models
trained from each of the digit classes.
11. Convert your test utterances into the MFCC matrices in the same way as above. Now you
have 5 uttarences per class to test out. Feed all five of them (as a big matrix along with the
length information) to each of the 10 HMM instances for testing. For this, you will use the
score function. This will give you the “log-likelihood” score. For each digit-specific test set,
you have 10 numbers from 10 HMM instances. The one that gives you the largest number is
the digit class the test set belongs to. Report these numbers.
12. Draw a confusion matrix, whose x-axis is the ground-truth digit class label and y-axis is the
prediction. Hence, for a given row of the confusion matrix, the brightest pixel should be on
the diagonal element if the classification is accurate. Submit this confusion matrix plot, too.
4