P1: Neural Network for Source Separation [4 points]
1. When you were attending IUB, you took a course taught by Prof. K. Since you really liked his
lectures, you decided to record them without the professor’s permission. You felt awkward,
but you did it anyway because you really wanted to review his lectures later.
2. Although you meant to review the lecture every time, it turned out that you never listened
to it. After you graduated, you realized that a lot of concepts you face at work were actually
covered by Prof. K’s class. So, you decided to revisit the lectures and study the materials
once again using the recordings.
3. You should have reviewed your recordings earlier. It turned out that there was a fellow
student who used to sit next to you always ate chips in the middle of the class right beside
your microphone. So, Prof. K’s beautiful deep voice was contaminated by the annoying
chip eating noise. So, you decided to build a simple NN-based speech denoiser that takes a
noisy speech spectrum (speech plus chip eating noise) and then produces a cleaned-up speech
4. trs.wav and trn.wav are the speech and noise signals you are going to use for training the
network. Load them. Let’s call the variables s and n. Add them up. Let’s call this noisy
signal x. They all must be a 403,255 dimensional column vector.
5. Transform the three vectors using STFT (frame size 1024, hop size 512, Hann windowing).
Then, you can come up with three complex-valued matrices, S, N, X, each of which has about
800 spectra. A spectrum should be with 513 Fourier coefficients (after discarding complex
conjugate as usual). |X| is your input matrix (its column vector is one input sample).
6. Define an Ideal Binary Mask (IBM) M by comparing S and N:
1 if |Sf,t| > |Nf,t|
0 otherwise ,
whose column vectors are the target samples.
7. Train a shallow neural network with a single hidden layer, which has 50 hidden units. For
the hidden layer, you can use tanh (or whatever activation function you prefer, e.g. rectified
linear units). But, for the output layer, you have to apply a logistic function to each of your
513 output units rather than any other activation functions, because you want your network
output to be ranged between 0 and 1 (remember, you’re predicting a binary mask!). Feel
free to investigate the other training options, such as weight decaying and early stopping (by
folding out 10% of spectra for validation). But you don’t have to.
8. Your baseline shallow tanh network should work to some degree, and once the performance
is above my criterion, you’ll get a full score.
9. tex.wav and tes.wav are the test noisy signal and its corresponding ground truth clean
speech. Load them and apply STFT as before. Feed the magnitude spectra of the test
mixture |Xtest| to your network and predict their masks Mtest (ranged between 0 and 1).
Then, you can recover the (complex-valued) speech spectrogram of the test signal in this way:
10. Recover the time domain speech signal by applying an inverse-STFT on Xtest Mtest. Let’s
call this cleaned-up test speech signal sˆ. From tes.wav, you can load the ground truth clean
test speech signal s. Report their Signal-to-Noise Ratio (SNR):
SNR = 10 log10
(s − sˆ)>(s − sˆ)
11. Note: My shallow network implementation converges in 5000 epoch, which never takes more
than 5 minutes using my laptop CPU. Don’t bother learning GPU computing for this problem.
Your network should give at least 6 dB SNR.
12. Note: DO NOT use Tensorflow, PyTorch, or any other package that calculates
gradients for you. You need to come up with your own backpropagation algorithm. It’s okay to use the one you wrote up in the previous homework.
P2: Stereo Matching (revisited) [4 points]
1. im0.ppm (left) and im8.ppm (right) are the pictures taken by two different camera positions1
If you load the images, they will be a three dimensional array of 381 × 430 × 3, whose third
dimension is for the three color channels (RGB). Let’s call them XL and XR. For the (i,j)-th
pixel in the right image, XR
(i,j,:), which is a 3-d vector of RGB intensities, we can scan and
find the most similar pixel in the left image at i-th row (using a metric of your choice). For
example, I did the search from XL
(i,j,:) to XL
(i,j+39,:), to see which pixel among the 40 are
the closest. I record the index-distance of the closest pixel. Let’s say that XL
(i,j+19,:) is the
most similar one to XR
(i,j,:). Then, the index-distance is 19. I record this index-distance (to
the closest pixel in the left image) for all pixels in my right image to create a matrix called
“disparity map”, D, whose (i, j)-th element says the index-distance between the (i, j)-th pixel
of the right image and its closest pixel in the left image. For an object in the right image, if
its pixels are associated with an object in the left image, but are shifted far away, that means
the object is close to the cameras, and vice versa.
2. Calculate the disparity map D from im0.ppm and im8.ppm, which will be a matrix of 381×390
(since we search within only 40 pixels). Vectorize the disparity matrix and draw a histogram.
How many clusters do you see?
3. Write up your own GMM clustering code, and cluster the disparity values in D (you can of
course use your own code from the previous homework, but not the ones from the toolboxes).
Each value will belong to (only) one of the clusters. The number of clusters says the number
of depth levels. If you replace the disparity values with the cluster means, you can recover
the depth map with k levels. Plot your depth map (the disparity map replaced by the mean
disparities as in the image quantization examples) in gray scale–pixels of the frontal objects
should be bright, while the ones in the back get darker.
4. Extend your implementation with the MRF’s smoothing priors using an eight neighborhood
system (e.g. Ni,j =
(i − 1, j − 1),(i − 1, j),(i − 1, j + 1),(i, j − 1),(i, j + 1),(i + 1, j − 1),(i +
1, j),(i+ 1, j + 1)
. Feel free to choose either ICM or Gibbs sampling. Show me the smoothed
results. You can use the Gaussian-kernel-looking prior probability equations discussed in class
5. Submit your estimated depth maps from both na¨ıve GMM and MRF.
P3: Rock or Metal [4 points]
1. trX.mat contains a matrix of size 2 × 160. Each of the column vectors holds “loudness” and
“noisiness” features that describe a song. If the song is louder and noisier, it belongs to the
“metal” class, and vice versa. trY.mat holds the labeling information of the songs: -1 for
“rock”, +1 for “metal”.
2. Implement your own AdaBoost training algorithm. Train your model by adding weak learners.
For your m-th weak learner, train a perceptron (no hidden layer) with the weighted error
E(yt||yˆt) = wt(yt − yˆt)
where wt is the weight applied to the t-th example after m − 1-th step. Note that ˆyt is the
output of your perceptron, whose activation function is tanh.
3. Implementation note: make sure that the m-th weak learner φm(x) is the sign of the perceptron output, i.e. sgn(ˆyt). What that means is, during training the m-th perceptron, you use
yˆt as the output to calculate backpropagation error, but once the perceptron training is done,
φm(xt) = sgn(ˆyt), not φm(xt) = ˆyt.
4. Don’t worry about testing the model on the test set. Instead, report a figure that shows the
final weights over the examples (by changing the size of the markers), as well as the prediction
of the models (giving different colors to the area). I’m expecting something similar to the
ones in M12 S26.
5. Report your classification accuracy on the training samples, too.
P4: PLSI for Analyzing Twitter Stream [3 points]
1. twitter.mat holds two Term-Frequency (TF) matrices Xtr and Xte. It also contains YtrM at
and YteM at, the target variables in the one-hot vector format.
2. Each column of the TF matrix Xtr can be either “positive”, “negative”, or “neutral”, which
are represented numerically as 1, 2, and 3 in the YtrM at. They are sentimental classes of the
3. Learn 50 PLSI topics B ∈ R
891×50 and their weights Θtr ∈ R
50×773 from the training data
Xtr, using the ordinary PLSI update rules.
4. Reduce the dimension of Xte down to 50, by learning the weight matrix Θte ∈ R
can be done by doing another PLSI on the test data Xte, but this time by reusing the topic
matrix B you learned from the training set. So, you skip the update rule for B. You only
update Θte ∈ R
5. Define a perceptron layer for the softmax classification. This part is similar to the case with
kernel PCA with a perceptron as you did in Homework #4 Problem 3. Instead of the kernel
PCA results as the input to the perceptron, you use Θtr for training, and Θte for testing.
This time the number of output units is 3 as there are three classes, and that’s why the target
variable YtrM at is with three elements. Review M6 S37-39 to review what softmax is.
6. Report your classification accuracy.