P1: When to applaud? [4 points]
1. Piano Clap.wav is an audio signal simulating a music concert, particularly at the end of a
song. As some of the audience haven’t heard of the song before, they started to applaud
before the end of the song (at around 1.5 seconds). Check out the audio.
2. Since I’m so kind, I did the MFCC conversion for you, which you can find in mfcc.mat. If you
load it, you’ll see a matrix X, which holds 962 MFCC vectors, each of which has 12 coefficients.
3. You can find the mean vectors and covariance matrices in MuSigma.mat that I kindly provide,
too. Once you load it, you’ll see the matrix mX, which has two column vectors as the mean
vectors. The first column vector of mX is a 12-dimensional mean vector of MFCCs of the
piano-only frames. The second vector holds another 12-dimensional mean vector of MFCCs
of the claps. In addition to that, Sigma, has a 3D array (12×12×2), whose first 2D slice
(12×12) is the covariance matrix of the piano part, and the upper 2D slice is for the clap
4. Since you have all the parameters you need, you can calculate the p.d.f. of an MFCC vector
in X for the two multivariate normal (Gaussian) distribution you can define from the two sets
of means and cov matrices. Go ahead and calculate them. Put them in a 2 × 962 matrix:
P(X1|C1) P(X2|C1) · · · P(X962|C1)
P(X1|C2) P(X2|C2) · · · P(X962|C2)
Note that C1 is for the piano frames while C2 is for the applause. Normalize this matrix,
so that you can recover the posterior probabilities of belonging to the two classes given an
Plot this 2 × 962 matrix as an image (e.g. using the imagesc function in MATLAB). This
is your detection result. If you see a frame at t
0 where P˜
1,t0 < P˜
2,t0 , it could be the right
moment to start clapping.
5. You might not like the this result, because it is sensitive to the wrong claps in the middle.
You want to smooth them out. For this, you may want to come up with a transition matrix
with some dominant diagonal elements:
What it means is that if you see a C1 frame, you’ll want to stay at that status (no clap) in the
next frame with a probability 0.9, while you want to transit to C2 (clap) with a probability
of 0.1. On the other hand, you absolutely want to stay at C2 once you observe a frame with
that label (you don’t want to get back if you start clapping).
Apply this matrix to your P˜ matrix in a recursive way:
:,1 = P˜
b = arg max
:,t+1 = T
First, you initialize the first column vector of your new posterior prob matrix P¯ (you have
no previous frame to work with at that moment). For a given time frame of interest, t + 1,
you first need to see its previous frame to find which class the frame belongs to (by using
the simple max operation on the posterior probabilities at t). This class index b is going to
be used to pick up the corresponding transition probabilities (one of the row vectors of the
T matrix). Then, the transition probabilities will be multiplied to your existing posterior
probabilities at t + 1.
In the end, you may want to normalize them so that they can serve as the posterior probabilities:
1,t+1 = P¯
1,t+1 + P¯
2,t+1 = P¯
1,t+1 + P¯
Repeat this procedure for all the 962 frames.
6. Plot your new smoothed post prob matrix P¯ . Do you like it?