- Home
- ENGR-E 511
- Machine Learning for Signal Processing (ENGR-E 511) Homework 4

~~$30.00~~ $18.00

Category: ENGR-E 511

Description

5/5 - (3 votes)

P1: When to applaud? [4 points]

1. Piano Clap.wav is an audio signal simulating a music concert, particularly at the end of a

song. As some of the audience haven’t heard of the song before, they started to applaud

before the end of the song (at around 1.5 seconds). Check out the audio.

2. Since I’m so kind, I did the MFCC conversion for you, which you can find in mfcc.mat. If you

load it, you’ll see a matrix X, which holds 962 MFCC vectors, each of which has 12 coefficients.

3. You can find the mean vectors and covariance matrices in MuSigma.mat that I kindly provide,

too. Once you load it, you’ll see the matrix mX, which has two column vectors as the mean

vectors. The first column vector of mX is a 12-dimensional mean vector of MFCCs of the

piano-only frames. The second vector holds another 12-dimensional mean vector of MFCCs

of the claps. In addition to that, Sigma, has a 3D array (12×12×2), whose first 2D slice

(12×12) is the covariance matrix of the piano part, and the upper 2D slice is for the clap

sound.

4. Since you have all the parameters you need, you can calculate the p.d.f. of an MFCC vector

in X for the two multivariate normal (Gaussian) distribution you can define from the two sets

of means and cov matrices. Go ahead and calculate them. Put them in a 2 × 962 matrix:

P =

P(X1|C1) P(X2|C1) · · · P(X962|C1)

P(X1|C2) P(X2|C2) · · · P(X962|C2)

. (1)

1

Note that C1 is for the piano frames while C2 is for the applause. Normalize this matrix,

so that you can recover the posterior probabilities of belonging to the two classes given an

MFCC frame:

P˜

c,t =

P(Xt|Cc)

P2

c=1 P(Xt|Cc)

(2)

Plot this 2 × 962 matrix as an image (e.g. using the imagesc function in MATLAB). This

is your detection result. If you see a frame at t

0 where P˜

1,t0 < P˜

2,t0 , it could be the right

moment to start clapping.

5. You might not like the this result, because it is sensitive to the wrong claps in the middle.

You want to smooth them out. For this, you may want to come up with a transition matrix

with some dominant diagonal elements:

T =

0.9 0.1

0 1

. (3)

What it means is that if you see a C1 frame, you’ll want to stay at that status (no clap) in the

next frame with a probability 0.9, while you want to transit to C2 (clap) with a probability

of 0.1. On the other hand, you absolutely want to stay at C2 once you observe a frame with

that label (you don’t want to get back if you start clapping).

Apply this matrix to your P˜ matrix in a recursive way:

P¯

:,1 = P˜

:,1 (4)

b = arg max

c

P¯

c,t (5)

P¯

:,t+1 = T

>

b,: P˜

:,t+1 (6)

(7)

First, you initialize the first column vector of your new posterior prob matrix P¯ (you have

no previous frame to work with at that moment). For a given time frame of interest, t + 1,

you first need to see its previous frame to find which class the frame belongs to (by using

the simple max operation on the posterior probabilities at t). This class index b is going to

be used to pick up the corresponding transition probabilities (one of the row vectors of the

T matrix). Then, the transition probabilities will be multiplied to your existing posterior

probabilities at t + 1.

In the end, you may want to normalize them so that they can serve as the posterior probabilities:

P¯

1,t+1 = P¯

1,t+1/

P¯

1,t+1 + P¯

2,t+1

P¯

2,t+1 = P¯

2,t+1/

P¯

1,t+1 + P¯

2,t+1

. (8)

Repeat this procedure for all the 962 frames.

6. Plot your new smoothed post prob matrix P¯ . Do you like it?

WhatsApp us