P1: Instantaneous Source Separation [4 points]
1. As you might have noticed from my long hair, I’ve got a rock spirit. However, for this
homework I dabbled to compose jazz music. The title of the song is boring: Homework 3.
2. From x ica 1.wav to x ica 20.wav are 20 recordings of my song, Homework 3. Each recording
has N time domain samples. In this music there are K unknown number of musical sources
played at the same time. In other words, it could simulate the situation that 20 of my students
come to my gig and record my band’s play from 20 different locations (sounds unethical, so I
wouldn’t invite you guys, no worries). This can be seen as a situation where the source was
mixed up with a 20 × K mixing matrix A to the K sources to create the 20 channel mixture:
3. As you’ve learned how to do source separation using ICA, you should be able to separate
them out into K clean speech sources.
4. First, you don’t like the fact that there are too many recordings for this separation problem,
because you have a feeling that the number of sources is a lot smaller than 20. So, you decided
to do a dimension redcution first, before you actually go ahead and do ICA. For this, you
choose to perform PCA with the whitening option. Apply your PCA algorithm on your data
matrix X, a 20 × N matrix. Don’t forget to whiten the data. Make a decision as to how
many dimensions to keep, which will correspond to your K. Hint: take a very close look at
5. On your whitened/dimension reduced data matrix Z (K × N), apply ICA. At every iteration
of the ICA algorithm, use these as your update rules:
NI − g(Y )f(Y )
W ← W + ρ∆W
Y ← W Z
W : The ICA unmixing matrix you’re estimating
Y : The K × N source matrix you’re estimating
Z : Whitened/dim reduced version of your input (using PCA)
g(x) : tanh(x)
f(x) : x
ρ : learning rate
N : number of samples
6. Enjoy your separated music. Submit your separated .wav files, source code, and the convergence graph.
7. Implementation notes: Depending on the choice of the learning rate the convergence of the
ICA algorithm varies. But I always see the convergence in from 5 sec to 90 sec in my iMac.
P2: Ideal Masks [3 points]
1. piano.wav and ocean.wav are two sources you’re interested in. Load them separately and
apply STFT with 1024 point frames and 50% overlap. Use Hann windows. Let’s call these
two spectrograms S and N, respectively. Discard the complex conjugate part, so eventually
they will be an 513×158 matrix. Later on in this problem when you recover the time domain
signal out of this, you can easily recover the discarded half from the existing half so that
you can do inverse-DFT on the column vector of full 1024 points. Hint: Why 513, not 512?
Create a very short random signal with 16 samples, and do a DFT transform to convert it
into a spectrum of 16 complex values. Check out their complex coefficients to see why you
need N/2 + 1, not N/2
2. Now you build a mixture spectrogram by simply adding the two source spectrograms: X =
S + N.
I’ll allow you to use a toolbox for STFT, but I encourage you to use your own implementation.