## Description

In this lab, you will write code for a simple signal processing and for a musical instrument simulation, parallelize it using CUDA, and write a report summarizing your experimental results. We will use PNG images as the test signal for the first part (convolution).

## A. Convolution: (30 points)

Convolution is a slightly more complicated operation than rectification and pooling from Lab 0, but it is still highly parallelizable. For each pixel in the input image, a 3×3 convolution computes the corresponding pixel in the output image using a weighted sum of the input pixel and its neighbors.

That is: ππ’π‘ππ’π‘[π][π] = ΰ· ΰ· ππππ’π‘[π + ππ β 1][π + ππ β 1]π€[ππ][ππ] ΰ¬Ά ΰ―ΰ―ΰΰ¬΄ ΰ¬Ά ΰ―ΰ―ΰΰ¬΄ , πππ 1 β€ π β€ π β 1,1 β€ π β€ π β 1 where m is the number of rows in the input image, and n is the number of columns in the input image. Since we are using square weight matrices and the βvalid paddingβ definition of convolution (1 β€ i β€ m β 1, 1 β€ i β€ n β 1), the output image will be of size m β 2 by n β 2. Figure 1 illustrates the convolution operation for a certain weight matrix, W. (Note: You donβt have to manipulate the Alpha channel.) Write the code for convolution.

For this lab, you only need to implement convolution using 3×3 weight matrices for the 3 test images provided. Before you convert the output to unsigned chars and save the file, you should clamp the output of the convolution between 0 and 255 (i.e., if a value in the output is less than 0, you should set it equal to 0, and if a value is greater than 255, you should set it equal to 255). The header file βwm.hβ (on myCourses) contains the weight matrix you should use.

(15 points) Please provide the 3 output images after convolution for the given 3 test images. (15 points) The grader will run the following command: ./convolve < # threads> When the input test image is βTest_1.pngβ, the output of your code should be identical to βTest_1_convolve.pngβ. These are available on myCourses. You can use the βtest_equality.cβ code provided in Lab 0 to check if two images are identical. The grader should be able to run your code with different numbers of threads, and the output of your code should be correct for any of those thread counts.

Figure 1: Convolution Example Report Deliverables: (20 points) ο· Measure the runtime when the number of threads used is equal to {1, 4, 8, 16, 64, 128, 256, 512, 1024}. ο· Plot the speedup as a function of the number of threads for each of the 3 test images provided. (10 points) ο· Discuss your parallelization scheme and your speedup plots and make reference to the architecture on which you run your experiments. (10 points)

## B. Finite Element Music Synthesis (30 points)

In a βfinite elementβ model, a complex physical object is modelled as a collection of simple objects (finite elements) that interact with each other and behave according to physical laws. It is possible to simulate electromagnetic fields, calculate stresses in the structure of a building, or synthesize the sounds of a musical instrument using finite elements.

In this part of the lab, you will synthesize drum sounds using a two-dimensional grid of finite elements. The finite element method we will be using is like the ocean simulator examined in the class. In this method, each interior element performs the following update at every iteration: π’(π,π) = π[π’1(π β 1,π) + π’1(π + 1,π) + π’1(π,π β 1) + π’1(π,π + 1) β 4π’1(π,π)] + 2π’1(π,π) β (1 β π)π’2(π,π) 1 + π , πππ 1 β€ π β€ π β 2, 1 β€ π β€ π β 2

where u is the displacement of the element at position (i, j), u1 is the displacement of that element at the previous time step, u2 is the displacement of that element at the previous previous time step, and Ξ· (0.0002) and Ο (0.5) are constants related to the size of the drum, sampling rate, damping, etc. Then, we ensure that the boundary conditions are met by updating the side elements: π’(0, π) βΆ= πΊπ’(1,π) π’(π β 1,π) βΆ= πΊπ’(π β 2,π) π’(π, 0) βΆ= πΊπ’(π, 1) π’(π, π β 1) βΆ= πΊπ’(π, π β 2) 1 β€ π β€ π β 2 and then the corner elements: π’(0, 0) βΆ= πΊπ’(1, 0) π’(π β 1, 0) βΆ= πΊπ’(π β 2, 0) π’(0, π β 1) βΆ= πΊπ’(0, π β 2) π’(π β 1, π β 1) βΆ= πΊπ’(π β 1, π β 2) where G (0.75) is the boundary gain.

Finally, at the end of the iteration, we set u2 equal to u1 and u1 equal to u. 4 | P a g e To simulate a hit on the drum, simply add 1 to u1 at some position (only once before the first iteration). To record the sound of the drum, collect the value of u at some position for each iteration in an array of length N, where T is the number of iterations to perform.

For this lab, you should use (N/2, N/2) as the position on the drum for both the hit and the recording.

1. Write code that implements a 4 by 4 finite element grid sequentially. The user will provide a single command line argument, T, which specifies the number of iterations to run the simulation. Your code should print u(N/2, N/2) (which in this case is u(2, 2) ) to the terminal at each iteration. (5 points)

2. Parallelize your implementation in Part 1 by assigning each node of this grid to a single thread using CUDA. The user will provide a single command line argument, T, which specifies the number of iterations to run the simulation. Your code should print u(N/2, N/2) (which in this case is u(2, 2) ) to the terminal at each iteration. (The output must match the output from Part 1, otherwise please debug). (10 points) The grader will run the following commands: ./grid_4_4

3. Using one thread per finite element can require an enormous amount of communication and GPU overhead, and even in large clusters there may not be enough hardware available to fully parallelize the computation. It is usually better to use a decomposition that assigns multiple elements to each processor. Parallelize your code using such a decomposition (with rows, columns, blocks, etc.) and simulate a 512 by 512 grid.

Try to parallelize with at least 3 different combinations of threads, blocks and finite elements per thread. Ex β 1024 threads/block and 16 blocks can allow each thread to handle 16 finite elements (Total elements = 512×512 = 1024x16x16). (15 points) The grader will run the following commands: ./grid_512_512

Report Deliverables: (20 points) ο· Provide your printed outputs for each part (1, 2 & 3). (5 points) ο· Provide a table comparing the execution times for each combination of threads, blocks and finite elements per thread. (5 points) ο· Discuss and analyze your parallelization schemes and experimental results. (10 points) 5 | P a g e To help debug your code, Figure 2: u for first 3 iterationsFigure 2 and Figure 3 shows the entire u grid (4×4) for the first three iterations: Figure 2: u for first 3 iterations for 4×4 grid Figure 3: u(N/2, N/2) for the first 12 iterations for 512×512 grid Submission Instructions: Each group should submit a single zip file with the filename Group_ _Lab_2.zip. (Ex β Group_09_Lab_2.zip).

The zip file should contain the following:

1. A lab report answering all the questions in the lab (please use a PDF format). (40 points) a. Must be named Group__Lab_2_Report.pdf. (Ex β Group_03_Lab_2_Report.pdf). b. Must have a cover page (Include Group number, membersβ names and ID). c. Must follow the logical order for the lab discussions and provide all the items asked for as mentioned above.

2. Your own source code. Add the whole project solution. (45 points)

3. The output test images (raw .png) for the grader to validate. (15 points) Late Submission Penalty: ο· 5% per day of late submission. ο· Lab will be accepted up to 1 week after the deadline (14% max penalty).