ECE786 Programming Assignment 3 Part B

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (1 vote)

1 Objective

Optimize the quantum circuit simulation process of applying six single-qubit gates to six different
qubits in an n-qubit quantum circuit.

3 Description

In PA1, we have written the Cuda code for the quantum circuit simulation of applying a
single-qubit quantum gate to one qubit in an n-qubit quantum circuit. In PA3, we want to
write the Cuda code to do the quantum circuit simulation of applying six single-qubit
quantum gates to six different qubits in an n-qubit quantum circuit. And we have provided
the format of the input file and the expected output as shown in the next section.

Task1. Write the code to read in the input file and generate the expected output result
of applying six single-qubit quantum gates to six different qubits in an n-qubit
quantum circuit.

You can divide the simulation process of PA3 into six steps and apply one single-qubit gate
at one step, and in this way, each step will be the same as the simulation process of PA1.

Task2. Use shared memory to optimize the above code you have written.
This optimization method is called ShareMem Method and comes from
https://dl.acm.org/doi/pdf/10.1145/3447818.3460357. You can read section 3.1 from this
paper to learn more about this ShareMem Method. Be careful that in this paper, the least
significant bit is on the left most of the index.

However, in PA1’s simulation example, the least
significant bit is on the rightmost of the index. And for the simulation process of PA3, we will
keep the assumption that the least significant bit is on the rightmost of the index.

The basic idea of this ShareMem Method is can be divided into 3 steps as shown below:
● For an n-qubit quantum circuit that has an initial quantum state of 2^n values, split the
2^n values into several independent fragments of size 2^6, then map each fragment to
one GPU thread block. For each thread block, load the 2^6 values of the fragment from
global memory to shared memory. (The reason for using 2^6 as the fragment size is
because there are six single-qubit gates applied to six different qubits.)

● Apply the six single-qubit gates on each of the fragments and get the results. Similar to
applying 6 gates on a 6-qubit circuit, you can split this process into 6 steps: applying the
first gate to qubit 0, applying the second gate to qubit 1, applying the third gate to qubit
2, applying the fourth gate to qubit 3, applying the fifth gate to qubit 4, and applying the
sixth gate to qubit 5.

And for each of the steps, the computation is the same as in PA1,
and here since we have 2^6 values that need to be calculated in each thread block,
there will be 2^5 threads in each thread block, one thread will do one matrix
multiplication as shown in the figure below.

● After finishing the above computation and getting results, the 2^6 values of each
fragment in each of the thread blocks are stored back in global memory.

Task3. (Optional, bonus points) Thread coarsening optimization
Use thread coarsening optimization to optimize the code in task2.

4 Input format and expected output

1. The submitted code will be tested using a randomly generated input file. The input format
is very similar to the format of PA1. After you download the example of the input file from
moodle, you can see that in the input file, there are three parts as shown below:

● There are a total of 6 matrices representing the 6 gates and they are separated by a
blank line. The first 2×2 matrix represents the first single-qubit quantum gate, and
second 2×2 matrix represents the second single-qubit quantum gate, the third 2×2 matrix
represents the third single-qubit quantum gate, and so on. Each matrix row is in a
separate line (i.e. ends with a linefeed “\n”), and matrix elements in the same row are
separated by a single space.

● Below the 6 matrices, there is a 128×1 vector that represents a 7-qubit (from qubit 0 to
qubit 6) quantum state. This vector size may change based on the circuit size, but you
can assume that the circuit size will not be larger than 30 qubits, so this vector size will
not be larger than 2^30.

● The last six numbers 0, 2, 3, 4, 5, 6 in the last six lines represent that the first gate is
applied to qubit 0, the second gate is applied to qubit 2, the third gate is applied to qubit
3, the fourth gate is applied to qubit 4, the fifth gate is applied to qubit 5, and the sixth
gate is applied to qubit 6. And these six numbers may change which means we are
applying the six gates to different qubits of an n-qubit quantum circuit.

2. The expected output format is the same as PA1. You should output a vector of length N =
2^n, and n equals the size of the quantum circuit.

3. For grading, your program will be compiled and run with commands like “./quamsim
./inputfile_name.txt”

5 Submission Requirements
1. Submit the code for the task1 and the code for task2. And submit the code for task3 if you
have done task3. Your code should work on both the hydra cluster and the GPGPU-SIM
simulator.

2. In your report, for task1, briefly explain your CUDA kernel functions; for task2, explain how
you optimize the code using the ShareMem method; and explain how you do the thread
coarsening optimization for task3 if you have done the task3.

3. Use GPGPU-SIM to count the number of global memory accesses of task1 and task2, report
the count number of task1 and task2 and analyze their differences in your report.
4. zip the code, Makefile, and the report as unity id.zip.

6 Grading
1. Task1: 35 points, 30 points for the code to work correctly, and 5 points for the discussion in
your report.
2. Task2: 65 points, 50 points for the code to work correctly, and 15 points for the discussion in
your report.
3. Task3: 15 points, 10 points for the code to work correctly, and 5 points for the discussion in
your report.