Here, you are to re-use your serial Carry Lookahead Adder and adapt it to use 16 bit blocks
and extend it to work in parallel for a 1M (e.g., 1,048,576) bit CLA adder using up to 16
MPI ranks on the class server, kratos.cs.rpi.edu.
This CLA adder can be constructed from the following. Note, the description has been
updated to consider that the i-th carry at any level is really the carry-out from that bit position
and that i − 1 is the carry-in to the i-th bit, group or section, etc bit position.
Note, to make the CLA program run in serial and parallel using 2, 4, 8 and 16 MPI
ranks, you can think of “slicing” the CLA adder into 8 equal size chunks. So, you will need
to exchange messages at the s
cm level. That is each rank will have only one “bit” of carry
information at that level.
1. Calculate gi and pi
for all 1,048,576 i.
2. Calculate ggj and gpj
for all 65,536 j using gi and pi
3. Calculate sgk and spk for all 4,096 k using ggj and gpj
4. Calculate s
gl and s
for all 256 l using sgk and spk.
5. Calculate s
gm and s
3pm for all 16 m using s
gl and s
6. Calculate s
cm using s
gm and s
3pm for all m and correct s
co. Note, for 16 MPI
ranks, Rank 0 will send carry message updates to rank 1, and rank 1 to rank 2 and so
on up to 16. If you have fewer than 16 ranks, then each rank will perform the necessary
calculations multiple s
cm. For example at 8 ranks, there are 2 s
cm, at 4 ranks there
are 4 s
cm and at 2 ranks there are 8 s
cm and the serial processor case computes all
7. Calculate s
cl using s
gl and s
for all l and replacing each s
cl−1 when l mod 16 = 0
using correct s
cm−1, m = l/16 as super sectional carry-in for all l in that block of 16.
8. Calculate sck using sgk and spk for all k and replacing each sck−1 when k mod 16 = 0
using correct s
cl−1, l = k div 16 as sectional carry-in for all k in that block of 16.
Note, make sure you set sck−1 = s
cl−1 when k mod 16 = 0 and l = k/16.
9. Calculate gcj using ggj
for all j and replacing each gcj−1 when j mod 16 = 0 using
correct sck−1, k = j/16 as sectional carry-in for all j in that block of 16. Note, make
sure you set gcj−1 = sck−1 when j mod 16 = 0 and j = i/16.
10. Calculate ci using gi
for all i and replacing each ci−1 when i mod 16 = 0 using
correct gcj−1. Note, make sure you set ci−1 = gcj−1 when i mod 16 = 0 and j = i/16
for the final sum step below.
11. Calculate sumi using ai ⊕ bi ⊕ ci−1 for all i.
More specifically, you will do the following:
1. Like before represent the 1,048,576 bit input numbers as 262,144 hex digits.
2. Have MPI Rank 0 perform the conversion from hex to binary and reverse the binary
input numbers such that the most significant bit is in the highest position within the
3. Have MPI Rank 0 distribute the input binary arrays to each rank in the correct order
where (for a 16 ranks configuration) MPI rank 1 has bits 16,384 thru 32,767 and MPI
rank 2 has bits 32,768 thru 49,151 and so on. This should be done using MPI Scatter.
4. Like before, write functions for each step in the above algorithm. You can do it using
for loops and do not have to perform the equation substitutions by hand as we did in
5. Between each algorithm step perform an MPI Barrier collective operation to keep all
ranks in step with each other. Make the barrier operation conditional as you will want
to turn it own and off as part of your performance study.
6. Use MPI Isend and MPI Irecv routines to exchange the nearest-neighbor carry bit
information at the s
cm level. Again, Rank 0 will send to rank 1 and rank 1 will
receive from 0 and send to rank 2 and so on. Note that Rank 0 will only send and rank
15 will only receive carry information. Note, all ranks except Rank 0, should
post an MPI Irecv message before the cla calculations start well in advance
of any MPI Isend messages being scheduled.
7. A master “cla” routine will run thru all the steps.
8. You main routine will invoke your input reading function, followed by your master
“cla” function. You can check your answer by having Rank 0 read in the full data set
hex input data (convert it to binary) and like before using the ripple carry adder based
on computing the ci = gi + pi ∗ ci−1 for all i and then computing the final sum, based
on sumi = ai ⊕ bi ⊕ ci−1.
9. Have each rank send there part of the final sumi solution to Rank 0. This should
be done using MPI Gather. Rank 0 will then re-reverse the sum and output the final
result. (Yes, this output result will consume several pages of text).
10. For the performance study, measure the execution of serial and parallel runs using
MPI Wtime which returns a double. Here you’ll have a start time and finish time
and their difference is the execution. Performance testing should be done on the class
11. Next, create the following three graphs: First, plot the execution time of 2, 4, 8 and 16
rank runs as function of their number of ranks. Next, plot of the speedup relative to
the execution time of the serial MPI CLA adder of the 2, 4, and 8 rank runs and finally
plot the speedup of the relative to the execution time of the serial ripple carry adder
to the MPI CLA adder running in parallel on 2 thru 16 ranks. Each graph should
have two plots line – one with and one without the MPI Barrier operations between
each step. Explain why your performance graphs turned out they way they did. This
report must be written using LaTeX, MSWord other document preparation software
and handed in in PDF format using submitty along with your code.
12. Test case will be posted on the website and run using file redirection to standard input.
E.g., the command line that will be used is:
mpirun -np X ./cla < test1.txt
Note, that X is the number of ranks that will be used. We will run our tests on 4 or
8 ranks, depending on the test but again your performance tests should run on 2, 4, 8
and 16 MPI ranks using kratos.cs.rpi.edu.
2 HAND-IN INSTRUCTIONS
Submit both your code and PDF report to submitty.cs.rpi.edu.
3 LATE DAYS
This assignment will be eligible for you to use your “late days”. Each student has up to 3
late days they can use on individual programming assignments for the semester.