## Description

Jacobi’s method is an iterative, numerical method for solving a system of linear equations. Formally, given a full rank n × n matrix A ∈ R

n×n and a vector b ∈ R

n

, Jacobi’s method iteratively

approximates x ∈ R

n

for:

Ax = b

Given the matrix A,

A =

a11 a12 · · · a1n

a21 a22 · · · a2n

.

.

.

.

.

.

.

.

.

.

.

.

an1 an2 · · · ann

we first separate A into its diagonal elements D and the remaining elements R such that A = D+R

with:

D =

a11 0 · · · 0

0 a22 · · · 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 · · · ann

, R =

0 a12 · · · a1n

a21 0 · · · a2n

.

.

.

.

.

.

.

.

.

.

.

.

an1 an2 · · · 0

Jacobi’s method then follows the following steps till convergence or termination:

1. Initialize x: x ←

h

0 0 . . . 0

i

2. D = diag(A)

3. R = A − D

4. while ||Ax − b|| > l do

(a) update x ← D−1

(b − Rx)

where ||x|| is the L2-norm, and l is a parameter for the termination accuracy.

A matrix A is diagonally dominant, if its diagonal elements are larger than the sum of absolutes

of the corresponding rows, i.e., iff:

|aii| >

X

j6=i

|aij |.

It can be shown that for diagonally dominant matrices, Jacobi’s method is guaranteed to converge.

1

Parallel Algorithm

Data distribution We use a 2-dimensional grid as the communication network for this problem.

We assume that the number of processors is a perfect square: p = q ×q with q =

√p, arranged into

a mesh of size q × q.

The inputs are distributed on the grid as follows:

• The n-by-n matrix A is block distributed onto the grid. The rows of A are distributed onto

the rows of the processor-grid such that either d

n

q

e or b

n

q

c rows of A are distributed onto

each row of the grid. More specifically, the first (n mod q) rows of the grid contain d

n

q

e rows

of A, and the remaining rows of the grid contain b

n

q

c rows of A. The same applies to the

distribution of columns. A processor with coordinates (i, j) thus has the following size local

matrix:

d

n

q

e

2

if i < (n mod q) and j < (n mod q)

d

n

q

e × bn

q

c if i < (n mod q) and j ≥ (n mod q)

b

n

q

c × dn

q

e if i ≥ (n mod q) and j < (n mod q)

b

n

q

c

2

if i ≥ (n mod q) and j ≥ (n mod q)

• The size n vector b and x are equally block distributed only along the first column of the

processor-grid, i.e., only among processors with indexes (i, 0). Processor (i, 0) will thus have

the following local size:

d

n

q

e if i < (n mod q)

b

n

q

c if i ≥ (n mod q)

Parallel Matrix-Vector Multiplication Let A be a n-by-n matrix and let x and y be n

dimensional vectors. We wish to compute y = Ax. Assume that the square matrix A and the

vector x are distributed among the processor grid as explained above. The final answer y will again

be distributed in the same fashion as the vector x was. To do so, we will first “transpose” the

vector x on the grid, such that a processor with index (i, j) will end up with the elements that were

on processor (j, 0) according to the above distribution. We do this by first sending the elements

from a processor (i, 0) to its corresponding diagonal processor (i, i), using a single MPI Send and

the sub-communicator for the row of processors. We then broadcast the received elements from

(i, i) along each column of the grid using a sub-communicator for the column of processors.

Now each processor has the elements it needs for its local matrix vector multiplication. We

multiply the local vector with the local matrix and then use a parallel reduction to sum up the

resulting vectors along the rows of the grid back onto the processors of the first column. The final

result of the multiplication thus ends up distributed among the first column in the same way that

the input x was.

Parallel Jacobi For parallelizing Jacobi’s method, we distribute A and R as described above and

use parallel matrix-vector multiplication for calculating Rx. The vectors x, and b, and the diagonal

elements D of A are distributed among the first column of the processor grid. We can thus update

x by xi ← 1

di

(bi − (Rx)i) using only local operations.

In order to detect termination, we calculate Ax using parallel matrix vector multiplication, and

then subtract b locally on the first column of processors (where the result of the multiplication and

b are located). Next, we calculate the L2-norm ||Ax − b|| by first calculating the sum of squares

locally and then performing a parallel reduction along the first column of processors. We can now

determine whether to terminate or not by checking the L2-norm against the termination criteria and

then broadcasting the result along rows. Now every processor knows whether or not to continue

with further iterations. In your implementation, you should not only check for the termination

criteria, but also terminate after a pre-set maximum number of iterations.

Assignment

In this assignment, you will implement Jacobi’s method using MPI. A framework is provided

with this programming assignment, which declares all functions that you should implement. It

furthermore provides a testing framework, which you can use to test your code for correctness. See

the README file in the framework for more details regarding the framework.

The framework is available on Georgia Tech’s Github installation at: https://github.gatech.

edu/pflick3/cse6220-prog2.

First, you should implement matrix-vector multiplication and Jacobi’s method sequentially.

You will later use this as the comparison for measuring speedup and as reference implementation

for testing your parallel code.

Implement the sequential code in the file jacobi.cpp according to the function declarations in

jacobi.h. Unit Tests for the sequential code are implemented in seq tests.cpp. Add your own

test cases in this file. To compile and run the tests, run make test either locally on your own

machine, or in an interactive job. DO NOT run the tests on the login node of the cluster.

Next, you should implement the parallel algorithm as described above. For this, you’ll have

to implement functions to distribute the data among the grid of processors and gather results

back to the processor with rank (0, 0). Then, implement the parallel algorithms for matrix-vector

multiplication and Jacobi’s method. Implement your parallel code in the file mpi jacobi.cpp

according to the function declarations in mpi jacobi.h. A few test cases are already provided

in the file mpi tests.cpp. You should add your own test cases in this file as well. Some utility

functions for block distributions are provided in utils.h and utils.cpp. You can make use of

these function, and implement your own utility functions in these files.

Finally, test the performance of your parallel implementation against your sequential implementation for various input sizes. Use the executable jacobi for this and generate the input

on-the-fly via the -n and -d parameters. Report your achieved speedups for different input sizes

and difficulties. Increased difficulty increases the number of iterations needed for Jacobi’s Method

to converge.

Submission guidelines

The assignment is due on March 31st, 2015 on T-Square. A framework is provided along with

the assignment for you to work with. Please refer to the README file provided in the framework

for information on how to use it. You are expected to work in teams of two. One member from

each team should submit a zip/tar file containing the following

1. A text file containing the names of all team members and their contribution in terms of

percentage of work done.

2. All source files. Your program should be well commented and easy to read.

3. A report in PDF format containing the following:

• Short design description of your algorithms.

• Runtime and speedup plots by varying problem size. Give observations on how your

algorithm behaves on varying these parameters.