Description
Questions
1. (18 marks) Chapter 7, Memory access, Anderson Lock
The following benchmark is designed to measure the average time for reading array A
containing L elements from memory, when the array elements are s words apart:
for stride s from 4 Bytes (1 word) to L/2 by 2x (s = 1, 2, 4, …, L/2)
time the following loop (repeat many times and average)
for i from 0 to L-1
load A[i*(s+1)] from memory (4 Bytes)
So, the benchmark reads L words from memory into cache, where L is a constant; however,
these L words are not consecutive memory locations (they are “s” words apart).
The results obtained from the benchmark are shown in Fig.1:
Figure1. Average time for reading each array element
Assuming only one level of cache exists, the cache line size is 4 words, and a single
processor is running the benchmark, answer the following questions:
1.1 In Fig.1, when the total number of elements of the array – L – is smaller than L’, the
average time per access remains constant. What does the value of L’ represent? What
does time t0 show?
1.2 When L is larger than L’, what does time t1 indicate in Fig.1?
1.3 For each part of the graph – parts 1, 2 and 3 -, justify its behaviour.
We know a phenomenon called false sharing could cause unnecessary invalidations in
ALock (Anderson Lock), and one way to avoid false sharing is to pad array elements so
that distinct elements are mapped to distinct cache lines.
© 2022 Instructor-generated course materials (e.g., handouts, notes, summaries, exam questions, etc.) are
protected by law and may not be copied or distributed in any form or in any medium without explicit permission of the
instructor. Note that infringements of copyright can be subject to follow up by the University under the Code of Student
Conduct and Disciplinary Procedures.”
2/2
1.4 Considering the results from Fig.1, how could the padding technique used in ALock
degrade the overall performance of the lock?
2. (18 marks) Chapter 9, examining the fine-grained algorithm
2.1. Provide the code for the contains() method missing from the fine-grained algorithm
described in chapter 9.
2.2. Write a test to verify the contains() method and explain why your implementation is
correct.
3. (12 marks) Chapter 10, designing a bounded lock-based queue
3.1. Design a bounded lock-based queue implementation using an array instead of a linked
list. Allow parallelism by using two separate locks for head and tail.
3.2. Try to transform your algorithm to be lock-free. Where do you run into difficulty?
4. (32 marks) Chapter 16, matrix vector multiplication
4.1. Implement a sequential matrix vector multiplication.
4.2. Give an efficient and highly parallel multithreaded algorithm for multiplying an n × n
matrix A by a length-n vector x that achieves work Θ(n
2
) and critical path Θ(log n).
[Hint: If you use a cached thread pool with the Future interface, to achieve the Θ(log n)
critical path you can follow the algorithm for matrix multiplication in Chapter 16.
Otherwise, depending on how you divide the problem, you could obtain a critical path
different than Θ(log n). This is acceptable, just be sure to show your work and justify in
your report the critical path of your implementation.]
4.3. Write a test program that measures the execution time for multiplying a 2,000 by 2,000
matrix with a corresponding 2000 wide vector using the parallel method and sequential
method, respectively. Discuss the execution time of the two methods and compute the
speedup. Be sure to indicate how many threads are being used.
4.4. Analyze and discuss the work and critical-path length of your implementation, and give
the parallelism.
Total: 80 marks