Assignment 2 - Prefix Computations in OpenMP

CS 475: Parallel Programming <br> GRAD 510: Fundamentals of HPC

CS 475: Parallel Programming
GRAD 510: Fundamentals of HPC
Computer Science Department
Fall 2019
Assignment 2 - Prefix Computations in OpenMP

Objectives (Deadine extended)

The objective of this assignment is to write a sequence of sequential and parallel programs to compute the prefix-op or the scan of a sequence of inputs. You will write the prorams measure their performance on the ski-resort machines in the CSB-225 lab, and determine the gains you get with up to 16 threads, and validate analytical predictions.

Download and untar PA2.tar. (updated Thursday Sept 19 at 8:00 pm, in calse you want the earlier version, it's here). You will need to update your makefile for each parallel program you write. Do not set the number of threads in the program, set it at the Linux OS level (using the export or setenv command). Otherwise we cannot run your program for different numbers of threads. Write your report in pdf format, preferably using LaTeX. The provided code can generate many different executables, such as MMScanSEQ, and MMScanSEQ.verb from the sigle source file MMScan-wrapper.c using conditional complation--the Makefile has rules that compile it with different flags, and the file has code wrapped in #ifdef-#endif macros. We expect you to adopt a similar strategy for this assignment.

You will write the following programs (the name in bold is the name of the executable that should be produced).

[MMScanDNC] Your first task is to write a (sequential) divide and conquer function (its signature and a dummy implementation has been pfovided for you) that exposes all available parallelism in the scan by exploiting associativity. This program is sequential, and should exhibit $n \lg n$ complexity.
[MMScanDNCP1] This is the first parallelization of MMScanDNC. It should use only task parallelism, and should include a strategy to throttle parallelism at a predetermined number of tasks, rather than for a predetermined size of task. Your code should take the value of p as an additional command-line parameter, and you should experiment and report for the optimal values of this parameter
[MMScanDNCP2] This is the second parallelization of MMScanDNC, and should only parallelize the for loop that updates the second half of the array after the recursive children have returned. It should also include a strategy to throttle parallelism appropriate for for parallelization, and you should tune the schedule and any other clauses you deem necessary to get the best performance. Your submitted code should have these parameters set to their optimal values.
[MMScanDNCP3] Now experiment with the hybrid of both parallelization, experiment with the one that gives the best performance and submit that (with the tunable parameters set to their optimal values in the code). As before, the total work should remain $n \lg n$ .
[MMScanEFF] This is the efficient (sequential) program that has $\Theta(n \lg p)$ complexity, where the DnC recursion stops at the level with p nodes. Rather, the algorithm executes a simple, sequential $\Theta(n/p)$ loop to compute the (partial) answer.
[MMScanEFFPAR] This is your best parallelization of the above MMScanEFF. Make sure that any tunable parameters are set to their optimal values when you submit the code, and the work remains $\Theta(n \lg p)$ Again, the value of p should be an additional commandline parameter.
[MMScanOPT] This is the optimal algorithm wthat only does a constant factor extra work (explained in class on Thursday, Sept 19). Make sure that work now becomes $\Theta(n)$ . Please tune the program to provide the best performance.
[MMScanOPTEC] The extra credit version, with p+1 chunks on p threads. This should provide linear speeedup, beating the sequential version on any number of threads.

Grading Rubric

Report: 10 points
Functional correctness of DNC, (and its three parallel versions). The testing of the parallel versions will first make sure that the results produced are correct, and second, verufy that the parallelization is as per the specifications. Points: 25+5+5+5=40.
EFF (and its parallelization). Correctness points: 15.
OPT (only parallel version). Correctness points: 21.
Performance. We will test (even for the sequential ones) that the performance is close to our solutions, and the parallelization yields the expected speedups. Performance points: 2*7=14.

Submission

Submit a PA2.tar file through Checkin. More details coming soon. Good luck and have fun!