Objectives

The objective of the first part of this programming assignment is to write an MPI program and an OpenMP program for the one-D Jacobi problem similar to the stencil-1D code from PA1: where stencil-1D was a 5 point stencil, Jacobi is a 3 point stencil. In MPI you will use a blocked (halo) approach: each of the p MPI processes owns an n/p sized block of the array(s) and computes that block's result. To get data from neighboring blocks (in neighboring processes), the blocks have variable sized ghost element buffers, as described in Quinn Section 13.3.5. and the slides from that chapter. At the end, each block's result must be sent back to process 0 using "MPI_Gather" or "MPI_Gatherv" function calls. Timer should start after initialization of data and end after gather results from all the processes. You will write the MPI code, jacMPI.c, and analyze its performance on department fish machines, see these machines and grep 325. Then you will write the OpenMP version of the code, jacOMP.c, and compare the performance of the MPI and openMP versions of your code on one multi-core fish processor.

The provided tar file PA6.tar contains the sequential Jacobi code, an initial Makefile, and timer code. Jac.c has 2 or 3 command line parameters: data size, number of iterations, and a third optional parameter to indicate debugging. If the third parameter is provided, the sequential code prints the result array and timing information:

anchovy:~/cs/475/from12/PAs/PA6$ make
gcc -O3 -o jac jac.c timer.o
anchovy:~/cs/475/from12/PAs/PA6$ jac 8 4 0
     
produces
0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 
Data size : 8  , #iterations : 4 , time : 0.000001 sec
In non verbose mode, it prints the first, middle and last result element and the timing information:
anchovy:~/cs/475/from12/PAs/PA6$ jac 12 6 
first, mid, last: 0.000000 5.000000 11.000000
Data size: 12, #iterations: 6, time : 0.000001 sec

Turn jac.c into an openMP program jacOMP.c and exercise it on a fish machine for 1 to 6 threads. In non verbose mode, jacOMP produces the first, middle and last result element, and problem and timing information:

anchovy:~/cs/475/from12/PAs/PA6$ export OMP_NUM_THREADS=6
anchovy:~/cs/475/from12/PAs/PA6$ jacOMP 12 6 
first, mid, last: 0.000000 5.000000 11.000000
Data size: 12, #iterations: 6 , time: 0.000191 sec
Then work on your MPI program, jacMPI.c. The arguments to your MPI program are:
  arg1: n: problem (array) size
  arg2: m: number of iterations
  arg3: k: buffer size
  optional arg4: vp: 
           if not present: non-verbose, i.e. no debug info
           if present: the id of the verbose process, i.e. the process providing debug info

The vp parameter allows you to debug each process independently. Here is an example use of v (verbose, i.e. we are in verbose mode) and vp ( and this is the verbose process):

 
if(v && vp == id ){	
	for( . . . ){		
		printf("%f ",prev[i]);		
	}	
	printf("\n");
}

Update the Makefile to work for all your codes, jac, jacOMP and jacMPI.

One Processor / multicore comparison

Compare jacOMP and jacMPI on one fish machine for 1 to 6 threads / processes, with data size n=120000, number of iterations m=12000, (for jacMPI) buffer size k=1. Report the results of your comparisons using tables and graphs. What is your conclusion?

In the MPI code, at the end, each block's result must be sent back to process 0 using "MPI_Gather" or "MPI_Gatherv" function calls. Timer should start after initialization of data and end after gather results from all the processes. The output of your jacMPI program in non verbose mode consists of 2 lines: The first produces the first, middle and last element of the entire array one space between each element, using '%f' to print the element. The second line produces the problem size and process 0 execution time. Example:

mpirun -np 2 jacMPI 12 6 1
returns:
first, mid, last: 0.000000 5.000000 11.000000
proc 0 complete, time: 0.000064
anchovy:~/cs/475/from12/PAs/PA6$ mpirun -np 3 jacMPI 120 60 1
first, mid, last: 0.000000 59.000000 119.000000
Data size: 120, #iterations: 60, time: 0.000141 

Multi Processor / multi core comparison

For the multiprocessor jacMPI experiment, run a larger problem n=480000, m=12000 with 6 cores on one processor (-np = 6) and with 12 cores on 2 processors (-np 12). Experiment to find a good (near optimum) buffer size.

Report the results of your experiments using tables and graphs. What are your conclusions about speedup and buffer size? To run on multi processors, you need a host file (I call mine hn for n hosts), e.g.:

anchovy:~/cs/475/from12/PAs/PA6$ more h2
tuna  slots=6
wahoo slots=6

anchovy:~/cs/475/from12/PAs/PA6$ mpirun -np 12 --hostfile h2 --mca btl_tcp_if_include eno1 jacMPI 480000 12000 30
first, mid, last: 0.000000 239999.000000 479999.000000
Data size: 480000, #iterations: 12000, time: 1.020789 

Submit your Makefile, jacOMP.c and jacMPI.c, and your report.pdf. We will execute your programs in non verbose mode. You can assume n is divisible by p, and k < n/p