Homework 0 - Warmup OpenMP & CUDA

CS 575 Parallel Processing
Computer Science Department
Spring 2015
Homework 0 - Warmup OpenMP & CUDA

Introduction

The objective of this assignment is to review background material and ensure that you have the prerequisites necessary to succeed in this class: prior experience in parallel programming, especially in CUDA, and in complexity analysis of algorithms. The assignment has four parts, three of them are programming, and one is a paper-and-pencil analysis. You have three weeks to do the assignment, but the individual pieces are due over the next three Wednesdays. Also note that the add-drop deadline is deadline is before all of HW0 is done.

Work in a step wise fashion for debugging purposes. In 2014, the department set up a set of 50 new machines (named after US state capitals) and that is the platform we will use for the OpenMP programming assignments. When studying performance, run your codes on a quiet machine.

HW0A: Sequential and OpenMP DPKP [DUE WEDNESDAY 28 JANUARY 2015 @ 23:55]
Start from the code skeleton knap.c provided in KnapSeq.tar to create a sequential knapsack solver.
Inputs are in knnxcc.txt files, where nn represents N, the number of objects and cc represents the approximate total capacity. e.g. k120x4M.txt corresponds to input file for 120 objects and total capacity of ~4M.
The file format is:
```
   //  first line:  # of objects, knapsack capacity, 
   //  next lines:   weight and profit  of next object (1 object per line)
 
```
The input file k50x25K.txt, has the following solution k50x25K.res.
You can also use the simpler case, k10x25.txt (provided in the KnapSeq.tar) to debug your codes. knapSeq k10.txt should produce:
```
platte 70 # knapSeq k10x25.txt 
The number of objects is 10, and the capacity is 25.
The optimal profit is 874.
Time taken: 0.000003
Solution vector is: 
0 1 0 0 0 1 1 1 0 1 
```
Using OpenMP parallelize the loop that goes through all capacities for one object, and other easy parallelizable loops (initializations).
Record times for the part of your code that computes the table, and speedups for 1 through 20 threads (pick a few points, for eg. 1, 2, 4, 8, 12, 16, 20 threads). Report performance results for the input data files k120x4M.txt, k240x2M.txt, k500x1M.txt.
Write down your observations:
- How does the performance of parallel code compare to the sequential?
- For which number of threads does speedup end?
- What is your maximum speedup?
- Your code is allocating a memory of asymptotic size N*C, and so there is a limit to how large a problem you can solve -- the table should fit in memory, otherwise your malloc will fail. For N=2500, what is the largest capacity that your program can handle?
- Run your program with input file k1200x38M.txt. You should get some error. State your observations.
Submission Instructions: You need to submit only your C code files and not the Makefile/kxx.txt files. To ensure that our testing script works, please make sure those C files are named as knapSeq.c and knapPar.c. Submission file name: HW0A.tar
Submit your report as a file named R0A.pdf through the Checkin system.
HW0B: Memory-efficient DPKP [DUE WEDNESDAY 4 FEBRUARY 2015 @ 23:55]

O(nc)

⁹

In a few weeks you are going to see how to overcome this problem, but for the purpose of this first assignment, we are going to brush many things under the carpet. We will completely ignore the backtracking problem and just focus on how to compute the DP table in a memory efficient manner. If necessary, we will recompute any entries that are needed.

The first observation is that, in order to compute any row (each row corresponds to a new object) in the DP table, only the entries in the previous row are needed. So we can allocate memory for only two rows (total memory is now O(c). Basically, we will be throwing away (overwriting) all but the two most recent rows of the table.

Write a program knapMEseq that does this, parallelize this in OpenMP just like HW0A (call it knapMEpar) and compare the execution time and quality of parallelization with that of the full table version. Report your observations:

How does the performance of parallel code compare to the sequential?
For which number of threads does speedup end?
What is your maximum speedup?
Your code is allocating a memory of asymptotic size N*C, and so there is a limit to how large a problem you can solve -- the table should fit in memory, otherwise your malloc will fail. For N=2500, what is the largest capacity that your program can handle?

Sample grading script script , sample output of this script output

Submission Instructions: Files: knapMEseq.c and knapMEpar.c Submission file name: HW0B.tar Report: R0B.pdf

HW0C: Complexity analysis [DUE WEDNESDAY 4 FEBRUARY 2015 @ 23:55]

Our ultimate goal is to retain the same asymptotic complexity O(nc) that the original, memory inefficient algorithm had. What should be the nature of the recurrence equation for this to happen?

Submission file name: R0C.pdf

HW0D: CUDA [DUE WEDNESDAY 11 FEBRUARY 2015 @ 23:55]

knapMEcuda

Makefile

no more than 3 pages

another

The number of operations per second. Since the knapsack does not do any floating point ops, we will use the number of iterations per second. This is simply product of n and c divided by the execution time (in GITERs/sec rather than GFLOPs/sec).
The bandwidth achieved by your program. Count the total number of data transfers (from global memory) made by your program. Note that we can't give you this number, because it may depend on how exactly you write the program (but it will be some function of n and c). Divide the total number of transfers by the execution time, and report that number (in MBytes/sec).

Report your observations:

For the provided test file k240x2M.txt what was the performance of your code on each of the two metrics?
What fraction of the machine peak in terms of operations per sec did your code achieve? To convert iteration to operations, if you wrote your code cleanly, the loop body should be doing two compare-add operations.
Based on the above two questions, was the code memory-bound or compute bound? Why?
What was the best performance you achieved and for what values of n and c was this?
Was your code optimized so that memory accesses were coalesced? If so, describe in your own words (to someone who does not want to read your code) how you achieved this optimization, and report the loss in performance without this optimization. If you didn't do this optimization, explain why not.

Expected output for ./knapMEcuda k10.txt 1:

The number of objects is 10, and the capacity is 25.
The optimal profit is 874.
Time taken: 0.090000.
The last row of the solution is:
          0          0        200        200        200        390        400        400        400        590
        590        590        590        605        605        619        749        749        749        749
        764        764        778        849        874        874

Submission Instructions: Files: knapMEcuda.cu knapMEkernel.cu knapMEkernel.h Makefile (Irrespective of your Makefile modifications, after compiling the executable file's name should be knapMEcuda.) Submission file name: HW0D.tar Report: R0D.pdf