Introduction

The objective of this assignment is to review background material and ensure that you have the prerequisites necessary to succeed in this class: prior experience in parallel programming, especially in CUDA, and in complexity analysis of algorithms. The assignment has four parts, three of them are programming, and one is a paper-and-pencil analysis. You have three weeks to do the assignment, but the individual pieces are due over the next three Wednesdays. Also note that the add-drop deadline is deadline is before all of HW0 is done.

Work in a step wise fashion for debugging purposes. In 2014, the department set up a set of 50 new machines (named after US state capitals) and that is the platform we will use for the OpenMP programming assignments. When studying performance, run your codes on a quiet machine.

  • HW0A: Sequential and OpenMP DPKP [DUE WEDNESDAY 28 JANUARY 2015 @ 23:55]
    Start from the code skeleton knap.c provided in KnapSeq.tar to create a sequential knapsack solver.

    Inputs are in knnxcc.txt files, where nn represents N, the number of objects and cc represents the approximate total capacity. e.g. k120x4M.txt corresponds to input file for 120 objects and total capacity of ~4M.

    The file format is:

       //  first line:  # of objects, knapsack capacity, 
       //  next lines:   weight and profit  of next object (1 object per line)
     

    The input file k50x25K.txt, has the following solution k50x25K.res.

    You can also use the simpler case, k10x25.txt (provided in the KnapSeq.tar) to debug your codes. knapSeq k10.txt should produce:

    platte 70 # knapSeq k10x25.txt 
    The number of objects is 10, and the capacity is 25.
    The optimal profit is 874.
    Time taken: 0.000003
    Solution vector is: 
    0 1 0 0 0 1 1 1 0 1 
    

    Using OpenMP parallelize the loop that goes through all capacities for one object, and other easy parallelizable loops (initializations).

    Record times for the part of your code that computes the table, and speedups for 1 through 20 threads (pick a few points, for eg. 1, 2, 4, 8, 12, 16, 20 threads). Report performance results for the input data files k120x4M.txt, k240x2M.txt, k500x1M.txt.

    Write down your observations:

    • How does the performance of parallel code compare to the sequential?
    • For which number of threads does speedup end?
    • What is your maximum speedup?
    • Your code is allocating a memory of asymptotic size N*C, and so there is a limit to how large a problem you can solve -- the table should fit in memory, otherwise your malloc will fail. For N=2500, what is the largest capacity that your program can handle?
    • Run your program with input file k1200x38M.txt. You should get some error. State your observations.

    Submission Instructions: You need to submit only your C code files and not the Makefile/kxx.txt files. To ensure that our testing script works, please make sure those C files are named as knapSeq.c and knapPar.c. Submission file name: HW0A.tar
    Submit your report as a file named R0A.pdf through the Checkin system.

  • HW0B: Memory-efficient DPKP [DUE WEDNESDAY 4 FEBRUARY 2015 @ 23:55]
  • The main limitation of this program is that it is memory inefficient, since it maintains a table of size O(nc) (and this is also the time complexity of the algorithm). This has two drawbacks. First, large problem sizes (e.g., 10,000 objects with a knapsack capacity of 109) simply cannot be handled. Furthermore, the number of distinct memory locations touched by the program is asymptotically the same as the number of operations that the program performs, and this means that the performance is limited by memory bandwidth, not speed. This is because each memory location is written exactly once and read exactly twice during the execution of the program, so there is no data reuse. As a result the performance of your code in PA3 of CS475 left you unhappy.

    In a few weeks you are going to see how to overcome this problem, but for the purpose of this first assignment, we are going to brush many things under the carpet. We will completely ignore the backtracking problem and just focus on how to compute the DP table in a memory efficient manner. If necessary, we will recompute any entries that are needed.

    The first observation is that, in order to compute any row (each row corresponds to a new object) in the DP table, only the entries in the previous row are needed. So we can allocate memory for only two rows (total memory is now O(c). Basically, we will be throwing away (overwriting) all but the two most recent rows of the table.

    Write a program knapMEseq that does this, parallelize this in OpenMP just like HW0A (call it knapMEpar) and compare the execution time and quality of parallelization with that of the full table version. Report your observations:

    • How does the performance of parallel code compare to the sequential?
    • For which number of threads does speedup end?
    • What is your maximum speedup?
    • Your code is allocating a memory of asymptotic size N*C, and so there is a limit to how large a problem you can solve -- the table should fit in memory, otherwise your malloc will fail. For N=2500, what is the largest capacity that your program can handle?

    Sample grading script script , sample output of this script output

    Submission Instructions: Files: knapMEseq.c and knapMEpar.c Submission file name: HW0B.tar Report: R0B.pdf

  • HW0C: Complexity analysis [DUE WEDNESDAY 4 FEBRUARY 2015 @ 23:55]
  • Now we consider the backtracking part. Since we have only retained the last two rows of the table, we are only able to do the first step of the backtracking, i.e., we are only able to decide if the last object contributes to the optimal solution. An obvious way to compute the entire solution using the program that you have already written is to then go back and recompute (again in the same, memory efficient manner), the remainder of the solution. Write the recurrence equation that describes the execution time of this approach, solve it (you may use any standard technique for solving recurrences) and discuss the asymptotic complexity of the algorithm.

    Our ultimate goal is to retain the same asymptotic complexity O(nc) that the original, memory inefficient algorithm had. What should be the nature of the recurrence equation for this to happen?

    Submission file name: R0C.pdf

  • HW0D: CUDA [DUE WEDNESDAY 11 FEBRUARY 2015 @ 23:55]
  • Write a CUDA program knapMEcuda that implements the memory efficient DP algorithm on the GPU machines in the department. Test our program on the fruit machines in the department (final performance data on apples, oranges, bananas and coconuts, and use the other fruit machines for development). A sample Makefile is provided. Your Makefile must have a "clean" command like the one provided. Create a tar file using "make tar" and submit the generated HW0D.tar file. We want you to submit your report as a single pdf file of no more than 3 pages (we are setting up another submission under Checkin). Most likely one of us will be reading reports and the other will evaluate code, so make sure that your report is self contained. For this program, we are not really interested in speedup as a function of the number of processors (threads, threadblocks, whatever). We just want you to get to the best performance you can achieve on "large enough" inputs. In particular, we will be testing with n in the 50-500, and c going to the limit of the memory on the GPU. For the purposes of the report, there are two metrics of performance:
    1. The number of operations per second. Since the knapsack does not do any floating point ops, we will use the number of iterations per second. This is simply product of n and c divided by the execution time (in GITERs/sec rather than GFLOPs/sec).
    2. The bandwidth achieved by your program. Count the total number of data transfers (from global memory) made by your program. Note that we can't give you this number, because it may depend on how exactly you write the program (but it will be some function of n and c). Divide the total number of transfers by the execution time, and report that number (in MBytes/sec).
    Both these metrics are to be compared to the theoretical peak of the machine. For compute performance, the Tesla1060 GPU is capable of delivering about 640 GFLOPs/sec, but this uses a fused multiply-add unit that does two operations per clock cycle. For integer operations, therefore, take the machine peak as 320 GOPs/sec. For the bandwidth metric, the reported machine peak is about 100 GB/sec (we have not been able to attain this on the fruit machines, but we will know if and when your code does a good job (don't expect to get 100% on any of the metrics).

    Report your observations:

    • For the provided test file k240x2M.txt what was the performance of your code on each of the two metrics?
    • What fraction of the machine peak in terms of operations per sec did your code achieve? To convert iteration to operations, if you wrote your code cleanly, the loop body should be doing two compare-add operations.
    • Based on the above two questions, was the code memory-bound or compute bound? Why?
    • What was the best performance you achieved and for what values of n and c was this?
    • Was your code optimized so that memory accesses were coalesced? If so, describe in your own words (to someone who does not want to read your code) how you achieved this optimization, and report the loss in performance without this optimization. If you didn't do this optimization, explain why not.

Expected output for ./knapMEcuda k10.txt 1:

The number of objects is 10, and the capacity is 25.
The optimal profit is 874.
Time taken: 0.090000.
The last row of the solution is:
          0          0        200        200        200        390        400        400        400        590
        590        590        590        605        605        619        749        749        749        749
        764        764        778        849        874        874

Submission Instructions: Files: knapMEcuda.cu knapMEkernel.cu knapMEkernel.h Makefile (Irrespective of your Makefile modifications, after compiling the executable file's name should be knapMEcuda.) Submission file name: HW0D.tar Report: R0D.pdf