CUDA parallelization of Divide & Conquer Memory efficient algorithm for 0/1 Knapsack problem

Version 1.0

The objective of this assignment is for you to implement the Divide and Conquer Dynamic Programming algorithm for the 0/1 knapsack problem and parallelize it on CUDA and report your performance on the department's Fermi based machines (pomegranates). The assignment has three parts, each one incrementally building on the previous. You should be able to finally make the code compute bound and achieve a good value of Gops-per-second.

  • HW4A: knapAlgo [15 points]
    Start from sequential code that you did for HW3, and implement the fill_table function in CUDA (you will essentially be using the code that you wrote for HW0, which does a sequence of kernel calls, one for each row of the table). Make sure that your program now produces correct answers for large values of the capacity. Modify the program so that beyond a certain depth, the fill_table function is executed on the host rather than the GPU, and study and report the value of depth at which the overhead of GPU parallelization is no longer worth the parallelization gains.

  • HW4B: knapAlgoSmallN (Making the CUDA compute bound) [40 points]
  • Using the techniques described in the lecture, modify the CUDA code so that the function fill_table function is evaluated in a single kernel call. We suggest that you proceed in the following steps.

    1. First write a simple CUDA function that accomplishes the correct synchronization between the thread blocks. Think of writing a toy program similar to Waruna's example for syncthreads (except that that one was for threads within a single threadblock, and this one is for multiple threadblocks).
    2. Next embellish this with the code that correctly updates (with only one thread per threadblock active) the section of the array that is allocated to this threadblock. make sure that the correct values are written and read from global memory between the synchronizations. For this to work there should be enough global memory such that each threadblock can store its entire "output" (i.e., N*WMAX or N*sigma(Wi) memory per threadblock). Hence, in order to maximize the number of active threadblocks, this scheme will only work for relatively small values of N.
    3. After ensuring that this code produces the correct answers, parallelize the computation of a threadblock. This is the part that was not completely detailed in the class, since there are a few different options that you could pursue.

  • HW4C: knapAlgoLargeN [25 points]
  • In 4B, the total amount of global memory required is Number of Blocks*N*WMAX. This restricts the number of objects your code can handle. Now modify your program so that an arbitrary value of N can be handled. For this you will repeatedly (in a sequence of kernel calls) call the code of Part B.

  • HW4D: Report [20 points]
  • Your report should present the performance, i.e., total number of Gops-per second achieved for large values of N and C. Use k50kx1M.txt for reporting performance. If you need a smaller input file for debugging, use k5kx100k.txt any input file from previous assignments. You can also use generateInput.c to generate your own input files.

    Please be sure to report the times for the complete application, including any code that is executed on the host as well as the time for data transfer to and from the host. Measure each of the different components and present an Amdahl's law analysis.

    Analyze your code's performance with respect to the system peak performance.

    Submit, using the Checkin tab, your codes, your time and speedup statistics, and observations as well as your analysis in a report. For all three codes, the verbose options should be similar to HW3.