Objectives

The objective of this assignment is to explore implementations of Polynomial Multiplication. Record the performance of your programs on a Fish machine (Anchovy to Wahoo in the machines list), and experimentally determine the gains you get with 1-8 threads.

Download and untar PA3.tar. You should not need to update the Makefile. Do not set the number of threads in the program, set it at the Linux OS level (using the export or setenv command). Otherwise we cannot run your program for different numbers of threads. Write your report in pdf format.

The following two sequential programs have been implemented for you.

  • [PolyMultGSQ] (Grade School Quadratic) This is the naive implementation of Polynomial Multiplication.
  • [PolyMultGold] This (sequential) implementation includes all of the optimizations up to PolyMultBLQ. It is only provided as a binary. Use it as a benchmark against which you may compare your programs.
You will need to implement the following programs.
  • [PolyMultINQ] (Inverted Quadratic)
    • Invert the inner and outer loops of both computational loops from the PolyMultGSQ implementation.
    • Hint: once the loops have been inverted you can condense the computation into a single double for loop.
    • Explore loop parallelism. Report the optimal strategy and number of threads.
  • [PolyMultOPQ] (Outer Product Quadratic)
    • Rewrite PolyMultGSQ to use a square iteration space (as shown in lecture).
    • Explore loop parallelism. Report the optimal strategy and number of threads.
  • [PolyMultBLQ] (Blocked Quadratic)
    • Block both the inner and outer loops from PolyMultOPQ.
    • Use the same parameter (tune1) to set the block size for each loop.
    • Explore loop parallelism. Report the optimal strategy, number of threads, and block size.
  • [PolyMultDCQ] (Divide and Conquer Quadratic)
    • Rewrite polyMultOPQ to use the divide and conquer strategy presented in lecture.
    • Use tune2 to set the minimum leaf size.
    • Explore task parallelism and test the use of different non recursive algorithms in the leafs (e.g. PolyMult{INQ,OPQ,BLQ}). Determine and report which configuration gives you the best performance.
  • [PolyMultDCNew] (Divide and Conquer New)
    • The divide and conquer strategy used in PolyMultDCQ is a preliminary step to writing Karatsuba's algorithm (which combines the upper right and lower left recursive calls).
      However, PolyMultDCQ does not parallelize well because only the upper right and lower left recursive calls can be executed in parallel (without allocating more memory).
    • Use PolyMultINQ as a starting point to write a divide and conquer algorithm which does parallelize well.
    • Divide relative to the i index, so that both recursive calls are independent.
    • You will need to write your own leaf case because the leaf domain will not be rectangular.
    • Any divide and conquer strategy which achieves the same (or better) speedup as this strategy will recieve credit.
    • Add this task to your project by edditing the Makefile and PolyMult-wrapper.c to produce PolyMultDCNew.{time,time-par,check,check-rand,check-par}

Report

  • Include the details from the explore steps in each of the programs above.
  • When measuring speedup and finding optimal tuning parameters use a degree of 65535 (2^16-1)
  • Justify the optimal design and configuration of each program with either experimental evidence or design principles from this class (for some programs you may need to use both).
  • For each parallelization, report speedup on 1-8 threads, using the speedup equation from lecture 2.
  • As always, report the details of the machine where you are running your tests.

Submission

Submit a PA3.tar file through Checkin. Create this file by using the rule for PA3.tar in the provided Makefile. (This will ensure that your project has the correct directory structure.)

Testing

  • Your executables must be named PolyMult{INQ,OPQ,BLQ,DCQ}.{time,time-par,check,check-rand,check-par} (you should not need to edit the Makefile).
  • Do not change the format and the order of printing statements. Example:
    $ ./bin/PolyMultGSQ.time 65536
    Execution time for Current:   3.266627 sec.
    Execution time for Gold:  1.038377 sec.
    $ ./bin/PolyMultGSQ.time-par 65536
    Execution time for Current:   3.273981 sec.
    Execution time for Gold:  1.040758 sec.
    $ ./bin/PolyMultGSQ.check 65536
    The total number of errors is 0
    $ ./bin/PolyMultGSQ.check-rand 65536
    The total number of errors is 0
    $ ./bin/PolyMultGSQ.check-par 65536
    The total number of errors is 0
        

Good luck and have fun!