CS 553 Programming Assignment #2 — CSE in LLVM, Loop transformations in Pluto and Polly

(Due Monday September 29th at 2:30pm, Now due Friday October 3rd at 11:59pm)

Introduction

The goal of this assignment is to learn about a couple of tools that perform loop transformations and gain experience with data-flow analysis based optimization for a micro-benchmark relevant to Computational Fluid Dynamics (CFD) simulations our colleagues in Mechanical Engineering use for combustion simulation. There are multiple parts of this assignment, and each part requires some reporting. Keep a repository and check in your report and code after each part is complete so that you always have a report that is submittable.

Data Flow Analysis and Common Subexpression Elimination

Common Subexpression Elimination (CSE) is an optimization to remove redundant computations from 3-address code. For an example, download the cfd-mini-c.c file, which is a benchmark we are using in our research with the Mechanical Engineers who simulate combustion. Create a bitcode and llvm assembly version of the program AFTER you have applied the mem2reg transformation.

    clang -emit-llvm -O0 -c cfd-mini-c.c -o cfd-mini-c.bc
    opt -mem2reg cfd-mini-c.bc > cfd-mini-c-mem2reg.bc
    llvm-dis cfd-mini-c-mem2reg.bc

Now do a search in the file for the call to gettimeofday and find the really large basic block named for.body279. This basic block (see below) has many common subexpressions including add nsw i32 %iz.1, %nGhost. By hand replace the left hand side of the instances of add nsw i32 %iz.1, %nGhost in that basic block with %add283, which is the first value register where the common subexpression is stored. For example, before common subexpression elimination you would see


// BEFORE common subexpression elimination
for.body279:                                      ; preds = %for.cond275
  %mul280 = mul nsw i32 0, %add3
  %idx.ext281 = sext i32 %mul280 to i64
  %add.ptr282 = getelementptr inbounds double* %7, i64 %idx.ext281
  %add283 = add nsw i32 %iz.1, %nGhost          // example common subexpression
  %mul284 = mul nsw i32 %add283, %mul1
  %idx.ext285 = sext i32 %mul284 to i64
  ...
  %add298 = add nsw i32 %iz.1, %nGhost
  %mul299 = mul nsw i32 %add298, %mul1
  ...

and after you would see the following:

// AFTER common subexpression elimination
for.body279:                                      ; preds = %for.cond275
  %mul280 = mul nsw i32 0, %add3
  %idx.ext281 = sext i32 %mul280 to i64
  %add.ptr282 = getelementptr inbounds double* %7, i64 %idx.ext281
  %add283 = add nsw i32 %iz.1, %nGhost          // example common subexpression
  %mul284 = mul nsw i32 %add283, %mul1
  %idx.ext285 = sext i32 %mul284 to i64
  ...
  %mul299 = mul nsw i32 %add283, %mul1
  ...

To prototype this by hand, use the following commands:

    lli cfd-mini-c-mem2reg.ll  // interpret program before doing hand opt
    cp cfd-mini-c-mem2reg.ll cfd-mini-c-mem2reg-handopt.ll
    // edit the cfd-mini-c-mem2reg-handopt.ll
    lli cfd-mini-c-mem2reg-handopt.ll   // interpret program after

Note that all of the instances of add nsw i32 %iz.1, %nGhost after the first one in a single basic block are dominated by that first instance. Another way to think of it is that the first instance of that subexpression is available throughout the rest of the same basic block. We will talk about why this is the case in class.

CSE uses the results of the available expressions data-flow analysis to determine where CSE can be performed. Implement the available expression data-flow analysis and use it to implement CSE. Call this pass -cse. Figure out how to store available in and out sets for each instruction and how to map available expressions to the temporary value to which they are assigned (e.g. %add283 above). Delete redundant instructions that recompute the available expression. Then use replaceAllUsesWith to replace later instance of the value computed in the redundant instruction with the temporary value that contains the relevant available expression.

Note that when performing the data-flow analysis, at merge points in the control-flow graph the same available expression may be assigned to multiple temporary values. You will want to insert a phi expression so that each available expression is only associated with one temporary value.

In your PA2 report, use one paragraph to describe your implementation of the available expressions data-flow analysis. Then provide a table showing the change in static instruction counts and change in performance for both LULESH and cfd-mini-c after performing your -cse pass.

Common Subexpression Elimination (CSE) and SSA

Already when performing the data-flow analysis based CSE, you had to consider the fact that the code is represented in SSA. There are ways to implement CSE and other optimizations without using data-flow analysis when the program is in SSA representation. Andre Platzer describes one such way in a great set of lecture notes on the topic.

Implement a second pass called -csessa using the algorithm described by Andre. In your report, compare the complexity of the implementations and again report the static instruction count before and after -csessa. Does -csessa result in more or fewer instructions than -cse? Why is that the case?

Loop Transformations in Pluto

Pluto is a compiler that can automatically parallelize and tile C codes. Pragmas need to be placed around the loops that should be transformed. I have installed pluto in the cs553 account. Try out some of the tests and examples using the following command:


    ~cs553/public/pluto-0.11.0/polycc ~cs553/public/pluto-0.11.0/test/seidel.c --tile --parallel

Pick one of the test or examples programs in ~cs553/public/pluto-0.11.0/ and create at least 3 different variants of the course code using various combinations of pluto command-line options (e.g. tiled and parallel, just tiled, etc). For you PA2 report, create an execution time graph (execution time vs number of threads) that compares the performance of the original code and the 3 variants. Describe what the graph is showing. Which variant did you expect to execute faster and why? Which variant actually executed faster? Why was your hypothesis wrong or correct?

Loop Transformations in Polly

Polly is a loop analysis and transformation tool that works with llvm. Work through the matrix multiply example. NOTE that you need to use the llvm binaries installed with polly. They can be found in ~cs553/public/polly/llvm_build/bin, and the pass shared libraries are in ~cs553/public/polly/llvm_build/lib/. See below for an example of their usage.

In your PA2 report, for ~cs553/public/pluto-0.11.0/test/seidel.c show the SCoPs detected by Polly, the polyhedral representation of those SCoPs, and the dependencies for the SCoPs. Note that in the below I have taken out the "-basicaa" flag that the Polly folks are using in their tutorial. It was giving me errors.


    setenv POLLYBIN ~cs553/public/polly/llvm_build/bin

    $POLLYBIN/clang -emit-llvm -c ~cs553/public/pluto-0.11.0/test/seidel.c -o seidel.bc

    $POLLYBIN/opt -load ~cs553/public/polly/llvm_build/lib/LLVMPolly.so -polly-canonicalize seidel.bc > seidel-polly.bc
    
    $POLLYBIN/opt -load ~cs553/public/polly/llvm_build/lib/LLVMPolly.so -polly-cloog -analyze -q seidel-polly.bc
    
    ...

Submitting the Assignment

Create a pdf file of your report and submit it via RamCT. You will also need to submit your PA2.cpp file.

The late policy in the syllabus applies. However, if you show a revision control log and versions of PA2.cpp and your report satisfying the following constraints, then you can have 2 extra days to complete your assignment with no late penalty.

You can use any revision control tool with which you are familiar, but if you use something like github make sure your code is not publicly available. See the PA1 writeup for quick start instructions for using subversion.


mstrout@cs.colostate.edu, 9/29/14