The goal of this assignment is to learn about a couple of tools that perform loop transformations and gain experience with data-flow analysis based optimization for a micro-benchmark relevant to Computational Fluid Dynamics (CFD) simulations our colleagues in Mechanical Engineering use for combustion simulation. There are multiple parts of this assignment, and each part requires some reporting. Keep a repository and check in your report and code after each part is complete so that you always have a report that is submittable.
Common Subexpression Elimination (CSE) is an optimization to remove
redundant computations from 3-address code. For an example,
file, which is a benchmark we are using in our research with the
Mechanical Engineers who simulate combustion.
Create a bitcode and llvm assembly version of the program AFTER
you have applied the
clang -emit-llvm -O0 -c cfd-mini-c.c -o cfd-mini-c.bc opt -mem2reg cfd-mini-c.bc > cfd-mini-c-mem2reg.bc llvm-dis cfd-mini-c-mem2reg.bc
Now do a search in the file for the call to
and find the really large basic block named
This basic block (see below) has many common subexpressions
add nsw i32 %iz.1, %nGhost.
By hand replace the left hand side of the instances of
add nsw i32 %iz.1, %nGhost
in that basic block with
%add283, which is the first
value register where the common subexpression is stored.
For example, before common subexpression elimination you would see
and after you would see the following:
// BEFORE common subexpression elimination for.body279: ; preds = %for.cond275 %mul280 = mul nsw i32 0, %add3 %idx.ext281 = sext i32 %mul280 to i64 %add.ptr282 = getelementptr inbounds double* %7, i64 %idx.ext281 %add283 = add nsw i32 %iz.1, %nGhost // example common subexpression %mul284 = mul nsw i32 %add283, %mul1 %idx.ext285 = sext i32 %mul284 to i64 ... %add298 = add nsw i32 %iz.1, %nGhost %mul299 = mul nsw i32 %add298, %mul1 ...
To prototype this by hand, use the following commands:
// AFTER common subexpression elimination for.body279: ; preds = %for.cond275 %mul280 = mul nsw i32 0, %add3 %idx.ext281 = sext i32 %mul280 to i64 %add.ptr282 = getelementptr inbounds double* %7, i64 %idx.ext281 %add283 = add nsw i32 %iz.1, %nGhost // example common subexpression %mul284 = mul nsw i32 %add283, %mul1 %idx.ext285 = sext i32 %mul284 to i64 ... %mul299 = mul nsw i32 %add283, %mul1 ...
Note that all of the instances of
lli cfd-mini-c-mem2reg.ll // interpret program before doing hand opt cp cfd-mini-c-mem2reg.ll cfd-mini-c-mem2reg-handopt.ll // edit the cfd-mini-c-mem2reg-handopt.ll lli cfd-mini-c-mem2reg-handopt.ll // interpret program after
add nsw i32 %iz.1, %nGhostafter the first one in a single basic block are dominated by that first instance. Another way to think of it is that the first instance of that subexpression is available throughout the rest of the same basic block. We will talk about why this is the case in class.
CSE uses the results of the available expressions data-flow analysis
to determine where CSE can be performed.
Implement the available expression data-flow analysis
and use it to implement CSE. Call this pass
Figure out how to store available in and out sets for each
instruction and how to map available expressions to the
temporary value to which they are assigned (e.g.
Delete redundant instructions that recompute the available expression.
replaceAllUsesWith to replace later instance of
the value computed in the redundant instruction with the temporary
value that contains the relevant available expression.
Note that when performing the data-flow analysis, at
merge points in the control-flow graph the same available
expression may be assigned to multiple temporary values.
You will want to insert a
so that each available expression is only associated with one
In your PA2 report, use one paragraph to describe your implementation
of the available expressions data-flow analysis.
Then provide a table showing the change in static instruction counts
and change in performance for both LULESH and cfd-mini-c after
Already when performing the data-flow analysis based CSE, you had to consider the fact that the code is represented in SSA. There are ways to implement CSE and other optimizations without using data-flow analysis when the program is in SSA representation. Andre Platzer describes one such way in a great set of lecture notes on the topic.
Implement a second pass called
the algorithm described by Andre. In your report, compare the
complexity of the implementations and again report the static instruction
count before and after
-csessa result in more or fewer instructions than
-cse? Why is that the case?
Pluto is a compiler that can automatically parallelize and tile C codes. Pragmas need to be placed around the loops that should be transformed. I have installed pluto in the cs553 account. Try out some of the tests and examples using the following command:
~cs553/public/pluto-0.11.0/polycc ~cs553/public/pluto-0.11.0/test/seidel.c --tile --parallel
Pick one of the test or examples programs in
~cs553/public/pluto-0.11.0/ and create at least
3 different variants of the course code using various combinations
of pluto command-line options (e.g. tiled and parallel, just tiled, etc).
For you PA2 report, create an execution time graph
(execution time vs number of threads) that compares the performance of
the original code and
the 3 variants.
Describe what the graph is showing. Which variant did you expect to
execute faster and why? Which variant actually executed faster?
Why was your hypothesis wrong or correct?
Polly is a loop analysis and transformation
tool that works with llvm. Work through
the matrix multiply
example. NOTE that you need to use the llvm binaries installed with polly.
They can be found in
and the pass shared libraries are in
See below for an example of their usage.
In your PA2 report, for
show the SCoPs detected by Polly,
the polyhedral representation of those SCoPs,
and the dependencies for the SCoPs.
Note that in the below I have taken out the "-basicaa" flag that
the Polly folks are using in their tutorial. It was giving me errors.
setenv POLLYBIN ~cs553/public/polly/llvm_build/bin $POLLYBIN/clang -emit-llvm -c ~cs553/public/pluto-0.11.0/test/seidel.c -o seidel.bc $POLLYBIN/opt -load ~cs553/public/polly/llvm_build/lib/LLVMPolly.so -polly-canonicalize seidel.bc > seidel-polly.bc $POLLYBIN/opt -load ~cs553/public/polly/llvm_build/lib/LLVMPolly.so -polly-cloog -analyze -q seidel-polly.bc ...
Create a pdf file of your report and submit it via RamCT.
You will also need to submit your
The late policy in the syllabus applies. However, if you show a revision control log and versions of PA2.cpp and your report satisfying the following constraints, then you can have 2 extra days to complete your assignment with no late penalty.
PA2Passes/subdirectory with a working
csehashpass by September 19th and daily progress on the other portions of the assignment.
PA2Passes/just to simplify revision control, it needs to be in latex and be generating a pdf that reports on the status of all aspects of the project.
You can use any revision control tool with which you are familiar, but if you use something like github make sure your code is not publicly available. See the PA1 writeup for quick start instructions for using subversion.