CS 675 Advanced Parallel Computing (Spring 2010)
Wim Bohm, Email: bohm at cs dot colostate.
- Textbook: There is no required Textbook for this
course. We will read papers and notes
- Meeting Time: TR 3:30-4:45 Room: HPC lab (CSB 315)
- Read Vassili Volkov's Cuda BLAS paper, in particular:
let's discuss the matrix multiply algorithm and its complexity.
- We have started creating a wiki:
register. It will send you a password so you can login. Next
week we will shut registration down.
- Thursday Feb 11: Discussion of your optimized
CUDA matrix multiply algorithm.
Write up a short (< two page) dicussion paper, with the
algorithms structure and its performance
on a Tesla (apples, oranges, coconuts, bananas, tuttifrutti).
In this course we will compare programming models and evaluate the performance
of GPUs and multi cores. Finally parallel systems have become commodity and we
will have to learn how to program these effectively. The "LA Z boy"
programming era of getting better performance by waiting for our processors to
take the next jump on Moore's curve is over.
We must learn how to use the new parallel systems effectively and do
honest, scientifically valid performance comparisons.
A more detailed description of what we want to achieve can be found
A comparison of CS560 and CS675 can be found here.
Our approach will be, in terms of Bailey, not to "Fool the masses". A common
way to fool the masses is to compare highly optimized code on one system
against unoptimized code omn another. Bailey also states that
"Scientists must be willing to provide all details of the experimental
environment, so others can reproduce their results."
In particular, we want to compare highly optimized
GPU implementations to tiled, multithreaded, and vectorized CPU implementations.
We will work in a "Competitive, collaborative" style, by having teams trying to
come up with a best solution and then learning from each other, and finding out
whether aspects of various solutions can be combined to create an even better
- GPU vs multi core architecture comparison
- Expression of parallelism: CUDA, SSE and OpenMP
- CUDA thread grids and blocks
- CUDA Coalescing
- SSE vectorization
- OpenMP multithreading
- Memory models: global, shared, local
- Banks, avoiding bank conflicts and the effect on performance
- Tiling for improved locality
- Micro benchmarks: very simple programs that teach us how to get
performance information (peak compute, peak throughput, effect of
coalescing, effect of (avoiding) bank conflicts, effect of vectorization.
- Kernels: transpose, Mat*Vec, Mat*Mat, more Dwarfs (Virginia
paper), stencil codes, convolution codes, scans.
- "Bottleneck"-ology. How to figure out how what is causing the
deviation from ideal behavior?
- Is it an Amdahl/Gustaffson effect? If the sequential fraction is
diminishing as the problem size increases, then a larger problem
size will achieve better speedup.
- Is it parallelization overhead? Then we need to try to improve the
communication/synchronization behavior. If so, is this in the
computation/parallelization part, or in the communication, and at what
level of the communication hierarchy?
- Is there a general methodology to discover it? So far we don't know.
Maybe we will by the end of the semester.
- Repeating performance studies from papers. Do we get the same, or
better, results? Did the authors compare apples and oranges?
- Studying Rizk and Rajopadhye's RNA stepwise development "saga". Can we
improve on it?
- Projects: to be decided and discussed.
Your participation will make or break the course.
CUDA Install fest
The cuda nvcc compiler is available on our Linux system.You should have your
own copy of the cuda SDK to experiment and extend. Elliot Forney has created a
that explains how to install your cuda SDK.
Papers to read
- weeks1 and 2: NVidia Programming and Best Practices Guide
- weeks 3 and 4: Vassili Volkov's SC08b CUDA BLAS papers
- week 5: NVidia Matrix Transpose
Links and refs
Last updated Feb 3, 2010