CS553 Colorado State University =============================================== Parallelism and Data Locality =============================================== 10/27/09 -------- ----------------- The Problem (slide 2) How can compilers map programs to modern computer architectures so that the programs best utilize architectural features and run as efficiently as possible? This mapping is difficult because of the latency between memory and the core and because of latency between cores. Memory hierarchies are the HW approach for attempting to improve performance when there is data reuse, but it requires that the compiler generate code with good data locality. Quick review of caching L1 -cache lines of memory kept in sets -set associativity tells you how many lines per set -cache size / (line size * associativity) indicates how many sets -address / (line size) indicates what line an address belongs to -(line num)/(numsets) indicates what set each line in memory maps to L2 -if miss in L1, then look in L2 -variations in terms of whether L2 has superset of L1 or not TLB -buffers translations of virtual addresses to physical addresses -caches may be addressed by either -buffers top bits of virtual address to page, page to physical address bits -worst case: miss in TLB, go get entry from page table (which is in memory), page being accessed is not in memory so go fetch it Big picture Want to schedule accesses to neighbors in memory and reuse of the same data item as close in time as possible. Subproblems we are going to focus on ... Place parallel computations on separate cores. Want consecutive computations on a single processor to access memory sequentially. Approaches we will take ... -how do we generate code for loops with affine loop bounds and affine memory accesses so that they have good data locality? -how do we automatically parallelize the same class of loops? ----------------- Locality example (slide 3) Column Major versus row major --> draw some pictures showing what this layout looks like in memory -------------------- Parallelization Example (slide 4) --> How do we determine that parallelization and/or loop transformations are legal? #pragma omp for ---------------------- Concepts needed for automating loop transformations (slide 7) Data dependence analysis problem: How do we determine that a transformation is legal? Expressibility problem: How do we represent loops? How do we represent transformations and parallelization? Code generation problem: How do we generate the transformed code? Guidance problem:How do we determine that a transformation is beneficial? (reuse analysis in book looks at this) Today we are looking at how to represent loops and dependences. Also looking at the start of legality checking. ------------------------- Dependences and Loops (slide 8) --> have two students write out three iterations of each loop and draw arrows between dependences. Loop independent: These are easy to find and require no new techniques Loop carried: These are dependences that depend on the loop structure because they go between one loop iteration to another loop iterations. If you just look at a single iteration, there appears to be no dependence Can fully parallelize the loop if there are no loop-carried dependences! Can perform any loop transformation that maintains same number of iterations if there are no loop-carried dependences! Do these two example loops need any transformations? - how good is the data locality in these loops? - are these loops parallelizable? ---------------------------------- Data dependence testing in general (slide 9) -DDA enables the compiler to determine whether parallelization and/or various loop transformations are legal. -Show the general DDA setup. Possible questions: - is there a dependence? NP-Complete, integer linear programming - what dependences exist? - how do we represent those dependences? if we solve for deltas then can represent as distance or direction vectors. -If there is no dependence then can reorder the iterations in the loop in any manner. Parallelization is also a possibility. -If there are dependences, what is possible is restricted a bit more, but transformations and parallelization might still be possible. ------------------ Algorithms for Solving the Dependence Problem (slide 10) The accuracy and cost increase as we go down the list. We are going to cover the GCD test, the Banerjee test, and you will need to be able to do simple examples by hand. Those are exponential in the worst case, but the ones we do by hand will be simple. The book covers the independent variables test, which you will also need to learn. ------------------ Dependence Testing (slide 11) --> how do we set up the dependence problem for the example loop? Write loop on the board and go back to slide 9. 1 <= i,i' <= 5 3i' + 2 = 2i + 1 3i'-2i = -1 i' = 1 i = 2 ------------------ Summarizing Data Dependence Testing (slides 16 and 17) Two of the examples we started with can be shown to only have loop-independent dependences trivially. The other one will result in a maybe dependence answer, because there really is a loop carried dependence. Often we can perform loop transformations and parallelize even when there are some loop carried dependences. To detect those cases we need a finer-grained abstraction instead of just NO loop-carried dependence and MAYBE loop carried dependence. We are going to first look at the distance vector abstraction, which is based on abstracting the loop as an iteration space. ------------------------- Iteration spaces (slide 18) - popular way to represent loops - show tuples associated with points in iteration space - figuring out dependences initially can be done by unrolling the loop a bit A(1,1) = A(0,0) ... - show how an iteration space can be described mathematically (i,j) such that (i,j) in Z^2 and 1 <= i <=6 and 1 <=j<=5 This mathematical representation is a polyhedron. ------------------------- Distance Vectors (slide 19) - distance between iterations not data items ---------------------------- Distance Vectors and Loop transformations (slides 20-21) - show how computation traverses A Got this far in 1 hr 15 minutes ------------------------- Data Dependence Terminology (slides 22-24) The distance vector does not contain enough information by itself to indicate whether it is a flow, anti, or output dependence. Therefore we need to determine that information and mark each vector as one of those. Computing Distance Vectors for an Example with dim>1 do i =1,6 do j = 1,5 A(i,j) = A(i-1, j-1) + 1 Calculating distance vector write: A(i,j) read: A(i'-1,j'-1) 1 <= i,i' <=6, 1 <= j,j' <= 5 i = i'-1, j = j'-1 assume anti dep source: read target: write delta_i = target iteration - source iteration = i - i' = -1 delta_j = j - j' = -1 distance vector = (delta_i, delta_j) = (-1, -1) not an anti dep! distance vector = (-delta_i, -delta_j) = (1,1) ------------------- Example (slides 25-26) - goal is to show that dependence can't change its type -> compute distance vector with technique we just covered -------------------- Protein String Matching Example (slide 28) - what is the iteration space for the first assignments to h[] and f[]? - can pull those out into a separate loop - or can code sink them into the main loop - let's focus on main body of 2D loop - what distance vectors are in the main loop? -------------------- Loop-Carried Dependences (slides 29-30) We can parallelize any loop that does NOT carry a dependence. -------------------- Loop-Carried Storage-Related Dependences (slide 31) --> Attempt to compute the distance vectors for the example. Due to write and read to t. read access function: g(j) = 1, source = j write access function: f(i) = 1, target = i Writes in all iterations greater than j experience an anti-dependence with read in iteration j. delta_i = i-j j < i therefore delta_i > 0 -------------- Direction vector (slide 32) Example do i =1,6 do j = 1,5 A(i,j) = A(i-1, j-1) + 1 Calculating distance vector iteration (i,j), write: A(i,j) iteration (i',j') read: A(i'-1,j'-1) 1 <= i,i' <=6, 1 <= j,j' <= 5 i = i'-1, j = j'-1 assume anti dep source: read target: write delta_i = target iteration - source iteration = i - i' = -1 delta_j = j - j' = -1 distance vector = (delta_i, delta_j) = (-1, -1) not an anti dep! assume flow dep source: write target: read delta_i = target iteration - source iteration = i' - i = 1 delta_j = j' - j = 1 distance vector = (delta_i, delta_j) = (1, 1) full specification for direction vector if delta >0 then < or + if delta = 0 then = or 0 if delta < 0, then > or - if anything but zero, then <> if delta any int, then * Example where a direction vector is needed. do i=1,6 do j=1,5 A(i) = B(i) + sin(x) assume output dependence between instances of A(i) source iteration (i,j): write A(i) target iteration (i',j'): write A(i') i=i' 1<=i<=6, 1<=i'<=6 1<=j<=5, 1<=j'<=5 (i',j') lexicographically after (i,j) i 0 so v can be represented with a direction vector v = (0,<) Another example that requires a direction vector: LUD do k=1,N-1 do i=k+1,N A(i,k) = A(i,k)/A(k,k) // s1 do j=k+1,N do i=k+1,N A(i,j) = A(i,j) - A(i,k) * A(k,j) // s2 -Need to take all pairs of A references to get full data dependence information. This example only looks at one pair. iteration (k,i), write: A(i,k) iteration (k',j',i') write: A(i',j') iteration space for first statement, (k,i) for second, (k',j',i') i = i' k = j' 1 <= k,k' <= N k+1 <= i <= N k'+1 <= j',i' <=N possible solution: j' = k = k'+1 i = i' = k'+2 = k+1 shared iter space, (k), would like distance vector s1 source, s2 target k'+1 <=j'=k, k'-k <= -1 delta_k = k'-k <= -1, not lexicographically positive s2 source, s1 target delta_k = k-k' >=1, lexicographically positive but no single value -------------------- Scalar Expansion (slides 33-34) There is also an array expansion technique. When we apply array expansion to the following loop, what are the resulting dependences? do i=1,6 do j=1,5 A(i) = B(i) + sin(x) -------------------- mstrout@cs.colostate.edu, 10/27/09