CS553 Colorado State University =============================================== Synchronization-Free Parallelization =============================================== 11/12/09 ---------------------------- Two models SPMD with MPI and OpenMP for these two models want to make or create outer loops that are fully parallel - but what about ILP? ----------------------------- Wavefront parallelization example (slide 3) - what iterations should be assigned to each processor if we don't want processors to communicate or synchronize? - we are ignoring false sharing for the moment, could fix this by providing an appropriate data mapping ----------------------------- Step through the synchronization-free affine partition algorithm. - use code with simple diagonal and only one statement (book has more complicated example) The space-partition constraints from the book (pg. 832): For each two accesses, and , sharing a dependence: a) B_1 i_1 = b_1 >= 0 // loop bounds for first access b) B_2 i_2 = b_2 >= 0 // loop bounds for second access c) F_1 i_1 + f_1 = F_2 i_2 + f_2 // access same memory select a space-partition such that dependent accesses are mapped to the same processor: C_1 i_1 + c_1 = C_2 i_2 + c_2 --------------------------- Space-Partition Constraints (slide 4) loop bounds 1 <= i,i' <= 6 1 <= j,j' <= 5 dependence equality constraints i = i'-1 j=j'-1 space partition constraints [ C_11 C_12 ] [ i ] + [ c1 ] = [ C_11 C_12 ] [ i' ] + [ c1 ] [ j ] [ j' ] ----------------------------- Solving the space partition constraints Ad hoc approach t1 = i = i' - 1 t2 = j = j' - 1 [ C_11 C_12 ] [ t1 ] + [ c1 ] = [ C_11 C_12 ] [ t1 + 1 ] + [ c1 ] [ t2 ] [ t2 + 1 ] [ C_11-C_11 C_12-C_12 ] [ t1 ] + [ c1 - c1 - C_11 - C_12 ] = 0 simplified c1 - c1 - C_11 - C_12 = 0 C_11 = - C_12 // one independent choice Let C_11 = 1, C_12 = -1, and c1 = 4 why c1 = 4? figure out the iteration with the smallest space partition and want it to map to zero have [ C_11 C_12 ] [ i ] + c1 [ j ] given C_11 = 1 and C_12 = -1, i - j + c1 computes space partition smallest i=1 largest j=5 1 - 5 + c1 = 0 min( C_{11} i + C_{12} j + c1 ) = min( i - j + c1 ) = 1 - 5 + c1 = 0 c1 = 4 max( C_{11} i + C_{12} j + c1 ) = max( i - j + c1 ) = 6 - 1 + c1 = 9 Algorithmic approach (algorithm 11.43 in the book) Rewrite dependence equality constraints as F1 i1 + f1 = F2 i2 + f2 [1 0] [i] + [0] = [1 0] [i'] + [-1] [0 1] [j] [0] [0 1] [j'] + [-1] Then as [ F1 -F2 (f1-f2) ] [ i1 ] = 0 [ i2 ] [ 1 ] [ 1 0 -1 0 1 ] [ i ] = 0 [ 0 1 0 -1 1 ] [ j ] [ i'] [ j'] [ 1 ] -> continue this using algorithm 11.43 in the book, pg 836 ----------------------------- Simple Code (slide 6) -for example, only have one statement loop bounds 1 <= i,i' <= 6 1 <= j,j' <= 5 dependence equality constraints i = i'-1 j=j'-1 space partition constraints i - j + 4 = p 1) Use loop bounds and inequalities due to space partition constraints 1 <= i <= 6 1 <= j <= 5 i - j + 4 <= p i - j + 4 >= p Project out j: Lower 1 <= j i+4-p <= j Upper j <= 5 j <= i+4-p New bounds 1 <= i+4-p i+4-p <= 5 Current bounds 1 <= i <= 6 1 <= i+4-p i+4-p <= 5 Project out i: Lower 1 <= i -3+p <= i Upper i <= 6 i <= p+1 New bounds 1 <= p+1, 0 <= p -3+p <= 6, p <= 9 Current bounds 0 <= p p <= 9 Loop bounds for p: 0 <= p <= 9 Don't have to do anything for the middle two steps of algorithm for this example. Insert space partition predicate before each statement. for 0 <= p <= 9 for i=1,6 for j=1,5 //min(5,7-i) if (p = i-j) A(i,j) = A(i-1,j-1)+1 ---------------------------- Eliminating Empty Iterations (slide 7) Use above projection results do p = 0 to 9 do i = max(1,-3+p), min(6,p+1) do j = max(1,i+4-p), min(5,i+4-p) p -3+p p+1 i i+4-p j 0 -3 1 1 5 5 1 -2 2 1 4 4 1 -2 2 2 5 5 yeah!! ---------------------------- Eliminating Tests from Innermost Loop (Slide 8) - For our example with one statement, just use the constraint in the test to compute i and j bounds. We had already done that when eliminating empty iterations, so we can just remove the guard. ----------------------------- OpenMP and MPI for the example (slide 9) ----------------------------- All the primitive affine transformations can be derived by applying the synchronization-free affine partition algorithm. (Slide 10) REINDEXING loop bounds: 1 <= i,i' <= N data dep equalities: i = i'-1 space partition constraints C_11 i + c_1 = C_21 i' + c_2 reduce the number of unknowns t1 = i = i'-1 simplify C_11 t1 + c_1 = C_21 (t1+1) + c_2 (C_11 - C_21) t1 + c_1 - c_2 - C_21 = 0 C_11 - C_21 = 0 c_1 - c_2 = C_21 = C_11 Let C_11 = 1 then C_21 = 1 Let c_2 = -1 c_1 = 0 s1: p = C_11 i + c_1 = i s2: p = C_21 i + c_2 = i - 1 simple code gen Find p bounds for each statement statement 1: p <= i p >= i 1 <= i <= N project out i p <= N 1 <= p statement 2: p <= i-1 i-1 <= p 1 <= i <= N project out i p+1 <= N, p <= N-1 1 <= p+1, 0 <= p Union across all p's 0 <= p <= N Put space partition predicate before each statement for 0 <= p <= N for 1 <= i <= N if (p=i) s1 if (p=i-1) s2 Eliminate empty iterations For each statement, use unioned p bounds, statement iteration space and space partition constraints to determine bounds for the iteration space. statement 1: 0 <= p <= N 1 <= i <= N p <= i p >= i bounds for i: max(1,p) <= i <= min(N,p) statement 2: 0 <= p <= N 1 <= i <= N p <= i-1 p >= i-1 bounds for i: max(1,p+1) <= i <= min(N,p+1) union for statements in same loop max(1,p) <= i <= min(N,p+1) for 0 <= p <= N for max(1,p) <= i <= min(N,p+1) if (p=i) s1 if (p=i-1) s2 Eliminate tests from innermost loops three iteration spaces for this example split p=0, p=1 to N: s1 only in one of these split p=0 to N-1, p=N: s2 only in one of these p=0 max(1,p) <= i <= min(N,p+1), if N>=1, 1<= i <=1 S2 1<=p<=N-1 max(1,p) <= i <= min(N,p+1) S1 S2 p=N max(1,p) <= i <= min(N,p+1), if (N>=1) N<= i <= N S1 ------------------- Suggested exercises - Be able to derive the synchronization-free affine partitioning for Example 11.41 in the book. - show how the other primitive affine transformations are derived -------------------- mstrout@cs.colostate.edu, 11/12/09