CS553 Colorado State University =============================================== Tiling =============================================== 11/16/09 -------- --------------------- Tiling Overview Schedule: How tiling changes the schedule of computation Data locality and parallelism: Tiling can improve data locality by making reuse closer within the schedule. Also tiling can expose larger granularity parallelism. Tiling and frameworks: ways of expressing tiling and how it fits and does not fit into the polyhedral framework Legality: when is tiling legal? Rectangular tiling is legal when the loops are permutable. Unroll and Jam: How to do unroll and jam by hand? Tiling: How to specify tiling in the K&P framework. Code gen: using outset and inset --------------------- Today: the story - unroll and jam was one of the first instantiations of tiling. It was developed to improve the balance between floating point operations and memory accesses within loops. Tiling is more generally applicable and arguably subsumes unroll and jam. --------------------- Loop Unrolling - Recall loop unrolling ... ---------------------- Loop Unrolling (cont) - We can't really specify this in the transformation frameworks we have presented so far, because unrolling does not change the schedule of the loop. Can consider this a lower-level transformation. It can be best be done after other loop transformations and before the representation is lowered to 3-address code. Don't want to have to re-derive loop bounds information. ---------------------- Loop Balance - notice that the inner loop uses the same value of A(j) for all m iterations - loop unrolling doesn't help with this by itself. ---------------------- Unroll and Jam - ... per iteration of the innermost loop - notice we don't need cleanup code because of handy initial loop bounds - ok unroll and jam helps with loop balance, what is tiling all about? ---------------------- Unroll and Jam legality (slide 7) - Unrolling is always legal, if you guard the trip count - Fusion is legal when their are no deps from 2nd body to 1st body. Is it possible to get that after just unrolled the outer loop? If writes are involved in false dependences in the inner loop, then that might be a problem B(i) = B(i-1) - unrolling size limits - register pressure - code size - overhead due to handling special cases ---------------------- Tiling (slide 8) - show old order and new order on iteration points ---------------------- Specify Tiling (slide 9) - mapping for this example is on the slide - show some of the mappings for the points 1, 1 -> 0, 0, 1, 1 1, 2 -> 0, 0, 1, 2 ---------------------- Code Generation (slide 13) Step through how code generation works. Original loop for i=1, 6 for j = 1,5 A[i,j] = ... Tiling specification {[i,j] -> [ti, tj, i, j] | ti = (i-1)/2 and tj = (j-1)/2} Transformed iteration space {[ti, tj, i, j] | ti = (i-1)/2 and tj = (j-1)/2 and 1<=i<=6 and 1<=j<=5 } Note that the above is not a polyhedron because of the integer division. We can use the remainder theorem to create affine inequalities. ti = (i-1)/2 becomes 0 <= ri <= 2-1 and i = ti*2 + ri tj = (j-1)/2 becomes 0 <= rj <= 2-1 and j = tj*2 + rj Performing Fourier Motzkin on {[ti, tj, i, j] | 0 <= ri <= 2-1 and i = ti*2 + ri and 0 <= rj <= 2-1 and j = tj*2 + rj and 1<=i<=6 and 1<=j<=5 } will result in the following loop (you have to add step sizes in for ti and tj): for ti = 0 to 6/2 by 2 for tj = 0 to 5/2 by 2 for i = max(0,ti*2) to min(6,ti*2+2-1) for j = max(0,tj*2) to min(6,tj*2+2-1) A[i,j] = ... --------------------- Unroll and Jam is Tiling (slide 14) - data dependences A(j) to A(j') j=j' dep vector = (0,<) - tiling i and j, i with tile size 1 and j with tile size 2 {[j,i] -> [jj,ii,j,i] : jj = 2 (j/2) && ii = i } {[j,i] -> [jj,ii,j] : jj = 2 (j/2) && ii = i } -------------------- mstrout@cs.colostate.edu, 11/19/09