CS553 Colorado State University ========================================== Instruction Scheduling ========================================== 9/17/09 ----------------------- Pipelining basics - washer/dryer analogy - heads up the figure in the book switches the time and instructions axes ----------------------- Idealized Pipeline (slide 3) - pipeline used in the Hennessy and Patterson architecture book - draw a bit of the pipeline to show how the instructions move between phases ----------------------- Stalls (Data Hazards) (slide 6) - forwarding logic would remove the pipeline bubbles in the example - forwarding logic would not be enough if the operation was something like a floating point operation that took 5 cycles in the EX stage ----------------------- Problem: How can instructions be scheduled so as to minimize pipeline stalls? Constraints: -data dependences and hazards -affect of scheduling on register allocation ---------------------- Data dependences (slide 11) - also have them between memory --------------------- Phase ordering (slide 13) Q: Why does register allocation artificially constrain instruction scheduling? A: Register allocation introduces false dependences. Q: Why does scheduling increase register pressure? A: Instruction scheduling often moves loads earlier and increases the live ranges of variables, creating more arcs in the interference graph. --------------------- List Scheduling (slides 14-20) -> write down architectural hazards load immediately followed by an ALU op store immediately followed by a load -assume a is in global data space and $sp points somewhere in stack, we are assuming a minimal level of alias analysis for this example -> write down scheduling heuristics, schedule an instruction next if... -it does not interlock with the previously scheduled instruction -it interlocks with its dep graph successors -it has many successor -it is on the critical path -Can we remove all hazards? The schedule 2,5,4,3,6,1,7,9,8 does this. Where does the heuristic steer us wrong in this specific case? --------------------------- Example 10.6 from book -> which dependences are flow, anti, and output? --------------------------- Improving Instruction Scheduling (slide 23) -Register renaming -Scheduling loads -Loop unrolling -Software pipelining -Predication and speculation --------------------------- First round SW pipelining example (33-35) -> look at register usage to determine why the proposed pattern doesn't work One iteration of pipelined loop ST B(i), R5 ADD R5, R3, R4 ADD R4, R2, R1 LD R3, C(i) LD R2, B(i) LD R1, A(i) --------------------------- Second round on example with a more realistic machine model (slide 36) B[i] = A[i] + B[i] + C[i] Using machine in dragon book on page 739 -one load, one store, one ALU op, and one branch per cycle -arithmetic ops are not available until 2 clocks later, other ops have single-cycle latency ------------------------ mstrout@cs.colostate.edu, 9/17/09