CS 470 DL: Additional Notes

Performance of Single Cycle Machines

To help in understanding the "critical path" concept

from P&H, 3rd edition, revised, pp. 315ff

Example

Performance of Single-Cycle Machines

Assume that the operation times for the major functional units in this implementation are the following:

Memory units: 200 picoseconds (ps)
ALU and adders: 100 ps
Register file (read or write):o 50 ps

Assuming that the multiplexors, control unit, PC accesses, sign extension unit, and wires have no delay, which of the following implementations would be faster and by how much?

An implemenation in which every instruction operates in 1 clock cycle of a fixed length.
An implementation where every instruction executes in 1 clock cycle using a variable length clock, which for each instruction is only as long as it needs to be. (Such an approach is not terribly practical, but it will allow us to see what is being sacrificed when all the instructions must execute in a single clock of the same length.)

To compare the performance, assume the following instruction mix: 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps.

Answer

Let's start by comparing the CPU execution times. Recall from Chapter 4 (Sections 1.4 and 1.5 in P&H, 4th edition) that

CPU execution time = Instruction count × CPI × Clock cycle time

Since CPI must be 1, we can simplify this to

CPU execution time = Instruction count × Clock cycle time

We need only find the clock cycle time for the two implementations, since the instruction count and CPI are the same for both implementations. The critical path for the different instruction classes is as follows:

Instruction class	Functional units used by the instuction class
R-type	Instruction fetch	Register access	ALU	Register access
Load word	Instruction fetch	Register access	ALU	Memory access	Register access
Store word	Instruction fetch	Register access	ALU	Memory access
Branch	Instruction fetch	Register access	ALU
Jump	Instruction fetch

Using these critical paths, we can compute the required length for each instruction class:

Instruction class	Instruction memory	Register read	ALU operation	Data memory	Register write	Total
R-type	200	50	100	0	50	400 ps
Load word	200	50	100	200	50	600 ps
Store word	200	50	100	200		550 ps
Branch	200	50	100	0		350 ps
Jump	200					200 ps

The clock cycle for a machine with a single clock for all instructions will be determined by the longest instruction, which is 600 ps. (This timing is approximate, since our timing model is quite simplistic. In reality, the timing of modern digital systems is complex.)

Thus, the average time per instruction with a variable clock is

CPU clock cycle = 600 × 25% + 550 × 10% + 400 × 45% + 350 × 15% + 200 × 5%
= 447.5 ps

Since the variable clock implementation has a shorter average clock cycle, it is clearly faster. Let's find the performance ratio:

  CPU performance_{variable clock}      CPU execution time_{single clock}
  ---------------------------  =  -----------------------------
   CPU performance_{single clock}      CPU execution time_{variable clock}

      IC × CPU clock cycle_{single clock}
  =  --------------------------------
     IC × CPU clock cycle_{variable clock}

      CPU clock cycle_{single clock}
  =  ---------------------------
     CPU clock cycle_{variable clock}

  =  600/447.5 = 1.34

The variable clock implementation would be 1.34 times faster. Unfortunately, implementing a variable-speed clock for each instruction class is extremely difficult, and the overhead for such an approach could be larger than any advantage gained. As we will see in the next section, an alternative is to use a shorter clock cycle that does less work and then vary the number of clock cycles for the different instruction classes.

Comments from the instructor:

The example above is taken directly from Patterson & Hennessy, COD, 3rd edition, revised, pp. 315 ff, even though I omitted the appropriate quotation marks. There appears to be no such example in the 4th edition, which is unfortunate, because this example clearly shows what is involved in a critical path.

Also see the first paragraph on page 323 of the 4th edition:

"Figure 4.19 shows the operation of the datapath for an R-type instruction, such as add $t1, $t2, $t3. Although everything occurs in one clock cycle, we can think of four steps to execute the instruction; these steps are ordered by the flow of information ...."

These four steps are exactly those for the R-type instruction in the first table above.

Similarly, you may think of a store word instruction as having five steps in its critical path, while a jump instruction has only one step. Etc.

Please use these steps for determining the critical path on any exam question I may ask.

Please let me know if you find any typos in this writeup.