### Example

#### Performance of Single-Cycle Machines

Assume that the operation times for the major functional units in this implementation are the following:
• Memory units: 200 picoseconds (ps)

• ALU and adders: 100 ps

• Register file (read or write):o 50 ps
Assuming that the multiplexors, control unit, PC accesses, sign extension unit, and wires have no delay, which of the following implementations would be faster and by how much?
1. An implemenation in which every instruction operates in 1 clock cycle of a fixed length.

2. An implementation where every instruction executes in 1 clock cycle using a variable length clock, which for each instruction is only as long as it needs to be. (Such an approach is not terribly practical, but it will allow us to see what is being sacrificed when all the instructions must execute in a single clock of the same length.)

To compare the performance, assume the following instruction mix: 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps.

Let's start by comparing the CPU execution times. Recall from Chapter 4 (Sections 1.4 and 1.5 in P&H, 4th edition) that

CPU execution time = Instruction count × CPI × Clock cycle time

Since CPI must be 1, we can simplify this to

CPU execution time = Instruction count × Clock cycle time

We need only find the clock cycle time for the two implementations, since the instruction count and CPI are the same for both implementations. The critical path for the different instruction classes is as follows:

Instruction class Functional units used by the instuction class
R-type Instruction fetch Register access ALU Register access
Load word Instruction fetch Register access ALU Memory access Register access
Store word Instruction fetch Register access ALU Memory access
Branch Instruction fetch Register access ALU
Jump Instruction fetch

Using these critical paths, we can compute the required length for each instruction class:

Instruction
class
Instruction
memory
Register
ALU
operation
Data
memory
Register
write

Total
R-type 200 50 100 0 50 400 ps
Load word 200 50 100 200 50 600 ps
Store word 200 50 100 200   550 ps
Branch 200 50 100 0   350 ps
Jump 200         200 ps

The clock cycle for a machine with a single clock for all instructions will be determined by the longest instruction, which is 600 ps. (This timing is approximate, since our timing model is quite simplistic. In reality, the timing of modern digital systems is complex.)

Thus, the average time per instruction with a variable clock is

CPU clock cycle = 600 × 25% + 550 × 10% + 400 × 45% + 350 × 15% + 200 × 5%
= 447.5 ps

Since the variable clock implementation has a shorter average clock cycle, it is clearly faster. Let's find the performance ratio:

```  CPU performancevariable clock      CPU execution timesingle clock
---------------------------  =  -----------------------------
CPU performancesingle clock      CPU execution timevariable clock

IC × CPU clock cyclesingle clock
=  --------------------------------
IC × CPU clock cyclevariable clock

CPU clock cyclesingle clock
=  ---------------------------
CPU clock cyclevariable clock

=  600/447.5 = 1.34
```

The variable clock implementation would be 1.34 times faster. Unfortunately, implementing a variable-speed clock for each instruction class is extremely difficult, and the overhead for such an approach could be larger than any advantage gained. As we will see in the next section, an alternative is to use a shorter clock cycle that does less work and then vary the number of clock cycles for the different instruction classes.

The example above is taken directly from Patterson & Hennessy, COD, 3rd edition, revised, pp. 315 ff, even though I omitted the appropriate quotation marks. There appears to be no such example in the 4th edition, which is unfortunate, because this example clearly shows what is involved in a critical path.

Also see the first paragraph on page 323 of the 4th edition:

"Figure 4.19 shows the operation of the datapath for an R-type instruction, such as add \$t1, \$t2, \$t3. Although everything occurs in one clock cycle, we can think of four steps to execute the instruction; these steps are ordered by the flow of information ...."

These four steps are exactly those for the R-type instruction in the first table above.

Similarly, you may think of a store word instruction as having five steps in its critical path, while a jump instruction has only one step. Etc.

Please use these steps for determining the critical path on any exam question I may ask.

Please let me know if you find any typos in this writeup.