











# Principle of Locality

- Programs access a small proportion of their address space at any time
- Temporal locality
  - Items accessed recently are likely to be accessed again soon
  - e.g., instructions in a loop, induction variables
- Spatial locality
  - Items near those accessed recently are likely to be accessed soon
  - E.g., sequential instruction access, array data

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

# Taking Advantage of Locality

- Memory hierarchy
- Store everything on disk
- Copy recently accessed (and nearby) items from disk to smaller DRAM memory
  - Main memory
- Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
  - Cache memory attached to CPU

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8



























# Interactions with Advanced CPUs

# Out-of-order CPUs can execute instructions

- during cache miss
- Pending store stays in load/store unit
- Dependent instructions wait in reservation stations Independent instructions continue
- Effect of miss depends on program data flow
  - Much harder to analyse
  - Use system simulation

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25

# Virtual Memory

- Use main memory as a "cache" for secondary (disk) storage
  - Managed jointly by CPU hardware and the operating system (OS)
- Programs share main memory
   Each gets a private virtual address space holding its frequently used code and data Protected from other programs
- CPU and OS translate virtual addresses to physical addresses

  - VM "block" is called a page
     VM translation "miss" is called a page fault

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26





# **Disk Access Example**

#### Given

- 512B sector, 15,000rpm, 4ms average sæk time, 100MB/s transfer rate, 0.2ms controller overhead, ide disk
- Average read time

  - 4ms seek time
     ½ / (15,000/60) = 2ms rotational latency
     + 512 / 100MB/s = 0.005ms transfer time
     + 0.2ms controller delay

  - = 6.2ms
- If actual average seek time is 1ms Average read time = 3.2ms

Chapter 6 — Storage and Other I/O Topics — 29









 Processor assumes a certain memory addressing scheme:

- A block of data is called a virtual page
- An address is called virtual (or logical) address
- Main memory may have a different addressing scheme:
  - Real memory address is called a physical address, MMU translates virtual address to physical address
  - Complete address translation table is large and must therefore reside in main memory
  - MMU contains TLB (translation lookaside buffer), which is a small cache of the address translation table

33



Memory Protection

- Different tasks can share parts of their virtual address spaces
  - But need to protect against errant access
  - Requires OS assistance
- Hardware support for OS protection
  - Privileged supervisor mode (aka kernel mode)
  - Privileged instructions
  - Page tables and other state information only accessible in supervisor mode
  - System call exception (e.g., syscall in MIPS)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35







|                                   | Intel Nehalem                                                                                                                                                                                       | AMD Opteron X4                                                                                                                                                                             |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| L1 caches<br>(per core)           | L1 I-cache: 32KB, 64-byte<br>blocks, 4-way, approx LRU<br>replacement, hit time n/a<br>L1 D-cache: 32KB, 64-byte<br>blocks, 8-way, approx LRU<br>replacement, write-<br>back/allocate, hit time n/a | L1 Lcache: 32KB, 64-byte<br>blocks, 2-way, LRU<br>replacement, hit time 3 cycl<br>L1 D-cache: 32KB, 64-byte<br>blocks, 2-way, LRU<br>replacement, write-<br>back/allocate, hit time 9 cycl |
| L2 unified<br>cache<br>(per core) | 256KB, 64-byte blocks,8-way,<br>approx LRU replacement, write-<br>back/allocate, hit time n/a                                                                                                       | 512KB, 64-byte blocks,16-v<br>approx LRU replacement, w<br>back/allocate, hit time n/a                                                                                                     |
| L3 unified<br>cache<br>(shared)   | 8MB, 64-byte blocks,16-way,<br>replacementn/a, write-<br>back/allocate, hit time n/a                                                                                                                | 2MB, 64-byte blocks, 32-wa<br>replace block shared by few<br>cores, write-back/allocate, hi<br>time 32 cycles                                                                              |

| Concluding Remarks                                                                 | § 5.12 Concl |
|------------------------------------------------------------------------------------|--------------|
| <ul> <li>Fast memories are small, large memories are<br/>slow</li> </ul>           | uding Rer    |
| We really want fast, large memories <sup>(3)</sup>                                 | nark         |
| Caching gives this illusion <sup>©</sup>                                           | ŝ            |
| Principle of locality                                                              |              |
| <ul> <li>Programs use a small part of their memory space<br/>frequently</li> </ul> |              |
| Memory hierarchy                                                                   |              |
| L1 cache ↔ L2 cache ↔ ↔ DRAM memory<br>↔ disk                                      |              |
| Memory system design is critical for                                               |              |
| multiprocessors Chapter 5—Large and                                                |              |

Fast: Exploiting Memory Hierarchy — 40