CS641 Mar. 2

5.3 Cache Performance

Last time: intro, idea of CPU time for a program. Cache miss is “lost time” to the system, counted officially as “CPU time” since it’s handled completely by the CPU. The CPU just rates as slower because of all the cache misses.

We see it’s all up to the CPU to manage the bus and access to main memory. Here’s the breakdown:

° CPU time = (CPU exec clock cycles + memory stall cycles) * clock cycle time

° Memory stalls = read stalls + write stalls

° CPU time = (CPU exec clock cycles + memory stall cycles) * clock cycle time

° Memory stalls = read stalls + write stalls

° Read stall cycles = reads per program * read miss rate * read miss penalty

° Write stalls for write-through

• write stall cycles = Writes/program * write miss rate * write miss penalty + write buffer stalls

• Write stalls for write-back: more complicated, defer

° Write buffer stalls are usually small so can ignore

° Assume read miss penalty = write penalty (read a block for either case)

Then memory stall cycles = read stall cycles + write stall cycles

= reads/program*read miss rate * miss penalty + writes/program* write miss rate * miss penalty

= (reads/program*read miss rate + writes/program * write miss rate)*miss penalty

reads/program*read miss rate + writes/program * write miss rate = missed accesses/program = miss rate * (accesses/program)

So memory stall cycles = memory accesses/program * miss rate * miss penalty

= instructions/program * misses/instruction * miss rate * miss penalty (pg. 476)

(Miss penalty expressed in #clock cycles)

Performance example pg. 477

° Instruction cache miss rate = 2%

° Data cache miss rate = 4%

° Processor cycles per instruction(CPI) = 2 (without memory stalls)

° Miss penalty = 100 cycles,

° 36% of instructions are loads and stores

How much faster would the processor run if the cache never missed?

° Assume I = total instruction count

° Instruction miss cycles = I * 2% * 100 = 2*I

° Data miss rate - What programs?

° Specint2000 benchmark (36% instructions are loads/stores)

° Data miss = I* 36% * 4% *100 = 1.44*I

° Memory stall cycles = 1.44*I + 2*I = 3.44*I

° CPU execution time per inst = 3.44 + 2 = 5.44, vs. 2 for perfect cache

° Performance = 5.44/2 = 2.72 faster for a system with a perfect cache

What happens if CPU is sped up by 2 (improving pipelining without increasing clock rate), but memory system is not?

° CPI goes from 2 to 1

° Using the same memory stall figure,

° CPU exec time per instruction = 1+ 3.44 = 4.44, was 5.44, so 5.44/4.44 = 1.22, so only 22% faster

° amt of time spent on memory stalls rose from 3.44/5.44 = 63% to 3.44/4.44 = 77%

° Need to speed up memory: it’s the big problem here

General Approach: minimize Average Memory Access Time (AMAT)

AMAT = Hit Time x Hit Rate + Miss Time x Miss Rate

Miss Rate = 1 - Hit Rate

= Hit Time x (1- Miss Rate) + Miss Time x Miss Rate

= Hit Time + Miss Rate x (Miss Time - Hit Time)

= Hit Time + Miss Rate x Miss Penalty

Hit Time = time to find and retrieve data from current level cache

Miss Penalty = time to fetch data from a lower level of memory hierarchy, including the time to access the block, transmit from one level to another, and insert it at the higher level

Hit Rate = % of requests that are found in current level cache

Miss penalty is determined by time to fetch block from next level of memory hierarchy + time to load into the cache

time to fetch block = latency to get first item + transfer time for each successive item (assume interleaved memory)

Early restart- resume execution as soon as required word is available, rather than wait for entire block.

Often used for instructions, why not for data? (Instruction accesses are more sequential, but done for data in Pentium 4)

Fully Associative Cache: just the idea, since not on syllabus

In this picture, add comparators coming in to each cache entry from the left. These work in parallel to detect the tag match and report to the controller.

Tags are checked in parallel, by lots of circuitry

So blocks can be stored anywhere (fully associative cache)

Hybrid: set associative cache

Improving Miss Penalty with Multilevel Caches (pg. 487-489)

° When caches first became popular, Miss Penalty ~ 10 processor clock cycles

° Today 1000 MHz Processor (1 ns per clock cycle) and 100 ns to go to DRAM
Þ 100 processor clock cycles!

Solution: another cache between memory and the processor cache: Second Level (L2) Cache

Analyzing Multi-level cache hierarchy

Avg Mem Access Time =

L1 Hit Time + L1 Miss Rate * L1 Miss Penalty

L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty

Avg Mem Access Time =

L1 Hit Time + L1 Miss Rate *

(L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)

° L1

• size: tens of KB, say 16KB

• hit time: complete in one clock cycle

• miss rates: 1-5%

° L2:

• size: hundreds of KB/MBs, say 1MB

• hit time: few clock cycles (SRAM over a very fast bus, say 3ns)

• miss rates: 10-20%

° L2 miss rate is the fraction of L1 misses that also miss in L2

See example on pg. 487 that shows how adding an L2 cache speeds up a certain system by a factor of 2.6.

Sec. 1.5: The Power Wall

See Fig. 1.15 for power use of processors, or why your electricity bill is so high

Faster processor needs faster clock, but faster clock means more changes/sec to each transistor, which causes more power use. See equation on pg. 39. Voltage has been lowered, so we don’t have a strict line over time, but there’s a big problem when we get up to 100W, like an old standard light bulb, inside a box. And the voltage can’t be reduced further without other problems.

Sec. 1.6: The Big Switch to Multiprocessors

In the last few years, multi-core processors have become common. They are great for web servers and database servers, and of some use to home users. It means we programmers need to know about parallelism

Chap. 7 Multicores, etc., intro

Terminology

core: one processor, usually in a multi-core chip

microprocessor: one system

serial vs. parallel: how many PCs are in use

parallel processing program: one that can utilize multiple processors (cores)

multiprocessing: utilizing multiple processors

multithreading: ability to run multiple threads (doesn’t require multiprocessing, just a decent OS)

Theme: Effective parallel programming is hard