Hacker News | The other hidden cycle-eating demon: code cache misses

Hacker News

new | comments | leaders | jobs | submit

	The other hidden cycle-eating demon: code cache misses (multimedia.cx)
	42 points by DarkShikari 3 days ago \| 7 comments

4 points by DarkShikari 3 days ago | link

This post is a followup to this one: http://news.ycombinator.com/item?id=803826

2 points by Hoff 3 days ago | link

Cache sizing and memory latency are linked.

Up to a point, implementing larger caches can improve aggregate performance in higher-latency processor design configurations. Beyond a point, larger caches aren't performance- or cost-effective. Conversely, lower-latency processor designs mean that cache misses are less costly, and that L1 or particularly L2 caches can be smaller.

For a reasonable comparison of what changing the latencies within a design can provide, here is a LANL write-up from the Alpha microprocessor environment, and where Alpha EV7 had (for its time) low interprocessor and low memory latency and with toroidal processor links as compared with its Alpha EV68 predecessors and hierarchical or bus-based systems:

http://www.c3.lanl.gov/PAL/publications/papers/kerbyson02:EV...

Among the x86 designs, the Xeon Nehalem-class processors have substantially better latencies (around 27 ns and 54 ns remote) than previous generations of Xeon processors. And rather better than the Alpha latencies discussed in the LANL document. Which means the effects of different cache sizes or access patterns can change.

Branches, too, can play havoc with the instruction streams and with the efficacy of caching and of instruction decode. Branch often and performance can suffer. Highly pipelined designs can take bigger performance hits with branches.

2 points by briansmith 3 days ago | link

Why do you assume that you get the whole L1 code cache to yourself? I would think that on a real desktop system you would be lucky to get even half of it.

6 points by DarkShikari 3 days ago | link

The L1 cache is only 32 (or 64) kilobytes. One program's time slice lasts at least a few milliseconds, easily 100 times longer than would be necessary to fill up the cache with that program's code data.

The OS can't dedicate parts of the L1 cache to different applications (the CPU doesn't offer any feature to allow it to do so), nor would it be a good idea to do so.

3 points by briansmith 3 days ago | link

I see now. You take for granted that the L1 cache will get completely replaced during every context switch. And, it wouldn't matter anyway, because the context switch itself is already a massive performance hit relative to the L1 cache misses.

2 points by ntoshev 3 days ago | link

I would expect the hyperthreaded cores share their cache between both threads though.

1 point by DarkShikari 3 days ago | link

That could actually be a good explanation for why reducing code cache pressure can help even in cases where it doesn't make sense that it would; because the another thread is also using that cache.

Though I wonder if that's true of all SMT chips; I wonder if any chips have dual L1 caches for exactly this reason?