When Spatial and Temporal Locality Collide: The Case of the Missing Cache Hits

Printer-friendly version

Publication Type:

Conference Paper

Source:

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ACM, New York, NY, USA (2013)

ISBN:

978-1-4503-1636-1

URL:

http://doi.acm.org/10.1145/2479871.2479883

Keywords:

l1 data cache, padding, pipeline bubble, tilepro64

Abstract:

<p>Even the simplest hardware, running the simplest programs, can behave in the strangest of ways. Tracking down the cause of a performance anomaly without the complete hardware reference of a processor is a prime example of black-box architectural exploration. When doubling the work of a simple benchmark program, that was run on a single core of Tilera's TILEPro64 processor, did not double the number of consumed cycles, a mystery was unveiled. After ruling out different levels of optimization for the two programs, a cycle-accurate simulation attributed the sub-optimal performance to an abnormally high number of L1 data cache misses. Further investigation showed that the processor stalled on every Read-After-Write instruction sequence when the following two conditions were met: 1) there are 0 or 1 instructions between the write and the read instruction and 2) the read and the write instructions target distinct memory locations that share an L1 cache line. We call this performance pitfall a <em>RAW hiccup</em>. We describe two &nbsp;countermeasures, memory padding and the explicit introduction of pipeline bubbles, that sidestep the RAW hiccup.</p>