Perils of performance analysis on single core

A new article on the Raspberry Pi (Broadcom BCM2835) memory hierarchy is almost ready. The first code has already been posted.

I’ve been working on multi-core processors for so long that I forgot what it’s like to take measurements on a single core machine like the Raspberry Pi.

In the ideal world, a benchmark or performance test program has the machine to itself and no other program or system activity perturbs it. Measurements on the ideal machine accurately and exactly reflect the dynamic behavior and performance of the program. On multi-core, you can usually assign the test program to an idle core (or two), preferably a core that is free of operating system activity. With careful process or thread placement, results on multi-core approach the ideal.

On single core, we don’t have any luxury. The test program has to share the one core with other programs and the operating system. On Raspberry Pi, Linux fires up services that run periodically. Even if we shut the services off, the system clock continues to run and it generates interrupts. At the very least, extraneous activity affects elapsed, user and system time measurements.

When we measure performance events, however, there is a deeper level of interference. The core has one physical level 1 (L1) data cache, one physical MicroTLB, one physical Main TLB and one physical branch history table. These microarchitecural components are transparent to the architecture, but they must be shared between programs and the OS. A context switch may cause a cache or TLB flush which invalidates the entire contents of the cache/TLB. Cache, TLB or branch history may be partially polluted by other software activity. The final performance event counts are affected by flushes and pollution and do not accurately reflect the behavior of the test program.

I ran into this issue while characterizing the memory hierarchy with performance events. One test case is designed to exercise only the L1 data cache and never touch primary memory. Yet, the test case measured a rather significant number of data cache misses beyond the compulsory misses that I would have expected. The extra misses are most likely caused by timer interrupts. I now think of these extra misses as “background radiation” which bias measurement.

Such are the perils of performance measurement and analysis on single core!