An introduction to performance tuning (and counters)

My latest page is an overview of performance tuning on ARM11. The Raspberry Pi is a nifty little Linux box, but it’s kind of slow at 700MHz. Therefore, I suspect that programmers will have an interest in tuning up application programs and making them run faster. Performance tuning is also a good opportunity to learn more about computer architecture and machine organization, especially the ARM1176 core at the heart of the Raspberry Pi and its memory subsystem.

The ARM1176 has three performance counters which can measure over 20 different microarchitectural events. One of these counters is dedicated to core clock cycles while the other two are configurable. The new performance tuning page has a brief overview of the counters and it has a table with the supported events.

The new page also describes two different use cases for the counters: caliper mode and sampling mode. Caliper mode counts the number of microarchitectural hardware events that occur between two different points in program execution. Caliper mode is good for measuring the number of data cache accesses and misses for a hot code region like a loop. The programmer inserts code to start counting at the beginning of the hot region and inserts code to stop counting at the end of the hot region. This is the easiest use case to visualize and to implement. It’s the approach that I’m taking with my first performance measurement software and experiments (a custom kernel module plus some user-space code). These experiments are almost finished and ready for write up.

Sampling is a statistical technique that produces an event profile. A profile shows the distribution of events across program instructions, routines, source lines, or modules. This is a good way to find hot-spots in a program where tuning is most beneficial. Sampling does not require modification to source.

Performance Events for Linux (informally called “PERF”) is the standard tool for program profiling on Linux. At the moment, PERF has a bug which prevents it from sampling hardware events. I’ve been looking into this problem, too, and hope to post some results. In the long-run, I want to post examples using PERF in order to help people tune up their programs on Raspberry Pi.