Performance measurement and tuning - Sand, software and soundSand, software and sound

Performance measurement and tuning is practical because programmers often need to improve the performance of their programs. By now, you know that the Raspberry Pi is not the fastest machine in the world and programs running on the RPi need a tune up to make users happy. Performance tuning is a bit of a game:

My program ran in 30 seconds? Can I make it run in 15 seconds? 10 seconds? Less?
My game game has a frame rate of 20 frames per second (FPS) — not so good. Can I increase the frame rate to 30FPS or more?
Can I find a better algorithm to solve the problem?
Is my program taking good advantage of the processor pipelines? The memory subsystem?
Can I make memory access more predictable? Faster?

This page is the jumping off point for my experiments in performance measurement and tuning on Raspberry Pi. I hope it inspires experiments of your own!

Two key aspects of performance tuning are algorithms and microarchitecture. The traditional course of study in computer science teaches about algorithms, space-time complexity (“big O”) and why efficient algorithms are favored over inefficient algorithms (e.g., why an O(N log N) algorithm is better for sorting than O(N²)). Plenty has already been written about algorithms and many good textbooks and on-line resources are available.

On the other hand, how about microarchitecture? Programs should be written to be in harmony with the underlying machine structure. Computer architects design the microarchitecture to take advantage of certain kinds of program behavior such as sequential access to data in memory or predictable if-then-else decision making. This is where knowledge of the underlying microarchitecture can be applied to coding practice in order to write faster programs. Or, if you’re using the “get it right then make it fast approach,” knowledge of the microarchitecture and actual program behavior can be used to tune the code for speed.

Parallelism, by the way, is an important part of performance. Even a uni-processor like the ARM1176 in the RPi has multiple pipelines which execute instructions in parallel. Whenever there is an opportunity for parallelism (concurrent execution), go for it!

If you would like to know more about the microarchitecture of the ARM11 processor in the Raspberry Pi right now, please read the page on ARM11 microarchitecture. We have more to say about microarchitecture when we discuss the benchmark programs.

Performance counters and events

The Raspberry Pi, like many modern processors, is equipped with hardware performance counters. The performance counters count hardware-level microarchitectural events. The event counts give important clues about dynamic program behavior. These events may indicate the presence of a performance bottleneck which needs to be fixed up in the code. Performance tuning takes detective work, just like debugging!

The ARM1176 processor in the Raspberry Pi has three performance counters. All three performance counters are 32 bits in length (32-bit unsigned integers). 4,294,967,296 (2^32) events can be counted before a counter rolls over to 0. When the counter rolls over, it’s called an overflow.

One of the three counters is dedicated to counting processor cycles, that is, the number of hardware clock ticks. You probably know that the RPi has a 700MHz clock. So, this counter increments once per processor cycle, 700,000,000 times per second. Even though the counter is 32 bits, the count can easily overflow in about 6.13 seconds. This counter can be scaled by 64 so that it increments once every 64 processor clock ticks. Scaling is the most practical mode of operation for this counter unless cycle accuracy is absolutely paramount.

The other two performance counters are configurable. You, the programmer, get to choose the event that you want to count. Here is a table of the performance events supported by the ARM1176 (RPi).

ARM1176 performance events
Event number	Event
0x0	Instruction cache miss
0x1	Instruction buffer cannot deliver instruction
0x2	Data dependency stall
0x3	Instruction MicroTLB miss
0x4	Data MicroTLB miss
0x5	Branch instruction executed
0x6	Branch mispredicted
0x7	Instructions executed
0x9	Data cache access (nonsequential, cacheable only)
0xA	Data cache access (nonsequential, inc. noncacheable)
0xB	Data cache miss
0xC	Data cache write-back
0xD	Software changed the PC
0xF	Main TLB miss
0x10	Explicit external data access
0x11	Load Store Unit request queue full
0x12	Write buffer drained due to synchronization
0x23	Procedure call instruction executed
0x24	Procedure return instruction executed
0x25	Return address predicted correctly
0x26	Return address predicted incorrectly
0xFF	Cycles

At this point, don’t worry too much if you don’t understand what these events mean. The performance events are defined with respect to the underlying ARM1176 microarchitecture. We illustrate the meaning of these events when we discuss the benchmark programs. In the meantime, though, look at event number 0x7. This event occurs when an instruction successfully completes. Successful completion is often called retirement. When a performance counter is configured for event number 0x7, it counts the number of instructions executed, i.e., the counter increments every time an instruction successfully completes (when an instruction retires).

It’s worth emphasizing that the events depend upon the microarchitecture of the specific ARM processor that you are using. Although the ARM1156 and ARM1176 share some of the same events, the 1156 and 1176 have events which are unique to each model. Further, the number of counters and events are quite different between ARM11 and ARM Cortex-A8. You definitely need to RTFM when dealing with microarchitecture and performance monitoring!

Using counters: Calipers and samples

There are two different styles of operation: caliper mode and sampling mode.

A physical caliper is an adjustable instrument that measures the distance between two points. Like a real world caliper, the performance counters in caliper mode measure the number of hardware events between two points in the execution of a program. Let’s say that we want to measure the number of instructions that are executed by a loop in a program and the number of branches in the loop. Before the loop, we insert code to:

Select event number 0x7 (which measures instructions executed) and select event number 0x5 (which measures the number of branches executed),
Clear the performance counters (which sets the initial counts to zero), and
Start the counters.

After the loop, we insert code to stop the counters and read the final counts. The resulting code might look like:

uint32_t cycles, instructions, branches ; ... start_counting(ARMV6_EVENT_INSTR_EXEC, ARMV6_EVENT_BR_EXEC) ; while( tolerance > 0.001 ) { // Loop body ... } stop_counting(&cycles, &instructions, &branches) ;

where the start, clear and stop operations are embedded in the functions start_counting() and stop_counting(). The symbolic constants ARMV6_EVENT_INSTR_EXEC and ARMV6_EVENT_BR_EXEC select the performance events and the variables cycles, instructions and branches receive the final event counts. Once the program has the final counts, the number of events can be displayed, saved to a file, or whatever. Notice that start_counting() arms the cycle counter, too, since we always get processor cycles for free.

That’s all there is to caliper mode!

Sampling mode is more informative, but requires more software infrastructure. In sampling mode, the counters are configured to generate an interrupt on overflow. The counters are initialized with a value called the sampling period which determines how often each counter overflows and an interrupt is generated. When a sampling interrupt is generated, the interrupt handler (in the operating system) produces a data item called a sample. You should think of the sampling period as a weighting factor that specifies how many events a single sample represents. If a sample is taken after 25,000 instructions execute, for example, then each sample represents 25,000 executed instructions.

Samples are stored away, aggregated and displayed by a user-space performance measurement tool called a profiler. One of the things remembered with each sample is the place where the program was (the program counter or PC) when the sample was taken. The profiler collects a large number of samples and uses all of this information to build up a histogram-like picture called a profile. The profile shows the distribution of events across the program code. When the profiler measures executed instructions, the profile shows the most heavily executed portions of the program. The places with the most traffic are called hot-spots. The hot-spots are the best places to tune because they have the highest performance impact and pay-off.

Accessing and controlling the counters

The performance counters and the hardware register that controls them are read and written with the special ARM11 instructions MRC and MCR, respectively. The MRC and MCR instructions are privileged instructions and are ordinarily executed by the operating system. If a user-space program attempts to execute an MRC or MCR instruction, the program throws an illegal instruction exception and terminates abnormally.

The ARM11 makes a special exemption, however. The operating system can allow user-space programs to read and write the performance counters and the performance monitor control register through MRC and MCR. In kernel mode, Linux must flip the appropriate bit in the Secure User and Non-secure Access Validation Control Register. (Whew, what a name!) When this bit is turned on, a user-space program can read and write the performance monitoring unit registers using MRC and MCR. Sounds great, but user-space software must be careful to NOT enable counter overflow interrupts because user-space software has no way to handle the interrupts. Further, the kernel does not have a driver to handle the interrupts and the system may become unstable or crash.

Raspbian Wheezy does not provide an off-the-shelf way to enable user-space access. So, what do we do now?

Tools, tools, tools

Let’s say that we want to try our own hand at user-space performance counter access in caliper mode. The MRC and MCR instructions give us very fast and immediate access to the counters. The PERF SYSCALL interface is relatively slow and heavy-weight in comparison. Fast access may be an advantage in certain use cases. In order to access the counters from user-space, we need:

A loadable kernel module that flips the right bit in the Secure User and Non-secure Access Validation Control Register, and
A small library of C functions that configure, clear, start and stop the performance counters using the MRC and MCR instructions.

That takes me to the first round of experiments. I’ll show you how and provide the code.

Performance Events for Linux (PERF) is the standard tool for performance measurement on Linux. PERF supports both caliper mode and sampling mode. PERF uses a system call (SYSCALL) to configure and access the counter hardware. It also provides a pretty solid set of user-space profiling tools.

Unfortunately, at the time of this post (May 14, 2013), PERF refuses to collect hardware counter events. This is a bug which will eventually be repaired. We will eventually demonstrate PERF.