[At the time of this post, Performance Events for Linux (PERF) supports only software events on Raspberry Pi. Rpistat is a tool that counts hardware performance events in the style of perf stat
and is a poor man’s substitute.]
So far, my example programs for the Raspberry Pi (matrix multiplication, pointer chasing) are instrumented with explicit code to measure and display hardware performance events. The instrumentation code executes privileged ARM instructions to read and write the performance counters and control register. The aprofile kernel module must be loaded in order to enable user-space access to the ARM1176 performance monitoring unit (PMU). If aprofile is not loaded, the privileged instructions trap to the OS and the process terminates abnormally.
This approach, which I call self-measurement, requires modification to the application source code. While self-measurement collects performance event data about specific code regions in the application, source code changes are intrusive, inconvenient and involve more work.
Performance Events for Linux (PERF) is a profiling infrastructure that provides a suite of performance analysis tools on Linux. The perf stat
tool launches a workload and counts (selected) hardware or software performance events caused by the workload. No changes to the workload are required at either the source or binary levels.
Rpistat is similar to perf stat
. It launches a workload and counts nine of the most commonly used ARM1176 hardware performance events while the workload executes. Running rpistat is easy. Just enter “rpistat” followed by the command you would normally use to launch the workload from the shell.
rpistat naive
In the example above, the workload is the naive matrix multiplication program. Rpistat writes a file named rpistat.txt when the workload completes. The file contains a basic performance report. Here is example output from rpistat.
*************************************************************** rpistat: ./naive Thu Jun 27 09:58:08 2013 *************************************************************** System information Number of processors: 1 Performance events [ ... ] = scaled event count PTI = per thousand instructions Total periods: 169 Cycles: 11,759,598,287 Instructions: 315,810,640 [1,241,209,259] IBUF stall cycles: 65,981,902 [259,324,219] Instr periods: 43 CPI: 9.474 IBUF stall percent: 2.205 % DC cached accesses: 4,558,795 [18,343,722] DC misses: 933,837 [3,757,582] DC periods: 42 DC miss ratio: 20.484 % MicroTLB misses: 224,886 [904,898] Main TLB misses: 172,973 [696,010] TLB periods: 42 Micro miss rate: 0.729 PTI Main miss rate: 0.561 PTI Branches: 33,438,664 [134,550,814] Mispredicted BR: 366,383 [1,474,255] BR periods: 42 Branch rate: 108.403 PTI Mispredict ratio: 1.096 %
The report includes raw event counts, scaled event counts, rates and ratios. The raw event counts are the actual number of events counted by the hardware performance counters. A rate tells us how often a given event is occurring in terms of events per thousand instructions (PTI). A ratio tells us what portion of events have a certain property, such as the percentage of non-sequential data cache accesses that result in a miss.
Rpistat uses a time division multiplexing scheme to periodically switch the performance counters across the nine events of interest. Rpistat makes a switch every 100 milliseconds (0.1s). The ARM1176 dedicates one performance counter to processor cycles. Rpistat exploits this counter to the max and measures processor cycles all the time (a 100% duty-cycle for this event). Rpistat switches through the other eight events in pairs called event sets. Each event set is measured for about 25% of the overall time. Rpistat keeps track of the active time for each event and scales the raw event counts up to full duty-cycle estimates. These scaled event counts are displayed within square brackets [ ... ]
in the output file.
When rpistat switches between event sets, it first accumulates the current counts into 64-bit unsigned integers called virtual counters. Rpistat avoids overflow problems because the measurement period (0.1s) is much shorter than the time needed to overflow the 32-bit hardware counters. The long virtual counter length (64 bits) postpones any real overflow for a very long time, effectively eliminating the practical possibility of an overflow.
If you would like to know more about rpistat’s implementation including important design concerns and limitations, please read about it here.
How does rpistat compare against self-measurement for accuracy? Here’s a quick comparison between the two approaches when they are applied to the naive (textbook) matrix multiplication program.
Metric | Self-measurement | Rpistat |
---|---|---|
Elapsed time | 16.0s | 16.6s |
Scaled cycles | 181,984,943 | 184,288,556 |
Instructions | 1,190,897,822 | 1,245,571,856 |
DC access | 17,588,053 | 18,587,014 |
DC miss | 2,982,314 | 3,735,024 |
MicroTLB miss | 809,154 | 904,166 |
Main TLB miss | 612,688 | 661,667 |
CPI | 9.78 CPI | 9.47 CPI |
DC access rate | 14.77 PTI | 14.92 PTI |
DC miss rate | 2.50 PTI | 2.99 PTI |
DC miss ratio | 17.0% | 20.1% |
MicroTLB rate | 0.68 PTI | 0.73 PTI |
Main TLB rate | 0.51 PTI | 0.53 PTI |
The event counts, rates and ratios are within normal run-to-run variability in every case.
The matrix multiplication program has fairly consistent (uniform) dynamic behavior throughout its lifetime. Workloads with distinct processing phases may not fair as well. YMMV. Please see the rpistat page for more information and links to source (rpistat.c).