rpistat: RPi event monitoring tool - Sand, software and sound

[At the time of this post, Performance Events for Linux (PERF) supports only software events on Raspberry Pi. Rpistat is a tool that counts hardware performance events in the style of perf stat and is a poor man’s substitute.]

So far, my example programs for the Raspberry Pi (matrix multiplication, pointer chasing) are instrumented with explicit code to measure and display hardware performance events. The instrumentation code executes privileged ARM instructions to read and write the performance counters and control register. The aprofile kernel module must be loaded in order to enable user-space access to the ARM1176 performance monitoring unit (PMU). If aprofile is not loaded, the privileged instructions trap to the OS and the process terminates abnormally.

This approach, which I call self-measurement, requires modification to the application source code. While self-measurement collects performance event data about specific code regions in the application, source code changes are intrusive, inconvenient and involve more work.

Performance Events for Linux (PERF) is a profiling infrastructure that provides a suite of performance analysis tools on Linux. The perf stat tool launches a workload and counts (selected) hardware or software performance events caused by the workload. No changes to the workload are required at either the source or binary levels.

Rpistat is similar to perf stat. It launches a workload and counts nine of the most commonly used ARM1176 hardware performance events while the workload executes. Running rpistat is easy. Just enter “rpistat” followed by the command you would normally use to launch the workload from the shell.

rpistat naive

In the example above, the workload is the naive matrix multiplication program. Rpistat writes a file named rpistat.txt when the workload completes. The file contains a basic performance report. Here is example output from rpistat.

***************************************************************
rpistat: ./naive
Thu Jun 27 09:58:08 2013
***************************************************************

System information
  Number of processors:     1

Performance events
  [ ... ] = scaled event count
  PTI     = per thousand instructions
  Total periods:      169

  Cycles:             11,759,598,287
  Instructions:       315,810,640  [1,241,209,259]
  IBUF stall cycles:  65,981,902  [259,324,219]
  Instr periods:      43
  CPI:                9.474  
  IBUF stall percent: 2.205  %

  DC cached accesses: 4,558,795  [18,343,722]
  DC misses:          933,837  [3,757,582]
  DC periods:         42
  DC miss ratio:      20.484 %

  MicroTLB misses:    224,886  [904,898]
  Main TLB misses:    172,973  [696,010]
  TLB periods:        42
  Micro miss rate:    0.729   PTI
  Main miss rate:     0.561   PTI

  Branches:           33,438,664  [134,550,814]
  Mispredicted BR:    366,383  [1,474,255]
  BR periods:         42
  Branch rate:        108.403 PTI
  Mispredict ratio:   1.096  %

The report includes raw event counts, scaled event counts, rates and ratios. The raw event counts are the actual number of events counted by the hardware performance counters. A rate tells us how often a given event is occurring in terms of events per thousand instructions (PTI). A ratio tells us what portion of events have a certain property, such as the percentage of non-sequential data cache accesses that result in a miss.

Rpistat uses a time division multiplexing scheme to periodically switch the performance counters across the nine events of interest. Rpistat makes a switch every 100 milliseconds (0.1s). The ARM1176 dedicates one performance counter to processor cycles. Rpistat exploits this counter to the max and measures processor cycles all the time (a 100% duty-cycle for this event). Rpistat switches through the other eight events in pairs called event sets. Each event set is measured for about 25% of the overall time. Rpistat keeps track of the active time for each event and scales the raw event counts up to full duty-cycle estimates. These scaled event counts are displayed within square brackets [ ... ] in the output file.

When rpistat switches between event sets, it first accumulates the current counts into 64-bit unsigned integers called virtual counters. Rpistat avoids overflow problems because the measurement period (0.1s) is much shorter than the time needed to overflow the 32-bit hardware counters. The long virtual counter length (64 bits) postpones any real overflow for a very long time, effectively eliminating the practical possibility of an overflow.

If you would like to know more about rpistat’s implementation including important design concerns and limitations, please read about it here.

How does rpistat compare against self-measurement for accuracy? Here’s a quick comparison between the two approaches when they are applied to the naive (textbook) matrix multiplication program.

**Self-measurement vs. rpistat**
Metric	Self-measurement	Rpistat
Elapsed time	16.0s	16.6s
Scaled cycles	181,984,943	184,288,556
Instructions	1,190,897,822	1,245,571,856
DC access	17,588,053	18,587,014
DC miss	2,982,314	3,735,024
MicroTLB miss	809,154	904,166
Main TLB miss	612,688	661,667
CPI	9.78 CPI	9.47 CPI
DC access rate	14.77 PTI	14.92 PTI
DC miss rate	2.50 PTI	2.99 PTI
DC miss ratio	17.0%	20.1%
MicroTLB rate	0.68 PTI	0.73 PTI
Main TLB rate	0.51 PTI	0.53 PTI

The event counts, rates and ratios are within normal run-to-run variability in every case.

The matrix multiplication program has fairly consistent (uniform) dynamic behavior throughout its lifetime. Workloads with distinct processing phases may not fair as well. YMMV. Please see the rpistat page for more information and links to source (rpistat.c).