RPi performance event monitoring: rpistat

Source: Makefile, rpistat.c, readme
Common code: test_common.h, test_common.c, rpi_pmu.h, rpi_pmu.c

The matrix multiplication and pointer chasing programs use the BCM2835 (Raspberry Pi) performance counters to self-monitor their own behavior. The programs configure and start the performance counters before entering a critical loop and stop the performance counters after leaving the loop. This approach is intrusive and requires us to modify and compile the workload program in order to measure hardware performance events. But, what if we don’t want to modify the source?

This page describes a relatively simple program, rpistat, that:

  • Launches a workload in a child process,
  • Configures and starts the performance counters,
  • Waits for the workload to terminate,
  • Stops the performance counters, and
  • Writes performance event counts, rates and ratios to a file named rpistat.txt.

The parent, running the program rpistat, monitors the child and other system activity. The parent handles all of the performance counter work. The workload in the child doesn’t need to be modified/recompiled and is unaware of the monitoring.

The measurements in the result file are sufficient to do an initial performance triage for the workload. Rpistat usage is similar to the shell time command. Just follow rpistat with a command to launch the workload:

rpistat ./naive

I ran rpistat for the naive matrix multiplication program and received the output shown below.

***************************************************************
rpistat: ./naive
Wed Jun 26 15:12:33 2013
***************************************************************

System information
  Number of processors:     1

Performance events
  [ ... ] = scaled event count
  PTI     = per thousand instructions
  Total periods:      169

  Cycles:             11,794,467,561
  Instructions:       316,920,650  [1,245,571,856]
  IBUF stall cycles:  69,764,851  [274,192,088]
  Instr periods:      43
  CPI:                9.469  
  IBUF stall percent: 2.325  %

  DC cached accesses: 4,619,258  [18,587,014]
  DC misses:          928,231  [3,735,024]
  DC periods:         42
  DC miss ratio:      20.095 %

  MicroTLB misses:    224,704  [904,166]
  Main TLB misses:    164,438  [661,667]
  TLB periods:        42
  Micro miss rate:    0.726   PTI
  Main miss rate:     0.531   PTI

  Branches:           33,633,705  [135,335,622]
  Mispredicted BR:    369,167  [1,485,457]
  BR periods:         42
  Branch rate:        108.653 PTI
  Mispredict ratio:   1.098  %

If you read my earlier notes on the ARM1176JZF-S (BCM2835) performance counters, this output should raise an eyebrow. First, the BCM2835 only has two fully configurable performance counter registers; the third register is dedicated to processor cycles. Yet, the output has counts for nine different events (including cycles). Next, the performance counter registers are only 32 bits in length and are subject to overflow, thereby limiting workload run time. How does rpistat avoid counter overflows? Read on.

Time division multiplexing

The rpistat program uses time division multiplexing (TDM) to resolve both of these issues.

The program design separates the nine events into four event sets numbered 0, 1, 2, and 3. All event sets measure processor cycles in the Cycle Counter Register (CCR). Individual processor cycles are counted without scaling (i.e., no division by 64). The remaining two events in each set are assigned to performance counter 0 and counter 1.

Event set Events in the set
Event set 0 Processor cycles
Executed instructions
Instruction buffer stalls
Event set 1 Processor cycles
Data cache accesses (cacheable)
Data cache misses
Event set 2 Processor cycles
Data MicroTLB miss
Main TLB miss
Event set 3 Processor cycles
Executed branch instructions
Mispredicted branches
Event sets (rpistat)

Rpistat operates in the following way. Rpistat does a fork/exec to launch the workload. It then configures the performance counters for the first event set and starts the counters. Rpistat sleeps for 100 milliseconds (0.1s). After it wakes, rpistat checks to see if the workload is still running or if it terminated. If the workload is still running, it stops the counters and accumulates the current event counts into 64-bit virtual counters (which are maintained in ordinary variables). Rpistat then configures the performance counters for the next event set, clears the counters and starts them again. Rpistat repeats the process of sleeping, accumulating event counts and switching event sets until the workload terminates.

The 64-bit virtual counters postpone the possibility of an overflow for a very long time. For example, a 64-bit virtual cycles counter overflows in 2.6E+10 seconds. Rpistat’s measurement period (100 milliseconds) is well within the critical limit for 32-bit physical counter overflows (6 seconds). TDM in combination with 64-bit virtual counters effectively eliminates overflow.

Each event in a set is measured for 100 milliseconds at a time. I call this a measurement period. The program keeps track of the total number of measurement periods and the number of periods for each event set. This allows us to scale the raw event counts into estimates for full duty-cycle event counts, that is, what the event counts would be if the events were measured full-time. The event counts which are surrounded by square brackets [ ... ]in the sample output are scaled, full duty-cycle estimates.

Performance Events for Linux (PERF) implements a form of performance counter multiplexing. It, too, switches events on and off the physical performance counters.

Luxuries

The rpistat program provides a few luxuries. The program uses the locale to separate large integers into thousands when displaying the event counts in the output. Formatting makes it much easier to compare large integer values. Some workloads run for several minutes or even hours, and event counts in practice get VERY large.

Rpistat collects all of the most interesting events in a single workload run. This makes multiple experimental runs unnecessary and saves time. However, the number of actual events of each type is only one quarter (25%) of what a full-time measurement scheme would accumulate. The reduction could affect accuracy especially for short running programs (i.e., a small number of 100 millisecond measurement periods). Workloads must be run and measured for at least several seconds to attain minimal accuracy. Further, the TDM scheme assumes that workload behavior is relatively consistent (uniform) across the entire run — an assumption that does not always hold.

On the other hand, many event rates and ratios can be generated from the range of available performance event data. Event rates are expressed as “events per thousand instructions” or PTI. Events PTI yield a more intuitive understanding than “per instruction”. The program uses the period counts to determine a normalizing scale factor for each event. Events must have the same weight (duty-cycle) when performing arithmetic on them or comparing them.

Comparison to previous results

In a previous experiment, I measured the matrix multiplication function in the program named naive, as in the “na├»ve” (textbook) algorithm. The performance counters start just before matrix multiplication and stop just after. Four separate measurement runs are required to collect all event data because only two events plus the cycles event can be measured during a single run. I call this scheme self-measurement since the performance counter instrumentation code is embedded in the application program itself.

The following table compares the event counts taken by self-measurement and by rpistat.

Metric Self-measurement Rpistat
Elapsed time 16.0s 16.6s
Scaled cycles 181,984,943 184,288,556
Instructions 1,190,897,822 1,245,571,856
DC access 17,588,053 18,587,014
DC miss 2,982,314 3,735,024
MicroTLB miss 809,154 904,166
Main TLB miss 612,688 661,667
CPI 9.78 CPI 9.47 CPI
DC access rate 14.77 PTI 14.92 PTI
DC miss rate 2.50 PTI 2.99 PTI
DC miss ratio 17.0% 20.1%
MicroTLB rate 0.68 PTI 0.73 PTI
Main TLB rate 0.51 PTI 0.53 PTI
Self-measurement vs. rpistat

Event counts are scaled where necessary to make quantities comparable. Although rpistat measures array initialization in addition to matrix multiplication, the rpistat measurements are quite consistent with respect to self-measurement.

Limitations

Both rpistat and the workload run on the same single CPU core in the Raspberry Pi. Rpistat wakes up ten times per second and interrupts the execution of the workload each time. (Since rpistat and the workload are in separate processes, this entails a heavy-weight context switch.) The workload runs a little longer since it must share the core with rpistat. Further, rpistat pollutes key microarchitectural components: the level 1 (L1) caches, the TLBs, and the branch history table. Pollution affects the performance event counts and the workload does not run exactly as it would if it had the core to itself.

Rpistat wakes up and switches between event sets every 100 milliseonds. The length of the measurement period is a trade-off between periodic interference/pollution by Rpistat and measurement accuracy. I do not recommend a higher multiplexing rate. At 100 milliseconds per period, the system already must perform 20 heavy context switches per second to alternate between rpistat and the workload.

The overall approach assumes a workload with reasonably consistent behavior over the life of the workload. The matrix multiplication program is a good fit with rpistat’s style. The program has two phases: a short initialization phase and the matrix multiplication phase. Matrix multiplication consumes most of the run time and inner loop behavior is highly consistent across all iterations. Thus, rpistat measurements taken over the life of the program accurately reflect the behavior of the matrix multiplication procedure.

Some workloads, however, have multiple phases where the locus and nature of CPU/memory activity are different for each phase. In this case, rpistat measurements aggregate the observable dynamic behavior (via performance events) across all phases. Additional measurement error occurs when one or more phases are shorter than the measurement period.

Possible extensions

One possible extension is to write the intermediate event counts to a file (rpistat.data) at the end of every measurement period. The cycles count is effectively the timestamp for the reported measurement period. A post-processing tool can read the data file and produce a time plot showing the event frequency for each measurement period. The time plot shows phase-specific workload behavior provided that the phases are long enough with respect to the length of the measurement period.

Another possible extension is to support multi-core. This extension isn’t relevant for the current Raspberry Pi Models A and B because the BCM2835 is a single core processor. However, future implementations may be multi-core. Neither rpistat or the workload can make assumptions about the underlying core(s). Either process can migrate to a different core. In most multi-core processors, each core has its own set of performance counters. Rpistat cannot assume that the performance counters on its own core contain counts for the workload. The workload could be executing on a different core.

Gory details

Here is a little taste of the code (rpistat.c). There are three functions that handle the event sets.

  • handle_first_event_set(): This function clears the performance counters and gets the first event set going.
  • handle_event_sets(): This function accumulates counts for the current event set into the virtual counters and switches to the next event set. Rpistat steps through the event sets in round-robin fashion.
  • handle_last_event_set(): This function accumulates counts for the current event set into the virtual counters. It does not restart the performance counters.

The following code snippet (from rpistat.c) is the main control loop in rpistat.

handle_first_event_set() ;
for(;;) {
  if ((pid = waitpid(child, 0, WNOHANG)) == 0) {
    // Child is still running
    usleep(100000) ;
    // Stop the performance counters
    armv6_pmcr_write(0) ;
    handle_event_sets() ;
  } else if (pid == child) {
    // Child terminated
    // Stop the performance counters
    armv6_pmcr_write(0) ;
    handle_last_event_set() ;
    // fprintf(stdout, "Workload terminated\n") ;
    break ;
  } else if (pid == -1) {
    // Error
    fprintf(stderr, "*error* waitpid failed\n") ;
    perror("waitpid") ;
  } else {
    armv6_pmcr_write(0) ;
    fprintf(stderr, "*fatal* Unexpected return from waitpid()\n") ;
    exit( EXIT_FAILURE ) ;
  }

The function handle_first_event_set() starts the counters with the first set of events before entering the control loop. The loop body calls waitpid() without blocking (WNOHANG) to determine if the workload in the child process has terminated. If the child is still running (the normal case), rpistat sleeps for 100 milliseconds. When rpistat awakes, it stops the counters and calls handle_event_sets() to accumulate the current event counts and to switch to the next event set. If the workload has terminated, rpistat stops the counters and calls handle_last_event_set() to accumulate the latest and last event counts. The remaining cases handle errors.

Copyright © 2013 Paul J. Drongowski