rpistat: RPi event monitoring tool

[At the time of this post, Performance Events for Linux (PERF) supports only software events on Raspberry Pi. Rpistat is a tool that counts hardware performance events in the style of perf stat and is a poor man’s substitute.]

So far, my example programs for the Raspberry Pi (matrix multiplication, pointer chasing) are instrumented with explicit code to measure and display hardware performance events. The instrumentation code executes privileged ARM instructions to read and write the performance counters and control register. The aprofile kernel module must be loaded in order to enable user-space access to the ARM1176 performance monitoring unit (PMU). If aprofile is not loaded, the privileged instructions trap to the OS and the process terminates abnormally.

This approach, which I call self-measurement, requires modification to the application source code. While self-measurement collects performance event data about specific code regions in the application, source code changes are intrusive, inconvenient and involve more work.

Performance Events for Linux (PERF) is a profiling infrastructure that provides a suite of performance analysis tools on Linux. The perf stat tool launches a workload and counts (selected) hardware or software performance events caused by the workload. No changes to the workload are required at either the source or binary levels.

Rpistat is similar to perf stat. It launches a workload and counts nine of the most commonly used ARM1176 hardware performance events while the workload executes. Running rpistat is easy. Just enter “rpistat” followed by the command you would normally use to launch the workload from the shell.

rpistat naive

In the example above, the workload is the naive matrix multiplication program. Rpistat writes a file named rpistat.txt when the workload completes. The file contains a basic performance report. Here is example output from rpistat.

***************************************************************
rpistat: ./naive
Thu Jun 27 09:58:08 2013
***************************************************************

System information
  Number of processors:     1

Performance events
  [ ... ] = scaled event count
  PTI     = per thousand instructions
  Total periods:      169

  Cycles:             11,759,598,287
  Instructions:       315,810,640  [1,241,209,259]
  IBUF stall cycles:  65,981,902  [259,324,219]
  Instr periods:      43
  CPI:                9.474  
  IBUF stall percent: 2.205  %

  DC cached accesses: 4,558,795  [18,343,722]
  DC misses:          933,837  [3,757,582]
  DC periods:         42
  DC miss ratio:      20.484 %

  MicroTLB misses:    224,886  [904,898]
  Main TLB misses:    172,973  [696,010]
  TLB periods:        42
  Micro miss rate:    0.729   PTI
  Main miss rate:     0.561   PTI

  Branches:           33,438,664  [134,550,814]
  Mispredicted BR:    366,383  [1,474,255]
  BR periods:         42
  Branch rate:        108.403 PTI
  Mispredict ratio:   1.096  %

The report includes raw event counts, scaled event counts, rates and ratios. The raw event counts are the actual number of events counted by the hardware performance counters. A rate tells us how often a given event is occurring in terms of events per thousand instructions (PTI). A ratio tells us what portion of events have a certain property, such as the percentage of non-sequential data cache accesses that result in a miss.

Rpistat uses a time division multiplexing scheme to periodically switch the performance counters across the nine events of interest. Rpistat makes a switch every 100 milliseconds (0.1s). The ARM1176 dedicates one performance counter to processor cycles. Rpistat exploits this counter to the max and measures processor cycles all the time (a 100% duty-cycle for this event). Rpistat switches through the other eight events in pairs called event sets. Each event set is measured for about 25% of the overall time. Rpistat keeps track of the active time for each event and scales the raw event counts up to full duty-cycle estimates. These scaled event counts are displayed within square brackets [ ... ] in the output file.

When rpistat switches between event sets, it first accumulates the current counts into 64-bit unsigned integers called virtual counters. Rpistat avoids overflow problems because the measurement period (0.1s) is much shorter than the time needed to overflow the 32-bit hardware counters. The long virtual counter length (64 bits) postpones any real overflow for a very long time, effectively eliminating the practical possibility of an overflow.

If you would like to know more about rpistat’s implementation including important design concerns and limitations, please read about it here.

How does rpistat compare against self-measurement for accuracy? Here’s a quick comparison between the two approaches when they are applied to the naive (textbook) matrix multiplication program.

Metric Self-measurement Rpistat
Elapsed time 16.0s 16.6s
Scaled cycles 181,984,943 184,288,556
Instructions 1,190,897,822 1,245,571,856
DC access 17,588,053 18,587,014
DC miss 2,982,314 3,735,024
MicroTLB miss 809,154 904,166
Main TLB miss 612,688 661,667
CPI 9.78 CPI 9.47 CPI
DC access rate 14.77 PTI 14.92 PTI
DC miss rate 2.50 PTI 2.99 PTI
DC miss ratio 17.0% 20.1%
MicroTLB rate 0.68 PTI 0.73 PTI
Main TLB rate 0.51 PTI 0.53 PTI
Self-measurement vs. rpistat

The event counts, rates and ratios are within normal run-to-run variability in every case.

The matrix multiplication program has fairly consistent (uniform) dynamic behavior throughout its lifetime. Workloads with distinct processing phases may not fair as well. YMMV. Please see the rpistat page for more information and links to source (rpistat.c).

Memory access time

In my last post, I used a simple pointer chasing loop to characterize the different levels of the BCM2835 (Raspberyy Pi) memory hierarchy. I ran the pointer chasing loop for a range of linked list sizes where the list size determines the storage level to be exercised. For example, a linked list occupying 16KB or less causes the chase loop to exercise the level 1 (L1) data cache. I measured hardware performance events for each test case and summarized the results in a table. The results clearly show three different timing tiers: L1 data cache, access to primary memory (no TLB miss) and access to primary memory with a Main TLB miss. A miss in the Main TLB is quite slow because the hardware page table walker must read page mapping information from primary memory in order to update the TLB and to complete the memory operation in flight.

The one characteristic missing from the analysis, however, is the estimated access time for each level. Sure, the execution times clearly break the test cases into tiers, but elapsed execution time does not really measure the access (latency) time itself.

In the next experiment, I estimate the access time. (See the source code in latency.c.) The approach is essentially unchanged from the first experiment: use pointer chasing on lists of different sizes to hit in particular levels of the memory hierarchy. Like the first experiment, the program (named latency) successively traverses the same linked list many times (default: 1000 traversals). Part way through each traversal, the loop chooses a chase operation and measures its execution time. The sampled execution time is saved temporarily in an array until all traversals are complete. Then, the program writes the execution times to a file named samples.dat. The samples are aggregated into a histogram which shows the distribution of the times.

The sampling technique lets each full traversal get underway before taking a measurement in order to “warm up” the pipeline, cache and TLBs. Taking a single sample per traversal avoids major cache and TLB pollution due to the bookkeeping needed to take and save samples. The measured execution times should faithfully represent access time (plus some measurement bias).

The program measures execution time in cycles. It configures and enables the ARM1176 Cycles Counter Register (CCR) as a free-running cycles counter. The program resets and enables the CCR before starting a traversal, thus avoiding a counter overflow. (A full traversal completes before the CCR overflows.) The pointer chasing loop reads the CCR before chasing the next pointer and reads the CCR after chasing the pointer. The difference between the after and before counts is the estimated execution time of the chase operation. The chase operation is implemented as a single ARM load instruction, so the instruction execution time is a biased estimate of the memory access time.

Here is the code for the function sample_cycles() and the pointer chasing loop within. The function takes two arguments: a pointer to the head of the linked list and an integer value that chooses the chase operation to be sampled.

uint32_t sample_cycles(CacheLine* linked_list, int n)
{
  register CacheLine* item ;
  register uint32_t before, after, cycles ;
  register count = n ;

  cycles = 0 ;
  for(item = linked_list ; item != NULL ; ) {
    if (count-- == 0) {
      before = armv6_read_ccr() ;
      item = item->ptrCacheLine[0] ;
      after = armv6_read_ccr() ;
      cycles = after - before ;
    } else {
      item = item->ptrCacheLine[0] ;
    }
  }

  return( cycles ) ;
}

On each iteration, the loop body checks and decrements the sampling count maintained in the variable count. If the sampling count is not zero, then the loop quickly performs an ordinary chase operation. If the count is zero, then the code snaps the before and after CCR values by calling armv6_read_ccr(), an in-line function that performs an ARM MRC instruction to read CCR. The difference is computed and is saved in the variable cycles until it can be returned by the function after traversal. The chase loop exits when it reaches the NULL pointer at the end of the linked list.

Local variables are allocated to general registers in order to avoid extraneous cache and TLB references that would perturb measurements. I checked the compiled machine code to make sure that the compiler accepted the register hints and actually allocated the local variables to registers. I also made sure that the function armv6_read_ccr() was expanded in-line.

The latency program needs to have user-space access to the ARM1176 performance counters. You must load the aprofile kernel module before running latency. Otherwise, you will get an “Illegal instruction” exception when the program attempts to configure the performance counters.

Here is a histogram showing the distribution of execution time samples when the linked list is 8KB in size. The 8KB test case consistently hits in the L1 data cache.

9  1000 ################################################

The first column is the execution time in cycles. The second column is the number of samples with the value shown in the first column. All 1,000 samples are 9 cycles.

Chapter 16 of the ARM1176JZF-S Technical Reference Manual (TRM) describes instruction cycle timings. A load from L1 data cache has a 3 cycle load-to-use latency (access) time. Taking 3 cycles as the actual access time, I estimate a measurement bias of (9 cycles – 3 cycles) = 6 cycles due to the pipelined execution of the ARM MRC instructions that read the CCR and the load instruction that chases the linked list pointer.

The 32KB and 64KB test cases form the middle timing tier. Both test cases are within the coverage of the Main TLB (288KB) and memory accesses hit in either the Micro TLB or Main TLB. The distributions are similar. Here is the distribution for the 64KB case.

 9   49 ##### 
17    1 # 
61  320 ##########################
62  571 ################################################
63   11 # 
64    4 # 
65    1 # 
69    1 # 
70   27 ### 

There is a strong peak in the narrow range of [61:62] cycles. The nonbiased estimate for access time to primary memory with a TLB hit is (62 cycles – 6 cycles bias) = 56 cycles.

The test cases at 256KB and larger form the third timing tier. These cases exceed the Main TLB coverage. (The 256KB case almost occupies the entire Main TLB and the performance event data place this case in the third and longest running tier.) Load operations miss in the Main TLB forcing a hardware page table walk. The table walker performs at least one additional primary memory read operation in order to obtain page mapping information for the TLB. Here is the distribution for the 1MB case.

113    4 # 
116  224 ##############################################
117   23 ###### 
118   59 ############## 
119  129 ###########################
120    5 ## 
121  224 ##############################################
122  185 ############
125    4 # 
130    9 ### 
132    7 ## 
133   18 ##### 
135    5 ## 

The samples are clustered in the range of [116:122] cycles. The nonbiased estimate for access time to primary memory with a Main TLB miss is (122 cycles – 6 cycles bias) = 116 cycles. This is approximately twice the access time of a load that hits in either the data MicroTLB or the Main TLB.

The following table summarizes the access time for each level of the memory hierarchy.

Level Access time Condition
CPU register 1 cycle
L1 data cache 3 cycles Data cache hit
Primary memory 56 cycles No TLB miss
Primary memory 116 cycles Main TLB miss
Memory hierarchy access times (latencies)

Where is the L2 cache? The BCM2835 dedicates the L2 cache to the Broadcom VideoCore GPU. Memory operations from the CPU are routed around the L2 cache. The L2 cache is not a factor here and is not part of the analysis.

From these estimates, we can see why the textbook (naive) matrix multiplication program is so slow. The textbook algorithm does not walk sequentially through one of its operand arrays. The long memory stride causes a large number of Main TLB misses. Memory access is twice as slow when an access misses in the Main TLB, thereby slowing down program execution and increasing elapsed time.

Memory hierarchy and performance events

My latest set of experiments with Raspberry Pi use the BCM2835 performance counters to exercise and observe the behavior of the RPi memory system. The experiments use a simple microbenchmark that traverses a linked list. This technique is often called pointer chasing because list traversal chases the pointers through the linked list from beginning to end. A pointer chasing loop is simple, is easy to instrument, and has consistently repeatable behavior.

Here’s the approach. We exercise a single specific level in the memory hierarchy (such as the level 1 data cache) simply by adjusting the size of the linked list. We just need to choose a size such that the the physical size of the list is less than or equal to the capacity of a level — 16KB in the case of the level 1 (L1) data cache. Storage for the linked list is allocated as a contiguous block of memory (an array of bytes), so it’s easier to think of the linked list size in terms of its contiguous footprint in memory, the array size. Each linked list item is the same size as a cache line: 32 bytes.

The first experiment measures seven kinds of performance events: cycles, executed instructions, L1 data cache accesses, L1 data cache misses, data MicroTLB misses, Main TLB misses and explicit external data accesses.

Before we take a look at the results of the first experiment, here’s a few things to keep in mind.

  • The Broadcom BCM2835 128KB L2 cache is dedicated to the VideoCore GPU. Memory requests from the CPU are routed around the L2 and go directly to primary memory. L2 cache is not a factor in our analysis.
  • Coverage is the amount of primary memory accessible through a translation lookaside buffer (TLB) without incurring a TLB miss.
  • The MicroTLB has ten fully associative (page) entries. Its coverage is (10 entries * 4,096 bytes) = 40KB.
  • The Main TLB handles MicroTLB misses and has 8 fully associative entries plus 64 2-way associative entries. Its coverage is (72 entries * 4,096 bytes) = 288KB.
  • A hardware page table walker handles Main TLB misses. A page table walk requires at least one additional read in primary memory in order to look up page mapping information.

These memory system characteristics affect the results and should be directly observable.

The following table summarizes basic measurements for ten test runs. The first column is the size of the dynamically allocated block of memory that holds the linked list elements. The second column is the size of the block expressed as 32-byte cache lines. There is one list item per cache lines, so this is also the number of linked list items. The third column is the number of list traversals (one traversal per iteration). I adjusted the number of iterations for each run in order to perform exactly the same number of pointer chasing operations per test case. This allows us to make meaningful run-to-run comparisons.

 Size  Cache lines  Iterations   Time    CPI
-----  -----------  ----------  -----  -----
  4MB         128K        1024  25.76  20.51
  2MB          64K        2048  25.74  20.52
  1MB          32K        4096  25.98  20.70
512KB          16K        8192  25.56  20.41
256KB           8K       16384  24.62  19.71
128KB           4K       32768  19.99  16.24
 64KB           2K       65536  11.75   9.61
 32KB           1K      131072  10.49   8.77
 16KB          512      262144   2.66   2.30
  8KB          256      524288   2.39   2.04

The fourth column is the total elapsed time per run. There are roughly three distinct timing tiers:

  • Tier A: 4MB, 2MB, 1MB, 512KB, 256KB, 128KB
  • Tier B: 32KB, 64KB
  • Tier C: 8KB, 16KB

The 128KB case is somewhat “transitional,” but I placed it in Tier A. The cycles per instruction (CPI column) ratios are consistent with respect to the elapsed times.

The following table summarizes the performance event counts for the same test cases (block/array size and iterations).

                           MicroTLB  Main TLB  External
 Size DC access  DC miss      miss      miss    access
----- ---------  -------  --------  --------  --------
  4MB  24696257  5107400   1049399    258783   7433309 C
  2MB  24927252  5119529   1051719    259531   7509503 C
  1MB  24878797  5145341   1065404    273960   7521771 C
512KB  24668739  5086410   1050024    259104   7461179 C
256KB  22999900  4803510   1057814    308477   7263176 C
128KB  20051408  4258682    965191    181398   6136462 C
 64KB  11972871  2683992    620354     91207   3785568 B
 32KB  10096551  2288319    548311     71756   3231861 B
 16KB   2621836   594034    136900      8265    804582 A
  8KB   3480732   446481     72712      2757    480707 A

Keeping the basic measurements in mind, we can see that the performance event counts for Tier C are consistent with L1 data cache access. We are measuring on a single core computer and the data cache and TLB misses are mostly due to other system activity (like OS timer interrupts). Tier B cases hit in either the MicroTLB or Main TLB. Tier B are mainly references to primary memory without any TLB miss. The Tier B array sizes are within the MicroTLB and/or Main TLB coverage. Tier A cases miss in the TLBs. Tier A references require an additional primary memory read by the page walker. Tier A has the longest elapsed execution time.

Next up, we’ll measure the memory access (latency) time to primary memory.

A sweet one-liner for histograms

Here is a shell script one-liner that is just too sweet to go unrecognized.

I’ve written a program (latency.c) to measure the execution time of individual chase operations in a linked list pointer chasing loop. The program writes the execution times into a file named samples.dat. The distribution of the execution times should show the access time to different levels of the ARM1176 memory hierarchy.

I still needed a way to view the distribution in a histogram. So, I wrote a short C program (histogram.c) to postprocess the execution times. The program produces a histogram-like table — not a chart. Bummer.

A quick search of the Web brought up some rather sophisticated data visualization tools. I experimented with a few of them, but still couldn’t get want I wanted.

Then, this little gem came up from Small Labs, Inc. (Search on “command line histogram.)

history | awk '{h[$2]++}END{for(i in h){print h[i],i|"sort -rn|head -20"}}' |awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%15s %5d %s %s",$2,$1,r,"\n";}'

This one-liner displays a histogram of recent command lines (history) from the most to least frequently used. It displays only the 20 most frequent commands (head -20).

This one-liner looks quite promising, but it needs a few changes. First, it needs to read data from the file samples.dat. It needs to read only one item from each line of the file; history produces two items per line. Finally, we can discard some of the white space in the output and make the lines in the chart narrower.

Here is the revised one-line written in the form of a bash shell script. We’re going to analyze several files and it’s appropriate to put the one-liner into a shell command file.

#!/bin/bash

cat samples.dat | awk '{h[$1]++}END{for(i in h){print h[i],i|"sort -rn|head -20"}}' |awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%6s %5d %s %s",$2,$1,r,"\n";}'

Here is some sample output.

    62  517 ###########################################################
    61  301 ################################### 
     9  119 ############## 
    70   12 ## 
    63   11 ## 
    65    9 ## 
    64    4 # 
    17    4 # 
   336    2 # 
   168    2 # 
   154    2 # 
    71    1 # 
    67    1 # 
    60    1 # 
   525    1 # 
   521    1 # 
   390    1 # 
   389    1 # 
   386    1 # 
   378    1 # 

We can easily see the most frequent values, but this doesn’t really show the distribution in a useful way. So, let’s just sort the output on the leading numeric values (sort -n). The numbers in the first column are access/latency times in cycles, by the way.

     9  119 ##############
    17    4 # 
    60    1 # 
    61  301 ###################################
    62  517 ###########################################################
    63   11 ## 
    64    4 # 
    65    9 ## 
    67    1 # 
    70   12 ## 
    71    1 # 
   154    2 # 
   168    2 # 
   336    2 # 
   378    1 # 
   386    1 # 
   389    1 # 
   390    1 # 
   521    1 # 
   525    1 # 

Ah, now we can see the level 1 data cache hits (9 cycles including measurement bias) and reads to primary memory (61 and 62 cycles). The bi-modal nature of the distribution is revealed.

Before leaving this topic, I’d like to give a shout-out to Dave Christie at AMD. Dave always gave me a good-natured kidding about these old school histograms. Dave, BTW, is one of the true unsung technical heroes at AMD. All the best!

Perils of performance analysis on single core

A new article on the Raspberry Pi (Broadcom BCM2835) memory hierarchy is almost ready. The first code has already been posted.

I’ve been working on multi-core processors for so long that I forgot what it’s like to take measurements on a single core machine like the Raspberry Pi.

In the ideal world, a benchmark or performance test program has the machine to itself and no other program or system activity perturbs it. Measurements on the ideal machine accurately and exactly reflect the dynamic behavior and performance of the program. On multi-core, you can usually assign the test program to an idle core (or two), preferably a core that is free of operating system activity. With careful process or thread placement, results on multi-core approach the ideal.

On single core, we don’t have any luxury. The test program has to share the one core with other programs and the operating system. On Raspberry Pi, Linux fires up services that run periodically. Even if we shut the services off, the system clock continues to run and it generates interrupts. At the very least, extraneous activity affects elapsed, user and system time measurements.

When we measure performance events, however, there is a deeper level of interference. The core has one physical level 1 (L1) data cache, one physical MicroTLB, one physical Main TLB and one physical branch history table. These microarchitecural components are transparent to the architecture, but they must be shared between programs and the OS. A context switch may cause a cache or TLB flush which invalidates the entire contents of the cache/TLB. Cache, TLB or branch history may be partially polluted by other software activity. The final performance event counts are affected by flushes and pollution and do not accurately reflect the behavior of the test program.

I ran into this issue while characterizing the memory hierarchy with performance events. One test case is designed to exercise only the L1 data cache and never touch primary memory. Yet, the test case measured a rather significant number of data cache misses beyond the compulsory misses that I would have expected. The extra misses are most likely caused by timer interrupts. I now think of these extra misses as “background radiation” which bias measurement.

Such are the perils of performance measurement and analysis on single core!

Faster matrix multiplication (part 2 of 2)

Part 2 of two parts on matrix multiplication demonstrates a fast matrix multiplication program. The algorithm is a simple transformation of the textbook algorithm — the olde loop nest interchange. The transformation changes the slow access pattern to one of the arrays so that the program steps sequentially through the array elements in memory. Execution time speeds up from about 16 seconds elapsed time to 6 seconds. Not bad for a few minutes work!

All of the key memory-related performance events are improved since the access pattern is a better fit with the underlying memory microarchitecture. The analysis shows that we need to be careful when interpreting the Data Cache Access event because this event counts nonsequential memory accesses instead of all level 1 DC accesses or architectural loads and stores.

Part 2 also discusses operation or instruction counting to analyze program complexity at a micro-level. I like to look at the assembler code generated by the compiler to see if there are any potential speed-ups. The article shows how to look at the assembler code using the GCC -S option and using the objdump program. I use instruction counting to check the operation and meaning of performance events like the ARM11 Executed Instructions event.

The Broadcom BCM2835 in the Raspberry Pi has an integer core and a Vector Floating Point (VFP) coprocessor. The VFP operates concurrently with the integer core. In fact, it operates quite independently and only synchronizes with the integer core at a few well-defined points. VFP instructions are allowed to complete out of order, which allows for greater speed, but makes FP exceptions somewhat imprecise. (Now exactly where did that underflow/overflow occur?) The VFP coprocessor has 32 registers of its own, which are organized as four 8-register banks. GCC uses the coprocessor for scalar floating point arithmetic, but doesn’t exploit any parallelism.

The VFP operates on short vectors in a register bank. Potentialy, the VFP coprocessor could be exploited to further speed up matrix multiplication. One possibility is to stream incoming data as a four-wide stripe through an array and operate on four elements at once. Or, stream four elements at a time from a single row/column. Take a look at the VFP Math Library.

It’s not all good news, however. The VFP coprocessor is not a true single instruction, multiple data (SIMD) engine. (It’s similar to an old school short vector architecture.) It only has a single floating multiply/accumulate (FMAC) pipeline. A true SIMD would have four FMAC units. Also, computations are relatively difficult to set up and stage. Computations must be double buffered where the integer Load Store Unit (LSU) is filling one register bank while the coprocessor is performing computations in a different register bank. Further, GCC vectorization doesn’t appear to support VFP.

ARM must have gotten the message from its users. Later processors implement NEON SIMD and just enough VFP for the sake of legacy compatibility. The Beaglebone Black (ARM Cortex-A8) has NEON and I’m looking forward to trying it out. GCC vectorization supports NEON, too, and it’s a whole lot easier to let the compiler vectorize your program for you than to write vector code yourself!

The BCM2835 also has the VideoCore GPU for SIMD computation. There are a bunch of folks who are reverse engineering the GPU in order to use it for general purpose computation (GPGPU). Have it at, guys and gals!

Even if VFP is an orphan, the coprocessor has 32 registers where you can stash data. Maybe you can find a way to make use of these extra registers? Side-to-side access (integer/floating) is pretty fast.