PERF tutorial part 3 is now on-line

Just wrapped up Part 3 of the Linux-tools PERF tutorial.

The tutorial now consists of three parts. Part 1 covers the most basic PERF commands and shows how to find program hot-spots using software performance events. Part 2 discusses hardware performance events and performance counters, and demonstrates how to measure hardware performance events using PERF counting mode. Part 2 introduces several derived performance metrics like instructions per second (IPC) and applies these metrics to the sample application programs.

Part 3 is the newest addition to the tutorial series. It builds on parts 1 and 2, showing how to use hardware performance events and counter sampling to profile an application program. Part 3 discusses sampling period and frequency, the sampling process, overhead, statistical accuracy/confidence and other practical concerns.

I hope you find the PERF tutorial to be useful in your work! Although I produced the example data on the ARM-based Raspberry Pi, the commands and techniques will also work on x86.

PERF tutorial part 2 now available

Part 2 of a three part tutorial about Linux-tools PERF is now available.

Part 1 of the series shows how to find hot execution spots in an application program. It demonstrates the basic PERF commands using software performance events such as CPU clock ticks and page faults.

Part 2 of the series — just released — introduces hardware performance counters and events. I show how to count hardware events with PERF and how to compute and apply a few basic derived measurements (e.g., instructions per cycle, cache miss rate) for analysis. Part 3 is in development and will show how to use sampling to profile a program and to isolate performance issues in code.

All three parts of the series use the same simple, easy to understand example: matrix multiplication. One version of the matrix multiplication program illustrates the impact of severe performance issues and what to look for in PERF measurements. The issues are mitigated in the second, improved version of the program. PERF measurements for the improved program are presented for comparison.

The test platform is the latest second generation Raspberry Pi 2 running Raspbian Wheezy 3.18.9-v7+. The Raspberry Pi 2 has a 900MHz quad-core ARM Cortex-A7 (ARMv7) processor with 1GByte of primary memory. Although the tutorial series demonstrates PERF on Cortex-A7, the same PERF commands and analytical techniques can be employed on other architectures like x86.

A special note for Raspberry Pi users. The current stable distribution of Raspbian Wheezy — 3.18.7-v7+ February 2015 — does not support PERF hardware events. Full PERF support was enabled in a later, intermediate release and full PERF support should be available in the next stable release of Raspbian Wheezy. In the meantime, Raspberry Pi 2 users may profile their programs using PERF software events as shown in Part 1 of the tutorial. First generation Raspberry Pi users are also restricted to software performance events.

Brave souls may try rpi-update to upgrade to the latest and possibly unstable release. I recommend waiting for the next stable release unless you really, really know what you are doing and are willing to chance an unstable kernel with potentially catastrophic consequences.

rpistat: RPi event monitoring tool

[At the time of this post, Performance Events for Linux (PERF) supports only software events on Raspberry Pi. Rpistat is a tool that counts hardware performance events in the style of perf stat and is a poor man’s substitute.]

So far, my example programs for the Raspberry Pi (matrix multiplication, pointer chasing) are instrumented with explicit code to measure and display hardware performance events. The instrumentation code executes privileged ARM instructions to read and write the performance counters and control register. The aprofile kernel module must be loaded in order to enable user-space access to the ARM1176 performance monitoring unit (PMU). If aprofile is not loaded, the privileged instructions trap to the OS and the process terminates abnormally.

This approach, which I call self-measurement, requires modification to the application source code. While self-measurement collects performance event data about specific code regions in the application, source code changes are intrusive, inconvenient and involve more work.

Performance Events for Linux (PERF) is a profiling infrastructure that provides a suite of performance analysis tools on Linux. The perf stat tool launches a workload and counts (selected) hardware or software performance events caused by the workload. No changes to the workload are required at either the source or binary levels.

Rpistat is similar to perf stat. It launches a workload and counts nine of the most commonly used ARM1176 hardware performance events while the workload executes. Running rpistat is easy. Just enter “rpistat” followed by the command you would normally use to launch the workload from the shell.

rpistat naive

In the example above, the workload is the naive matrix multiplication program. Rpistat writes a file named rpistat.txt when the workload completes. The file contains a basic performance report. Here is example output from rpistat.

rpistat: ./naive
Thu Jun 27 09:58:08 2013

System information
  Number of processors:     1

Performance events
  [ ... ] = scaled event count
  PTI     = per thousand instructions
  Total periods:      169

  Cycles:             11,759,598,287
  Instructions:       315,810,640  [1,241,209,259]
  IBUF stall cycles:  65,981,902  [259,324,219]
  Instr periods:      43
  CPI:                9.474  
  IBUF stall percent: 2.205  %

  DC cached accesses: 4,558,795  [18,343,722]
  DC misses:          933,837  [3,757,582]
  DC periods:         42
  DC miss ratio:      20.484 %

  MicroTLB misses:    224,886  [904,898]
  Main TLB misses:    172,973  [696,010]
  TLB periods:        42
  Micro miss rate:    0.729   PTI
  Main miss rate:     0.561   PTI

  Branches:           33,438,664  [134,550,814]
  Mispredicted BR:    366,383  [1,474,255]
  BR periods:         42
  Branch rate:        108.403 PTI
  Mispredict ratio:   1.096  %

The report includes raw event counts, scaled event counts, rates and ratios. The raw event counts are the actual number of events counted by the hardware performance counters. A rate tells us how often a given event is occurring in terms of events per thousand instructions (PTI). A ratio tells us what portion of events have a certain property, such as the percentage of non-sequential data cache accesses that result in a miss.

Rpistat uses a time division multiplexing scheme to periodically switch the performance counters across the nine events of interest. Rpistat makes a switch every 100 milliseconds (0.1s). The ARM1176 dedicates one performance counter to processor cycles. Rpistat exploits this counter to the max and measures processor cycles all the time (a 100% duty-cycle for this event). Rpistat switches through the other eight events in pairs called event sets. Each event set is measured for about 25% of the overall time. Rpistat keeps track of the active time for each event and scales the raw event counts up to full duty-cycle estimates. These scaled event counts are displayed within square brackets [ ... ] in the output file.

When rpistat switches between event sets, it first accumulates the current counts into 64-bit unsigned integers called virtual counters. Rpistat avoids overflow problems because the measurement period (0.1s) is much shorter than the time needed to overflow the 32-bit hardware counters. The long virtual counter length (64 bits) postpones any real overflow for a very long time, effectively eliminating the practical possibility of an overflow.

If you would like to know more about rpistat’s implementation including important design concerns and limitations, please read about it here.

How does rpistat compare against self-measurement for accuracy? Here’s a quick comparison between the two approaches when they are applied to the naive (textbook) matrix multiplication program.

Metric Self-measurement Rpistat
Elapsed time 16.0s 16.6s
Scaled cycles 181,984,943 184,288,556
Instructions 1,190,897,822 1,245,571,856
DC access 17,588,053 18,587,014
DC miss 2,982,314 3,735,024
MicroTLB miss 809,154 904,166
Main TLB miss 612,688 661,667
CPI 9.78 CPI 9.47 CPI
DC access rate 14.77 PTI 14.92 PTI
DC miss rate 2.50 PTI 2.99 PTI
DC miss ratio 17.0% 20.1%
MicroTLB rate 0.68 PTI 0.73 PTI
Main TLB rate 0.51 PTI 0.53 PTI
Self-measurement vs. rpistat

The event counts, rates and ratios are within normal run-to-run variability in every case.

The matrix multiplication program has fairly consistent (uniform) dynamic behavior throughout its lifetime. Workloads with distinct processing phases may not fair as well. YMMV. Please see the rpistat page for more information and links to source (rpistat.c).

Memory access time

In my last post, I used a simple pointer chasing loop to characterize the different levels of the BCM2835 (Raspberyy Pi) memory hierarchy. I ran the pointer chasing loop for a range of linked list sizes where the list size determines the storage level to be exercised. For example, a linked list occupying 16KB or less causes the chase loop to exercise the level 1 (L1) data cache. I measured hardware performance events for each test case and summarized the results in a table. The results clearly show three different timing tiers: L1 data cache, access to primary memory (no TLB miss) and access to primary memory with a Main TLB miss. A miss in the Main TLB is quite slow because the hardware page table walker must read page mapping information from primary memory in order to update the TLB and to complete the memory operation in flight.

The one characteristic missing from the analysis, however, is the estimated access time for each level. Sure, the execution times clearly break the test cases into tiers, but elapsed execution time does not really measure the access (latency) time itself.

In the next experiment, I estimate the access time. (See the source code in latency.c.) The approach is essentially unchanged from the first experiment: use pointer chasing on lists of different sizes to hit in particular levels of the memory hierarchy. Like the first experiment, the program (named latency) successively traverses the same linked list many times (default: 1000 traversals). Part way through each traversal, the loop chooses a chase operation and measures its execution time. The sampled execution time is saved temporarily in an array until all traversals are complete. Then, the program writes the execution times to a file named samples.dat. The samples are aggregated into a histogram which shows the distribution of the times.

The sampling technique lets each full traversal get underway before taking a measurement in order to “warm up” the pipeline, cache and TLBs. Taking a single sample per traversal avoids major cache and TLB pollution due to the bookkeeping needed to take and save samples. The measured execution times should faithfully represent access time (plus some measurement bias).

The program measures execution time in cycles. It configures and enables the ARM1176 Cycles Counter Register (CCR) as a free-running cycles counter. The program resets and enables the CCR before starting a traversal, thus avoiding a counter overflow. (A full traversal completes before the CCR overflows.) The pointer chasing loop reads the CCR before chasing the next pointer and reads the CCR after chasing the pointer. The difference between the after and before counts is the estimated execution time of the chase operation. The chase operation is implemented as a single ARM load instruction, so the instruction execution time is a biased estimate of the memory access time.

Here is the code for the function sample_cycles() and the pointer chasing loop within. The function takes two arguments: a pointer to the head of the linked list and an integer value that chooses the chase operation to be sampled.

uint32_t sample_cycles(CacheLine* linked_list, int n)
  register CacheLine* item ;
  register uint32_t before, after, cycles ;
  register count = n ;

  cycles = 0 ;
  for(item = linked_list ; item != NULL ; ) {
    if (count-- == 0) {
      before = armv6_read_ccr() ;
      item = item->ptrCacheLine[0] ;
      after = armv6_read_ccr() ;
      cycles = after - before ;
    } else {
      item = item->ptrCacheLine[0] ;

  return( cycles ) ;

On each iteration, the loop body checks and decrements the sampling count maintained in the variable count. If the sampling count is not zero, then the loop quickly performs an ordinary chase operation. If the count is zero, then the code snaps the before and after CCR values by calling armv6_read_ccr(), an in-line function that performs an ARM MRC instruction to read CCR. The difference is computed and is saved in the variable cycles until it can be returned by the function after traversal. The chase loop exits when it reaches the NULL pointer at the end of the linked list.

Local variables are allocated to general registers in order to avoid extraneous cache and TLB references that would perturb measurements. I checked the compiled machine code to make sure that the compiler accepted the register hints and actually allocated the local variables to registers. I also made sure that the function armv6_read_ccr() was expanded in-line.

The latency program needs to have user-space access to the ARM1176 performance counters. You must load the aprofile kernel module before running latency. Otherwise, you will get an “Illegal instruction” exception when the program attempts to configure the performance counters.

Here is a histogram showing the distribution of execution time samples when the linked list is 8KB in size. The 8KB test case consistently hits in the L1 data cache.

9  1000 ################################################

The first column is the execution time in cycles. The second column is the number of samples with the value shown in the first column. All 1,000 samples are 9 cycles.

Chapter 16 of the ARM1176JZF-S Technical Reference Manual (TRM) describes instruction cycle timings. A load from L1 data cache has a 3 cycle load-to-use latency (access) time. Taking 3 cycles as the actual access time, I estimate a measurement bias of (9 cycles – 3 cycles) = 6 cycles due to the pipelined execution of the ARM MRC instructions that read the CCR and the load instruction that chases the linked list pointer.

The 32KB and 64KB test cases form the middle timing tier. Both test cases are within the coverage of the Main TLB (288KB) and memory accesses hit in either the Micro TLB or Main TLB. The distributions are similar. Here is the distribution for the 64KB case.

 9   49 ##### 
17    1 # 
61  320 ##########################
62  571 ################################################
63   11 # 
64    4 # 
65    1 # 
69    1 # 
70   27 ### 

There is a strong peak in the narrow range of [61:62] cycles. The nonbiased estimate for access time to primary memory with a TLB hit is (62 cycles – 6 cycles bias) = 56 cycles.

The test cases at 256KB and larger form the third timing tier. These cases exceed the Main TLB coverage. (The 256KB case almost occupies the entire Main TLB and the performance event data place this case in the third and longest running tier.) Load operations miss in the Main TLB forcing a hardware page table walk. The table walker performs at least one additional primary memory read operation in order to obtain page mapping information for the TLB. Here is the distribution for the 1MB case.

113    4 # 
116  224 ##############################################
117   23 ###### 
118   59 ############## 
119  129 ###########################
120    5 ## 
121  224 ##############################################
122  185 ############
125    4 # 
130    9 ### 
132    7 ## 
133   18 ##### 
135    5 ## 

The samples are clustered in the range of [116:122] cycles. The nonbiased estimate for access time to primary memory with a Main TLB miss is (122 cycles – 6 cycles bias) = 116 cycles. This is approximately twice the access time of a load that hits in either the data MicroTLB or the Main TLB.

The following table summarizes the access time for each level of the memory hierarchy.

Level Access time Condition
CPU register 1 cycle
L1 data cache 3 cycles Data cache hit
Primary memory 56 cycles No TLB miss
Primary memory 116 cycles Main TLB miss
Memory hierarchy access times (latencies)

Where is the L2 cache? The BCM2835 dedicates the L2 cache to the Broadcom VideoCore GPU. Memory operations from the CPU are routed around the L2 cache. The L2 cache is not a factor here and is not part of the analysis.

From these estimates, we can see why the textbook (naive) matrix multiplication program is so slow. The textbook algorithm does not walk sequentially through one of its operand arrays. The long memory stride causes a large number of Main TLB misses. Memory access is twice as slow when an access misses in the Main TLB, thereby slowing down program execution and increasing elapsed time.

Memory hierarchy and performance events

My latest set of experiments with Raspberry Pi use the BCM2835 performance counters to exercise and observe the behavior of the RPi memory system. The experiments use a simple microbenchmark that traverses a linked list. This technique is often called pointer chasing because list traversal chases the pointers through the linked list from beginning to end. A pointer chasing loop is simple, is easy to instrument, and has consistently repeatable behavior.

Here’s the approach. We exercise a single specific level in the memory hierarchy (such as the level 1 data cache) simply by adjusting the size of the linked list. We just need to choose a size such that the the physical size of the list is less than or equal to the capacity of a level — 16KB in the case of the level 1 (L1) data cache. Storage for the linked list is allocated as a contiguous block of memory (an array of bytes), so it’s easier to think of the linked list size in terms of its contiguous footprint in memory, the array size. Each linked list item is the same size as a cache line: 32 bytes.

The first experiment measures seven kinds of performance events: cycles, executed instructions, L1 data cache accesses, L1 data cache misses, data MicroTLB misses, Main TLB misses and explicit external data accesses.

Before we take a look at the results of the first experiment, here’s a few things to keep in mind.

  • The Broadcom BCM2835 128KB L2 cache is dedicated to the VideoCore GPU. Memory requests from the CPU are routed around the L2 and go directly to primary memory. L2 cache is not a factor in our analysis.
  • Coverage is the amount of primary memory accessible through a translation lookaside buffer (TLB) without incurring a TLB miss.
  • The MicroTLB has ten fully associative (page) entries. Its coverage is (10 entries * 4,096 bytes) = 40KB.
  • The Main TLB handles MicroTLB misses and has 8 fully associative entries plus 64 2-way associative entries. Its coverage is (72 entries * 4,096 bytes) = 288KB.
  • A hardware page table walker handles Main TLB misses. A page table walk requires at least one additional read in primary memory in order to look up page mapping information.

These memory system characteristics affect the results and should be directly observable.

The following table summarizes basic measurements for ten test runs. The first column is the size of the dynamically allocated block of memory that holds the linked list elements. The second column is the size of the block expressed as 32-byte cache lines. There is one list item per cache lines, so this is also the number of linked list items. The third column is the number of list traversals (one traversal per iteration). I adjusted the number of iterations for each run in order to perform exactly the same number of pointer chasing operations per test case. This allows us to make meaningful run-to-run comparisons.

 Size  Cache lines  Iterations   Time    CPI
-----  -----------  ----------  -----  -----
  4MB         128K        1024  25.76  20.51
  2MB          64K        2048  25.74  20.52
  1MB          32K        4096  25.98  20.70
512KB          16K        8192  25.56  20.41
256KB           8K       16384  24.62  19.71
128KB           4K       32768  19.99  16.24
 64KB           2K       65536  11.75   9.61
 32KB           1K      131072  10.49   8.77
 16KB          512      262144   2.66   2.30
  8KB          256      524288   2.39   2.04

The fourth column is the total elapsed time per run. There are roughly three distinct timing tiers:

  • Tier A: 4MB, 2MB, 1MB, 512KB, 256KB, 128KB
  • Tier B: 32KB, 64KB
  • Tier C: 8KB, 16KB

The 128KB case is somewhat “transitional,” but I placed it in Tier A. The cycles per instruction (CPI column) ratios are consistent with respect to the elapsed times.

The following table summarizes the performance event counts for the same test cases (block/array size and iterations).

                           MicroTLB  Main TLB  External
 Size DC access  DC miss      miss      miss    access
----- ---------  -------  --------  --------  --------
  4MB  24696257  5107400   1049399    258783   7433309 C
  2MB  24927252  5119529   1051719    259531   7509503 C
  1MB  24878797  5145341   1065404    273960   7521771 C
512KB  24668739  5086410   1050024    259104   7461179 C
256KB  22999900  4803510   1057814    308477   7263176 C
128KB  20051408  4258682    965191    181398   6136462 C
 64KB  11972871  2683992    620354     91207   3785568 B
 32KB  10096551  2288319    548311     71756   3231861 B
 16KB   2621836   594034    136900      8265    804582 A
  8KB   3480732   446481     72712      2757    480707 A

Keeping the basic measurements in mind, we can see that the performance event counts for Tier C are consistent with L1 data cache access. We are measuring on a single core computer and the data cache and TLB misses are mostly due to other system activity (like OS timer interrupts). Tier B cases hit in either the MicroTLB or Main TLB. Tier B are mainly references to primary memory without any TLB miss. The Tier B array sizes are within the MicroTLB and/or Main TLB coverage. Tier A cases miss in the TLBs. Tier A references require an additional primary memory read by the page walker. Tier A has the longest elapsed execution time.

Next up, we’ll measure the memory access (latency) time to primary memory.

Perils of performance analysis on single core

A new article on the Raspberry Pi (Broadcom BCM2835) memory hierarchy is almost ready. The first code has already been posted.

I’ve been working on multi-core processors for so long that I forgot what it’s like to take measurements on a single core machine like the Raspberry Pi.

In the ideal world, a benchmark or performance test program has the machine to itself and no other program or system activity perturbs it. Measurements on the ideal machine accurately and exactly reflect the dynamic behavior and performance of the program. On multi-core, you can usually assign the test program to an idle core (or two), preferably a core that is free of operating system activity. With careful process or thread placement, results on multi-core approach the ideal.

On single core, we don’t have any luxury. The test program has to share the one core with other programs and the operating system. On Raspberry Pi, Linux fires up services that run periodically. Even if we shut the services off, the system clock continues to run and it generates interrupts. At the very least, extraneous activity affects elapsed, user and system time measurements.

When we measure performance events, however, there is a deeper level of interference. The core has one physical level 1 (L1) data cache, one physical MicroTLB, one physical Main TLB and one physical branch history table. These microarchitecural components are transparent to the architecture, but they must be shared between programs and the OS. A context switch may cause a cache or TLB flush which invalidates the entire contents of the cache/TLB. Cache, TLB or branch history may be partially polluted by other software activity. The final performance event counts are affected by flushes and pollution and do not accurately reflect the behavior of the test program.

I ran into this issue while characterizing the memory hierarchy with performance events. One test case is designed to exercise only the L1 data cache and never touch primary memory. Yet, the test case measured a rather significant number of data cache misses beyond the compulsory misses that I would have expected. The extra misses are most likely caused by timer interrupts. I now think of these extra misses as “background radiation” which bias measurement.

Such are the perils of performance measurement and analysis on single core!

Faster matrix multiplication (part 2 of 2)

Part 2 of two parts on matrix multiplication demonstrates a fast matrix multiplication program. The algorithm is a simple transformation of the textbook algorithm — the olde loop nest interchange. The transformation changes the slow access pattern to one of the arrays so that the program steps sequentially through the array elements in memory. Execution time speeds up from about 16 seconds elapsed time to 6 seconds. Not bad for a few minutes work!

All of the key memory-related performance events are improved since the access pattern is a better fit with the underlying memory microarchitecture. The analysis shows that we need to be careful when interpreting the Data Cache Access event because this event counts nonsequential memory accesses instead of all level 1 DC accesses or architectural loads and stores.

Part 2 also discusses operation or instruction counting to analyze program complexity at a micro-level. I like to look at the assembler code generated by the compiler to see if there are any potential speed-ups. The article shows how to look at the assembler code using the GCC -S option and using the objdump program. I use instruction counting to check the operation and meaning of performance events like the ARM11 Executed Instructions event.

The Broadcom BCM2835 in the Raspberry Pi has an integer core and a Vector Floating Point (VFP) coprocessor. The VFP operates concurrently with the integer core. In fact, it operates quite independently and only synchronizes with the integer core at a few well-defined points. VFP instructions are allowed to complete out of order, which allows for greater speed, but makes FP exceptions somewhat imprecise. (Now exactly where did that underflow/overflow occur?) The VFP coprocessor has 32 registers of its own, which are organized as four 8-register banks. GCC uses the coprocessor for scalar floating point arithmetic, but doesn’t exploit any parallelism.

The VFP operates on short vectors in a register bank. Potentialy, the VFP coprocessor could be exploited to further speed up matrix multiplication. One possibility is to stream incoming data as a four-wide stripe through an array and operate on four elements at once. Or, stream four elements at a time from a single row/column. Take a look at the VFP Math Library.

It’s not all good news, however. The VFP coprocessor is not a true single instruction, multiple data (SIMD) engine. (It’s similar to an old school short vector architecture.) It only has a single floating multiply/accumulate (FMAC) pipeline. A true SIMD would have four FMAC units. Also, computations are relatively difficult to set up and stage. Computations must be double buffered where the integer Load Store Unit (LSU) is filling one register bank while the coprocessor is performing computations in a different register bank. Further, GCC vectorization doesn’t appear to support VFP.

ARM must have gotten the message from its users. Later processors implement NEON SIMD and just enough VFP for the sake of legacy compatibility. The Beaglebone Black (ARM Cortex-A8) has NEON and I’m looking forward to trying it out. GCC vectorization supports NEON, too, and it’s a whole lot easier to let the compiler vectorize your program for you than to write vector code yourself!

The BCM2835 also has the VideoCore GPU for SIMD computation. There are a bunch of folks who are reverse engineering the GPU in order to use it for general purpose computation (GPGPU). Have it at, guys and gals!

Even if VFP is an orphan, the coprocessor has 32 registers where you can stash data. Maybe you can find a way to make use of these extra registers? Side-to-side access (integer/floating) is pretty fast.

Raspberry Pi performance counters (part 1)

Finally, an example to show the Raspberry Pi performance counters in action. My friends will no doubt chuckle because the first example is an analysis of matrix multiplication. (“He always starts with matrix multiplication…”) Matrix multiplication is a good place to start because it is a small easy to build and easy to analyze program with a known performance issue. It’s a great way to get an intuitive feel for the performance events on a new, unfamiliar platform like the Raspberry Pi. I’ve analyzed this example on x86, SPARC, Itanium and Alpha, so I already have a fair bit of history with it.

Part 1 of the example shows how to use the Raspberry Pi performance counter kernel module and the user-space support functions. I collect performance event data for the infamous textbook implementation of matrix multiplication and define a few useful rates and ratios to help interpret the event counts. There is also a brief introduction to memory hierarchy in order to provide a little background for data cache and translation lookaside buffer (TLB) behavior.

I’m in the process of writing part 2, which explains and demonstrates an improve matrix multiplication program. The code for part 2 is already in the source area of this site.

After doing some comparative analysis, I strongly encourage you to read carefully the definitions of the ARM1176 performance events. The “data cache access” events, in particular, only count nonsequential data cache accesses. This important qualification affects the interpretation of performance measurements. In particular, you can’t compute a pure data cache miss ratio, that is, all data cache misses divided by all data cache accesses.

The descriptions of the ARM1176 performance events are a little bit sketchy. ARM did a better job describing the Cortex-A8 events, for example. Adopting a Zen attitude, the ARM1176 events are what they are, they will not change or be updated, and we need to accept them.

Performance events and Zen

It’s always interesting to get started with a new microarchitecture and the ARM 1176 inside of the Raspberry Pi is no exception.

Back when we were in school, we all dutifully went to computer architecture class and learned about memory hierarchy, read/write access, cache memory, and the translation lookaside buffer (TLB). A hit was a clean hit and a miss was, well, a miss.

Real world behavior of cache memory and the TLB is far more complicated. Computer designers study the behavior of benchmark and application programs in order to find behavioral patterns which the hardware can exploit for speed. This includes behavior like sequential access, fixed-length address strides, temporal and spatial locality. Then the designers build hardware which implements every trick in the book, all in the name of faster access to memory data and ultimately, faster programs. In the case of low-power machines like ARM, computer designers use behavioral patterns to turn off or not actively use functional components in order to save power. Inactive components don’t consume power, but they don’t generate observable signals either.

A real world performance event is a dynamic condition which occurs in the midst of this complicated hardware. An access or miss may be counted only for the first of multiple sequential reads/writes to the same cache line. Other undocumented internal microarchitectural conditions affect the event counts. Suddenly, it’s not so easy to interpret performance event counts armed with our textbook notions of cache access and cache miss. It may not even be possible to effectively compare the behavior of two versions of the same program (one version tuned, the other version untuned).

We can whine, whinge and kvetch about the limitations of real world performance events, the lack of documentation, and other shortcomings. At the end of the day, the performance events are what they are and we need to accept them. Therein lies the Zen of performance events.

The fine print in the TRM

I’ve been busy writing up the first example showing how to use the Raspberry Pi performance counter kernel module and the ARM11 performance counters. These write-ups always take longer than I think, so here’s a few small things that I’ve learned along the way.

Run, don’t walk, to the ARM web site and download a copy of the ARM1176JZF-S Technical Reference Manual. The TRM has the goodies and details about the ARM1176 core in the Raspberry Pi. This includes information about cache and TLB sizes, branch prediction, instruction timing and, of course, the performance counters and events. The TRM is essential reading if you want to know more about the processor at the heart of the Raspberry Pi.

Good as it is, the TRM doesn’t contain all of the information and in some cases, the descriptions are very low level and esoteric. Computer designers describe hardware from their point of view, which is not necessarily the perspective of a software engineer. In fact, the terminology is often quite foreign to the ears of a software engineer. The TRM is no exception.

The first question many people ask about the Raspberry Pi is “What happened to the L2 cache?” The L2 cache is not part of the ARM1176 core, so the TRM describes the interface to the L2 and is mum about the L2 itself. The best clues about the L2 cache appear at the beginning of the Broadcom BCM2835 ARM Peripherals manual. Broadcom has implemented a pretty decent graphics processing unit (the VideoCore) and are reluctant to release too many details about it lest competitors learn too much about their design. The diagram at the beginning of the peripherals manual shows the memory layout and bus structure of the BCM2835. There are actually two memory management units (MMU). The ARM MMU maps program virtual addresses to physical addresses and the VC/ARM MMU maps physical addresses onto the VC/CPU bus and real honest to goodness physical memory. The high order bits of the VC/CPU address determine the cacheable status of memory regions including, TA-DA!, the L2 cache. The footnote on page 6 says, “BCM2835 provides a 128KB system L2 cache, which is used primarily by the GPU. Accesses to memory are routed either via or around the L2 cache depending on senior two bits of the bus address.” So, under normal circumstances, memory reads/writes made by a program are not routed through the L2 cache. I suspect, and this is a guess, that the L2 cache boosts the bandwidth for the memory hungry GPU. Well, we may never really know.

The ARM1176 has two separate level 1 (L1) caches: a 16KB instruction cache and a 16KB data cache. I verified the cache size using the coprocessor’s Cache Type Register. The TRM is not always clear about the number and size of the translation lookaside buffers (TLB). There are two level 1 MicroTLBs. Each MicroTLB has ten entries. The MicroTLBs are backed by a Main TLB consisting of eight fully associative elements and a 64 element low-associativity store. Depending on how Linux uses the eight fully associative, lockable entries, as many as 72 entries are available for address translation.

The other characteristics that require some digging are the load-to-use latencies for the L1 cache and the TLBs. I’m just diving into the instruction timing information in Chapter 16 of the TRM. However, the load-to-use latency for a hit in the L1 data cache is three cycles. That is, the data from a load hitting in the L1 data cache is not available for 3 cycles once the load is issued. Dedicated datapaths called bypasses forward load data to other instructions in the pipeline.

The Load/Store Unit does not always block on an L1 data cache miss and supports hit under miss (HUM). The miss goes into a holding state/buffer and non-dependent instructions are allowed to execute. Up to three outstanding misses are allowed. This increases parallelism letting some computations proceed even when a load misses in the L1 data cache.

Finally, the ARM1176 is a single issue machine, that is, one instruction is issued at a time. This constraint simplifies the issue logic. The ARM1176 doesn’t need to implement any complicated issue rules to determine whether two or more instructions of a certain type can issue in the same cycle. Instructions are issued in-order, but out-of-order completion is allowed. out-of-order completion increases exploitable fine-grained parallelism.

Another neat feature of the ARM — nearly all instructions can be predicated. An enabling condition (a predicate) can be defined for an instruction. The enabling condition gates the execution of the instruction. Predication is cool and is used to eliminate branches. More about predication one of these days…

Hope you enjoyed this trip through the fine print!