RPi 4 tuning: The code

I hope you have enjoyed my series of articles about Raspberry Pi 4 performance events, measurement and tuning:

Today, I want to wrap up the series with C code.

Please don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements. The commands in the PERF tutorial apply to x86, AMD64 and other architectures, too.

Before getting too far, here is the link to the ZIP file with the code. 🙂 The main source files are:

  • makefile: The make file (duh!)
  • pe_assist.h: Performance event helper header
  • pe_assist.c: Performance event helper functions
  • pe_cortex_a72.h: A72-specific helper header
  • pe_cortex_a72.c: A72-specific helper functions
  • pe_test.c: pe_assist check-out test
  • pe_matrix.c: pe_assist matrix multiply example
  • a72_test.c: A72-specific check-out test
  • a72_walk.c: A72-specific array walk kernel
  • a72_matrix.c: A72-specific matrix multiply
  • a72_misp.c: A72-specific branch mispredict kernel
  • a72_chase.c: A72-specific pointer chasing kernel

There are a few surprises, too, such as earlier versions of code, etc.

The programs self-monitor, that is, they call perf_event_open() to configure, control and read the performance counters. perf_event_open() has many parameters, so I wrote helper functions assisting counter configuration, control and access. There are two flavors: architecture independent and Cortex-A72 specific. The architecture independent functions are defined in pe_assist.* and the A72-specific functions are defined in pe_cortex_a72.*. The architecture independent functions should work on x86, etc., too.

Aside from the two check-out tests, the rest of the source modules are workloads. These are the programs that I used to collect data for the articles about Cortex-A72 performance measurement, analysis and tuning. Feel free to bash away at everything!

Helper functions

As I mentioned above, I separated the helper functions into architecture independent and Cortex-A72 specific modules. The architecture independent helper functions handle Linux performance counter set-up, control and read back:

  • peInitialize(): Initialize/reset the helper module:
  • peMakeGroup(): Make a counter group
  • peAddLeader(): Add leader event to the group
  • peAddEvent(): Add an event to the group
  • peStartCounting(): Start the counter group
  • peStopCounting(): Stop the counter group
  • peResetCounters(): Reset the counters
  • peReadCount(): Read an event count
  • pePrintCount(): Print and event count

The interface is “lite” and uncomplicated. It’s just enough to get the job done. Sometimes during early days, there is a temptation to build the Taj Mahal. I prefer to build something simple and get experience before building up and out. This simple interface proved to be good enough.

perf_event_open() supports simple event counting and sampling. If you’re familiar with perf stat, you’ve already seen simple event counting, AKA counting mode. perf stat measures events across the entire run of an application program. Self-monitoring is similar except you insert measurement code into the application program around the critical code that you wish to measure. perf stat doesn’t require code modification or recompile, but it doesn’t let you focus on particular critical loops or whatever. Self-monitoring is a little bit more effort, but it allows focus.

The usage model is straightforward:

  1. Initialize the module data structures.
  2. Create a performance event group.
  3. Add a leader event to the group.
  4. Add other events (up to 6 events for Cortex-A72) to the group.
  5. Start the counter group.
  6. Execute the workload or critical inner loops.
  7. Stop the counter group.
  8. Read and print the event counts.

Since this sequence is a recurring pattern, I also wrote a few functions which target common types of measurements such as:

  • peMeasureInstructionEvents()
  • pePrintInstructionEvents()
  • peMeasureDataAccessEvents()
  • pePrintDataAccessEvents()

These functions configure the pre-defined “symbolic” events which the Linux kernel has preselected for the platform architecture. Thus, you should be able to use the pe_assist.* module on any Linux box.

The Cortex-A72 module, pe_cortex_a72.*, use “raw” event identifiers for configuration. The available events are defined in pe_cortex_a72.h and they are specific to ARM Cortex-A72. I rely mainly on the Cortex-A72 events because then I know exactly which A72 events I am measuring. The Cortex-A72 module calls the low-level helper functions and it exports only targeted measurement functions:

  • a72MeasureInstructionEvents()
  • a72PrintInstructionEvents()
  • a72MeasureDataAccessEvents()
  • a72PrintDataAccessEvents()
  • a72MeasureTlbEvents()
  • a72PrintTlbEvents()

Take a peek inside one of the test programs and you’ll see how to call the helper modules.

Internal design

perf_event_open() is the Swiss Army knife of performance counter configuration and control. On Linux, all counter-related operations go through this single kernel call.

perf_event_open() allows control of individual counters, of course. However, it also provides a way to control a group of counters. One can save additional trips in and out of the kernel through counter groups. Instead of making six calls to start six counters, one only needs to make one perf_event_open() call to start an entire group of six events.

A Cortex-A72 group consists of one to six event counters. Each group has a distinguished member: the leader event. You can start, stop and reset the entire group by referring to the leader. Because the group members usually share characteristics like the process (ID) to be measured, the CPU set, flags, etc., it makes sense to define all of these common properties for an entire group. This approach reduces the number of parameters to be passed around during configuration.

In keeping with the “lite” philosophy, the helper module keeps the common flags and such in a few variables and arrays. The group and leader definition functions establish group-wide values for the member events in the group. That’s all there is to it, so hack away! The “lite” approach was good enough 99% of the time, so you might not need to dip into the helper modules at all.

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 tuning: Memory access

Today’s post characterizes read access time to three different levels of the Raspberry Pi 4 (Broadcom BCM2711) memory hierarchy. The ARM Cortex-A72 processor has a two level cache structure: Level 1 data (L1D) cache and unified Level 2 cache. There is one L1D cache per core and all four cores share the L2 cache. Primary memory is the third and final level beyond L2 cache.

The test program is a simple kernel (inner loop) that runs through a linked list, i.e., pointer chasing. Each linked list element is exactly one A72 cache line in size, 64 bytes. I have used pointer chasing on other non-ARM architectures (Alpha and AMD64 come to mind) and it’s a pretty simple and effective way to characterize memory access speed.

The trick is to adjust the number of linked elements so that the entire linked list fits entirely within the memory to be characterized. In order to facilitate run-to-run comparisons, there is an outer loop which repeatedly invokes list chasing, i.e., the entire list is walked multiple times per run.

There are two main run parameters:

  • The number of linked list elements (which determines the array size), and
  • The number of iterations (which is the number of times the full list is walked).

When the array is doubled, the number of iterations is cut in half. This keeps the number of individual pointer chase operations (approximately) constant across runs.

The following table summarizes the test run parameters and the memory level to be exercised by the each run:

    #Elements  Iterations  Array Size  Mem Level 
--------- ---------- ---------- ---------
32 8388608 2KB L1D cache
64 4194304 4KB L1D cache
128 2097152 8KB L1D cache
256 1048576 16KB L1D cache
512 524288 32KB L1D cache
1024 262144 64KB L2 cache
2048 131072 128KB L2 cache
4096 65536 256KB L2 cache
8192 32768 512KB L2 cache
16384 16384 1MB L2 cache
32768 8192 2MB RAM
65536 4096 4MB RAM

Here is the C code for the test kernel:

  initialize(number_of_elements) ;
a72MeasureDataAccessEvents() ;

start_clock() ;
peStartCounting() ;
for ( ; iterations > 0 ; iterations--) {
for (CacheLine *p = listHead ; p != NULL ; p = p->nextLine) ;
}
peStopCounting() ;

print_clock_time(stdout, get_clock_time()) ;
a72PrintDataAccessEvents(stdout) ;

Both Linux clock() time and Cortex-A72 performance counter events are measured.

In my first experiments, the linked list elements were laid down in a linear sequential fashion and in a simple ping-pong scheme. I quickly discovered that Cortex-A72’s aggressive data prefetch is too good and naive layout did not produce the expected number of L1D or L2 cache misses. A72 speculatively reads the next cache line beyond a miss. By the time execution would reach the list element beyond the current one (or the very next element), the needed destination element would be available in cache or in flight.

Ideally, we want to fool the memory prefetcher and hit only the intended memory level, taking the full read access penalty each time we chase a pointer. I rewrote array/list initialization to lay down the list elements at (pseudo-)random positions in the array. The Fisher-Yates (Knuth) shuffle algorithm got the job done. Once list element layout was randomized, the pointer chasing test began producing the expected number of reads and misses.

The following table summarizes each run by the number of retired instructions, CPU cycles, instructions per cycle (IPC) and execution time:

    Array  Mem  Retired Ins    CPU Cycles     IPC    Time 
----- --- ----------- -------------- ----- ------
2KB L1D 847,249,436 855,660,776 0.990 0.609
4KB L1D 826,277,916 1,154,215,728 0.716 0.814
8KB L1D 815,792,156 1,114,379,370 0.732 0.806
16KB L1D 810,549,276 1,093,757,212 0.741 0.763
32KB L1D 807,927,836 1,382,324,229 0.584 0.975
64KB L2 806,617,116 5,074,763,198 0.159 3.446
128KB L2 805,961,756 5,643,312,493 0.143 3.805
256KB L2 805,634,076 6,621,262,142 0.122 4.452
512KB L2 805,470,236 7,163,843,161 0.112 4.813
1MB L2 805,388,316 27,563,140,814 0.029 18.421
2MB RAM 805,347,356 49,317,924,775 0.016 32.969
4MB RAM 805,326,876 54,865,753,267 0.015 36.645

No surprise, access to L1D cache is best, L2 is second best and primary memory is worst. Access to L2 cache is about five times as long as L1D cache, in terms of CPU cycles. Access to primary memory is nearly 50 times longer than L1D cache. The effect on IPC is very significant.

Taking a look at the L1D performance event counts:

    Array  Mem   IPC    Time    L1D Reads    L1D Misses  Ratio 
----- --- ----- ------ ----------- ----------- -----
2KB L1D 0.990 0.609 268,435,785 484 0.000
4KB L1D 0.716 0.814 268,435,630 1,316 0.000
8KB L1D 0.732 0.806 268,435,639 1,149 0.000
16KB L1D 0.741 0.763 268,435,622 4,319 <0.001
32KB L1D 0.584 0.975 268,435,828 17,343,069 0.065
64KB L2 0.159 3.446 268,435,603 234,906,566 0.875
128KB L2 0.143 3.805 268,435,592 268,435,529 1.000
256KB L2 0.122 4.452 268,435,625 268,435,588 1.000
512KB L2 0.112 4.813 268,435,599 268,435,530 1.000
1MB L2 0.029 18.421 268,435,594 268,435,782 1.000
2MB RAM 0.016 32.969 268,435,579 268,435,960 1.000
4MB RAM 0.015 36.645 268,435,635 268,435,941 1.000

we see that pointer chasing correctly and independently exercises L1D cache according to design. The L1D cache capacity is 32KB. The particular 32KB run shown here has the shortest execution time of the 32KB runs and thus, is cherry-picked. As I’ve seen on other architectures, measurements get a bit “weird” near cache capacity. When a cache gets nearly full, “weird stuff” starts to happen and run statistics become inconsistent. The shortest run best shows the break between L1D and L2 access.

Finally, here are the L2 cache performance event counts.

    Array  Mem   IPC    Time     L2 Reads    L2 Misses   Ratio 
----- --- ----- ------ ----------- ----------- -----
2KB L1D 0.990 0.609 1,085 68 0.063
4KB L1D 0.716 0.814 8,490,994 228 0.000
8KB L1D 0.732 0.806 4,300,759 151 0.000
16KB L1D 0.741 0.763 2,102,562 163 0.000
32KB L1D 0.584 0.975 18,495,230 1,003 <0.001
64KB L2 0.159 3.446 235,483,730 1,517 <0.001
128KB L2 0.143 3.805 270,831,005 2,745 <0.001
256KB L2 0.122 4.452 269,203,020 31,340 <0.001
512KB L2 0.112 4.813 270,893,954 443,477 0.002
1MB L2 0.029 18.421 302,452,386 107,397,408 0.355
2MB RAM 0.016 32.969 286,244,127 227,010,870 0.793
4MB RAM 0.015 36.645 277,293,265 252,881,540 0.912

As expected, we see a dramatic breakpoint at 1MB, which is the capacity of the unified L2 cache.

Bottom line, these performance measurements reinforce the importance of cache-friendly algorithms and data access patterns. Start with the best algorithms for your application, measure cache events and then tune for minimum misses. Data access should hit most frequently in the Level 1 data cache, then L2 cache. Primary memory is fifty times (!) more expensive than L1D cache and reads out to primary memory should be as infrequent as possible. Your mantra should be, “Bring it into cache, compute the heck out of the in-cache data, then write the final results back to memory, and move on.”

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Next time, I will wrap up this long series of articles with C code so you can perform your own experiments.

Copyright © 2021 Paul J. Drongowski

Cortex-A72 tuning: Data access

In my discussion about instructions per cycle as a performance metric, I compared the textbook implementation of matrix multiplication against the loop next interchange version. The textbook program ran slower (28.6 seconds) than the interchange version (19.6 seconds). The interchange program executes 2.053 instructions per cycle (IPC) while the textbook version has a less than stunning 0.909 IPC.

Let’s see why this is the case.

Like many other array-oriented scientific computations, matrix multiplication is memory bandwidth limited. Matrix multiplication has two incoming data streams — one stream from each of the two operand matrices. There is one outgoing data stream for the matrix product. Thanks to data dependency, the incoming streams are more important than the outgoing matrix product stream. Thus, anything that we can do to speed up the flow of the incoming data streams will improve program performance.

Matrix multiplication is one of the most studied examples due to its simplicity, wide-applicability and familiar mathematics. So, nothing in this note should be much of a surprise! Let’s pretend, for a moment, that we don’t know the final outcome to our analysis.

I measured retired instructions, CPU cycles and level 1 data (L1D) cache and level 2 (L2) cache read events:

    Event                          Textbook     Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,831,497 60,210,830,509
CPU cycles 42,068,324,320 29,279,037,884
Instructions per cycle 0.909 2.056
L1 D-cache reads 15,070,922,957 19,094,920,483
L1 D-cache misses 1,096,278,643 9,576,935
L2 cache reads 1,896,007,792 264,923,412
L2 cache read misses 124,888,097 125,524,763

There is one big take-away here. The textbook program misses in the data cache far more often than interchange. The textbook L1D cache miss ratio is 0.073 (7.3%) while the interchange cache miss ratio is 0.001 (0.1%). As a consequence, the textbook program reads the slower level 2 (L2) cache more often to find necessary data.

If you noticed slightly different counts for the same event, good eye! The counts are from different runs. It’s normal to have small variations from run to run due to measurement error, unintended interference from system interrupts, etc. Results are largely consistent across runs.

The behavioral differences come down to the memory access pattern in each program. In C language, two dimensional arrays are arranged in row-major order. The textbook program touches one operand matrix in row-major order and touches the other operand matrix in column-major order. The interchange program touches both operand arrays in row-major order. Thanks to row-major order’s sequential memory access, the interchange program finds its data in level 1 data (L1D) cache more often than the textbook implementation.

There is another micro-architecture aspect to this situation, too. Here are the performance event counts for translation look-aside buffer (TLB) behavior:

    Event                          Textbook     Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,830,517 60,210,830,503
L1 D-cache reads 15,070,845,178 19,094,937,273
L1 DTLB miss 1,001,149,440 17,556
L1 DTLB miss LD 1,000,143,621 10,854
L1 DTLB miss ST 1,005,819 6,702

Due to the chosen matrix dimensions, the textbook program makes long strides through one of the operand matrices, again, due to the column-major order data access pattern. The stride is big enough to touch different memory pages, thereby causing level 1 data TLB (DTLB) misses. The textbook program has a 0.066 (6.6%) DTLB miss ratio. The miss ratio is near zero for the interchange version.

I hope this discussion motivates the importance of cache- and TLB-friendly algorithms and code. Please see the following articles if you need to brush up on ARM Cortex-A72 micro-architecture and performance events:

Check out my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Here is a list of the ARM Cortex-A72 performance events that are most useful for measuring memory access (load, store and fetch) behavior. Please see the ARM Cortex-A72 MPCore Processor Technical Reference Manual (TRM) for the complete list of performance events.

Number  Mnemonic            Name 
------ ------------------ ------------------------------------
0x01 L1I_CACHE_REFILL Level 1 instruction cache refill
0x02 L1I_TLB_REFILL Level 1 instruction TLB refill
0x03 L1D_CACHE_REFILL Level 1 data cache refill
0x04 L1D_CACHE Level 1 data cache access
0x05 L1D_TLB_REFILL Level 1 data TLB refill
0x08 INST_RETIRED Instruction architecturally executed
0x11 CPU_CYCLES Processor cycles
0x13 MEM_ACCESS Data memory access
0x14 L1I_CACHE Level 1 instruction cache access
0x15 L1D_CACHE_WB Level 1 data cache Write-Back
0x16 L2D_CACHE Level 2 data cache access
0x17 L2D_CACHE_REFILL Level 2 data cache refill
0x18 L2D_CACHE_WB Level 2 data cache Write-Back
0x19 BUS_ACCESS Bus access
0x40 L1D_CACHE_LD Level 1 data cache access - Read
0x41 L1D_CACHE_ST Level 1 data cache access - Write
0x42 L1D_CACHE_REFILL_LD L1D cache refill - Read
0x43 L1D_CACHE_REFILL_ST L1D cache refill - Write
0x46 L1D_CACHE_WB_VICTIM L1D cache Write-back - Victim
0x47 L1D_CACHE_WB_CLEAN L1D cache Write-back - Cleaning
0x48 L1D_CACHE_INVAL L1D cache invalidate
0x4C L1D_TLB_REFILL_LD L1D TLB refill - Read
0x4D L1D_TLB_REFILL_ST L1D TLB refill - Write
0x50 L2D_CACHE_LD Level 2 data cache access - Read
0x51 L2D_CACHE_ST Level 2 data cache access - Write
0x52 L2D_CACHE_REFILL_LD L2 data cache refill - Read
0x53 L2D_CACHE_REFILL_ST L2 data cache refill - Write
0x56 L2D_CACHE_WB_VICTIM L2 data cache Write-back - Victim
0x57 L2D_CACHE_WB_CLEAN L2 data cache Write-back - Cleaning
0x58 L2D_CACHE_INVAL L2 data cache invalidate
0x66 MEM_ACCESS_LD Data memory access - Read
0x67 MEM_ACCESS_ST Data memory access - Write
0x68 UNALIGNED_LD_SPEC Unaligned access - Read
0x69 UNALIGNED_ST_SPEC Unaligned access - Write
0x6A UNALIGNED_LDST_SPEC Unaligned access
0x70 LD_SPEC Speculatively executed - Load
0x71 ST_SPEC Speculatively executed - Store
0x72 LDST_SPEC Speculatively executed - Load or store

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 tuning: IPC

If you read my posts about ARM Cortex-A72 micro-architecture:

you’re probably wondering, “How I do I reduce all of this to practice on Raspberry Pi 4?”

Program performance tuning is experimental and is measurement-based.
Our goal is to reduce program execution time by efficiently exploiting the underlying machine micro-architecture. Tuning follows a systematic, multi-step process:

  1. Initial design and code.
  2. Run and measure execution time and performance events.
  3. Analyze measurements.
  4. Make a hypothesis about performance bottlenecks.
  5. Change the code.
  6. Go to step 2 until you’re satisfied.

“Satisfied” is a bit subjective, but generally means “produces a result within a defined time constraint”, “achieves the desired frame rate,” or some other time-related design requirement.

You will need performance measurement tools and techniques. This is where hardware performance events come into play. Raspberry Pi OS is Linux, and fortunately, Linux has a mature performance measurement infrastructure. Performance Events for Linux, often called “PERF,” is the best way to get started. I’ve written extensively about PERF including my three part PERF tutorial:

The PERF tutorial illustrates performance measurement and tuning on Raspberry Pi models 1, 2 and 3. The mechanics of running PERF are the same on Raspberry Pi 4. Please see my other articles about Cortex-A72 performance tuning:

Lately, I have been experimenting with program performance self-monitoring using the Linux perf_event_open() system call. Stay tuned for more details and code. For the moment, I’m going to focus on ARM Cortex-A72 performance events — good enough to help you apply techniques and commands in the PERF tutorial.

Cortex-A72 performance events

A performance event is the occurrence of a micro-architectural condition. The simplest example events are retired instructions and processor (CPU) cycles. A retired instruction event occurs every time an instruction successfully completes (architectural) execution. A processor cycle event occurs every processor clock tick.

Each Cortex-A72 core has six performance counter registers. Using a tool like PERF, a performance event is assigned to each register. Yes, you can measure up to six performance events simultaneously, i.e., in a single experimental execution run. [More events can be measured via counter multiplexing, but I’m keeping things simple here.] The trick is to choose and configure the performance events that help you test your performance tuning hypothesis.

The ARM Cortex-A72 performance events are listed in the ARM Cortex-A72 Technical Reference Manual (TRM) available at the ARM corporate web site. The list is rather long and not all of the events are particularly relevant for application programmers. Thus, I won’t list them all here. There are several major event categories:

  • Instructions and cycles
  • Level 1 instruction (L1I) cache events
  • Level 1 data (L1D) cache events
  • Level 2 (L2) cache events
  • Level 1 instruction TLB (L1 ITLB) events
  • Level 1 data TLB (L1 DTLB) events
  • Branches and mispredicted branches
  • Bus and primary memory access
  • Memory barriers (speculative)
  • Instruction mix (speculative)
  • Exceptions taken
  • System register access

These are my own categories and should give you a rough impression about the kinds of micro-architectural events you can measure on Raspberry Pi 4. I listed the categories from highest to lowest priority placing the most relevant and generally useful event categories near the top of the list.

Each Coretex-A72 performance event type is assigned an event number. The event number identifies the event to measurement tools and to the event counting hardware.

Time and instruction events

Processor/CPU cycles and retired instructions are the true all-rounders.

Processor cycles are a good proxy for actual execution time. Sure, you can measure wall-clock or CPU execution time using the Linux time command or system calls like gettimeofday(), time(), clock() or clock_gettime(). The CPU cycle event lets us measure time using a performance counter register. The cycle count tells us approximately how much CPU time was consumed by the program under test.

As mentioned earlier, the retired instruction event counts the number of successfully (architecturally) completed instructions. Given a specific data set, every program has a specific amount of work to be accomplished. Particularly in the case of a single-threaded program, the program executes the same instructions for the same given data set, every time. Thus, the retired instruction count is a measure of work accomplished and should be (roughly) the same every experimental run, assuming the same data set and no outside interference. (You shouldn’t run other applications while testing. Control the test environment!)

    Number  Mnemonic      Name 
------ ------------ ------------------------------------
0x08 INST_RETIRED Instruction architecturally executed
0x11 CPU_CYCLES Cycle
0x1B INST_SPEC Operation speculatively executed

Instructions per cycle (IPC)

Two basic performance tuning goals are:

  • Reduce the number of processor cycles, and
  • Reduce the number of retired instructions.

Reducing the number of processor cycles should reduce the overall execution time. That assumes, of course, that overall execution time is not dominated by input/output, page faults, human wait (interaction) time, or some other major factor!

Reducing the number of retired instructions should reduce the amount of work performed by the program. Optimizing compilers work hard to reduce the number of instructions in tight inner loops. In terms of conventional wisdom, the fastest instruction is an instruction which is never executed in the first place.

Practically, however, one program can execute more instructions and achieve a shorter execution time than another program (assuming the same data set and functionality, of course). How can this be? It comes down to a few fundamental factors:

  • Read and use data from fast cache memory.
  • Reducing reads (writes) from (to) slow primary memory.
  • Execute computations concurrently.
  • Overlap execution with read and write operations.

Short answer, it comes down to exploiting instruction-level parallelism (ILP), temporal data locality and spatial data locality.

Instructions per cycle (IPC) is one simple measure that tells us how we are doing overall. IPC is easy to measure and compute: Count the number of retired instructions, count the number of processor cycles, and divide:

    INST_RETIRED / CPU_CYCLES

The IPC ratio indicates the amount of useful work done during each processor cycle and we want to maximize it.

Goal IPC is very much application dependent. For a given critical inner loop, one might ask, “How many concurrent operations (computations, reads, writes) can Cortex-A72 perform assuming the data are cache-resident?” If IPC is significantly less than one, it’s probably time to tune. In scientific code with a mix of integer and floating point operations, an IPC of 2 is a good starting goal.

Speculative execution

ARM Cortex-A72 is a superscalar processor which predicts branch direction and executes instructions speculatively along predicted program paths. The A72 is capable of counting many speculative event types. Speculative event types are explicitly identified in the ARM Cortex-A72 Technical Reference Manual; Look for “_SPEC” in the event mnemonic.

The speculated instruction event count is the number of ARMv8-A issued
speculatively during program execution. Ideally, branch predictions are always correct and every speculatively issued instruction eventually retires. Speculatively issued instructions on a wrong path consume execution resources just like correct path instructions that retire. Unfortunately, wrong path results are discarded, thereby wasting any resources which they consumed. Fewer wrong-path instructions produces less waste, that is, fewer wrong-path instructions start execution, consume resources, and are discarded.

The ratio of speculated instructions to retired instructions:

    INST_RETIRED / INST_SPEC

indicates how often speculated instructions resolved into retirement. Best case, this ratio is one — all speculated instructions eventually retired (i.e., few execution resources are wasted).

Example: Matrix multiplication

Matrix multiplication is the classic example of performance tuning for micro-architecture. Mathematics specifies the end result — the matrix product. Algorithmically, however, there are two ways to compute the matrix product:

  1. Textbook algorithm and code: Straightforward implementation of the mathematics.
  2. Loop nest interchange algorithm and code: Cache-friendly implementation which exploits temporal and spatial locality.

For more detail about the algorithms and code, please see Textbook matrix multiplication (part 1) and Faster matrix multiplication (part 2).

I ran both the textbook and loop nest interchange programs on Raspberry Pi 4. The textbook code took 28.6 seconds and, as expected, the interchange code took more time, 19.6 seconds. Here are the raw event counts:

    Event                    Textbook        Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,831,497 60,210,830,503
CPU cycles 42,041,568,760 29,332,934,027
Instructions per cycle 0.909 2.053

The textbook program executed less than one instruction per processor cycle, 0.909 IPC. The textbook code underperforms with respect to the availability of Cortex-A72 execution units (two integer, two FP units) and the opportunity to overlap computation with memory access. The interchange program achieves a respectable 2.053 instructions per cycle. The interchange version consumes far fewer processor cycles than the textbook version.

Just for grins, multiply the CPU cycle counts by the Raspberry Pi 4 clock period (the inverse of the 1.5GHz clock frequency). You get approximately the measured clock() CPU times: 28.028 seconds versus 28.6 actual and 19.555 seconds vs. 19.6 seconds actual.

Here are the raw event counts for retired and speculated instructions:

    Event                    Textbook        Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,831,497 60,210,830,503
Speculated instructions 46,576,925,991 60,254,256,720
Retired / speculated 0.821 0.999

The interchange version has a near ideal retired to speculated instruction ratio (0.999). The textbook slightly underperforms with nearly 8 million speculated instructions started and abandoned.

The programs are written in C and compiled with the -O0 optimization level. Try -O3. The results may further surprise you. 🙂

Copyright © 2021 Paul J. Drongowski

Performance events on Raspberry Pi 4: Tips

Performance measurement and tuning experiments with Raspberry Pi 4 are well-underway. Here are a few quick observations and tips.

Linux provides two entries into performance measurement: Performance Events for Linux (PERF) and the kernel performance counter interface (perf_event_open()). PERF is an easy-to-use tool suite and is the best place to start explorations. If you want to measure an application without modifying its code, this is for you.

PERF is built on the kernel performance counter interface. The interface consists of two calls: perf_event_open() and its associated ioctl() functions. The kernel interface is suitable for self-monitoring, that is, adding calls to an application in order to measure its internal operation. Performance counters provide two modes of operation: counting and sampling. Counting mode is most appropriate for self-monitoring. I’m currently writing code that makes self-monitoring a bit easier and hope to post the code when it’s ready.

In the meantime…

Installation

PERF and perf_event_open support are not usually installed with your typical Linux distribution. Originally, PERF was available solely as part of the Linux tools package. Well, it seems like somewhere along the way, Ubuntu and Debian diverged. Ubuntu installs PERF with Linux tools:

    sudo apt-get install linux-tools-common 
sudo apt-get install linux-tools-common-$(uname -r)

As PERF depends heavily upon kernel facilities and interfaces, you should install the version of PERF that matches the installed kernel.

Raspberry Pi OS (once known as Raspian) is a Debian distro. Shucks, wouldn’t you know it, Debian installs PERF differently:

    sudo apt install linux-perf

There are different packages for buster and stretch (the current versions of Raspberry Pi OS and Debian at the time of this writing).

    https://packages.debian.org/buster/linux-perf 
https://packages.debian.org/stretch/linux-perf

Installing on buster produces output like:

    XXX@raspberrypi:~ $ sudo apt install linux-perf 
password for XXX:
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
linux-perf-4.9
Suggested packages:
linux-doc-4.9
The following NEW packages will be installed:
linux-perf linux-perf-4.9
0 upgraded, 2 newly installed, 0 to remove and 107 not upgraded.
Need to get 1,275 kB of archives.
After this operation, 2,735kB of additional space will be used.
Do you want to continue? [Y/n]

Versioning gotcha

And, of course, it’s never that simple. My version of Raspberry Pi OS (buster) is expecting PERF version 5.4. When you enter “sudo perf list” or any other PERF command on the command line, the shell runs the script /usr/bin/perf. The script checks the version of PERF against the kernel and complains when versions don’t match. The Debian install pulled version 4.9, not 5.4.

Rather than sort out versioning, I’ve been entering “perf_4.9” instead of “perf“. This work-around bypasses the perf script which checks versions. Since PERF is now fairly mature, it all seems to work. At some point, I’ll sort out the versioning situation and install 5.4. In the meantime, full steam ahead!

Getting started

Here’s a few PERF commands to get you started:

    perf stat --help 
perf list sw
perf stat
perf top -a
perf top -e cpu_clock
perf record
perf report

The stat approach uses counting mode to measure software and hardware events triggered by an application program (“<cmd>”). The top approach displays event counts dynamically in real-time like the ever-popular “top” utility program. The record and report approach uses sampling to produce performance reports and profiles.

For additional usage information, check out the Linux performance analysis tutorial. There are several other fine tutorials and helpful sites on the Web. Many of the tutorials show use on x86 (Intel and AMD) systems, not Raspberry Pi and ARM. For that, I recommend my own three part tutorial:

  • Part 1 demonstrates how to use PERF to identify and analyze the hottest execution spots in a program. Part 1 covers the basic PERF commands, options and software performance events.
  • Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application.
    Part 3 uses hardware performance event sampling to identify and analyze hot spots within an application program.

In addition to usage, I offer information and guidance concerning ARM micro-architecture. This information is especially helpful when you get into hardware performance events. Check out my summaries of the ARM11 and ARM Cortex-A72 micro-architectures. ARM11 covers Raspberry Pi models 1, 2, and 3 (BCM2835 and BCM2836), while the Cortex-A72 summary covers the Raspberry Pi 4 (BCM2711).

Other helpful on-line resources are:

Paranoia!

Performance measurement is fraught with security issues and holes. The kernel developers implemented a control flag file, /proc/sys/kernel/perf_event_paranoid which sets the level of access and vulnerability when taking measurements. Quoting the Linux man page:

    The perf_event_paranoid file can be set to restrict access 
to the performance counters.
2 allow only user-space measurements (default since
Linux 4.6).
1 allow both kernel and user measurements (default
before Linux 4.6).
0 allow access to CPU-specific data but not raw
tracepoint samples.
-1 no restrictions.
The existence of the perf_event_paranoid file is the
official method for determining if a kernel supports
perf_event_open().

If you’re operating in a fairly closed, single-user environment, then set the content of the file to 0 or -1.

Read the perf_event_open() man page

I recommend reading the perf_event_open() man page. If you’re just starting your journey into performance measurement, you will be overwhelmed by the detail at first. However, just let the information wash over you and know that it’s there. The tutorials don’t always mention the perf_event_paranoid flag and other low-level details. Reading the man page should help you across future stumbling blocks and will enhance your understanding of events, counting and sampling.

Want to learn more about Raspberry Pi 4 (Cortex-A72 / Broadcom BCM2711) performance tuning? Please read:

Copyright © 2020 Paul J. Drongowski

PERF tutorial part 3 is now on-line

Just wrapped up Part 3 of the Linux-tools PERF tutorial.

The tutorial now consists of three parts. Part 1 covers the most basic PERF commands and shows how to find program hot-spots using software performance events. Part 2 discusses hardware performance events and performance counters, and demonstrates how to measure hardware performance events using PERF counting mode. Part 2 introduces several derived performance metrics like instructions per second (IPC) and applies these metrics to the sample application programs.

Part 3 is the newest addition to the tutorial series. It builds on parts 1 and 2, showing how to use hardware performance events and counter sampling to profile an application program. Part 3 discusses sampling period and frequency, the sampling process, overhead, statistical accuracy/confidence and other practical concerns.

I hope you find the PERF tutorial to be useful in your work! Although I produced the example data on the ARM-based Raspberry Pi, the commands and techniques will also work on x86.

PERF tutorial part 2 now available

Part 2 of a three part tutorial about Linux-tools PERF is now available.

Part 1 of the series shows how to find hot execution spots in an application program. It demonstrates the basic PERF commands using software performance events such as CPU clock ticks and page faults.

Part 2 of the series — just released — introduces hardware performance counters and events. I show how to count hardware events with PERF and how to compute and apply a few basic derived measurements (e.g., instructions per cycle, cache miss rate) for analysis. Part 3 is in development and will show how to use sampling to profile a program and to isolate performance issues in code.

All three parts of the series use the same simple, easy to understand example: matrix multiplication. One version of the matrix multiplication program illustrates the impact of severe performance issues and what to look for in PERF measurements. The issues are mitigated in the second, improved version of the program. PERF measurements for the improved program are presented for comparison.

The test platform is the latest second generation Raspberry Pi 2 running Raspbian Wheezy 3.18.9-v7+. The Raspberry Pi 2 has a 900MHz quad-core ARM Cortex-A7 (ARMv7) processor with 1GByte of primary memory. Although the tutorial series demonstrates PERF on Cortex-A7, the same PERF commands and analytical techniques can be employed on other architectures like x86.

A special note for Raspberry Pi users. The current stable distribution of Raspbian Wheezy — 3.18.7-v7+ February 2015 — does not support PERF hardware events. Full PERF support was enabled in a later, intermediate release and full PERF support should be available in the next stable release of Raspbian Wheezy. In the meantime, Raspberry Pi 2 users may profile their programs using PERF software events as shown in Part 1 of the tutorial. First generation Raspberry Pi users are also restricted to software performance events.

Brave souls may try rpi-update to upgrade to the latest and possibly unstable release. I recommend waiting for the next stable release unless you really, really know what you are doing and are willing to chance an unstable kernel with potentially catastrophic consequences.

Memory hierarchy and performance events

My latest set of experiments with Raspberry Pi use the BCM2835 performance counters to exercise and observe the behavior of the RPi memory system. The experiments use a simple microbenchmark that traverses a linked list. This technique is often called pointer chasing because list traversal chases the pointers through the linked list from beginning to end. A pointer chasing loop is simple, is easy to instrument, and has consistently repeatable behavior.

Here’s the approach. We exercise a single specific level in the memory hierarchy (such as the level 1 data cache) simply by adjusting the size of the linked list. We just need to choose a size such that the the physical size of the list is less than or equal to the capacity of a level — 16KB in the case of the level 1 (L1) data cache. Storage for the linked list is allocated as a contiguous block of memory (an array of bytes), so it’s easier to think of the linked list size in terms of its contiguous footprint in memory, the array size. Each linked list item is the same size as a cache line: 32 bytes.

The first experiment measures seven kinds of performance events: cycles, executed instructions, L1 data cache accesses, L1 data cache misses, data MicroTLB misses, Main TLB misses and explicit external data accesses.

Before we take a look at the results of the first experiment, here’s a few things to keep in mind.

  • The Broadcom BCM2835 128KB L2 cache is dedicated to the VideoCore GPU. Memory requests from the CPU are routed around the L2 and go directly to primary memory. L2 cache is not a factor in our analysis.
  • Coverage is the amount of primary memory accessible through a translation lookaside buffer (TLB) without incurring a TLB miss.
  • The MicroTLB has ten fully associative (page) entries. Its coverage is (10 entries * 4,096 bytes) = 40KB.
  • The Main TLB handles MicroTLB misses and has 8 fully associative entries plus 64 2-way associative entries. Its coverage is (72 entries * 4,096 bytes) = 288KB.
  • A hardware page table walker handles Main TLB misses. A page table walk requires at least one additional read in primary memory in order to look up page mapping information.

These memory system characteristics affect the results and should be directly observable.

The following table summarizes basic measurements for ten test runs. The first column is the size of the dynamically allocated block of memory that holds the linked list elements. The second column is the size of the block expressed as 32-byte cache lines. There is one list item per cache lines, so this is also the number of linked list items. The third column is the number of list traversals (one traversal per iteration). I adjusted the number of iterations for each run in order to perform exactly the same number of pointer chasing operations per test case. This allows us to make meaningful run-to-run comparisons.

 Size  Cache lines  Iterations   Time    CPI
-----  -----------  ----------  -----  -----
  4MB         128K        1024  25.76  20.51
  2MB          64K        2048  25.74  20.52
  1MB          32K        4096  25.98  20.70
512KB          16K        8192  25.56  20.41
256KB           8K       16384  24.62  19.71
128KB           4K       32768  19.99  16.24
 64KB           2K       65536  11.75   9.61
 32KB           1K      131072  10.49   8.77
 16KB          512      262144   2.66   2.30
  8KB          256      524288   2.39   2.04

The fourth column is the total elapsed time per run. There are roughly three distinct timing tiers:

  • Tier A: 4MB, 2MB, 1MB, 512KB, 256KB, 128KB
  • Tier B: 32KB, 64KB
  • Tier C: 8KB, 16KB

The 128KB case is somewhat “transitional,” but I placed it in Tier A. The cycles per instruction (CPI column) ratios are consistent with respect to the elapsed times.

The following table summarizes the performance event counts for the same test cases (block/array size and iterations).

                           MicroTLB  Main TLB  External
 Size DC access  DC miss      miss      miss    access
----- ---------  -------  --------  --------  --------
  4MB  24696257  5107400   1049399    258783   7433309 C
  2MB  24927252  5119529   1051719    259531   7509503 C
  1MB  24878797  5145341   1065404    273960   7521771 C
512KB  24668739  5086410   1050024    259104   7461179 C
256KB  22999900  4803510   1057814    308477   7263176 C
128KB  20051408  4258682    965191    181398   6136462 C
 64KB  11972871  2683992    620354     91207   3785568 B
 32KB  10096551  2288319    548311     71756   3231861 B
 16KB   2621836   594034    136900      8265    804582 A
  8KB   3480732   446481     72712      2757    480707 A

Keeping the basic measurements in mind, we can see that the performance event counts for Tier C are consistent with L1 data cache access. We are measuring on a single core computer and the data cache and TLB misses are mostly due to other system activity (like OS timer interrupts). Tier B cases hit in either the MicroTLB or Main TLB. Tier B are mainly references to primary memory without any TLB miss. The Tier B array sizes are within the MicroTLB and/or Main TLB coverage. Tier A cases miss in the TLBs. Tier A references require an additional primary memory read by the page walker. Tier A has the longest elapsed execution time.

Next up, we’ll measure the memory access (latency) time to primary memory.

Perils of performance analysis on single core

A new article on the Raspberry Pi (Broadcom BCM2835) memory hierarchy is almost ready. The first code has already been posted.

I’ve been working on multi-core processors for so long that I forgot what it’s like to take measurements on a single core machine like the Raspberry Pi.

In the ideal world, a benchmark or performance test program has the machine to itself and no other program or system activity perturbs it. Measurements on the ideal machine accurately and exactly reflect the dynamic behavior and performance of the program. On multi-core, you can usually assign the test program to an idle core (or two), preferably a core that is free of operating system activity. With careful process or thread placement, results on multi-core approach the ideal.

On single core, we don’t have any luxury. The test program has to share the one core with other programs and the operating system. On Raspberry Pi, Linux fires up services that run periodically. Even if we shut the services off, the system clock continues to run and it generates interrupts. At the very least, extraneous activity affects elapsed, user and system time measurements.

When we measure performance events, however, there is a deeper level of interference. The core has one physical level 1 (L1) data cache, one physical MicroTLB, one physical Main TLB and one physical branch history table. These microarchitecural components are transparent to the architecture, but they must be shared between programs and the OS. A context switch may cause a cache or TLB flush which invalidates the entire contents of the cache/TLB. Cache, TLB or branch history may be partially polluted by other software activity. The final performance event counts are affected by flushes and pollution and do not accurately reflect the behavior of the test program.

I ran into this issue while characterizing the memory hierarchy with performance events. One test case is designed to exercise only the L1 data cache and never touch primary memory. Yet, the test case measured a rather significant number of data cache misses beyond the compulsory misses that I would have expected. The extra misses are most likely caused by timer interrupts. I now think of these extra misses as “background radiation” which bias measurement.

Such are the perils of performance measurement and analysis on single core!

Faster matrix multiplication (part 2 of 2)

Part 2 of two parts on matrix multiplication demonstrates a fast matrix multiplication program. The algorithm is a simple transformation of the textbook algorithm — the olde loop nest interchange. The transformation changes the slow access pattern to one of the arrays so that the program steps sequentially through the array elements in memory. Execution time speeds up from about 16 seconds elapsed time to 6 seconds. Not bad for a few minutes work!

All of the key memory-related performance events are improved since the access pattern is a better fit with the underlying memory microarchitecture. The analysis shows that we need to be careful when interpreting the Data Cache Access event because this event counts nonsequential memory accesses instead of all level 1 DC accesses or architectural loads and stores.

Part 2 also discusses operation or instruction counting to analyze program complexity at a micro-level. I like to look at the assembler code generated by the compiler to see if there are any potential speed-ups. The article shows how to look at the assembler code using the GCC -S option and using the objdump program. I use instruction counting to check the operation and meaning of performance events like the ARM11 Executed Instructions event.

The Broadcom BCM2835 in the Raspberry Pi has an integer core and a Vector Floating Point (VFP) coprocessor. The VFP operates concurrently with the integer core. In fact, it operates quite independently and only synchronizes with the integer core at a few well-defined points. VFP instructions are allowed to complete out of order, which allows for greater speed, but makes FP exceptions somewhat imprecise. (Now exactly where did that underflow/overflow occur?) The VFP coprocessor has 32 registers of its own, which are organized as four 8-register banks. GCC uses the coprocessor for scalar floating point arithmetic, but doesn’t exploit any parallelism.

The VFP operates on short vectors in a register bank. Potentialy, the VFP coprocessor could be exploited to further speed up matrix multiplication. One possibility is to stream incoming data as a four-wide stripe through an array and operate on four elements at once. Or, stream four elements at a time from a single row/column. Take a look at the VFP Math Library.

It’s not all good news, however. The VFP coprocessor is not a true single instruction, multiple data (SIMD) engine. (It’s similar to an old school short vector architecture.) It only has a single floating multiply/accumulate (FMAC) pipeline. A true SIMD would have four FMAC units. Also, computations are relatively difficult to set up and stage. Computations must be double buffered where the integer Load Store Unit (LSU) is filling one register bank while the coprocessor is performing computations in a different register bank. Further, GCC vectorization doesn’t appear to support VFP.

ARM must have gotten the message from its users. Later processors implement NEON SIMD and just enough VFP for the sake of legacy compatibility. The Beaglebone Black (ARM Cortex-A8) has NEON and I’m looking forward to trying it out. GCC vectorization supports NEON, too, and it’s a whole lot easier to let the compiler vectorize your program for you than to write vector code yourself!

The BCM2835 also has the VideoCore GPU for SIMD computation. There are a bunch of folks who are reverse engineering the GPU in order to use it for general purpose computation (GPGPU). Have it at, guys and gals!

Even if VFP is an orphan, the coprocessor has 32 registers where you can stash data. Maybe you can find a way to make use of these extra registers? Side-to-side access (integer/floating) is pretty fast.