RPi 4 tuning: The code

Posted on March 1, 2021 by pj

I hope you have enjoyed my series of articles about Raspberry Pi 4 performance events, measurement and tuning:

Today, I want to wrap up the series with C code.

Please don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements. The commands in the PERF tutorial apply to x86, AMD64 and other architectures, too.

Before getting too far, here is the link to the ZIP file with the code. 🙂 The main source files are:

makefile: The make file (duh!)
pe_assist.h: Performance event helper header
pe_assist.c: Performance event helper functions
pe_cortex_a72.h: A72-specific helper header
pe_cortex_a72.c: A72-specific helper functions
pe_test.c: pe_assist check-out test
pe_matrix.c: pe_assist matrix multiply example
a72_test.c: A72-specific check-out test
a72_walk.c: A72-specific array walk kernel
a72_matrix.c: A72-specific matrix multiply
a72_misp.c: A72-specific branch mispredict kernel
a72_chase.c: A72-specific pointer chasing kernel

There are a few surprises, too, such as earlier versions of code, etc.

The programs self-monitor, that is, they call perf_event_open() to configure, control and read the performance counters. perf_event_open() has many parameters, so I wrote helper functions assisting counter configuration, control and access. There are two flavors: architecture independent and Cortex-A72 specific. The architecture independent functions are defined in pe_assist.* and the A72-specific functions are defined in pe_cortex_a72.*. The architecture independent functions should work on x86, etc., too.

Aside from the two check-out tests, the rest of the source modules are workloads. These are the programs that I used to collect data for the articles about Cortex-A72 performance measurement, analysis and tuning. Feel free to bash away at everything!

Helper functions

As I mentioned above, I separated the helper functions into architecture independent and Cortex-A72 specific modules. The architecture independent helper functions handle Linux performance counter set-up, control and read back:

peInitialize(): Initialize/reset the helper module:
peMakeGroup(): Make a counter group
peAddLeader(): Add leader event to the group
peAddEvent(): Add an event to the group
peStartCounting(): Start the counter group
peStopCounting(): Stop the counter group
peResetCounters(): Reset the counters
peReadCount(): Read an event count
pePrintCount(): Print and event count

The interface is “lite” and uncomplicated. It’s just enough to get the job done. Sometimes during early days, there is a temptation to build the Taj Mahal. I prefer to build something simple and get experience before building up and out. This simple interface proved to be good enough.

perf_event_open() supports simple event counting and sampling. If you’re familiar with perf stat, you’ve already seen simple event counting, AKA counting mode. perf stat measures events across the entire run of an application program. Self-monitoring is similar except you insert measurement code into the application program around the critical code that you wish to measure. perf stat doesn’t require code modification or recompile, but it doesn’t let you focus on particular critical loops or whatever. Self-monitoring is a little bit more effort, but it allows focus.

The usage model is straightforward:

Initialize the module data structures.
Create a performance event group.
Add a leader event to the group.
Add other events (up to 6 events for Cortex-A72) to the group.
Start the counter group.
Execute the workload or critical inner loops.
Stop the counter group.
Read and print the event counts.

Since this sequence is a recurring pattern, I also wrote a few functions which target common types of measurements such as:

peMeasureInstructionEvents()
pePrintInstructionEvents()
peMeasureDataAccessEvents()
pePrintDataAccessEvents()

These functions configure the pre-defined “symbolic” events which the Linux kernel has preselected for the platform architecture. Thus, you should be able to use the pe_assist.* module on any Linux box.

The Cortex-A72 module, pe_cortex_a72.*, use “raw” event identifiers for configuration. The available events are defined in pe_cortex_a72.h and they are specific to ARM Cortex-A72. I rely mainly on the Cortex-A72 events because then I know exactly which A72 events I am measuring. The Cortex-A72 module calls the low-level helper functions and it exports only targeted measurement functions:

a72MeasureInstructionEvents()
a72PrintInstructionEvents()
a72MeasureDataAccessEvents()
a72PrintDataAccessEvents()
a72MeasureTlbEvents()
a72PrintTlbEvents()

Take a peek inside one of the test programs and you’ll see how to call the helper modules.

Internal design

perf_event_open() is the Swiss Army knife of performance counter configuration and control. On Linux, all counter-related operations go through this single kernel call.

perf_event_open() allows control of individual counters, of course. However, it also provides a way to control a group of counters. One can save additional trips in and out of the kernel through counter groups. Instead of making six calls to start six counters, one only needs to make one perf_event_open() call to start an entire group of six events.

A Cortex-A72 group consists of one to six event counters. Each group has a distinguished member: the leader event. You can start, stop and reset the entire group by referring to the leader. Because the group members usually share characteristics like the process (ID) to be measured, the CPU set, flags, etc., it makes sense to define all of these common properties for an entire group. This approach reduces the number of parameters to be passed around during configuration.

In keeping with the “lite” philosophy, the helper module keeps the common flags and such in a few variables and arrays. The group and leader definition functions establish group-wide values for the member events in the group. That’s all there is to it, so hack away! The “lite” approach was good enough 99% of the time, so you might not need to dip into the helper modules at all.

ARM Cortex-A72 tuning: Memory access

Posted on February 22, 2021 by pj

Today’s post characterizes read access time to three different levels of the Raspberry Pi 4 (Broadcom BCM2711) memory hierarchy. The ARM Cortex-A72 processor has a two level cache structure: Level 1 data (L1D) cache and unified Level 2 cache. There is one L1D cache per core and all four cores share the L2 cache. Primary memory is the third and final level beyond L2 cache.

The test program is a simple kernel (inner loop) that runs through a linked list, i.e., pointer chasing. Each linked list element is exactly one A72 cache line in size, 64 bytes. I have used pointer chasing on other non-ARM architectures (Alpha and AMD64 come to mind) and it’s a pretty simple and effective way to characterize memory access speed.

The trick is to adjust the number of linked elements so that the entire linked list fits entirely within the memory to be characterized. In order to facilitate run-to-run comparisons, there is an outer loop which repeatedly invokes list chasing, i.e., the entire list is walked multiple times per run.

There are two main run parameters:

The number of linked list elements (which determines the array size), and
The number of iterations (which is the number of times the full list is walked).

When the array is doubled, the number of iterations is cut in half. This keeps the number of individual pointer chase operations (approximately) constant across runs.

The following table summarizes the test run parameters and the memory level to be exercised by the each run:

    #Elements  Iterations  Array Size  Mem Level 
    ---------  ----------  ----------  --------- 
           32    8388608       2KB     L1D cache 
           64    4194304       4KB     L1D cache 
          128    2097152       8KB     L1D cache 
          256    1048576      16KB     L1D cache 
          512     524288      32KB     L1D cache 
         1024     262144      64KB     L2 cache 
         2048     131072     128KB     L2 cache 
         4096      65536     256KB     L2 cache 
         8192      32768     512KB     L2 cache 
        16384      16384       1MB     L2 cache 
        32768       8192       2MB     RAM 
        65536       4096       4MB     RAM

Here is the C code for the test kernel:

  initialize(number_of_elements) ;
  a72MeasureDataAccessEvents() ; 

  start_clock() ; 
  peStartCounting() ; 
  for ( ; iterations > 0 ; iterations--) { 
    for (CacheLine *p = listHead ; p != NULL ; p = p->nextLine) ; 
  } 
  peStopCounting() ;

  print_clock_time(stdout, get_clock_time()) ; 
  a72PrintDataAccessEvents(stdout) ;

Both Linux clock() time and Cortex-A72 performance counter events are measured.

In my first experiments, the linked list elements were laid down in a linear sequential fashion and in a simple ping-pong scheme. I quickly discovered that Cortex-A72’s aggressive data prefetch is too good and naive layout did not produce the expected number of L1D or L2 cache misses. A72 speculatively reads the next cache line beyond a miss. By the time execution would reach the list element beyond the current one (or the very next element), the needed destination element would be available in cache or in flight.

Ideally, we want to fool the memory prefetcher and hit only the intended memory level, taking the full read access penalty each time we chase a pointer. I rewrote array/list initialization to lay down the list elements at (pseudo-)random positions in the array. The Fisher-Yates (Knuth) shuffle algorithm got the job done. Once list element layout was randomized, the pointer chasing test began producing the expected number of reads and misses.

The following table summarizes each run by the number of retired instructions, CPU cycles, instructions per cycle (IPC) and execution time:

    Array  Mem  Retired Ins    CPU Cycles     IPC    Time 
    -----  ---  -----------  --------------  -----  ------ 
      2KB  L1D  847,249,436     855,660,776  0.990   0.609 
      4KB  L1D  826,277,916   1,154,215,728  0.716   0.814 
      8KB  L1D  815,792,156   1,114,379,370  0.732   0.806 
     16KB  L1D  810,549,276   1,093,757,212  0.741   0.763 
     32KB  L1D  807,927,836   1,382,324,229  0.584   0.975 
     64KB  L2   806,617,116   5,074,763,198  0.159   3.446 
    128KB  L2   805,961,756   5,643,312,493  0.143   3.805 
    256KB  L2   805,634,076   6,621,262,142  0.122   4.452 
    512KB  L2   805,470,236   7,163,843,161  0.112   4.813 
      1MB  L2   805,388,316  27,563,140,814  0.029  18.421 
      2MB  RAM  805,347,356  49,317,924,775  0.016  32.969 
      4MB  RAM  805,326,876  54,865,753,267  0.015  36.645

No surprise, access to L1D cache is best, L2 is second best and primary memory is worst. Access to L2 cache is about five times as long as L1D cache, in terms of CPU cycles. Access to primary memory is nearly 50 times longer than L1D cache. The effect on IPC is very significant.

Taking a look at the L1D performance event counts:

    Array  Mem   IPC    Time    L1D Reads    L1D Misses  Ratio 
    -----  ---  -----  ------  -----------  -----------  ----- 
      2KB  L1D  0.990   0.609  268,435,785          484  0.000 
      4KB  L1D  0.716   0.814  268,435,630        1,316  0.000 
      8KB  L1D  0.732   0.806  268,435,639        1,149  0.000 
     16KB  L1D  0.741   0.763  268,435,622        4,319 <0.001 
     32KB  L1D  0.584   0.975  268,435,828   17,343,069  0.065 
     64KB  L2   0.159   3.446  268,435,603  234,906,566  0.875 
    128KB  L2   0.143   3.805  268,435,592  268,435,529  1.000 
    256KB  L2   0.122   4.452  268,435,625  268,435,588  1.000 
    512KB  L2   0.112   4.813  268,435,599  268,435,530  1.000 
      1MB  L2   0.029  18.421  268,435,594  268,435,782  1.000 
      2MB  RAM  0.016  32.969  268,435,579  268,435,960  1.000 
      4MB  RAM  0.015  36.645  268,435,635  268,435,941  1.000

we see that pointer chasing correctly and independently exercises L1D cache according to design. The L1D cache capacity is 32KB. The particular 32KB run shown here has the shortest execution time of the 32KB runs and thus, is cherry-picked. As I’ve seen on other architectures, measurements get a bit “weird” near cache capacity. When a cache gets nearly full, “weird stuff” starts to happen and run statistics become inconsistent. The shortest run best shows the break between L1D and L2 access.

Finally, here are the L2 cache performance event counts.

    Array  Mem   IPC    Time     L2 Reads    L2 Misses   Ratio 
    -----  ---  -----  ------  -----------  -----------  ----- 
      2KB  L1D  0.990   0.609        1,085           68  0.063 
      4KB  L1D  0.716   0.814    8,490,994          228  0.000 
      8KB  L1D  0.732   0.806    4,300,759          151  0.000 
     16KB  L1D  0.741   0.763    2,102,562          163  0.000 
     32KB  L1D  0.584   0.975   18,495,230        1,003 <0.001 
     64KB  L2   0.159   3.446  235,483,730        1,517 <0.001 
    128KB  L2   0.143   3.805  270,831,005        2,745 <0.001 
    256KB  L2   0.122   4.452  269,203,020       31,340 <0.001 
    512KB  L2   0.112   4.813  270,893,954      443,477  0.002 
      1MB  L2   0.029  18.421  302,452,386  107,397,408  0.355 
      2MB  RAM  0.016  32.969  286,244,127  227,010,870  0.793 
      4MB  RAM  0.015  36.645  277,293,265  252,881,540  0.912

As expected, we see a dramatic breakpoint at 1MB, which is the capacity of the unified L2 cache.

Bottom line, these performance measurements reinforce the importance of cache-friendly algorithms and data access patterns. Start with the best algorithms for your application, measure cache events and then tune for minimum misses. Data access should hit most frequently in the Level 1 data cache, then L2 cache. Primary memory is fifty times (!) more expensive than L1D cache and reads out to primary memory should be as infrequent as possible. Your mantra should be, “Bring it into cache, compute the heck out of the in-cache data, then write the final results back to memory, and move on.”

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Next time, I will wrap up this long series of articles with C code so you can perform your own experiments.

ARM Cortex-A72 tuning: Branch mispredictions

Posted on February 10, 2021 by pj

Back on the old day job, I developed and tested software and hardware for program profiling. Testing may sound like drudge-work, but there are ways to make things fun!

Two questions arise while testing a profiling infrastructure — software plus hardware:

Does the hardware accurately count (or sample) performance events for a given specific workload?
Does the software accurately display the counts or samples?

Clearly, ya need working hardware before you can build working software.

Testing requires a solid, known-good (KG) baseline in order to decide if new test results are correct. Here’s one way to get a KG baseline — a combination of static analysis and measurement:

Static analysis: Analyze the post-compilation machine code and predict the expected number of instruction retires, cache reads, misses, etc.
Measurement: Run the code and count performance events.
Validation: Compare the measured results against the predicted results.

Thereafter, one can compare new measurements taken from the system under test (SUT) and compare against both predicted results and baseline measured results.

Applying this method to performance counter counting mode is straightforward. You might get a little “hair” in the counts due to run-to-run variability, however, the results should be well-within a small measurement error. Performance counter sampling mode is more difficult to assess and one must be sure to collect a statistically significant number of samples within critical workload code in order to have confidence in a result.

One way to make testing fun is to make it a game. I wrotekernel programs that exercised specific hardware events and analyzed the inner test loops. You could call these programs “test kernels.” The kernels are pathologically bad (or good!) code which triggers a large number of specific performance events. It’s kind of a game to write such bad code…

The expected number of performance events is predicted through machine code level complexity analysis known as program “microanalysis.” For example, the inner loops of matrix multiplication are examined and, knowing the matrix sizes, the number of retired instructions, cache reads, branches, etc. are computed in closed form, e.g.,

    (38 inner loop instructions) * (1,000,000,000 iterations) + 
    (26 middle loop instructions) * (1,000,000 iterations) + 
    (9 outer loop instructions) * (1,000 iterations) 
    ----------------------------------------------------------- 
    38,026,009,000 retired instructions expected 
    38,227,831,497 retired instructions measured

This formula is the closed form expression for the retired instruction count within the textbook matrix multiplication kernel. The microanalysis approach worked successfully on Alpha, Itanium, x86, x64 and (now) ARM. [That’s a short list of machines that I’ve worked on. 🙂 ]

With that background in mind, let’s write a program kernel to deliberately cause branch mispredictions and measure branch mispredict events.

The ARM Cortex-A72 core predicts conditional branch direction in order to aggressively prefetch and dispatch instructions along an anticipated program path before the actual branch direction is known. A branch mispredict event occurs when the core detects a mistaken prediction. Micro-ops on the wrong path must be discarded and the front-end must be steered down the correct program path. The Cortex-A72 mispredict penalty is 15 cycles.

What we need is a program condition that consistently fools the Cortex-A72 branch prediction hardware. Branch predictors try to remember a program’s tendency to take or not take a branch and the predictors are fairly sensitive; even a 49%/51% split between taken and not taken has a beneficial effect on performance. So, we need a program condition which has 50%/50% split with a random pattern of taken and not taken direction.

Here’s the overall approach. We fill a large array with a random pattern of ‘0’ and ‘1’ characters. Then, we walk through the array and count the number of ‘1’ characters. The function initialize_test_array() fills the array with a (pseudo-)random pattern of ones and zeroes:

void initialize_test_array(int size, char* array, 
               int always_one, int always_zero) 
{ 
    register char* r = array ; 
    int s ; 
    for (s = size ; s > 0 ; s--) { 
        if (always_one) { 
            *r++ = '1' ; 
        } else if (always_zero) { 
            *r++ = '0' ; 
        } else { 
            *r++ = ((rand() & 0x1) ? '1' : '0') ;
        } 
    } 
}

The function has options to fill the array with all ones or all zeroes in case you want to see what happens when the inner conditional branch is well-predicted. BTW, I made the array 20,000,000 characters long. The size is not especially important other than the desire to have a modestly long run time.

The function below, test_loop(), contains the inner condition itself:

int test_loop(int size, char* array) 
{ 
    register int count = 0 ; 
    register char* r = array ; 
    int s ; 
    for (s = size ; s > 0 ; s--) { 
        if (*r++ == '1') count++ ;    // Should mispredict!
    }     return( count ) ; 
}

The C compiler translates the test for ‘1’ to a conditional branch instruction. Given an array with random ‘0’ and ‘1’ characters, we should be able to fool the hardware branch predictor. Please note that the compiler generates a conditional branch for the array/loop termination condition, s > 0. This conditional branch should be almost always predicted correctly.

The function run_the_test() runs the test loop:

void run_the_test(int iteration_count, int array_size, char* array) 
{ 
  register int rarray_size = array_size ; 
  register char* rarray = array ; 
  int i ; 
  for (i = iteration_count ; i-- ; ) {
      test_loop(array_size, array) ; 
  } 
}

It calls test_loop() many times as determined by iteration_count. Redundant iterations aren’t strictly necessary when taking measurements in counting mode. They are needed, however, in sampling mode in order to collect a statistically significant number of performance event samples. I set the iteration count to 200 — enough to get a reasonable run time when sampling.

The test driver code initializes the branch condition array, configures the ARM Cortex-A72 performance counters, starts the counters, runs the test loop, stops the counters and prints the performance event counts:

initialize_test_array(array_size, array, always_one, always_zero) ; 
a72MeasureInstructionEvents() ; 
peStartCounting() ; 
run_the_test(iteration_count, array_size, array) ; 
peStopCounting() ; 
a72PrintInstructionEvents(stdout) ;

The four counter configuration, control and display functions are part of a small utility module that I wrote. I will explain the utility module in a future post and will publish the code, too.

Finally, here are the measurements when scanning an array holding a random pattern of ‘0’ and ‘1’ characters:

    Instructions ret'd:      45,999,735,845 
    Instructions spec'd:     98,395,483,123 
    CPU cycles:              59,010,851,259 
    Branch speculated :       8,012,669,711 
    Branch mispredicted:      2,001,934,251 
    Branch predicted          8,012,669,710 
    Instructions per cycle:      0.780 
    Retired/spec'd ratio:        0.467 
    Branches per 1000 (PTI):   174.189 
    Branch mispredict ratio:     0.250

Please recall that there are two conditional branches in the inner test loop: a conditional branch to detect ‘1’ characters and a conditional branch to check the array/loop termination condition. The loop check should be predicted correctly almost all the time, accounting for 50% of the total number of correctly predicted branches. The character test, however, should be incorrectly predicted 50% of the time. It’s like guessing coin flips — you’ll be right half the time on average. Overall, 25% of branch predictions should be incorrect, and yes, the measured branch mispredict ratio is 0.250 or 25%.

The number of speculated instructions is also very interesting. Cortex-A72 speculated twice as many ARMv8 instructions as it retired. Over half of the speculated instructions did not complete architecturally and were discarded. That’s what happens when a conditional branch is grossly mispredicted!

I hope you enjoyed this simple experiment. It makes the Cortex-A72 fetch and branch prediction behavior come alive. As a follow-up experiment, I suggest trying all-ones or all-zeroes.

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

ARM Cortex-A72 branch-related performance events:

 Number Mnemonic          Event name
 ------ ----------------  -----------------------------------------
 0x08   INST_RETIRED      Instruction architecturally executed
 0x10   BR_MIS_PRED       Mispredicted or not predicted branches
 0x11   CPU_CYCLES        Processor cycles
 0x12   BR_PRED           Predictable branch speculatively executed
 0x1B   INST_SPEC         Operation speculatively executed
 0x76   PC_WRITE_SPEC     Software change of the PC (speculative)
 0x78   BR_IMMED_SPEC     Immediate branch (speculative)
 0x79   BR_RETURN_SPEC    Procedure return (speculative)
 0x7A   BR_INDIRECT_SPEC  Indirect branch (speculative)

Disassembled code for test_loop():

00010678 :
    10678:  e92d0830   push {r4, r5, fp}
    1067c:  e28db008   add  fp, sp, #8
    10680:  e24dd014   sub  sp, sp, #20
    10684:  e50b0018   str  r0, [fp, #-24]  ; 0xffffffe8
    10688:  e50b101c   str  r1, [fp, #-28]  ; 0xffffffe4
    1068c:  e3a04000   mov  r4, #0
    10690:  e51b501c   ldr  r5, [fp, #-28]  ; 0xffffffe4
    10694:  e51b3018   ldr  r3, [fp, #-24]  ; 0xffffffe8
    10698:  e50b3010   str  r3, [fp, #-16]
    1069c:  ea000008   b    106c4 
    106a0:  e1a03005   mov  r3, r5
    106a4:  e2835001   add  r5, r3, #1
    106a8:  e5d33000   ldrb r3, [r3]
    106ac:  e3530031   cmp  r3, #49  ; 0x31
    106b0:  1a000000   bne  106b8    ; Should mispredict!
    106b4:  e2844001   add  r4, r4, #1
    106b8:  e51b3010   ldr  r3, [fp, #-16]
    106bc:  e2433001   sub  r3, r3, #1
    106c0:  e50b3010   str  r3, [fp, #-16]
    106c4:  e51b3010   ldr  r3, [fp, #-16]
    106c8:  e3530000   cmp  r3, #0
    106cc:  cafffff3   bgt  106a0    ; Correctly predicted
    106d0:  e1a03004   mov  r3, r4
    106d4:  e1a00003   mov  r0, r3
    106d8:  e24bd008   sub  sp, fp, #8
    106dc:  e8bd0830   pop  {r4, r5, fp}
    106e0:  e12fff1e   bx   lr

ARM Cortex-A72 tuning: IPC

Posted on January 27, 2021 by pj

If you read my posts about ARM Cortex-A72 micro-architecture:

you’re probably wondering, “How I do I reduce all of this to practice on Raspberry Pi 4?”

Program performance tuning is experimental and is measurement-based.
Our goal is to reduce program execution time by efficiently exploiting the underlying machine micro-architecture. Tuning follows a systematic, multi-step process:

Initial design and code.
Run and measure execution time and performance events.
Analyze measurements.
Make a hypothesis about performance bottlenecks.
Change the code.
Go to step 2 until you’re satisfied.

“Satisfied” is a bit subjective, but generally means “produces a result within a defined time constraint”, “achieves the desired frame rate,” or some other time-related design requirement.

You will need performance measurement tools and techniques. This is where hardware performance events come into play. Raspberry Pi OS is Linux, and fortunately, Linux has a mature performance measurement infrastructure. Performance Events for Linux, often called “PERF,” is the best way to get started. I’ve written extensively about PERF including my three part PERF tutorial:

The PERF tutorial illustrates performance measurement and tuning on Raspberry Pi models 1, 2 and 3. The mechanics of running PERF are the same on Raspberry Pi 4. Please see my other articles about Cortex-A72 performance tuning:

Cortex-A72 tuning: Data Access

Lately, I have been experimenting with program performance self-monitoring using the Linux perf_event_open() system call. Stay tuned for more details and code. For the moment, I’m going to focus on ARM Cortex-A72 performance events — good enough to help you apply techniques and commands in the PERF tutorial.

Cortex-A72 performance events

A performance event is the occurrence of a micro-architectural condition. The simplest example events are retired instructions and processor (CPU) cycles. A retired instruction event occurs every time an instruction successfully completes (architectural) execution. A processor cycle event occurs every processor clock tick.

Each Cortex-A72 core has six performance counter registers. Using a tool like PERF, a performance event is assigned to each register. Yes, you can measure up to six performance events simultaneously, i.e., in a single experimental execution run. [More events can be measured via counter multiplexing, but I’m keeping things simple here.] The trick is to choose and configure the performance events that help you test your performance tuning hypothesis.

The ARM Cortex-A72 performance events are listed in the ARM Cortex-A72 Technical Reference Manual (TRM) available at the ARM corporate web site. The list is rather long and not all of the events are particularly relevant for application programmers. Thus, I won’t list them all here. There are several major event categories:

Instructions and cycles
Level 1 instruction (L1I) cache events
Level 1 data (L1D) cache events
Level 2 (L2) cache events
Level 1 instruction TLB (L1 ITLB) events
Level 1 data TLB (L1 DTLB) events
Branches and mispredicted branches
Bus and primary memory access
Memory barriers (speculative)
Instruction mix (speculative)
Exceptions taken
System register access

These are my own categories and should give you a rough impression about the kinds of micro-architectural events you can measure on Raspberry Pi 4. I listed the categories from highest to lowest priority placing the most relevant and generally useful event categories near the top of the list.

Each Coretex-A72 performance event type is assigned an event number. The event number identifies the event to measurement tools and to the event counting hardware.

Time and instruction events

Processor/CPU cycles and retired instructions are the true all-rounders.

Processor cycles are a good proxy for actual execution time. Sure, you can measure wall-clock or CPU execution time using the Linux time command or system calls like gettimeofday(), time(), clock() or clock_gettime(). The CPU cycle event lets us measure time using a performance counter register. The cycle count tells us approximately how much CPU time was consumed by the program under test.

As mentioned earlier, the retired instruction event counts the number of successfully (architecturally) completed instructions. Given a specific data set, every program has a specific amount of work to be accomplished. Particularly in the case of a single-threaded program, the program executes the same instructions for the same given data set, every time. Thus, the retired instruction count is a measure of work accomplished and should be (roughly) the same every experimental run, assuming the same data set and no outside interference. (You shouldn’t run other applications while testing. Control the test environment!)

    Number  Mnemonic      Name 
    ------  ------------  ------------------------------------ 
    0x08    INST_RETIRED  Instruction architecturally executed 
    0x11    CPU_CYCLES    Cycle 
    0x1B    INST_SPEC     Operation speculatively executed

Instructions per cycle (IPC)

Two basic performance tuning goals are:

Reduce the number of processor cycles, and
Reduce the number of retired instructions.

Reducing the number of processor cycles should reduce the overall execution time. That assumes, of course, that overall execution time is not dominated by input/output, page faults, human wait (interaction) time, or some other major factor!

Reducing the number of retired instructions should reduce the amount of work performed by the program. Optimizing compilers work hard to reduce the number of instructions in tight inner loops. In terms of conventional wisdom, the fastest instruction is an instruction which is never executed in the first place.

Practically, however, one program can execute more instructions and achieve a shorter execution time than another program (assuming the same data set and functionality, of course). How can this be? It comes down to a few fundamental factors:

Read and use data from fast cache memory.
Reducing reads (writes) from (to) slow primary memory.
Execute computations concurrently.
Overlap execution with read and write operations.

Short answer, it comes down to exploiting instruction-level parallelism (ILP), temporal data locality and spatial data locality.

Instructions per cycle (IPC) is one simple measure that tells us how we are doing overall. IPC is easy to measure and compute: Count the number of retired instructions, count the number of processor cycles, and divide:

    INST_RETIRED / CPU_CYCLES

The IPC ratio indicates the amount of useful work done during each processor cycle and we want to maximize it.

Goal IPC is very much application dependent. For a given critical inner loop, one might ask, “How many concurrent operations (computations, reads, writes) can Cortex-A72 perform assuming the data are cache-resident?” If IPC is significantly less than one, it’s probably time to tune. In scientific code with a mix of integer and floating point operations, an IPC of 2 is a good starting goal.

Speculative execution

ARM Cortex-A72 is a superscalar processor which predicts branch direction and executes instructions speculatively along predicted program paths. The A72 is capable of counting many speculative event types. Speculative event types are explicitly identified in the ARM Cortex-A72 Technical Reference Manual; Look for “_SPEC” in the event mnemonic.

The speculated instruction event count is the number of ARMv8-A issued
speculatively during program execution. Ideally, branch predictions are always correct and every speculatively issued instruction eventually retires. Speculatively issued instructions on a wrong path consume execution resources just like correct path instructions that retire. Unfortunately, wrong path results are discarded, thereby wasting any resources which they consumed. Fewer wrong-path instructions produces less waste, that is, fewer wrong-path instructions start execution, consume resources, and are discarded.

The ratio of speculated instructions to retired instructions:

    INST_RETIRED / INST_SPEC

indicates how often speculated instructions resolved into retirement. Best case, this ratio is one — all speculated instructions eventually retired (i.e., few execution resources are wasted).

Example: Matrix multiplication

Matrix multiplication is the classic example of performance tuning for micro-architecture. Mathematics specifies the end result — the matrix product. Algorithmically, however, there are two ways to compute the matrix product:

Textbook algorithm and code: Straightforward implementation of the mathematics.
Loop nest interchange algorithm and code: Cache-friendly implementation which exploits temporal and spatial locality.

For more detail about the algorithms and code, please see Textbook matrix multiplication (part 1) and Faster matrix multiplication (part 2).

I ran both the textbook and loop nest interchange programs on Raspberry Pi 4. The textbook code took 28.6 seconds and, as expected, the interchange code took more time, 19.6 seconds. Here are the raw event counts:

    Event                    Textbook        Interchange 
    -----------------------  --------------  -------------- 
    Retired instructions     38,227,831,497  60,210,830,503 
    CPU cycles               42,041,568,760  29,332,934,027 
    Instructions per cycle   0.909           2.053

The textbook program executed less than one instruction per processor cycle, 0.909 IPC. The textbook code underperforms with respect to the availability of Cortex-A72 execution units (two integer, two FP units) and the opportunity to overlap computation with memory access. The interchange program achieves a respectable 2.053 instructions per cycle. The interchange version consumes far fewer processor cycles than the textbook version.

Just for grins, multiply the CPU cycle counts by the Raspberry Pi 4 clock period (the inverse of the 1.5GHz clock frequency). You get approximately the measured clock() CPU times: 28.028 seconds versus 28.6 actual and 19.555 seconds vs. 19.6 seconds actual.

Here are the raw event counts for retired and speculated instructions:

    Event                    Textbook        Interchange 
    -----------------------  --------------  -------------- 
    Retired instructions     38,227,831,497  60,210,830,503 
    Speculated instructions  46,576,925,991  60,254,256,720 
    Retired / speculated     0.821           0.999

The interchange version has a near ideal retired to speculated instruction ratio (0.999). The textbook slightly underperforms with nearly 8 million speculated instructions started and abandoned.

The programs are written in C and compiled with the -O0 optimization level. Try -O3. The results may further surprise you. 🙂

Performance events on Raspberry Pi 4: Tips

Posted on November 23, 2020 by pj

Performance measurement and tuning experiments with Raspberry Pi 4 are well-underway. Here are a few quick observations and tips.

Linux provides two entries into performance measurement: Performance Events for Linux (PERF) and the kernel performance counter interface (perf_event_open()). PERF is an easy-to-use tool suite and is the best place to start explorations. If you want to measure an application without modifying its code, this is for you.

PERF is built on the kernel performance counter interface. The interface consists of two calls: perf_event_open() and its associated ioctl() functions. The kernel interface is suitable for self-monitoring, that is, adding calls to an application in order to measure its internal operation. Performance counters provide two modes of operation: counting and sampling. Counting mode is most appropriate for self-monitoring. I’m currently writing code that makes self-monitoring a bit easier and hope to post the code when it’s ready.

In the meantime…

Installation

PERF and perf_event_open support are not usually installed with your typical Linux distribution. Originally, PERF was available solely as part of the Linux tools package. Well, it seems like somewhere along the way, Ubuntu and Debian diverged. Ubuntu installs PERF with Linux tools:

    sudo apt-get install linux-tools-common 
    sudo apt-get install linux-tools-common-$(uname -r)

As PERF depends heavily upon kernel facilities and interfaces, you should install the version of PERF that matches the installed kernel.

Raspberry Pi OS (once known as Raspian) is a Debian distro. Shucks, wouldn’t you know it, Debian installs PERF differently:

    sudo apt install linux-perf

There are different packages for buster and stretch (the current versions of Raspberry Pi OS and Debian at the time of this writing).

    https://packages.debian.org/buster/linux-perf 
    https://packages.debian.org/stretch/linux-perf

Installing on buster produces output like:

    XXX@raspberrypi:~ $ sudo apt install linux-perf 
    password for XXX: 
    Reading package lists… Done
    Building dependency tree       
    Reading state information… Done
    The following additional packages will be installed:
       linux-perf-4.9
    Suggested packages:
       linux-doc-4.9
    The following NEW packages will be installed:
       linux-perf linux-perf-4.9
    0 upgraded, 2 newly installed, 0 to remove and 107 not upgraded.
    Need to get 1,275 kB of archives.
    After this operation, 2,735kB of additional space will be used.
    Do you want to continue? [Y/n]

Versioning gotcha

And, of course, it’s never that simple. My version of Raspberry Pi OS (buster) is expecting PERF version 5.4. When you enter “sudo perf list” or any other PERF command on the command line, the shell runs the script /usr/bin/perf. The script checks the version of PERF against the kernel and complains when versions don’t match. The Debian install pulled version 4.9, not 5.4.

Rather than sort out versioning, I’ve been entering “perf_4.9” instead of “perf“. This work-around bypasses the perf script which checks versions. Since PERF is now fairly mature, it all seems to work. At some point, I’ll sort out the versioning situation and install 5.4. In the meantime, full steam ahead!

Getting started

Here’s a few PERF commands to get you started:

    perf stat --help 
    perf list sw 
    perf stat 
    perf top -a 
    perf top -e cpu_clock 
    perf record 
    perf report

The stat approach uses counting mode to measure software and hardware events triggered by an application program (“<cmd>”). The top approach displays event counts dynamically in real-time like the ever-popular “top” utility program. The record and report approach uses sampling to produce performance reports and profiles.

For additional usage information, check out the Linux performance analysis tutorial. There are several other fine tutorials and helpful sites on the Web. Many of the tutorials show use on x86 (Intel and AMD) systems, not Raspberry Pi and ARM. For that, I recommend my own three part tutorial:

Part 1 demonstrates how to use PERF to identify and analyze the hottest execution spots in a program. Part 1 covers the basic PERF commands, options and software performance events.
Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application.
Part 3 uses hardware performance event sampling to identify and analyze hot spots within an application program.

In addition to usage, I offer information and guidance concerning ARM micro-architecture. This information is especially helpful when you get into hardware performance events. Check out my summaries of the ARM11 and ARM Cortex-A72 micro-architectures. ARM11 covers Raspberry Pi models 1, 2, and 3 (BCM2835 and BCM2836), while the Cortex-A72 summary covers the Raspberry Pi 4 (BCM2711).

Other helpful on-line resources are:

“PERF Examples” by Brendan Gregg.
Using the perf utility on ARM, by Stefan.
“Enabling Raspberry Pi Performance Counter Support on Linux perf_event”, by Chad Paradis and Vincent Weaver, UMaine ECE Tech Report 2014-2.

Paranoia!

Performance measurement is fraught with security issues and holes. The kernel developers implemented a control flag file, /proc/sys/kernel/perf_event_paranoid which sets the level of access and vulnerability when taking measurements. Quoting the Linux man page:

    The perf_event_paranoid file can be set to restrict access 
    to the performance counters. 
        2   allow only user-space measurements (default since 
            Linux 4.6). 
        1   allow both kernel and user measurements (default 
            before Linux 4.6). 
        0   allow access to CPU-specific data but not raw 
            tracepoint samples. 
       -1   no restrictions. 
    The existence of the perf_event_paranoid file is the 
    official method for determining if a kernel supports 
    perf_event_open().

If you’re operating in a fairly closed, single-user environment, then set the content of the file to 0 or -1.

Read the perf_event_open() man page

I recommend reading the perf_event_open() man page. If you’re just starting your journey into performance measurement, you will be overwhelmed by the detail at first. However, just let the information wash over you and know that it’s there. The tutorials don’t always mention the perf_event_paranoid flag and other low-level details. Reading the man page should help you across future stumbling blocks and will enhance your understanding of events, counting and sampling.

Want to learn more about Raspberry Pi 4 (Cortex-A72 / Broadcom BCM2711) performance tuning? Please read:

PERF tutorial part 3 is now on-line

Posted on April 3, 2015 by pj

Just wrapped up Part 3 of the Linux-tools PERF tutorial.

The tutorial now consists of three parts. Part 1 covers the most basic PERF commands and shows how to find program hot-spots using software performance events. Part 2 discusses hardware performance events and performance counters, and demonstrates how to measure hardware performance events using PERF counting mode. Part 2 introduces several derived performance metrics like instructions per second (IPC) and applies these metrics to the sample application programs.

Part 3 is the newest addition to the tutorial series. It builds on parts 1 and 2, showing how to use hardware performance events and counter sampling to profile an application program. Part 3 discusses sampling period and frequency, the sampling process, overhead, statistical accuracy/confidence and other practical concerns.

I hope you find the PERF tutorial to be useful in your work! Although I produced the example data on the ARM-based Raspberry Pi, the commands and techniques will also work on x86.

PERF tutorial part 2 now available

Posted on March 26, 2015 by pj

Part 2 of a three part tutorial about Linux-tools PERF is now available.

Part 1 of the series shows how to find hot execution spots in an application program. It demonstrates the basic PERF commands using software performance events such as CPU clock ticks and page faults.

Part 2 of the series — just released — introduces hardware performance counters and events. I show how to count hardware events with PERF and how to compute and apply a few basic derived measurements (e.g., instructions per cycle, cache miss rate) for analysis. Part 3 is in development and will show how to use sampling to profile a program and to isolate performance issues in code.

All three parts of the series use the same simple, easy to understand example: matrix multiplication. One version of the matrix multiplication program illustrates the impact of severe performance issues and what to look for in PERF measurements. The issues are mitigated in the second, improved version of the program. PERF measurements for the improved program are presented for comparison.

The test platform is the latest second generation Raspberry Pi 2 running Raspbian Wheezy 3.18.9-v7+. The Raspberry Pi 2 has a 900MHz quad-core ARM Cortex-A7 (ARMv7) processor with 1GByte of primary memory. Although the tutorial series demonstrates PERF on Cortex-A7, the same PERF commands and analytical techniques can be employed on other architectures like x86.

A special note for Raspberry Pi users. The current stable distribution of Raspbian Wheezy — 3.18.7-v7+ February 2015 — does not support PERF hardware events. Full PERF support was enabled in a later, intermediate release and full PERF support should be available in the next stable release of Raspbian Wheezy. In the meantime, Raspberry Pi 2 users may profile their programs using PERF software events as shown in Part 1 of the tutorial. First generation Raspberry Pi users are also restricted to software performance events.

Brave souls may try rpi-update to upgrade to the latest and possibly unstable release. I recommend waiting for the next stable release unless you really, really know what you are doing and are willing to chance an unstable kernel with potentially catastrophic consequences.

rpistat: RPi event monitoring tool

Posted on June 28, 2013 by pj

[At the time of this post, Performance Events for Linux (PERF) supports only software events on Raspberry Pi. Rpistat is a tool that counts hardware performance events in the style of perf stat and is a poor man’s substitute.]

So far, my example programs for the Raspberry Pi (matrix multiplication, pointer chasing) are instrumented with explicit code to measure and display hardware performance events. The instrumentation code executes privileged ARM instructions to read and write the performance counters and control register. The aprofile kernel module must be loaded in order to enable user-space access to the ARM1176 performance monitoring unit (PMU). If aprofile is not loaded, the privileged instructions trap to the OS and the process terminates abnormally.

This approach, which I call self-measurement, requires modification to the application source code. While self-measurement collects performance event data about specific code regions in the application, source code changes are intrusive, inconvenient and involve more work.

Performance Events for Linux (PERF) is a profiling infrastructure that provides a suite of performance analysis tools on Linux. The perf stat tool launches a workload and counts (selected) hardware or software performance events caused by the workload. No changes to the workload are required at either the source or binary levels.

Rpistat is similar to perf stat. It launches a workload and counts nine of the most commonly used ARM1176 hardware performance events while the workload executes. Running rpistat is easy. Just enter “rpistat” followed by the command you would normally use to launch the workload from the shell.

rpistat naive

In the example above, the workload is the naive matrix multiplication program. Rpistat writes a file named rpistat.txt when the workload completes. The file contains a basic performance report. Here is example output from rpistat.

***************************************************************
rpistat: ./naive
Thu Jun 27 09:58:08 2013
***************************************************************

System information
  Number of processors:     1

Performance events
  [ ... ] = scaled event count
  PTI     = per thousand instructions
  Total periods:      169

  Cycles:             11,759,598,287
  Instructions:       315,810,640  [1,241,209,259]
  IBUF stall cycles:  65,981,902  [259,324,219]
  Instr periods:      43
  CPI:                9.474  
  IBUF stall percent: 2.205  %

  DC cached accesses: 4,558,795  [18,343,722]
  DC misses:          933,837  [3,757,582]
  DC periods:         42
  DC miss ratio:      20.484 %

  MicroTLB misses:    224,886  [904,898]
  Main TLB misses:    172,973  [696,010]
  TLB periods:        42
  Micro miss rate:    0.729   PTI
  Main miss rate:     0.561   PTI

  Branches:           33,438,664  [134,550,814]
  Mispredicted BR:    366,383  [1,474,255]
  BR periods:         42
  Branch rate:        108.403 PTI
  Mispredict ratio:   1.096  %

The report includes raw event counts, scaled event counts, rates and ratios. The raw event counts are the actual number of events counted by the hardware performance counters. A rate tells us how often a given event is occurring in terms of events per thousand instructions (PTI). A ratio tells us what portion of events have a certain property, such as the percentage of non-sequential data cache accesses that result in a miss.

Rpistat uses a time division multiplexing scheme to periodically switch the performance counters across the nine events of interest. Rpistat makes a switch every 100 milliseconds (0.1s). The ARM1176 dedicates one performance counter to processor cycles. Rpistat exploits this counter to the max and measures processor cycles all the time (a 100% duty-cycle for this event). Rpistat switches through the other eight events in pairs called event sets. Each event set is measured for about 25% of the overall time. Rpistat keeps track of the active time for each event and scales the raw event counts up to full duty-cycle estimates. These scaled event counts are displayed within square brackets [ ... ] in the output file.

When rpistat switches between event sets, it first accumulates the current counts into 64-bit unsigned integers called virtual counters. Rpistat avoids overflow problems because the measurement period (0.1s) is much shorter than the time needed to overflow the 32-bit hardware counters. The long virtual counter length (64 bits) postpones any real overflow for a very long time, effectively eliminating the practical possibility of an overflow.

If you would like to know more about rpistat’s implementation including important design concerns and limitations, please read about it here.

How does rpistat compare against self-measurement for accuracy? Here’s a quick comparison between the two approaches when they are applied to the naive (textbook) matrix multiplication program.

**Self-measurement vs. rpistat**
Metric	Self-measurement	Rpistat
Elapsed time	16.0s	16.6s
Scaled cycles	181,984,943	184,288,556
Instructions	1,190,897,822	1,245,571,856
DC access	17,588,053	18,587,014
DC miss	2,982,314	3,735,024
MicroTLB miss	809,154	904,166
Main TLB miss	612,688	661,667
CPI	9.78 CPI	9.47 CPI
DC access rate	14.77 PTI	14.92 PTI
DC miss rate	2.50 PTI	2.99 PTI
DC miss ratio	17.0%	20.1%
MicroTLB rate	0.68 PTI	0.73 PTI
Main TLB rate	0.51 PTI	0.53 PTI

The event counts, rates and ratios are within normal run-to-run variability in every case.

The matrix multiplication program has fairly consistent (uniform) dynamic behavior throughout its lifetime. Workloads with distinct processing phases may not fair as well. YMMV. Please see the rpistat page for more information and links to source (rpistat.c).

Memory access time

Posted on June 21, 2013 by pj

In my last post, I used a simple pointer chasing loop to characterize the different levels of the BCM2835 (Raspberyy Pi) memory hierarchy. I ran the pointer chasing loop for a range of linked list sizes where the list size determines the storage level to be exercised. For example, a linked list occupying 16KB or less causes the chase loop to exercise the level 1 (L1) data cache. I measured hardware performance events for each test case and summarized the results in a table. The results clearly show three different timing tiers: L1 data cache, access to primary memory (no TLB miss) and access to primary memory with a Main TLB miss. A miss in the Main TLB is quite slow because the hardware page table walker must read page mapping information from primary memory in order to update the TLB and to complete the memory operation in flight.

The one characteristic missing from the analysis, however, is the estimated access time for each level. Sure, the execution times clearly break the test cases into tiers, but elapsed execution time does not really measure the access (latency) time itself.

In the next experiment, I estimate the access time. (See the source code in latency.c.) The approach is essentially unchanged from the first experiment: use pointer chasing on lists of different sizes to hit in particular levels of the memory hierarchy. Like the first experiment, the program (named latency) successively traverses the same linked list many times (default: 1000 traversals). Part way through each traversal, the loop chooses a chase operation and measures its execution time. The sampled execution time is saved temporarily in an array until all traversals are complete. Then, the program writes the execution times to a file named samples.dat. The samples are aggregated into a histogram which shows the distribution of the times.

The sampling technique lets each full traversal get underway before taking a measurement in order to “warm up” the pipeline, cache and TLBs. Taking a single sample per traversal avoids major cache and TLB pollution due to the bookkeeping needed to take and save samples. The measured execution times should faithfully represent access time (plus some measurement bias).

The program measures execution time in cycles. It configures and enables the ARM1176 Cycles Counter Register (CCR) as a free-running cycles counter. The program resets and enables the CCR before starting a traversal, thus avoiding a counter overflow. (A full traversal completes before the CCR overflows.) The pointer chasing loop reads the CCR before chasing the next pointer and reads the CCR after chasing the pointer. The difference between the after and before counts is the estimated execution time of the chase operation. The chase operation is implemented as a single ARM load instruction, so the instruction execution time is a biased estimate of the memory access time.

Here is the code for the function sample_cycles() and the pointer chasing loop within. The function takes two arguments: a pointer to the head of the linked list and an integer value that chooses the chase operation to be sampled.

uint32_t sample_cycles(CacheLine* linked_list, int n)
{
  register CacheLine* item ;
  register uint32_t before, after, cycles ;
  register count = n ;

  cycles = 0 ;
  for(item = linked_list ; item != NULL ; ) {
    if (count-- == 0) {
      before = armv6_read_ccr() ;
      item = item->ptrCacheLine[0] ;
      after = armv6_read_ccr() ;
      cycles = after - before ;
    } else {
      item = item->ptrCacheLine[0] ;
    }
  }

  return( cycles ) ;
}

On each iteration, the loop body checks and decrements the sampling count maintained in the variable count. If the sampling count is not zero, then the loop quickly performs an ordinary chase operation. If the count is zero, then the code snaps the before and after CCR values by calling armv6_read_ccr(), an in-line function that performs an ARM MRC instruction to read CCR. The difference is computed and is saved in the variable cycles until it can be returned by the function after traversal. The chase loop exits when it reaches the NULL pointer at the end of the linked list.

Local variables are allocated to general registers in order to avoid extraneous cache and TLB references that would perturb measurements. I checked the compiled machine code to make sure that the compiler accepted the register hints and actually allocated the local variables to registers. I also made sure that the function armv6_read_ccr() was expanded in-line.

The latency program needs to have user-space access to the ARM1176 performance counters. You must load the aprofile kernel module before running latency. Otherwise, you will get an “Illegal instruction” exception when the program attempts to configure the performance counters.

Here is a histogram showing the distribution of execution time samples when the linked list is 8KB in size. The 8KB test case consistently hits in the L1 data cache.

9  1000 ################################################

The first column is the execution time in cycles. The second column is the number of samples with the value shown in the first column. All 1,000 samples are 9 cycles.

Chapter 16 of the ARM1176JZF-S Technical Reference Manual (TRM) describes instruction cycle timings. A load from L1 data cache has a 3 cycle load-to-use latency (access) time. Taking 3 cycles as the actual access time, I estimate a measurement bias of (9 cycles – 3 cycles) = 6 cycles due to the pipelined execution of the ARM MRC instructions that read the CCR and the load instruction that chases the linked list pointer.

The 32KB and 64KB test cases form the middle timing tier. Both test cases are within the coverage of the Main TLB (288KB) and memory accesses hit in either the Micro TLB or Main TLB. The distributions are similar. Here is the distribution for the 64KB case.

 9   49 ##### 
17    1 # 
61  320 ##########################
62  571 ################################################
63   11 # 
64    4 # 
65    1 # 
69    1 # 
70   27 ###

There is a strong peak in the narrow range of [61:62] cycles. The nonbiased estimate for access time to primary memory with a TLB hit is (62 cycles – 6 cycles bias) = 56 cycles.

The test cases at 256KB and larger form the third timing tier. These cases exceed the Main TLB coverage. (The 256KB case almost occupies the entire Main TLB and the performance event data place this case in the third and longest running tier.) Load operations miss in the Main TLB forcing a hardware page table walk. The table walker performs at least one additional primary memory read operation in order to obtain page mapping information for the TLB. Here is the distribution for the 1MB case.

113    4 # 
116  224 ##############################################
117   23 ###### 
118   59 ############## 
119  129 ###########################
120    5 ## 
121  224 ##############################################
122  185 ############
125    4 # 
130    9 ### 
132    7 ## 
133   18 ##### 
135    5 ##

The samples are clustered in the range of [116:122] cycles. The nonbiased estimate for access time to primary memory with a Main TLB miss is (122 cycles – 6 cycles bias) = 116 cycles. This is approximately twice the access time of a load that hits in either the data MicroTLB or the Main TLB.

The following table summarizes the access time for each level of the memory hierarchy.

**Memory hierarchy access times (latencies)**
Level	Access time	Condition
CPU register	1 cycle
L1 data cache	3 cycles	Data cache hit
Primary memory	56 cycles	No TLB miss
Primary memory	116 cycles	Main TLB miss

Where is the L2 cache? The BCM2835 dedicates the L2 cache to the Broadcom VideoCore GPU. Memory operations from the CPU are routed around the L2 cache. The L2 cache is not a factor here and is not part of the analysis.

From these estimates, we can see why the textbook (naive) matrix multiplication program is so slow. The textbook algorithm does not walk sequentially through one of its operand arrays. The long memory stride causes a large number of Main TLB misses. Memory access is twice as slow when an access misses in the Main TLB, thereby slowing down program execution and increasing elapsed time.

Faster matrix multiplication (part 2 of 2)

Posted on June 3, 2013 by pj

Part 2 of two parts on matrix multiplication demonstrates a fast matrix multiplication program. The algorithm is a simple transformation of the textbook algorithm — the olde loop nest interchange. The transformation changes the slow access pattern to one of the arrays so that the program steps sequentially through the array elements in memory. Execution time speeds up from about 16 seconds elapsed time to 6 seconds. Not bad for a few minutes work!

All of the key memory-related performance events are improved since the access pattern is a better fit with the underlying memory microarchitecture. The analysis shows that we need to be careful when interpreting the Data Cache Access event because this event counts nonsequential memory accesses instead of all level 1 DC accesses or architectural loads and stores.

Part 2 also discusses operation or instruction counting to analyze program complexity at a micro-level. I like to look at the assembler code generated by the compiler to see if there are any potential speed-ups. The article shows how to look at the assembler code using the GCC -S option and using the objdump program. I use instruction counting to check the operation and meaning of performance events like the ARM11 Executed Instructions event.

The Broadcom BCM2835 in the Raspberry Pi has an integer core and a Vector Floating Point (VFP) coprocessor. The VFP operates concurrently with the integer core. In fact, it operates quite independently and only synchronizes with the integer core at a few well-defined points. VFP instructions are allowed to complete out of order, which allows for greater speed, but makes FP exceptions somewhat imprecise. (Now exactly where did that underflow/overflow occur?) The VFP coprocessor has 32 registers of its own, which are organized as four 8-register banks. GCC uses the coprocessor for scalar floating point arithmetic, but doesn’t exploit any parallelism.

The VFP operates on short vectors in a register bank. Potentialy, the VFP coprocessor could be exploited to further speed up matrix multiplication. One possibility is to stream incoming data as a four-wide stripe through an array and operate on four elements at once. Or, stream four elements at a time from a single row/column. Take a look at the VFP Math Library.

It’s not all good news, however. The VFP coprocessor is not a true single instruction, multiple data (SIMD) engine. (It’s similar to an old school short vector architecture.) It only has a single floating multiply/accumulate (FMAC) pipeline. A true SIMD would have four FMAC units. Also, computations are relatively difficult to set up and stage. Computations must be double buffered where the integer Load Store Unit (LSU) is filling one register bank while the coprocessor is performing computations in a different register bank. Further, GCC vectorization doesn’t appear to support VFP.

ARM must have gotten the message from its users. Later processors implement NEON SIMD and just enough VFP for the sake of legacy compatibility. The Beaglebone Black (ARM Cortex-A8) has NEON and I’m looking forward to trying it out. GCC vectorization supports NEON, too, and it’s a whole lot easier to let the compiler vectorize your program for you than to write vector code yourself!

The BCM2835 also has the VideoCore GPU for SIMD computation. There are a bunch of folks who are reverse engineering the GPU in order to use it for general purpose computation (GPGPU). Have it at, guys and gals!

Even if VFP is an orphan, the coprocessor has 32 registers where you can stash data. Maybe you can find a way to make use of these extra registers? Side-to-side access (integer/floating) is pretty fast.

Sand, software and sound

Electronics and computing for the fun of it

Tag Archives: performance counters