RPi 4 tuning: The code

I hope you have enjoyed my series of articles about Raspberry Pi 4 performance events, measurement and tuning:

Today, I want to wrap up the series with C code.

Please don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements. The commands in the PERF tutorial apply to x86, AMD64 and other architectures, too.

Before getting too far, here is the link to the ZIP file with the code. 🙂 The main source files are:

  • makefile: The make file (duh!)
  • pe_assist.h: Performance event helper header
  • pe_assist.c: Performance event helper functions
  • pe_cortex_a72.h: A72-specific helper header
  • pe_cortex_a72.c: A72-specific helper functions
  • pe_test.c: pe_assist check-out test
  • pe_matrix.c: pe_assist matrix multiply example
  • a72_test.c: A72-specific check-out test
  • a72_walk.c: A72-specific array walk kernel
  • a72_matrix.c: A72-specific matrix multiply
  • a72_misp.c: A72-specific branch mispredict kernel
  • a72_chase.c: A72-specific pointer chasing kernel

There are a few surprises, too, such as earlier versions of code, etc.

The programs self-monitor, that is, they call perf_event_open() to configure, control and read the performance counters. perf_event_open() has many parameters, so I wrote helper functions assisting counter configuration, control and access. There are two flavors: architecture independent and Cortex-A72 specific. The architecture independent functions are defined in pe_assist.* and the A72-specific functions are defined in pe_cortex_a72.*. The architecture independent functions should work on x86, etc., too.

Aside from the two check-out tests, the rest of the source modules are workloads. These are the programs that I used to collect data for the articles about Cortex-A72 performance measurement, analysis and tuning. Feel free to bash away at everything!

Helper functions

As I mentioned above, I separated the helper functions into architecture independent and Cortex-A72 specific modules. The architecture independent helper functions handle Linux performance counter set-up, control and read back:

  • peInitialize(): Initialize/reset the helper module:
  • peMakeGroup(): Make a counter group
  • peAddLeader(): Add leader event to the group
  • peAddEvent(): Add an event to the group
  • peStartCounting(): Start the counter group
  • peStopCounting(): Stop the counter group
  • peResetCounters(): Reset the counters
  • peReadCount(): Read an event count
  • pePrintCount(): Print and event count

The interface is “lite” and uncomplicated. It’s just enough to get the job done. Sometimes during early days, there is a temptation to build the Taj Mahal. I prefer to build something simple and get experience before building up and out. This simple interface proved to be good enough.

perf_event_open() supports simple event counting and sampling. If you’re familiar with perf stat, you’ve already seen simple event counting, AKA counting mode. perf stat measures events across the entire run of an application program. Self-monitoring is similar except you insert measurement code into the application program around the critical code that you wish to measure. perf stat doesn’t require code modification or recompile, but it doesn’t let you focus on particular critical loops or whatever. Self-monitoring is a little bit more effort, but it allows focus.

The usage model is straightforward:

  1. Initialize the module data structures.
  2. Create a performance event group.
  3. Add a leader event to the group.
  4. Add other events (up to 6 events for Cortex-A72) to the group.
  5. Start the counter group.
  6. Execute the workload or critical inner loops.
  7. Stop the counter group.
  8. Read and print the event counts.

Since this sequence is a recurring pattern, I also wrote a few functions which target common types of measurements such as:

  • peMeasureInstructionEvents()
  • pePrintInstructionEvents()
  • peMeasureDataAccessEvents()
  • pePrintDataAccessEvents()

These functions configure the pre-defined “symbolic” events which the Linux kernel has preselected for the platform architecture. Thus, you should be able to use the pe_assist.* module on any Linux box.

The Cortex-A72 module, pe_cortex_a72.*, use “raw” event identifiers for configuration. The available events are defined in pe_cortex_a72.h and they are specific to ARM Cortex-A72. I rely mainly on the Cortex-A72 events because then I know exactly which A72 events I am measuring. The Cortex-A72 module calls the low-level helper functions and it exports only targeted measurement functions:

  • a72MeasureInstructionEvents()
  • a72PrintInstructionEvents()
  • a72MeasureDataAccessEvents()
  • a72PrintDataAccessEvents()
  • a72MeasureTlbEvents()
  • a72PrintTlbEvents()

Take a peek inside one of the test programs and you’ll see how to call the helper modules.

Internal design

perf_event_open() is the Swiss Army knife of performance counter configuration and control. On Linux, all counter-related operations go through this single kernel call.

perf_event_open() allows control of individual counters, of course. However, it also provides a way to control a group of counters. One can save additional trips in and out of the kernel through counter groups. Instead of making six calls to start six counters, one only needs to make one perf_event_open() call to start an entire group of six events.

A Cortex-A72 group consists of one to six event counters. Each group has a distinguished member: the leader event. You can start, stop and reset the entire group by referring to the leader. Because the group members usually share characteristics like the process (ID) to be measured, the CPU set, flags, etc., it makes sense to define all of these common properties for an entire group. This approach reduces the number of parameters to be passed around during configuration.

In keeping with the “lite” philosophy, the helper module keeps the common flags and such in a few variables and arrays. The group and leader definition functions establish group-wide values for the member events in the group. That’s all there is to it, so hack away! The “lite” approach was good enough 99% of the time, so you might not need to dip into the helper modules at all.

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 tuning: Branch mispredictions

Back on the old day job, I developed and tested software and hardware for program profiling. Testing may sound like drudge-work, but there are ways to make things fun!

Two questions arise while testing a profiling infrastructure — software plus hardware:

  • Does the hardware accurately count (or sample) performance events for a given specific workload?
  • Does the software accurately display the counts or samples?

Clearly, ya need working hardware before you can build working software.

Testing requires a solid, known-good (KG) baseline in order to decide if new test results are correct. Here’s one way to get a KG baseline — a combination of static analysis and measurement:

  • Static analysis: Analyze the post-compilation machine code and predict the expected number of instruction retires, cache reads, misses, etc.
  • Measurement: Run the code and count performance events.
  • Validation: Compare the measured results against the predicted results.

Thereafter, one can compare new measurements taken from the system under test (SUT) and compare against both predicted results and baseline measured results.

Applying this method to performance counter counting mode is straightforward. You might get a little “hair” in the counts due to run-to-run variability, however, the results should be well-within a small measurement error. Performance counter sampling mode is more difficult to assess and one must be sure to collect a statistically significant number of samples within critical workload code in order to have confidence in a result.

One way to make testing fun is to make it a game. I wrotekernel programs that exercised specific hardware events and analyzed the inner test loops. You could call these programs “test kernels.” The kernels are pathologically bad (or good!) code which triggers a large number of specific performance events. It’s kind of a game to write such bad code…

The expected number of performance events is predicted through machine code level complexity analysis known as program “microanalysis.” For example, the inner loops of matrix multiplication are examined and, knowing the matrix sizes, the number of retired instructions, cache reads, branches, etc. are computed in closed form, e.g.,

    (38 inner loop instructions) * (1,000,000,000 iterations) + 
(26 middle loop instructions) * (1,000,000 iterations) +
(9 outer loop instructions) * (1,000 iterations)
-----------------------------------------------------------
38,026,009,000 retired instructions expected
38,227,831,497 retired instructions measured

This formula is the closed form expression for the retired instruction count within the textbook matrix multiplication kernel. The microanalysis approach worked successfully on Alpha, Itanium, x86, x64 and (now) ARM. [That’s a short list of machines that I’ve worked on. 🙂 ]

With that background in mind, let’s write a program kernel to deliberately cause branch mispredictions and measure branch mispredict events.

The ARM Cortex-A72 core predicts conditional branch direction in order to aggressively prefetch and dispatch instructions along an anticipated program path before the actual branch direction is known. A branch mispredict event occurs when the core detects a mistaken prediction. Micro-ops on the wrong path must be discarded and the front-end must be steered down the correct program path. The Cortex-A72 mispredict penalty is 15 cycles.

What we need is a program condition that consistently fools the Cortex-A72 branch prediction hardware. Branch predictors try to remember a program’s tendency to take or not take a branch and the predictors are fairly sensitive; even a 49%/51% split between taken and not taken has a beneficial effect on performance. So, we need a program condition which has 50%/50% split with a random pattern of taken and not taken direction.

Here’s the overall approach. We fill a large array with a random pattern of ‘0’ and ‘1’ characters. Then, we walk through the array and count the number of ‘1’ characters. The function initialize_test_array() fills the array with a (pseudo-)random pattern of ones and zeroes:

void initialize_test_array(int size, char* array, 
int always_one, int always_zero)
{
register char* r = array ;
int s ;
for (s = size ; s > 0 ; s--) {
if (always_one) {
*r++ = '1' ;
} else if (always_zero) {
*r++ = '0' ;
} else {
*r++ = ((rand() & 0x1) ? '1' : '0') ;
}
}
}

The function has options to fill the array with all ones or all zeroes in case you want to see what happens when the inner conditional branch is well-predicted. BTW, I made the array 20,000,000 characters long. The size is not especially important other than the desire to have a modestly long run time.

The function below, test_loop(), contains the inner condition itself:

int test_loop(int size, char* array) 
{
register int count = 0 ;
register char* r = array ;
int s ;
for (s = size ; s > 0 ; s--) {
if (*r++ == '1') count++ ; // Should mispredict!
} return( count ) ;
}

The C compiler translates the test for ‘1’ to a conditional branch instruction. Given an array with random ‘0’ and ‘1’ characters, we should be able to fool the hardware branch predictor. Please note that the compiler generates a conditional branch for the array/loop termination condition, s > 0. This conditional branch should be almost always predicted correctly.

The function run_the_test() runs the test loop:

void run_the_test(int iteration_count, int array_size, char* array) 
{
register int rarray_size = array_size ;
register char* rarray = array ;
int i ;
for (i = iteration_count ; i-- ; ) {
test_loop(array_size, array) ;
}
}

It calls test_loop() many times as determined by iteration_count. Redundant iterations aren’t strictly necessary when taking measurements in counting mode. They are needed, however, in sampling mode in order to collect a statistically significant number of performance event samples. I set the iteration count to 200 — enough to get a reasonable run time when sampling.

The test driver code initializes the branch condition array, configures the ARM Cortex-A72 performance counters, starts the counters, runs the test loop, stops the counters and prints the performance event counts:

initialize_test_array(array_size, array, always_one, always_zero) ; 
a72MeasureInstructionEvents() ;
peStartCounting() ;
run_the_test(iteration_count, array_size, array) ;
peStopCounting() ;
a72PrintInstructionEvents(stdout) ;

The four counter configuration, control and display functions are part of a small utility module that I wrote. I will explain the utility module in a future post and will publish the code, too.

Finally, here are the measurements when scanning an array holding a random pattern of ‘0’ and ‘1’ characters:

    Instructions ret'd:      45,999,735,845 
Instructions spec'd: 98,395,483,123
CPU cycles: 59,010,851,259
Branch speculated : 8,012,669,711
Branch mispredicted: 2,001,934,251
Branch predicted 8,012,669,710
Instructions per cycle: 0.780
Retired/spec'd ratio: 0.467
Branches per 1000 (PTI): 174.189
Branch mispredict ratio: 0.250

Please recall that there are two conditional branches in the inner test loop: a conditional branch to detect ‘1’ characters and a conditional branch to check the array/loop termination condition. The loop check should be predicted correctly almost all the time, accounting for 50% of the total number of correctly predicted branches. The character test, however, should be incorrectly predicted 50% of the time. It’s like guessing coin flips — you’ll be right half the time on average. Overall, 25% of branch predictions should be incorrect, and yes, the measured branch mispredict ratio is 0.250 or 25%.

The number of speculated instructions is also very interesting. Cortex-A72 speculated twice as many ARMv8 instructions as it retired. Over half of the speculated instructions did not complete architecturally and were discarded. That’s what happens when a conditional branch is grossly mispredicted!

I hope you enjoyed this simple experiment. It makes the Cortex-A72 fetch and branch prediction behavior come alive. As a follow-up experiment, I suggest trying all-ones or all-zeroes.

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 branch-related performance events:

 Number Mnemonic          Event name
------ ---------------- -----------------------------------------
0x08 INST_RETIRED Instruction architecturally executed
0x10 BR_MIS_PRED Mispredicted or not predicted branches
0x11 CPU_CYCLES Processor cycles
0x12 BR_PRED Predictable branch speculatively executed
0x1B INST_SPEC Operation speculatively executed
0x76 PC_WRITE_SPEC Software change of the PC (speculative)
0x78 BR_IMMED_SPEC Immediate branch (speculative)
0x79 BR_RETURN_SPEC Procedure return (speculative)
0x7A BR_INDIRECT_SPEC Indirect branch (speculative)

Disassembled code for test_loop():

00010678 :
10678: e92d0830 push {r4, r5, fp}
1067c: e28db008 add fp, sp, #8
10680: e24dd014 sub sp, sp, #20
10684: e50b0018 str r0, [fp, #-24] ; 0xffffffe8
10688: e50b101c str r1, [fp, #-28] ; 0xffffffe4
1068c: e3a04000 mov r4, #0
10690: e51b501c ldr r5, [fp, #-28] ; 0xffffffe4
10694: e51b3018 ldr r3, [fp, #-24] ; 0xffffffe8
10698: e50b3010 str r3, [fp, #-16]
1069c: ea000008 b 106c4
106a0: e1a03005 mov r3, r5
106a4: e2835001 add r5, r3, #1
106a8: e5d33000 ldrb r3, [r3]
106ac: e3530031 cmp r3, #49 ; 0x31
106b0: 1a000000 bne 106b8 ; Should mispredict!
106b4: e2844001 add r4, r4, #1
106b8: e51b3010 ldr r3, [fp, #-16]
106bc: e2433001 sub r3, r3, #1
106c0: e50b3010 str r3, [fp, #-16]
106c4: e51b3010 ldr r3, [fp, #-16]
106c8: e3530000 cmp r3, #0
106cc: cafffff3 bgt 106a0 ; Correctly predicted
106d0: e1a03004 mov r3, r4
106d4: e1a00003 mov r0, r3
106d8: e24bd008 sub sp, fp, #8
106dc: e8bd0830 pop {r4, r5, fp}
106e0: e12fff1e bx lr

Cortex-A72 tuning: Data access

In my discussion about instructions per cycle as a performance metric, I compared the textbook implementation of matrix multiplication against the loop next interchange version. The textbook program ran slower (28.6 seconds) than the interchange version (19.6 seconds). The interchange program executes 2.053 instructions per cycle (IPC) while the textbook version has a less than stunning 0.909 IPC.

Let’s see why this is the case.

Like many other array-oriented scientific computations, matrix multiplication is memory bandwidth limited. Matrix multiplication has two incoming data streams — one stream from each of the two operand matrices. There is one outgoing data stream for the matrix product. Thanks to data dependency, the incoming streams are more important than the outgoing matrix product stream. Thus, anything that we can do to speed up the flow of the incoming data streams will improve program performance.

Matrix multiplication is one of the most studied examples due to its simplicity, wide-applicability and familiar mathematics. So, nothing in this note should be much of a surprise! Let’s pretend, for a moment, that we don’t know the final outcome to our analysis.

I measured retired instructions, CPU cycles and level 1 data (L1D) cache and level 2 (L2) cache read events:

    Event                          Textbook     Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,831,497 60,210,830,509
CPU cycles 42,068,324,320 29,279,037,884
Instructions per cycle 0.909 2.056
L1 D-cache reads 15,070,922,957 19,094,920,483
L1 D-cache misses 1,096,278,643 9,576,935
L2 cache reads 1,896,007,792 264,923,412
L2 cache read misses 124,888,097 125,524,763

There is one big take-away here. The textbook program misses in the data cache far more often than interchange. The textbook L1D cache miss ratio is 0.073 (7.3%) while the interchange cache miss ratio is 0.001 (0.1%). As a consequence, the textbook program reads the slower level 2 (L2) cache more often to find necessary data.

If you noticed slightly different counts for the same event, good eye! The counts are from different runs. It’s normal to have small variations from run to run due to measurement error, unintended interference from system interrupts, etc. Results are largely consistent across runs.

The behavioral differences come down to the memory access pattern in each program. In C language, two dimensional arrays are arranged in row-major order. The textbook program touches one operand matrix in row-major order and touches the other operand matrix in column-major order. The interchange program touches both operand arrays in row-major order. Thanks to row-major order’s sequential memory access, the interchange program finds its data in level 1 data (L1D) cache more often than the textbook implementation.

There is another micro-architecture aspect to this situation, too. Here are the performance event counts for translation look-aside buffer (TLB) behavior:

    Event                          Textbook     Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,830,517 60,210,830,503
L1 D-cache reads 15,070,845,178 19,094,937,273
L1 DTLB miss 1,001,149,440 17,556
L1 DTLB miss LD 1,000,143,621 10,854
L1 DTLB miss ST 1,005,819 6,702

Due to the chosen matrix dimensions, the textbook program makes long strides through one of the operand matrices, again, due to the column-major order data access pattern. The stride is big enough to touch different memory pages, thereby causing level 1 data TLB (DTLB) misses. The textbook program has a 0.066 (6.6%) DTLB miss ratio. The miss ratio is near zero for the interchange version.

I hope this discussion motivates the importance of cache- and TLB-friendly algorithms and code. Please see the following articles if you need to brush up on ARM Cortex-A72 micro-architecture and performance events:

Check out my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Here is a list of the ARM Cortex-A72 performance events that are most useful for measuring memory access (load, store and fetch) behavior. Please see the ARM Cortex-A72 MPCore Processor Technical Reference Manual (TRM) for the complete list of performance events.

Number  Mnemonic            Name 
------ ------------------ ------------------------------------
0x01 L1I_CACHE_REFILL Level 1 instruction cache refill
0x02 L1I_TLB_REFILL Level 1 instruction TLB refill
0x03 L1D_CACHE_REFILL Level 1 data cache refill
0x04 L1D_CACHE Level 1 data cache access
0x05 L1D_TLB_REFILL Level 1 data TLB refill
0x08 INST_RETIRED Instruction architecturally executed
0x11 CPU_CYCLES Processor cycles
0x13 MEM_ACCESS Data memory access
0x14 L1I_CACHE Level 1 instruction cache access
0x15 L1D_CACHE_WB Level 1 data cache Write-Back
0x16 L2D_CACHE Level 2 data cache access
0x17 L2D_CACHE_REFILL Level 2 data cache refill
0x18 L2D_CACHE_WB Level 2 data cache Write-Back
0x19 BUS_ACCESS Bus access
0x40 L1D_CACHE_LD Level 1 data cache access - Read
0x41 L1D_CACHE_ST Level 1 data cache access - Write
0x42 L1D_CACHE_REFILL_LD L1D cache refill - Read
0x43 L1D_CACHE_REFILL_ST L1D cache refill - Write
0x46 L1D_CACHE_WB_VICTIM L1D cache Write-back - Victim
0x47 L1D_CACHE_WB_CLEAN L1D cache Write-back - Cleaning
0x48 L1D_CACHE_INVAL L1D cache invalidate
0x4C L1D_TLB_REFILL_LD L1D TLB refill - Read
0x4D L1D_TLB_REFILL_ST L1D TLB refill - Write
0x50 L2D_CACHE_LD Level 2 data cache access - Read
0x51 L2D_CACHE_ST Level 2 data cache access - Write
0x52 L2D_CACHE_REFILL_LD L2 data cache refill - Read
0x53 L2D_CACHE_REFILL_ST L2 data cache refill - Write
0x56 L2D_CACHE_WB_VICTIM L2 data cache Write-back - Victim
0x57 L2D_CACHE_WB_CLEAN L2 data cache Write-back - Cleaning
0x58 L2D_CACHE_INVAL L2 data cache invalidate
0x66 MEM_ACCESS_LD Data memory access - Read
0x67 MEM_ACCESS_ST Data memory access - Write
0x68 UNALIGNED_LD_SPEC Unaligned access - Read
0x69 UNALIGNED_ST_SPEC Unaligned access - Write
0x6A UNALIGNED_LDST_SPEC Unaligned access
0x70 LD_SPEC Speculatively executed - Load
0x71 ST_SPEC Speculatively executed - Store
0x72 LDST_SPEC Speculatively executed - Load or store

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 execution and load/store

I hope you had an opportunity to read about ARM Cortex-A72 fetch and processing. ARM Cortex-A72 is the high performance application core in the Broadcom BCM2711, also known as the Raspberry Pi 4. In this post, I’m going to continue my exploration of the A72 micro-architecture, concentrating on the execution units and load/store operation.

Execution pipelines

Cortex-A72 has eight independent execution units (pipelines):

  • Branch: Branch micro-ops
  • Integer 0: Integer ALU micro-ops
  • Integer 1: Integer ALU micro-ops
  • Integer Multi-Cycle: Integer shift-ALU, multiply, divide, CRC and sum-of-absolute differences micro-ops
  • FP/ASIMD 0: ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide and crypto micro-ops
  • FP/ASIMD 1: ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, FP square root and ASIMD shift micro-ops
  • Load: Load and register transfer micro-ops
  • Store: Store and special memory micro-ops

The Cortex-A72 front-end puts micro-ops into per-pipe issue queues which, in turn, feed the execution units. There are eight issue queues. The queues have eight entries each except the branch queue, which has ten entries (66 queue entries total like the old Cortex-A57). The queues provide rate-balancing between the core front-end (i.e., the instruction/micro-op stream) and the execution units. The queues allow greater parallelism between units, too, letting each pipeline run at its own independent single- or multi-cycle speed.

ARM Cortex-A72 micro-architecture (Source: Hiroshige Goto)

The branch and integer pipes are very fast, each pipe executing a micro-op in a single processor cycle. The integer pipelines have multiple, zero-cycle forwarding datapaths. These paths, sometimes called “by-passes,” send intermediate results directly to stages (computations) needing the result right darned now without writing the result into the rename register file first.

The integer multi-cycle pipe handles integer micro-ops which require 2 or more processor cycles for execution. Shift operations are relatively fast: 2+ cycles. Integer multiplication has a 3 to 5 cycle latency. Integer divide is relatively slow taking anywhere from 4 to 20 cycles. Multiplication can be accelerated through dedicated combinational logic; division is sequential by nature and requires many steps.

The FP (floating point) and ASIMD (Advanced Single Instruction Multiple Data) units perform floating point and SIMD computations. Both units are generalist and perform commonly occurring FP operations: FP ADD, SUB, MUL, NEG, ABS, MAX, MIN, etc. Execution latency varies from 3 to 4 cycles for these basic operations. The FP pipes support late forwarding of FP MUL products to FP multiply-accumulate micro-ops, letting FP multiply-accumulate complete in 6 cycles.

Each FP unit is a specialist, too:

  • FP/ASIMD 0: FP CONVERT, ROUND, DIV, CRYPTO
  • FP/ASIMD 1: FP COMPARE, SQRT

FP divide and square root operations are performed using iterative algorithms. Only one FP DIV or SQRT operation at a time may execute in a pipe. Latencies are long: 6 to 18 cycles for DIV and 6 to 32 cycles for SQRT, depending upon FP datatype.

Please see the ARM Cortex-A72 Software Optimization Guide for detailed instruction timing, pipe assignment and ASIMD operation.

Load and store micro-ops are executed by the load and store units. The load and store units are mutually independent. One load and one store micro-op can execute each processor cycle. (Load and store are discussed below.) Load and store micro-ops issue speculatively. Under speculative execution, a load or store may reside on a correctly predicted branch path (the correct path) or an incorrectly predicted branch path (the wrong path). Loads, stores and associated data on a wrong path must be discarded. Store operations are buffered (wait) until they are determined to be on the correct path and are committed architecturally to primary memory.

Memory hierarchy

As I mentioned in my Cortex-A72 overview, the Raspberry Pi 4 (Broadcom BCM2711) has a four level memory hierarchy:

          Register          Fast, but small 
|
Level 1 caches
|
Level 2 cache
|
RAM Big, but slow

The RPi4 has four A72 cores. Each core has a register file and level 1 instruction and data caches. The four cores share a single unified level 2 cache and primary memory (RAM).

The register file is the fastest, but has the smallest capacity. Registers are read or written in a single processor cycle. RAM has the most capacity, but is relatively slow. RPi4 primary memory is LPDDR4-3200 SDRAM:

Memory array clock200 MHz
Prefetch size16n
I/O bus clock frequency1600 MHz
Data transfer rate (DDR)3200 Mb/s
Memory accesss bandwidth (MABW)12.8 GB/s

MABW above is peak. Memory bandwidth measurements using RAMspeed/SMP indicate actual RPi4 model B bandwidth is approximately 4.4 GB/s. (RAMspeed/SMP reads/writes memory in 1MB blocks.)

The following table summarizes Cortex-A72 cache characteristics:

L1I cache capacity48KB
L1I cache organizationPer-core, 3-way set associative, 64B line
L1D cache capacity32KB
L1D cache organizationPer-core, 2-way set associative, 64B line
L2 cache capacity1MB
L2 cache organizationShared, 16-way set associative, 64B line

Data accesses are handled by the Level 1 Data (L1D) cache. Instruction fetches are handled by the Level 1 Instruction (L1I) cache. The Level 2 (L2) cache is unified, handling both data and instructions. The Raspberry Pi BCM2711 ARM Peripherals manual states the following caveat with respect to the L2 cache:

BCM2711 provides a 1MB system L2 cache, which is used primarily by the GPU. Accesses to memory are routed either via or around the L2 cache depending on the address range being used.

Thus, application programs should not expect to receive a performance assist from the L2 cache! The VideoCode GPU accesses L2 cache through the Cortex-A72 ACP/AXI interface.

The L1D cache load-to-use latency is 4 cycles when the load hits in the L1D cache. The Level 2 (L2) cache load-to-use latency is 9 cycles when the load hits in the L2 cache.

A read access (e.g., a data load or instruction fetch) first tries the appropriate level 1 cache (Load: L1D cache, Fetch: L1I cache). If it finds the requested item in the level 1 cache — a hit — the item is sent to either the rename registers (for loads) or the instruction decoder (for fetches). Load data may also be sent through a bypass to an execution stage (micro-op) awaiting the incoming data.

If the read access misses the level 1 cache, the request is sent (optionally) to the unified L2 cache. If the requested item is found in L2 cache, the cache line containing the item is written into the level 1 cache, thereby replacing one of the existing lines. This operation is called a “refill.” If the line to be replaced is dirty (modified), then the old value is evicted and is written to primary memory. The requested item is selected from the incoming cache line and is routed to the appropriate destination (i.e., functional unit or instruction decoder).

If the read access misses (or optionally, bypasses) the L2 cache, the 64 byte line containing the item is read from primary memory. The incoming line is written to the level 1 cache (a refill) and (optionally) the L2 cache. Again, dirty lines are evicted.

Instruction and data bytes are read, written and transferred in 64-byte chunks (lines). This is true even if a load instruction requests a single byte from memory. Application programs should strive to use each entire cache line completely before moving on to the next line. Programs that exploit spatial and temporal locality perform better. Programmers need to pay careful attention to algorithm selection, data structure/layout and memory access patterns in order to make good, efficient use of data caching.

The description of Cortex-A72 cache operation above is simplified. Consider, for example, memory transaction types. Memory attributes within the Memory Management Unit (MMU) and page tables determine memory transaction types for each memory region:

  • Write-Back Read-Write-Allocate
  • Write-Back No-Allocate
  • Write-Through
  • Non-cacheable
  • Device

Memory transaction type affects cache behavior.

Write-Back Read-Write-Allocate is the most common and highest performing memory type. Incoming lines are written to the L1D cache and the read (or write) completes from the L1D cache. A store that hits a Write-Back cache line does not update main memory.

Write-Back No-Allocate does not write an incoming line to L1D cache. This prevents cache pollution when accessing large, one-time use data structures.

Non-cacheable memory bypasses both the level 1 caches and L2 cache. Requests go directly to primary memory. The Cortex-A72 treats Write-Through memory as Non-cacheable.

Instruction fetch (more details)

Instruction fetches are speculative and there is no guarantee that fetched instructions are executed. Instructions are aggressively prefetched pursuing either sequential execution flow or branch targets based on path prediction.

The L1I cache is fed by three fill buffers that hold instructions from either the unified L2 cache or primary memory. The fill buffers are non-blocking. A line may remain in a fill buffer until it is transferred to the L1I cache or discarded. Primary memory regions may be marked as non-cacheable regions or the L1I cache may be disabled, and incoming lines are not written to the L1I cache. A line is not committed to the L1I cache unless it is demanded by a fetch. The hardware also has an L2 instruction prefetcher.

Cortex-A72 treats the preload instruction cache instruction (PLDI) as a NOP.

Memory Management Unit

The Memory Management Unit (MMU) performs virtual to physical address translation and enforces secure, restricted access to memory regions. As to security, suffice it to say that the MMU restricts access by Address Space Identifier (ASID) and Virtual Machine Identifier (VMID). These concerns are addressed by the operating system and are generally transparent to application programmers. [And I won’t be dealing with access control here.]

Address type AArch64 AArch32
Virtual address (VA)48 bits 32 bits
Physical address (PA)44 bits 40 bits

The Cortex-A72 hardware supports 4KB, 64KB and 1MB page sizes. The Raspberry Pi Operating System (formerly known as “Raspbian”) organizes primary memory into 4KByte pages. [Huge pages must be enabled in the kernel and I will assume that it’s 4KB all the way on RPi4.] The operating system maintains page tables that specify the physical location of application program pages (both instructions and data).

Application programs use virtual addresses to identify instructions and data items. Conceivably, hardware could use memory-resident page tables to map a virtual address to its corresponding physical address. This approach is way too slow to be practical. Better, the Cortex-A72 maintains page (address) mapping information in a multi-level, hierarchical memory system:

       Level 1 TLB          Fast, but small 
|
Level 2 TLB
|
RAM Big page tables, but slow

The TLB structure is separate from the register/cache/memory hierarchy and it operates independently. The organizing principle is the same — most recent and frequently used mappings reside in fast memory and big page tables reside in slow primary memory.

“TLB” is the acronym for “translation lookaside buffer.” A TLB is an array where each entry describes the mapping from a virtual page to a physical page. Internal operation of a TLB is similar to a data cache. The following table summarizes Cortex-A72 TLB characteristics:

L1I TLB capacity48 entries
L1I TLB organizationFully associative
L1D TLB capacity32 entries
L1D TLB organizationFully associative
L2 TLB capacity1024 entries
L2 TLB organization4-way set associative

The L1I and L2D TLBs support 4KB, 64KB, and 1MB page sizes. The L2 TLB supports 4KB, 64KB, 1MB and 16MB page sizes. (Also, 2MB and 1GB using AArch32 long descriptor format translation.) Alas, Raspberry Pi OS uses 4KB pages.

TLB operation is similar to caching. Address translation is first tried in the L1I TLB (fetches) or L1D TLB (loads and stores). If the translation information is found (a hit), the physical address is returned in one cycle. Access permission is checked at the same time.

If translation misses in a level 1 TLB, address translation is attempted in the main L2 TLB. If the translation information is found, the physical address is returned (after one or more cycles).

If translation misses in the L2 TLB, the MMU performs a hardware translation table walk. (Page tables have a fairly complicated structure which is beyond the scope of this discussion.) Because page tables reside in slow primary memory, a hardware translation table walk takes a relatively long time to complete with respect to an L2 TLB look-up.

If the required page is not in memory and is on the RPi OS swap device, the operating system reads the page into primary memory before attempting re-translation. These exceptions, page faults, are the slowest of all and they should be avoided like the plague.

Once again, program performance depends up good temporal and spatial locality, albeit locality at the page level. An application program can touch as many as 32 different data pages without triggering an L1D TLB refill:

    32 L1D TLB entries * 4 KBytes/page = 128 KByte data working set

This is a modest-sized working set of pages, and like cache line strategy, a program should make maximal, efficient use of a page working set before demanding new page translation information from the L2 TLB. The L2 TLB supports a larger combined data/instruction working set:

   1024 L2 TLB entries * 4 KBytes/page = 4 MByte total working set 

The L2 TLB footprint (working set size) is larger. However, the L1D TLB and the L1I TLB compete for page translation entries in the unified L2 TLB.

As to data-page utilization, data structure layout and access pattern come into play once again. A program should work as much as possible within the current working set before moving to pages outside the current set. Random access within a big heap pays a penalty when heap items are distributed across many pages (i.e., when the working set exceeds 32 pages).

With respect to L1I TLB utilization strategy, frequently executed, related code should reside within the same page or just a few pages. Related code which is spread across many pages (i.e., a large instruction working set) will jump between pages and possibly cause L1I TLB or L2 TLB refills.

The above description of the translation process is simplified. For example, access is checked against page permissions, etc. and violations are reported after aborting offending translation and instruction. I tried to focus mainly on performance-related concerns of interest to application programmers.

Load and store operations

After absorbing all of that, let’s pick up a few additional odds and ends about load and store operations.

Cortex-A72 memory operations are weakly ordered. They may be performed out-of-order as long as data dependencies are honored. Due to the weak ordering, explicit synchronization barriers are needed in circumstance where strong ordering is required. There are four kinds of barriers:

  • Instruction Synchronization Barrier (ISB)
  • Data Synchronization Barrier (DSB)
  • Data Memory Barrier (DMB)
  • Load-Acquire (LDAR) and Store-Release (STLR)

Please see the Programmer’s Guide for ARMv8-A for further details.

More generally important to application programmers is data alignment. Naturally aligned data is accessed faster than unaligned data, especially unaligned data items that cross cache line boundaries. The following table summarizes alignment requirements:

Source: Programmer’s Guide for ARMv8-A

Load operations should not cross 64-byte, cache line boundaries. Store operations should not cross 16-byte boundaries.

As a general program design principle, computations proceed as fast as data items can stream from primary memory, and secondarily, as fast as results can stream back to primary memory. Data prefetching increases the speed of incoming data stream(s). The programmer or compiler should schedule load operations further ahead of instructions which consume the incoming data item. Ideally, other independent instructions are scheduled and executed ahead of the consuming load thereby overlapping useful computation with load latency (4 cycles from the L1D cache at a minimum).

A program may signal the need for a data item through an explicit prefetch instruction. The A72 supports three instruction prefetch hint instructions: PLD, PLDW, and PRFM. These are only hints and may be ignored. The PLD and PLDW instructions allocate a line in the Level 1 Data cache. Prefetch from Memory (PRFM) hints that data from a specific address will soon be needed. If accepted, these hints can bring in a data item (cache line) before it is required.

Programmers may further manage data cache contents via non-temporal load and store instructions (LDNP and STNP). These instructions hint that caching is not useful for data at an address, thereby preventing unnecessary cache pollution. Non-temporal load and store instructions may require explicit load barriers. (See the Programmer’s Guide for ARMv8-A for more details.)

The Cortex-A72 hardware has a load-side prefetcher which dynamically analyzes memory access patterns. Based on its analysis, the load-side prefetcher brings data into either the L1D cache, the L2 cache, or both. The hardware also has a store-side prefetcher which brings data into the L2 cache.

Outgoing data streams benefit from write combining which merges data from multiple store operations into a single memory write access.

Outstanding read and write requests (i.e., pending requests to primary memory) wait in the Fill/Eviction Queue (FEQ). The Cortex-A72 has a configurable FEQ: 20, 24, or 28 entries. The A72 write issuing capability is 16, that is, up to 16 writes may be outstanding at any time. The read issuing capability is 19, 23, or 27 depending upon FEQ configuration (capacity). L2 prefetch is throttled based on the FEQ occupancy count. [Extra credit: What is the specific Raspberry Pi 4 FEQ configuration and occupancy threshold?]

One important simplification in the cache discussion is cache coherency. Most application programmers needn’t worry about cache coherency. However, if you are writing a program with multiple, co-operating threads that actively share memory locations or regions, you should MOESI over to the ARM Cortex-A72 Technical Reference Manual (TRM) and read up on the details. The A72 Snoop Control Unit (SCU) uses a hybrid protocol (MESI+MOESI) to maintain coherency between the per-core L1 data caches (MESI) and the common L2 cache (MOESI). “MESI” and “MOESI” refer to the coherency status of each cache line:

  • Modified (M)
  • Owned (O)
  • Exclusive (E)
  • Shared (S)
  • Invalid (I)

The BCM2711 employs an Advanced Microcontroller Bus Architecture (AMBA) Advanced xExtensible Interface (AXI) bus interface. Broadcom does not specify if either AXI Coherency Extensions (ACE) or the Coherency Hub Interface (CHI) are supported. (Man, these acronyms stack up!) Since ACE and CHI are intended for SMP processor clusters (e.g., big.LITTLE) these features may have been left out or disabled.

If you do care about cache coherency, please be aware that load data may be sourced from a remote L1D cache as well as the shared L2 cache or primary memory.

Sources

Before closing, I want to offer a few words about my sources. My primary resources are:

  • ARM Cortex-A72 Technical Reference Manual (TRM)
  • ARM Cortex-A72 Software Optimization Guide
  • ARM Programmer’s Guide for ARMv8-A
  • BCM2711 ARM Peripherals

These resources are authoritative. I also relied upon ARM’s own briefings and presentations about Cortex-A72 as found on the Web. I tried to verify Web sources and briefings against the written TRM and programmer guides.

In closing

Hopefully, my write-ups will help developers tune their programs for ARM Cortex-A72. If you’re just getting started with performance tuning, I would first concentrate on cache- and page-friendly algorithms, data structures and access patterns. Fill buffers, FEQ, memory ordering, and memory transaction types are esoteric subjects for most application programmers.

Want to learn more about Raspberry Pi 4 (Cortex-A72 / Broadcom BCM2711) performance tuning? Please read:

If you’re interested in early model Raspberry Pi, I wrote several posts about micro-architecture, performance measurement and performance events:

There you will find general principles and techniques that apply to Raspberry Pi 4 although some details (e.g., cache and TLB capacity) differ.

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 fetch and branch processing

Let’s take a closer look at instruction fetch, decode and dispatch in the Cortex-A72 micro-architecture. These are the “front-end” stages of the core pipeline. The “back-end” of the pipeline consists of the register file(s), execution units and retirement (reorder) buffer. Branch prediction is frequently associated with the front-end since it directly affects instruction fetch.

[Update: This post is part 1 of a two part series. Part 2 discusses ARM Cortex-A72 execution and load/store operations.]

The front-end has one major job: Fetch ARMv8 instructions and keep the back-end execution units as busy as possible.

This page is required reading for Raspberry Pi 4 (BCM2711, ARM Cortex-A72) programmers who want to tune their programs for the ARM Cortex-A72. It is also necessary background information for programmers doing performance measurement with PERF (Performance Events for Linux) on Raspberry Pi 4.

Fetch, decode and dispatch

The Cortex-A72 pipeline has 15 stages. [On-line sources disagree on the length; some sources claim 14 stages.] The front-end stages are:

  • Fetch (5 stages)
  • Decode (3 stages)
  • Rename (1 stage)
  • Dispatch (2 stages)

The back-end pipeline stages are:

  • Execute (1 to 6 stages depending upon unit)
  • Write-back/retirement (2 stages)

Each execution unit has its own pipe:

  • Integer 0 (1 stage/cycle)
  • Integer 1 (1 stage/cycle)
  • Integer multi-cycle (4 to 12 cycles)
  • FP/ASIMD 0 (6 to 18 cycles)
  • FP/ASIMD 1 (6 to 32 cycles)
  • Load L1D cache hit: 4 cycles)
  • Store (1 cycle)
  • Branch (1 cycle)

See the ARM Cortex-A72 software optimization guide for exact instruction execution latencies.

ARM Cortex-A72 pipeline (Cortex-A72 Software Optimization Guide)

70 plus years after von Neumann, we still adhere to a linear, sequential program model. In the programmer’s view, there is a program counter (PC) which steps through instructions sequentially, even if some of the intended high-level computations could be performed in parallel by the execution units. Further, we force compilers to lay down instructions in the same linear, sequential fashion.

The front-end’s job is to find instruction-level parallelism (ILP) within the incoming instruction stream. The Cortex-A72 front-end translates ARMv8 instructions into micro-ops. It sends the micro-ops to the back-end functional units for execution and architecture-level retirement.

This all may seem inefficient and crazy and it is. Program representation and execution need a major re-think along the lines of the once-investigated dataflow architecture. Why do and then un-do?

Front-end translation is really two steps: translating ARMv8 instruction into macro-ops and then translating macro-ops into micro-ops. Let’s examine the details.

First, fetch acquires a 16-byte (128 bit quadword) fetch window and it recognizes ARMv8 instructions within the window (and across windows). Then, the decode stages turn the ARMv8 instructions into one or more macro-ops. The decoder will fuse multiple ARMv8 instructions into a single macro-op if such an optimization is possible. Architectural registers are renamed to an internal register file which temporarily holds intermediate results. Next, the macro-ops are translated into micro-ops and the micro-ops are dispatched to the issue queue belonging to the appropriate functional unit. Micro-ops are dispatched into 8 independent issue queues (one queue per execution pipeline). When an instruction successfully completes, it is retired and its result is committed to the architectural machine state. [I will discuss the meaning of “successfully completes” in a minute.]

There are limits on the number of ARMv8 instructions, macro-ops and micro-ops which are processed during a machine cycle:

    3 ARMv8 instructions --> 3 macro-ops --> 5 micro-ops 
Decode Rename Dispatch

During each cycle, the Cortex-A72 can decode 3 ARMv8 instructions, produce up to 3 macro-ops and dispatch up to 5 micro-ops. There are additional limitations on the number of micro-ops of each type that can be simultaneously dispatched (quoting the Cortex-A72 software optimization guide):

  • One micro-op using the Branch pipeline
  • Up to two micro-ops using the Integer pipelines
  • Up to two micro-ops using the Multi-cycle pipeline
  • One micro-op using the F0 pipeline
  • One micro-op using the F1 pipeline
  • Up to two micro-ops using the Load or Store pipeline

If there are more micro-ops to be dispatched above these limitations, they are dispatched in oldest-to-youngest age order.

According to ARM, most ARMv8 instructions are converted to a single micro-op (average: 1.08 micro-ops per instruction).

Register renaming allows micro-ops to execute out-of-order. Remember, we forced the compiler to lay down ARMv8 instructions in-order. Register renaming allows the micro-ops to execute out-of-order without violating data dependencies between architectural registers. There are 128 physical rename registers.

Speculative execution

Now the really tricky stuff — speculative execution. A basic block is a sequence of “straight-line” code with a single branch instruction at the end of the block. Program control flows into a basic block and is redirected at the end to either the same basic block or perhaps a different basic block.

Generally, basic blocks are about 9 instructions long. The branch at the end may be a conditional branch. The combination of short block length and conditional branching limits the number of ARMv8 instructions which can be aggressively fetched, decoded and dispatched within a single basic block. Without speculative execution, the processor must wait every time a conditional branch needs to be decided.

Enter branch prediction. When Cortex-A72 hits a conditional branch, it predicts the direction: taken or not taken. [More about branch prediction in a minute.] The front-end continues to fetch, decode and dispatch along the predicted control flow path. If the prediction is correct, hurray! The predictor has guessed correctly and the execution units have already been computing useful results. As each ARMv8 instruction completes along a known-to-be correct path, the instructions along the path retire in architectural order. This is the meaning of “successfully completes” above.

If a conditional branch is not predicted correctly (a “mispredict”), the intermediate results along the wrong path are discarded and the fetch stage is told to start fetching from the correct target address. This is called a “re-steer” as in the phrase “re-steering the front-end.” Recovery from a branch mispredict requires a pipeline flush and, yes, it’s expensive — at least 15 cycles.

The Write-back/retirement stage keeps track of the “retire pointer,” that is, the architectural program counter. The retirement stage is some of the most difficult hardware logic to design and it must be correct. The retirement stage maintains a reorder buffer which keeps the books on instruction and micro-op status. The reorder buffer has 128 entries, allowing up to 128 ARMv8 instructions to be simultaneously in flight.

Branch prediction

High performance rides on branch prediction accuracy. If the predictor guesses correctly most of the time, then speculative execution is a win. If the predictor guesses incorrectly, the core must throw away useless results and the execution pipeline stalls.

A deep dive into branch prediction is beyond the scope of this note. So, here’s a sketch. The predictor maintains a Pattern History Table (PHT) which retains the taken or not-taken status of (recent) branches encountered by the core. The predictor also maintains a Branch Target Buffer (BTB) containing the target addresses of these branches. The predictor uses the pattern history to predict the direction of a branch when it encounters the branch again. The BTB supplies the associated target address.

Cortex-A72 allows split BTB entries, accomodating both near branches (small target address) and far branches (large target addresses). That’s why you will see the BTB capacity quoted as 2K to 4K entries. The A72 BTB can hold as a many as 2K large target address (far) and 4K small target addresses (near). The A72 also has a micro-BTB which acts as a cache memory for the main BTB. The micro-BTB has 64 entries.

The branch prediction window is 16 bytes, which has implications for basic block layout. According to the Cortex-A72 software optimization guide, branch targets should be quadword aligned (i.e., 16 byte boundaries) and not more than two taken branches should be included with the same quadword-aligned quadword of instruction memory.

The compiler should really take care of quadword alignment and packing for you. As a higher-level language programmer, you can improve branch prediction by biasing flow conditions toward a true program path or the respective false program path. If a given path (true if-clause or false if-clause) is more frequently executed, the branch predictor should do a better job predicting the underlying conditional branch generated by the compiler.

Cortex-A72 has predictors for subroutine return and indirect branch. The predictor maintains an 8 [?] entry Call/Return Stack which remembers the most recent return addresses. Measurement shows that 8 entries (or so) are enough for most high-level language workloads, covering the most recently called functions.

Bi-mode prediction

There are three sources of interference in a Pattern History Table (PHT):
* Cold miss (compulsory alias)
* PHT capacity miss (capacity alias)
* Conflict miss (conflict alias)
Conflict misses can be reduced by partitioning the PHT or using a different
indexing scheme.

The ARM Cortex-A15 Bi-Mode predictor uses two pattern history tables and a choice predictor to reduce negative interference of branches in different
program modes. Instead of one big PHT, the pattern history is partitioned
into two halves, i.e., two smaller PHTs. A choice predictor table selects
one of the two PHTs. The chosen side delivers the final prediction.

Cortex-A72 does not employ a bi-mode predictor. In case you’re wondering, Cortex-A72 does not have a micro-op loop buffer either.

Want to learn more about Raspberry Pi 4 (Cortex-A72 / Broadcom BCM2711) performance tuning? Please read:

If you have an early model Raspberry Pi, read about the ARM11 micro-architecture here. Please don’t forget my PERF (Performance Events for Linux) tutorial.

Copyright © 2020 Paul J. Drongowski

Raspberry Pi 4 ARM Cortex-A72 processor

Raspberry Pi 4 (RPi4) is a big step beyond the earlier models 1, 2 and 3. Both desktop interaction and browsing are snappier and don’t have that laggy feel. I haven’t even thought (yet) about the RPi4’s music making and synthesis potential!

The Raspbeery Pi 4 is powered by a new processor from Broadcom: the BCM2711. The BCM2711 is an improvement over the BCM2835/2836 used in earlier models. Like the BCM2836, main memory is external. I’m running an RPi with 4GB of RAM (LPDDR4-3200 SDRAM, 3200Mb/s, dual channel). The old RPi2 has only 1GB of RAM. The BCM2711 supports Gigabit Ethernet (1000 BaseT) while the old RPi2 is just 100Megabit Ethernet. Faster Internet speed makes updates and browsing so much faster.

The RPi4 is a quad-core ARM Cortex-A72 processor clocking at 1.5GHz. The old RPi2 is a 900MHz quad-core ARM Cortex-A7 processor. The old BCM2835 is a member of the ARM11 family (ARM1176JZF-S, to be exact). The ARM Cortex-A72 within the BCM2711 has a much improved CPU core and memory subsystem.

The old ARM1176 is a relatively simple beast. It is a single issue machine, that is, it issues a single instruction per cycle. The ARM1176 core has eight pipeline stages and three execution pipes: 1. ALU, shift, saturation, 2. Multiply-accumulate, and 3. Load/store.

The Cortex-A72, on the other hand, performs 3-way instruction decoding and can issue as many as five operations per cycle. It is an out-of-order superscalar machine allowing speculative issue. That is waaay more sophisticated than the ARM1176, putting the Cortex-A72 on the same level as x86 superscalar machines. In fact, it translates ARM instructions into micro-ops like most modern x86 superscalar processors. It even performs micro-op fusion in some cases. The Cortex-A72 performs register renaming, letting micro-ops (instructions) execute when program data are ready (out-of-order execution, in-order retirement).

The Cortex-A72 issues micro-ops to eight execution pipelines:

  • Branch: Branch micro-ops
  • Integer 0: Integer ALU micro-ops
  • Integer 1: Integer ALU micro-ops
  • Integer Multi-Cycle: Integer shift-ALU, multiply, divide, CRC and sum-of-absolute differences micro-ops
  • FP/ASIMD 0: ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide and crypto micro-ops
  • FP/ASIMD 1: ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, FP square root and ASIMD shift micro-ops
  • Load: Load and register transfer micro-ops
  • Store: Store and special memory micro-ops

Up to 5-way issue and a larger number of independent execution pipelines permit more fine-grained parallelism than ARM1176. Of course, the compiler must know how to exploit all of this parallelism, but the potential is there. The ARM Cortex-A72 Software Optimization Guide specifies the number of execution cycles and pipeline units for each kind of ARM instruction. This information is incorporated into a compiler and guides the choice and scheduling of machine instructions.

ARM Cortex-A72 block diagram

The Cortex-A72 allows speculative execution. Without speculation, a CPU must wait at each conditional program branch until the direction is decided and instruction fetch can proceed along the chosen branch. The Core-A72 processor predicts branch direction (speculates) and aggressively issues instructions along predicted branches. The Cortex-A72 branch predictor is also improved over ARM1176. (I’m still digging into details.) If a branch is mispredicted, speculative results are discarded. So, it’s important to have a good branch predictor.

The Cortex-A72 can perform a load operation and a store operation every cycle because it has separate load and store pipelines. The ARMv8-A instruction set architecture (ISA) allows arbitrary data alignment and access. However, the Cortex-A72 hardware penalizes load operations that cross a cache-line (64-byte) boundary and store operations that cross a 16-byte boundary. Programmers (and compilers) should keep that in mind when laying down data structures in memory.

Like all modern high-performance computers, the Cortex-A72 organizes physical memory into a hierarchy with the fastest/smallest memory (registers) near the arithmetic/logic unit (ALU) and the slowest/largest memory (RAM) far away and off-chip. The registers and RAM are connected to intervening levels of memory — the caches:

          Register          Fast, but small 
|
Level 1 caches
|
Level 2 cache
|
RAM Big, but slow

Data and instructions are read (and written) in efficient chunks making data and instructions available when needed by the registers and ALU. The chunks are called “cache lines.” Thanks to cache memory, programs run faster when they (re)use data that are close together in memory (i.e., occupy the same cache line) and are the most recently accessed. These notions are called “spatial locality” and “temporal locality.”

The following table is a quick summary of the level 1 and level 2 cache structures of the ARM1176 and Cortex-A72.

Feature ARM1176 Cortex-A72
L1 I-cache capacity 16KB 48KB
L1 I-cache organization 4-way set associative, 32B line 3-way set associative, 64B line
L1 D-cache capacity 16KB 32KB
L1 D-cache organization 4-way set associative, 32B line 2-way set associative, 64B line
L2 cache capacity 128KB 1MB
L2 cache organization Shared, 8-way set associative, 64B line Shared, 16-way set associative, 64B line

Each core has an Instruction Cache (I-Cache) and Data Cache (D-Cache). The four cores share the Level 2 (L2) cache.

As you can see, the RPi4 (BCM2711) has larger caches and a bigger cache line size (64 bytes) than ARM11. RPi4 programs are more likely to find instructions and data in cache than earlier RPi models.

Contemporary processors have one or more memory management units (MMU) that break physical RAM into logical pages. This scheme is called “virtual memory.” The MMU translate logical program addresses (from loads, stores and instruction fetches) into physical RAM addresses. Address translation has its own memory hierarchy:

   Translation registers       Fast, but only a single mapping 
|
Level 1 TLBs
|
Level 2 TLB
|
RAM Big page tables, but slow

Page tables in RAM are maps that describe the layout of pages in the operating system and application programs. Translation lookaside buffers (TLB) are cache-like hardware structures that hold the most recently used (MRU) address translation information, i.e., where a logical page is located in physical memory. TLBs greatly speed up the translation process by keeping MRU page table information on-chip within the CPU.

Cortex-A72 has larger translation lookaside buffers (TLB) than ARM1176, as summarized in the table below. With larger TLBs, a program can touch more locations in memory without triggering a performance robbing page fault — an event which brings page translation information into the CPU from relatively slow RAM.

Feature ARM1176 Cortex-A72
D-MicroTLB capacity 10 entries 32 entries
D-MicroTLB organization Fully assoc, 1 lookup/cycle Fully assoc, 1 lookup/cycle
I-MicroTLB capacity 10 entries 48 entries
I-MicroTLB organization Fully assoc, 1 lookup/cycle Fully assoc, 1 lookup/cycle
L2 TLB capacity 256 entries 1024 entries
L2 TLB organization Unified, 2-way set assoc Unified, 4-way set assoc

Each core has a Data Micro-TLB (D-MicroTLB), Instruction Micro-TLB (I-MicroTLB), and Level 2 (L2) TLB. (In ARM1176 terminology, the L2 TLB is called the “Main TLB”).

In summary, the RPi4’s BCM2711 processor is a powerhouse even though it won’t knock that gaming machine off your desktop. 🙂 If you’ve been waiting to dive into Raspberry Pi or to upgrade, please don’t hesitate any longer.

I’m getting the itch to play with RPi4’s hardware performance counters and post results. In the meantime, check out my summary of the ARM11 micro-architecture. If you would like to know more about performance measurement and events in ARM1176-based Raspberry Pi’s, please see my Performance Events for Linux (PERF) tutorial.

Also, I have uploaded all of my teaching notes about computer design, VLSI systems and computer architecture:

These resources should help students and teachers alike!

Copyright © 2020 Paul J. Drongowski