Raspberry Pi 4 ARM Cortex-A72 processor

Raspberry Pi 4 (RPi4) is a big step beyond the earlier models 1, 2 and 3. Both desktop interaction and browsing are snappier and don’t have that laggy feel. I haven’t even thought (yet) about the RPi4’s music making and synthesis potential!

The Raspbeery Pi 4 is powered by a new processor from Broadcom: the BCM2711. The BCM2711 is an improvement over the BCM2835/2836 used in earlier models. Like the BCM2836, main memory is external. I’m running an RPi with 4GB of RAM (LPDDR4-3200 SDRAM, 3200Mb/s, dual channel). The old RPi2 has only 1GB of RAM. The BCM2711 supports Gigabit Ethernet (1000 BaseT) while the old RPi2 is just 100Megabit Ethernet. Faster Internet speed makes updates and browsing so much faster.

The RPi4 is a quad-core ARM Cortex-A72 processor clocking at 1.5GHz. The old RPi2 is a 900MHz quad-core ARM Cortex-A7 processor. The old BCM2835 is a member of the ARM11 family (ARM1176JZF-S, to be exact). The ARM Cortex-A72 within the BCM2711 has a much improved CPU core and memory subsystem.

The old ARM1176 is a relatively simple beast. It is a single issue machine, that is, it issues a single instruction per cycle. The ARM1176 core has eight pipeline stages and three execution pipes: 1. ALU, shift, saturation, 2. Multiply-accumulate, and 3. Load/store.

The Cortex-A72, on the other hand, performs 3-way instruction decoding and can issue as many as five operations per cycle. It is an out-of-order superscalar machine allowing speculative issue. That is waaay more sophisticated than the ARM1176, putting the Cortex-A72 on the same level as x86 superscalar machines. In fact, it translates ARM instructions into micro-ops like most modern x86 superscalar processors. It even performs micro-op fusion in some cases. The Cortex-A72 performs register renaming, letting micro-ops (instructions) execute when program data are ready (out-of-order execution, in-order retirement).

The Cortex-A72 issues micro-ops to eight execution pipelines:

  • Branch: Branch micro-ops
  • Integer 0: Integer ALU micro-ops
  • Integer 1: Integer ALU micro-ops
  • Integer Multi-Cycle: Integer shift-ALU, multiply, divide, CRC and sum-of-absolute differences micro-ops
  • FP/ASIMD 0: ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide and crypto micro-ops
  • FP/ASIMD 1: ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, FP square root and ASIMD shift micro-ops
  • Load: Load and register transfer micro-ops
  • Store: Store and special memory micro-ops

Up to 5-way issue and a larger number of independent execution pipelines permit more fine-grained parallelism than ARM1176. Of course, the compiler must know how to exploit all of this parallelism, but the potential is there. The ARM Cortex-A72 Software Optimization Guide specifies the number of execution cycles and pipeline units for each kind of ARM instruction. This information is incorporated into a compiler and guides the choice and scheduling of machine instructions.

ARM Cortex-A72 block diagram

The Cortex-A72 allows speculative execution. Without speculation, a CPU must wait at each conditional program branch until the direction is decided and instruction fetch can proceed along the chosen branch. The Core-A72 processor predicts branch direction (speculates) and aggressively issues instructions along predicted branches. The Cortex-A72 branch predictor is also improved over ARM1176. (I’m still digging into details.) If a branch is mispredicted, speculative results are discarded. So, it’s important to have a good branch predictor.

The Cortex-A72 can perform a load operation and a store operation every cycle because it has separate load and store pipelines. The ARMv8-A instruction set architecture (ISA) allows arbitrary data alignment and access. However, the Cortex-A72 hardware penalizes load operations that cross a cache-line (64-byte) boundary and store operations that cross a 16-byte boundary. Programmers (and compilers) should keep that in mind when laying down data structures in memory.

Like all modern high-performance computers, the Cortex-A72 organizes physical memory into a hierarchy with the fastest/smallest memory (registers) near the arithmetic/logic unit (ALU) and the slowest/largest memory (RAM) far away and off-chip. The registers and RAM are connected to intervening levels of memory — the caches:

          Register          Fast, but small 
|
Level 1 caches
|
Level 2 cache
|
RAM Big, but slow

Data and instructions are read (and written) in efficient chunks making data and instructions available when needed by the registers and ALU. The chunks are called “cache lines.” Thanks to cache memory, programs run faster when they (re)use data that are close together in memory (i.e., occupy the same cache line) and are the most recently accessed. These notions are called “spatial locality” and “temporal locality.”

The following table is a quick summary of the level 1 and level 2 cache structures of the ARM1176 and Cortex-A72.

Feature ARM1176 Cortex-A72
L1 I-cache capacity 16KB 48KB
L1 I-cache organization 4-way set associative, 32B line 3-way set associative, 64B line
L1 D-cache capacity 16KB 32KB
L1 D-cache organization 4-way set associative, 32B line 2-way set associative, 64B line
L2 cache capacity 128KB 1MB
L2 cache organization Shared, 8-way set associative, 64B line Shared, 16-way set associative, 64B line

Each core has an Instruction Cache (I-Cache) and Data Cache (D-Cache). The four cores share the Level 2 (L2) cache.

As you can see, the RPi4 (BCM2711) has larger caches and a bigger cache line size (64 bytes) than ARM11. RPi4 programs are more likely to find instructions and data in cache than earlier RPi models.

Contemporary processors have one or more memory management units (MMU) that break physical RAM into logical pages. This scheme is called “virtual memory.” The MMU translate logical program addresses (from loads, stores and instruction fetches) into physical RAM addresses. Address translation has its own memory hierarchy:

   Translation registers       Fast, but only a single mapping 
|
Level 1 TLBs
|
Level 2 TLB
|
RAM Big page tables, but slow

Page tables in RAM are maps that describe the layout of pages in the operating system and application programs. Translation lookaside buffers (TLB) are cache-like hardware structures that hold the most recently used (MRU) address translation information, i.e., where a logical page is located in physical memory. TLBs greatly speed up the translation process by keeping MRU page table information on-chip within the CPU.

Cortex-A72 has larger translation lookaside buffers (TLB) than ARM1176, as summarized in the table below. With larger TLBs, a program can touch more locations in memory without triggering a performance robbing page fault — an event which brings page translation information into the CPU from relatively slow RAM.

Feature ARM1176 Cortex-A72
D-MicroTLB capacity 10 entries 32 entries
D-MicroTLB organization Fully assoc, 1 lookup/cycle Fully assoc, 1 lookup/cycle
I-MicroTLB capacity 10 entries 48 entries
I-MicroTLB organization Fully assoc, 1 lookup/cycle Fully assoc, 1 lookup/cycle
L2 TLB capacity 256 entries 1024 entries
L2 TLB organization Unified, 2-way set assoc Unified, 4-way set assoc

Each core has a Data Micro-TLB (D-MicroTLB), Instruction Micro-TLB (I-MicroTLB), and Level 2 (L2) TLB. (In ARM1176 terminology, the L2 TLB is called the “Main TLB”).

In summary, the RPi4’s BCM2711 processor is a powerhouse even though it won’t knock that gaming machine off your desktop. 🙂 If you’ve been waiting to dive into Raspberry Pi or to upgrade, please don’t hesitate any longer.

I’m getting the itch to play with RPi4’s hardware performance counters and post results. In the meantime, check out my summary of the ARM11 micro-architecture. If you would like to know more about performance measurement and events in ARM1176-based Raspberry Pi’s, please see my Performance Events for Linux (PERF) tutorial.

Also, I have uploaded all of my teaching notes about computer design, VLSI systems and computer architecture:

These resources should help students and teachers alike!

Copyright © 2020 Paul J. Drongowski

Performance counter kernel module

As promised, I’ve described the design of a Linux loadable kernel module that allows user-space access to the Raspberry Pi (ARM 1176) performance counters. By the way, the design of the module is not specific to Raspbian Wheezy or even the Raspberry Pi for that matter. I believe that the kernel module could be used on the new Beagleboard Black (BBB) to enable user-space counter access on its ARM Cortex-A8 processor under Linux. I just ordered a BBB and will try out the code when possible. (Assuming quick delivery!)

The kernel module alone isn’t enough to measure performance events. In fact, the kernel module doesn’t even touch the counters. It merely flips a privileged hardware bit which lets user-space programs read and write the performance counters and control register. So, I have also written a few user-space C functions to configure, clear, start and stop the performance counters. An application program just needs to call a few functions to choose the events to be measured and start counting, to stop counting, to get the raw counts, and to print the event counts.

I have uploaded the source for both the kernel module (aprof.c) and the user-space functions (rpi_pmu.h and rpi_pmu.c). In addition, there is source for some utility functions that I like to use in benchmark programs (test_common.h and test_common.c). All of this is a work in progress and I will update the source when major enhancements or changes are made.

Speaking of source, I have found a way of organizing and storing source code through WordPress. WordPress is kind of security paranoid and doesn’t allow you to upload source code or even gzip’ed TAR files. I ran into this issue when I attempted to upload a make file and WordPress wouldn’t let me do it (with complaints about potentially malicious code and so forth). WordPress does let you post source for viewing, however.

So, I’ve added a Source menu item to the main menu. I want the menu structure below the Source item to operate like a browsable code repository. The first level of items below Source are projects, like the kernel module. The next level of menu items navigate into the source belonging to a project. Each make file and source file is a separate page. The source code is displayed using the SyntaxHighligher plug-in in order to keep indentation. No other formatting or highlighting is done just to keep things simple. I could cut and paste code from these pages, so I hope you can, too!

An introduction to performance tuning (and counters)

My latest page is an overview of performance tuning on ARM11. The Raspberry Pi is a nifty little Linux box, but it’s kind of slow at 700MHz. Therefore, I suspect that programmers will have an interest in tuning up application programs and making them run faster. Performance tuning is also a good opportunity to learn more about computer architecture and machine organization, especially the ARM1176 core at the heart of the Raspberry Pi and its memory subsystem.

The ARM1176 has three performance counters which can measure over 20 different microarchitectural events. One of these counters is dedicated to core clock cycles while the other two are configurable. The new performance tuning page has a brief overview of the counters and it has a table with the supported events.

The new page also describes two different use cases for the counters: caliper mode and sampling mode. Caliper mode counts the number of microarchitectural hardware events that occur between two different points in program execution. Caliper mode is good for measuring the number of data cache accesses and misses for a hot code region like a loop. The programmer inserts code to start counting at the beginning of the hot region and inserts code to stop counting at the end of the hot region. This is the easiest use case to visualize and to implement. It’s the approach that I’m taking with my first performance measurement software and experiments (a custom kernel module plus some user-space code). These experiments are almost finished and ready for write up.

Sampling is a statistical technique that produces an event profile. A profile shows the distribution of events across program instructions, routines, source lines, or modules. This is a good way to find hot-spots in a program where tuning is most beneficial. Sampling does not require modification to source.

Performance Events for Linux (informally called “PERF”) is the standard tool for program profiling on Linux. At the moment, PERF has a bug which prevents it from sampling hardware events. I’ve been looking into this problem, too, and hope to post some results. In the long-run, I want to post examples using PERF in order to help people tune up their programs on Raspberry Pi.

ARM11 microarchitecture

You probably know by now that the Raspberry Pi uses an ARM processor. In particular, the Raspberry Pi model B uses the Broadcom BCM2835 system on a chip (SoC). The BCM2835 is a member of the ARM11 family. Its name is the ARM1176JZF-S. (Whew!)

Like all computers, the BCM2835 has an internal processor structure called its “microarchitecture”. The word “architecture” refers to the machine features that are visible to a programmer — things like the instruction set. The microarchitecture refers to the building blocks in the guts of the machine, or more properly, in the guts of a specific implementation (BCM2835) of an architectural family (ARM11 or ARMv6).

The microarchitecture can have a big effect on program performance. Compiler writers, for example, study the microarchitecture in order to build compilers that generate the best possible code for the microarchitecture. As we’ll see in later posts, application programmers can also take steps to tune their programs for the underlying microarchitecture. Tuning is important on Raspberry Pi because at 700 MHz, this machine is running its heart out!

Today, I added a page that summarizes the characteristics of the BCM2835 (ARM11) microarchitecture. Please check out the info! We will revisit this page when I discuss profiling and tuning.