ARM11 microarchitecture - Sand, software and sound

This page is a very brief summary of the Broadcom BCM2835 microarchitecture. Processor microarchitecture has many important effects on program performance. A well-tuned program should make good use of the underlying microarchitecture and should avoid known performance pitfalls.

The Raspberry Pi uses the Broadcom BCM2835 system on a chip (SoC). The Raspberry Pi model B has 512MB of primary memory (RAM). Clock speed is 700MHz. The Broadcom BCM2835 is the specific implementation of an ARM11 processor.

The CPU core is the ARM1176JZF-S which is a member of the ARM11 family (ARMv6 architecture with floating point). The GPU is a Videocore IV GPU.

Overview

The source for the information in the following sections is the ARM1176JZF-S Technical Reference Manual (TRM). Please see this manual for additional details (especially if you’re a compiler guy).

Generally, the features of the ARM1176JZF-S core are:

- Eight stage pipeline
- Internal coprocessors CP14 and CP15
- Three instructions sets
  - 32-bit ARM instruction set (ARM state)
  - 16-bit Thumb instruction set (Thumb state)
  - 8-bit Java bytecodes (Jazelle state)

- Datapath consist of three pipelines:
  - ALU, Shift, Sat pipeline (Sat implements saturation logic)
  - MAC pipeline (MAC executes multiply and multiply-accumulate operations)
  - Load or store pipeline

The ARM1176 is a single issue machine, that is, one instruction is issued at a time. This constraint simplifies the issue logic. The ARM1176 doesn’t implement complicated issue rules to determine whether two or more instructions of certain types can issue in the same cycle. Instructions are issued in-order, but out-of-order completion is allowed. Out-of-order completion increases exploitable fine-grained parallelism.

Nearly all ARM instructions can be predicated. An enabling condition (a predicate) can be defined for an instruction. The enabling condition gates the execution of the instruction. Predication can be used to eliminate branches.

Prefetch unit / branch prediction

The prefetch unit fetches instructions and performs branch prediction. Prefetching keeps the execution pipelines busy by following the (predicted) program control flow and fetching instructions. Prefetching is a definite performance win in straight line code (no branches) because instructions are read sequentially from memory, issued and executed.

Branches disrupt straight line control flow and sequential access. The prefetch unit tries to follow the control flow through the branch instructions in order to maintain a steady flow of instructions into the execution pipelines. Conditional branches are particularly troublesome because the branch direction (taken or not taken) isn’t known with certainty until the condition is evaluated. The prefetch unit tries to predict the flow out of conditional branches using branch prediction.

When prediction is accurate, the execution pipelines stay busy. Trouble arises, however, when a branch is mispredicted. Any instructions which are along the wrong, mispredicted path must be discarded (flushed) from the pipeline. Then, fetching must start along the correct control path, a process which is often called “redirection.” Since the pipeline is deep (8 stages), the branch penalty is high when the control flow is mispredicted. A lot of unnecessary work needs to be discarded. Thus, computer architects work hard to design and implement accurate branch prediction. Your program can help the branch predictor by strongly (consistently) favoring one of the two control paths leading out of conditional branches (if-then-else decisions).

Two levels of prediction (static and dynamic branch prediction) are performed. Dynamic branch prediction is the first level of prediction. Dynamic prediction makes use of branch history. It consists of a 128-entry Branch Target Address Cache (BTAC). The BTAC is a virtually addressed, direct mapped cache. Each BTAC entry holds a virtual target address and 2-bit branch history. If the PC of a branch matches a BTAC entry, the branch history and target address are used to redirect the instruction fetch stream.

Dynamic prediction removes some dynamically predicted branches (a technique called “branch folding”).

Static prediction is the second level of prediction and is used when a dynamic prediction cannot be made (i.e., history isn’t available.) It makes a prediction solely on branch direction (no history). Branches are predicted in the following way:

All forward conditional branches are predicted not taken
All backward conditional branches are predicted taken

Statically predicted taken branches have a one cycle penalty.

Branch misprediction causes a pipeline flush. This is expensive since all intermediate results are thrown out and fetching must start from a new target address.

Return stack

The return stack is used to predict a procedure return address. The return stack is a three-entry circular buffer.

A procedure call or return is a write to the PC register. A procedure call pushes the return address onto the stack. A procedure return pops the stack and uses the predicted target address. No prediction is made when the return stack is empty. Misprediction occurs on condition code failure or an incorrectly predicted target address.

Static branch prediction of procedure calls is made in the following way:

Unconditional procedure calls are predicted taken.
Conditional procedure calls are predicted not taken.

Load Store Unit (LSU)

The LSU performs load and store operations. It decouples loads and store from the execution pipelines, allowing loads and stores to execute and complete in parallel with ALU computations.

The Load/Store Unit does not always block on an L1 data cache miss and supports hit under miss (HUM). The miss goes into a holding state/buffer and non-dependent instructions are allowed to execute. Up to three outstanding misses are allowed. This increases parallelism letting some computations proceed even when a load misses in the L1 data cache.

Level 1 memory system

The level one memory system consists of:

Separate instruction and data caches
Separate instruction and data Tightly-Coupled Memories (TCM)

All datapaths are 64 bits wide.

Level 1 instruction cache

The level 1 (L1) instruction cache is a fast associative memory that holds the most recently used instructions. The instruction cache has the following characteristics.

16KB capacity (BCM2835)
4-way set associative
Virtually indexed, physically tagged
32 byte cache line (eight 32-bit words)

Level 1 data cache

The level 1 (L1) data cache is a fast associative memory that holds the most recently used data items. The data cache has the following characteristics.

16KB capacity (BCM2835)
4-way set associative
Three cycle load-to-use latency
Virtually indexed, physically tagged
32 byte cache line (eight 32-bit words)
Supports three outstanding data cache misses

The load-to-use latency for a hit in the L1 data cache is three cycles. That is, the data from a load hitting in the L1 data cache is not available for 3 cycles after the load is issued. Dedicated datapaths called bypasses forward load data to other instructions in the pipeline.

Level 2 cache

The BCM2835 has a 128KB level 2 (L2) cache and two memory management units (MMU). The ARM MMU maps program virtual addresses to physical addresses and the VC/ARM MMU maps physical addresses onto the VideoCore/CPU bus which communicates with physical memory. The high order bits of the VC/CPU address determine the cacheable status of memory regions including the L2 cache. The Broadcom BCM2835 ARM Peripherals manual states “BCM2835 provides a 128KB system L2 cache, which is used primarily by the GPU. Accesses to memory are routed either via or around the L2 cache depending on senior two bits of the bus address.” “Bus” refers to the CPU/VideoCore (VC) bus that communicates with physical memory. Under normal circumstances, memory reads/writes made by a program are not routed through the L2 cache. Thus, the L2 cache does not directly affect application program performance on the BCM2835.

Tightly-Coupled Memories (TCM)

Tightly-Coupled Memories provide high speed access to code and data.

Instruction TCM holds interrupt/exception handing code.
Data TCM holds data for media processing and other intensive tasks.

Two Direct Memory Access (DMA) channels transfer data to/from TCMs.

The BCM2835 TCMs consist of two 8KB TCM blocks.

ARM Memory Management Unit (MMU)

The ARM MMU translates virtual addresses to physical addresses using page information. The MMU supports four page sizes: 4KB small pages, 64KB large pages, 1MB sections, and 16MB super sections. Address mapping is performed using two levels of translation lookaside buffers: the Main TLB and two microTLBs. The Main TLB backs separate microTLBs for each of the instruction and data caches.

Address translation is first attempted in a MicroTLB. If the address cannot be translated in the MicroTLB, then the Main TLB is tried. If the address cannot be translated through the Main TLB, then hardware page walking is invoked.

The Main TLB has the following characteristics:

Unified (both instruction and data page information)
64 low associativity entries PLUS 8 fully associative entries
The 8 fully associative entries are lockable
The low associativity entries are 2-way associative
Main TLB handles MicroTLB misses

There is a MicroTLB for each of the instruction cache and the data cache. The MicroTLBs have the following characteristics.

10 entries
Fully associative virtual address lookup in one cycle
Returns the physical address to the cache for comparison by the cache
Implements round robin (default) or random replacement

If you are interested in ARM11 performance events, please see my article about performance measurement and tuning.

Update: Raspberry Pi 4 is based on the Broadcom BCM2711 processor which consists of four ARM Cortex-A72 cores. Please see the following articles for more information about BCM2711 and Cortex-A72: