Memory hierarchy and performance events

My latest set of experiments with Raspberry Pi use the BCM2835 performance counters to exercise and observe the behavior of the RPi memory system. The experiments use a simple microbenchmark that traverses a linked list. This technique is often called pointer chasing because list traversal chases the pointers through the linked list from beginning to end. A pointer chasing loop is simple, is easy to instrument, and has consistently repeatable behavior.

Here’s the approach. We exercise a single specific level in the memory hierarchy (such as the level 1 data cache) simply by adjusting the size of the linked list. We just need to choose a size such that the the physical size of the list is less than or equal to the capacity of a level — 16KB in the case of the level 1 (L1) data cache. Storage for the linked list is allocated as a contiguous block of memory (an array of bytes), so it’s easier to think of the linked list size in terms of its contiguous footprint in memory, the array size. Each linked list item is the same size as a cache line: 32 bytes.

The first experiment measures seven kinds of performance events: cycles, executed instructions, L1 data cache accesses, L1 data cache misses, data MicroTLB misses, Main TLB misses and explicit external data accesses.

Before we take a look at the results of the first experiment, here’s a few things to keep in mind.

  • The Broadcom BCM2835 128KB L2 cache is dedicated to the VideoCore GPU. Memory requests from the CPU are routed around the L2 and go directly to primary memory. L2 cache is not a factor in our analysis.
  • Coverage is the amount of primary memory accessible through a translation lookaside buffer (TLB) without incurring a TLB miss.
  • The MicroTLB has ten fully associative (page) entries. Its coverage is (10 entries * 4,096 bytes) = 40KB.
  • The Main TLB handles MicroTLB misses and has 8 fully associative entries plus 64 2-way associative entries. Its coverage is (72 entries * 4,096 bytes) = 288KB.
  • A hardware page table walker handles Main TLB misses. A page table walk requires at least one additional read in primary memory in order to look up page mapping information.

These memory system characteristics affect the results and should be directly observable.

The following table summarizes basic measurements for ten test runs. The first column is the size of the dynamically allocated block of memory that holds the linked list elements. The second column is the size of the block expressed as 32-byte cache lines. There is one list item per cache lines, so this is also the number of linked list items. The third column is the number of list traversals (one traversal per iteration). I adjusted the number of iterations for each run in order to perform exactly the same number of pointer chasing operations per test case. This allows us to make meaningful run-to-run comparisons.

 Size  Cache lines  Iterations   Time    CPI
-----  -----------  ----------  -----  -----
  4MB         128K        1024  25.76  20.51
  2MB          64K        2048  25.74  20.52
  1MB          32K        4096  25.98  20.70
512KB          16K        8192  25.56  20.41
256KB           8K       16384  24.62  19.71
128KB           4K       32768  19.99  16.24
 64KB           2K       65536  11.75   9.61
 32KB           1K      131072  10.49   8.77
 16KB          512      262144   2.66   2.30
  8KB          256      524288   2.39   2.04

The fourth column is the total elapsed time per run. There are roughly three distinct timing tiers:

  • Tier A: 4MB, 2MB, 1MB, 512KB, 256KB, 128KB
  • Tier B: 32KB, 64KB
  • Tier C: 8KB, 16KB

The 128KB case is somewhat “transitional,” but I placed it in Tier A. The cycles per instruction (CPI column) ratios are consistent with respect to the elapsed times.

The following table summarizes the performance event counts for the same test cases (block/array size and iterations).

                           MicroTLB  Main TLB  External
 Size DC access  DC miss      miss      miss    access
----- ---------  -------  --------  --------  --------
  4MB  24696257  5107400   1049399    258783   7433309 C
  2MB  24927252  5119529   1051719    259531   7509503 C
  1MB  24878797  5145341   1065404    273960   7521771 C
512KB  24668739  5086410   1050024    259104   7461179 C
256KB  22999900  4803510   1057814    308477   7263176 C
128KB  20051408  4258682    965191    181398   6136462 C
 64KB  11972871  2683992    620354     91207   3785568 B
 32KB  10096551  2288319    548311     71756   3231861 B
 16KB   2621836   594034    136900      8265    804582 A
  8KB   3480732   446481     72712      2757    480707 A

Keeping the basic measurements in mind, we can see that the performance event counts for Tier C are consistent with L1 data cache access. We are measuring on a single core computer and the data cache and TLB misses are mostly due to other system activity (like OS timer interrupts). Tier B cases hit in either the MicroTLB or Main TLB. Tier B are mainly references to primary memory without any TLB miss. The Tier B array sizes are within the MicroTLB and/or Main TLB coverage. Tier A cases miss in the TLBs. Tier A references require an additional primary memory read by the page walker. Tier A has the longest elapsed execution time.

Next up, we’ll measure the memory access (latency) time to primary memory.