My latest set of experiments with Raspberry Pi use the BCM2835 performance counters to exercise and observe the behavior of the RPi memory system. The experiments use a simple microbenchmark that traverses a linked list. This technique is often called pointer chasing because list traversal chases the pointers through the linked list from beginning to end. A pointer chasing loop is simple, is easy to instrument, and has consistently repeatable behavior.
Here’s the approach. We exercise a single specific level in the memory hierarchy (such as the level 1 data cache) simply by adjusting the size of the linked list. We just need to choose a size such that the the physical size of the list is less than or equal to the capacity of a level — 16KB in the case of the level 1 (L1) data cache. Storage for the linked list is allocated as a contiguous block of memory (an array of bytes), so it’s easier to think of the linked list size in terms of its contiguous footprint in memory, the array size. Each linked list item is the same size as a cache line: 32 bytes.
The first experiment measures seven kinds of performance events: cycles, executed instructions, L1 data cache accesses, L1 data cache misses, data MicroTLB misses, Main TLB misses and explicit external data accesses.
Before we take a look at the results of the first experiment, here’s a few things to keep in mind.
- The Broadcom BCM2835 128KB L2 cache is dedicated to the VideoCore GPU. Memory requests from the CPU are routed around the L2 and go directly to primary memory. L2 cache is not a factor in our analysis.
- Coverage is the amount of primary memory accessible through a translation lookaside buffer (TLB) without incurring a TLB miss.
- The MicroTLB has ten fully associative (page) entries. Its coverage is (10 entries * 4,096 bytes) = 40KB.
- The Main TLB handles MicroTLB misses and has 8 fully associative entries plus 64 2-way associative entries. Its coverage is (72 entries * 4,096 bytes) = 288KB.
- A hardware page table walker handles Main TLB misses. A page table walk requires at least one additional read in primary memory in order to look up page mapping information.
These memory system characteristics affect the results and should be directly observable.
The following table summarizes basic measurements for ten test runs. The first column is the size of the dynamically allocated block of memory that holds the linked list elements. The second column is the size of the block expressed as 32-byte cache lines. There is one list item per cache lines, so this is also the number of linked list items. The third column is the number of list traversals (one traversal per iteration). I adjusted the number of iterations for each run in order to perform exactly the same number of pointer chasing operations per test case. This allows us to make meaningful run-to-run comparisons.
Size Cache lines Iterations Time CPI ----- ----------- ---------- ----- ----- 4MB 128K 1024 25.76 20.51 2MB 64K 2048 25.74 20.52 1MB 32K 4096 25.98 20.70 512KB 16K 8192 25.56 20.41 256KB 8K 16384 24.62 19.71 128KB 4K 32768 19.99 16.24 64KB 2K 65536 11.75 9.61 32KB 1K 131072 10.49 8.77 16KB 512 262144 2.66 2.30 8KB 256 524288 2.39 2.04
The fourth column is the total elapsed time per run. There are roughly three distinct timing tiers:
- Tier A: 4MB, 2MB, 1MB, 512KB, 256KB, 128KB
- Tier B: 32KB, 64KB
- Tier C: 8KB, 16KB
The 128KB case is somewhat “transitional,” but I placed it in Tier A. The cycles per instruction (CPI column) ratios are consistent with respect to the elapsed times.
The following table summarizes the performance event counts for the same test cases (block/array size and iterations).
MicroTLB Main TLB External Size DC access DC miss miss miss access ----- --------- ------- -------- -------- -------- 4MB 24696257 5107400 1049399 258783 7433309 C 2MB 24927252 5119529 1051719 259531 7509503 C 1MB 24878797 5145341 1065404 273960 7521771 C 512KB 24668739 5086410 1050024 259104 7461179 C 256KB 22999900 4803510 1057814 308477 7263176 C 128KB 20051408 4258682 965191 181398 6136462 C 64KB 11972871 2683992 620354 91207 3785568 B 32KB 10096551 2288319 548311 71756 3231861 B 16KB 2621836 594034 136900 8265 804582 A 8KB 3480732 446481 72712 2757 480707 A
Keeping the basic measurements in mind, we can see that the performance event counts for Tier C are consistent with L1 data cache access. We are measuring on a single core computer and the data cache and TLB misses are mostly due to other system activity (like OS timer interrupts). Tier B cases hit in either the MicroTLB or Main TLB. Tier B are mainly references to primary memory without any TLB miss. The Tier B array sizes are within the MicroTLB and/or Main TLB coverage. Tier A cases miss in the TLBs. Tier A references require an additional primary memory read by the page walker. Tier A has the longest elapsed execution time.
Next up, we’ll measure the memory access (latency) time to primary memory.