ARM Cortex-A72 tuning: Memory access

Today’s post characterizes read access time to three different levels of the Raspberry Pi 4 (Broadcom BCM2711) memory hierarchy. The ARM Cortex-A72 processor has a two level cache structure: Level 1 data (L1D) cache and unified Level 2 cache. There is one L1D cache per core and all four cores share the L2 cache. Primary memory is the third and final level beyond L2 cache.

The test program is a simple kernel (inner loop) that runs through a linked list, i.e., pointer chasing. Each linked list element is exactly one A72 cache line in size, 64 bytes. I have used pointer chasing on other non-ARM architectures (Alpha and AMD64 come to mind) and it’s a pretty simple and effective way to characterize memory access speed.

The trick is to adjust the number of linked elements so that the entire linked list fits entirely within the memory to be characterized. In order to facilitate run-to-run comparisons, there is an outer loop which repeatedly invokes list chasing, i.e., the entire list is walked multiple times per run.

There are two main run parameters:

  • The number of linked list elements (which determines the array size), and
  • The number of iterations (which is the number of times the full list is walked).

When the array is doubled, the number of iterations is cut in half. This keeps the number of individual pointer chase operations (approximately) constant across runs.

The following table summarizes the test run parameters and the memory level to be exercised by the each run:

    #Elements  Iterations  Array Size  Mem Level 
--------- ---------- ---------- ---------
32 8388608 2KB L1D cache
64 4194304 4KB L1D cache
128 2097152 8KB L1D cache
256 1048576 16KB L1D cache
512 524288 32KB L1D cache
1024 262144 64KB L2 cache
2048 131072 128KB L2 cache
4096 65536 256KB L2 cache
8192 32768 512KB L2 cache
16384 16384 1MB L2 cache
32768 8192 2MB RAM
65536 4096 4MB RAM

Here is the C code for the test kernel:

  initialize(number_of_elements) ;
a72MeasureDataAccessEvents() ;

start_clock() ;
peStartCounting() ;
for ( ; iterations > 0 ; iterations--) {
for (CacheLine *p = listHead ; p != NULL ; p = p->nextLine) ;
}
peStopCounting() ;

print_clock_time(stdout, get_clock_time()) ;
a72PrintDataAccessEvents(stdout) ;

Both Linux clock() time and Cortex-A72 performance counter events are measured.

In my first experiments, the linked list elements were laid down in a linear sequential fashion and in a simple ping-pong scheme. I quickly discovered that Cortex-A72’s aggressive data prefetch is too good and naive layout did not produce the expected number of L1D or L2 cache misses. A72 speculatively reads the next cache line beyond a miss. By the time execution would reach the list element beyond the current one (or the very next element), the needed destination element would be available in cache or in flight.

Ideally, we want to fool the memory prefetcher and hit only the intended memory level, taking the full read access penalty each time we chase a pointer. I rewrote array/list initialization to lay down the list elements at (pseudo-)random positions in the array. The Fisher-Yates (Knuth) shuffle algorithm got the job done. Once list element layout was randomized, the pointer chasing test began producing the expected number of reads and misses.

The following table summarizes each run by the number of retired instructions, CPU cycles, instructions per cycle (IPC) and execution time:

    Array  Mem  Retired Ins    CPU Cycles     IPC    Time 
----- --- ----------- -------------- ----- ------
2KB L1D 847,249,436 855,660,776 0.990 0.609
4KB L1D 826,277,916 1,154,215,728 0.716 0.814
8KB L1D 815,792,156 1,114,379,370 0.732 0.806
16KB L1D 810,549,276 1,093,757,212 0.741 0.763
32KB L1D 807,927,836 1,382,324,229 0.584 0.975
64KB L2 806,617,116 5,074,763,198 0.159 3.446
128KB L2 805,961,756 5,643,312,493 0.143 3.805
256KB L2 805,634,076 6,621,262,142 0.122 4.452
512KB L2 805,470,236 7,163,843,161 0.112 4.813
1MB L2 805,388,316 27,563,140,814 0.029 18.421
2MB RAM 805,347,356 49,317,924,775 0.016 32.969
4MB RAM 805,326,876 54,865,753,267 0.015 36.645

No surprise, access to L1D cache is best, L2 is second best and primary memory is worst. Access to L2 cache is about five times as long as L1D cache, in terms of CPU cycles. Access to primary memory is nearly 50 times longer than L1D cache. The effect on IPC is very significant.

Taking a look at the L1D performance event counts:

    Array  Mem   IPC    Time    L1D Reads    L1D Misses  Ratio 
----- --- ----- ------ ----------- ----------- -----
2KB L1D 0.990 0.609 268,435,785 484 0.000
4KB L1D 0.716 0.814 268,435,630 1,316 0.000
8KB L1D 0.732 0.806 268,435,639 1,149 0.000
16KB L1D 0.741 0.763 268,435,622 4,319 <0.001
32KB L1D 0.584 0.975 268,435,828 17,343,069 0.065
64KB L2 0.159 3.446 268,435,603 234,906,566 0.875
128KB L2 0.143 3.805 268,435,592 268,435,529 1.000
256KB L2 0.122 4.452 268,435,625 268,435,588 1.000
512KB L2 0.112 4.813 268,435,599 268,435,530 1.000
1MB L2 0.029 18.421 268,435,594 268,435,782 1.000
2MB RAM 0.016 32.969 268,435,579 268,435,960 1.000
4MB RAM 0.015 36.645 268,435,635 268,435,941 1.000

we see that pointer chasing correctly and independently exercises L1D cache according to design. The L1D cache capacity is 32KB. The particular 32KB run shown here has the shortest execution time of the 32KB runs and thus, is cherry-picked. As I’ve seen on other architectures, measurements get a bit “weird” near cache capacity. When a cache gets nearly full, “weird stuff” starts to happen and run statistics become inconsistent. The shortest run best shows the break between L1D and L2 access.

Finally, here are the L2 cache performance event counts.

    Array  Mem   IPC    Time     L2 Reads    L2 Misses   Ratio 
----- --- ----- ------ ----------- ----------- -----
2KB L1D 0.990 0.609 1,085 68 0.063
4KB L1D 0.716 0.814 8,490,994 228 0.000
8KB L1D 0.732 0.806 4,300,759 151 0.000
16KB L1D 0.741 0.763 2,102,562 163 0.000
32KB L1D 0.584 0.975 18,495,230 1,003 <0.001
64KB L2 0.159 3.446 235,483,730 1,517 <0.001
128KB L2 0.143 3.805 270,831,005 2,745 <0.001
256KB L2 0.122 4.452 269,203,020 31,340 <0.001
512KB L2 0.112 4.813 270,893,954 443,477 0.002
1MB L2 0.029 18.421 302,452,386 107,397,408 0.355
2MB RAM 0.016 32.969 286,244,127 227,010,870 0.793
4MB RAM 0.015 36.645 277,293,265 252,881,540 0.912

As expected, we see a dramatic breakpoint at 1MB, which is the capacity of the unified L2 cache.

Bottom line, these performance measurements reinforce the importance of cache-friendly algorithms and data access patterns. Start with the best algorithms for your application, measure cache events and then tune for minimum misses. Data access should hit most frequently in the Level 1 data cache, then L2 cache. Primary memory is fifty times (!) more expensive than L1D cache and reads out to primary memory should be as infrequent as possible. Your mantra should be, “Bring it into cache, compute the heck out of the in-cache data, then write the final results back to memory, and move on.”

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Next time, I will wrap up this long series of articles with C code so you can perform your own experiments.

Copyright © 2021 Paul J. Drongowski