Memory hierarchy and access time - Sand, software and sound

PMU measurement source: Makefile, chase.c, readme
Latency measurement source: Makefile, latency.c, readme
Common code: test_common.h, test_common.c, rpi_pmu.h, rpi_pmu.c

This page takes a closer look at the Raspberry Pi memory hierarchy. Each level of the memory hierarchy has a capacity and speed. Capacities are relatively easy to discover by querying the operating system or reading the ARM1176 technical reference manual. Speed, however, is not as easy to discover and must usually be measured. I use a simple pointer chasing technique to characterize the behavior of each level in the hierarchy. The technique also reveals the behavior of memory-related performance counter events at each level.

Background

The Raspberry Pi implements five levels in its memory hierarchy. The levels are summarized in the table below.

**Raspberry Pi memory hierarchy**
Virtual pages in secondary storage
Virtual/physical pages in primary memory
Level 2 (L2) cache
Level 1 (L1) instruction and data caches
CPU and VFP11 coprocessor registers

The highest level consists of virtual memory pages that are maintained in secondary storage. Raspbian Wheezy keeps its swap space in the file /var/swap on the SDHC card. Here’s one way to find the location and size of the swap file:

> cat /proc/swaps
Filename              Type      Size    Used    Priority
/var/swap             file	102396  0       -1
> ls -l /var/swap
-rw------- 1 root root 104857600 Feb  8 22:40 /var/swap

This is enough space for 25,600 4KB pages. The free command shows similar information:

> free -k
             total       used       free     shared    buffers     cached
Mem:        448776     233840     214936          0      20480      95416
-/+ buffers/cache:     117944     330832
Swap:       102396          0     102396

Another method is the swapon command:

> swapon -s
Filename           Type        Size     Used    Priority
/var/swap          file	       102396   0       -1

You are allowed as many pages as will fit into the preallocated swap space.

The Raspberry Pi has either 256MB (Model A) or 512MB (Model B) of primary memory. This is enough space for 65,536 pages or 131,072 physical pages, if all of primary memory were available for paging. It isn’t all available for user-space programs because the Linux kernel needs space for its own code and data. Linux also supports large pages, but that’s a separate topic for now. The vmstat command displays information about virtual memory usage. Please refer to the man page for usage. Vmstat is a good tool for troubleshooting paging-related performance issues since it shows page in and out statistics.

The processor in the Raspberry Pi is the Broadcom BCM2835. The BCM2835 does have a unified level 2 (L2) cache. However, the L2 cache is devoted to the VideoCore GPU. Memory references from the CPU side are routed around the L2 cache. In case you’re wondering, though, the L2 cache capacity is 128KB.

The BCM2835 has two level 1 (L1) caches: a 16KB instruction cache and a 16KB data cache. Our analysis below concentrates on the data cache. The data cache is 4-way set associative. Each way in an associative set stores a 32-byte cache line. The cache can handle up to four active references to the same set without conflict. If all four ways in a set are valid and a fifth reference is made to the set, then a conflict occurs and one of the four ways is victimized to make room for the new reference.

The data cache is virtually indexed and physically tagged. Cache lines and tags are stored separately in DATARAM and TAGRAM, respectively. Virtual address bits 11:5 index the TAGRAM and DATARAM. Given a 16KB capacity, 32 byte lines and 4 ways, there must be 128 sets. Virtual address bits 4:0 are the offset into the cache line. The data MicroTLB translates a virtual address to a physical address and sends the physical address to the L1 data cache. The L1 data cache compares the physical address with the tag and determines hit/miss status and the correct way. The load-to-use latency is three (3) cycles for an L1 data cache hit.

The BCM2835 implements a two level translation lookaside buffer (TLB) structure for virtual to physical address translation. There are two MicroTLBs: a ten entry data MicroTLB and a ten entry instruction MicroTLB. The MicroTLBs are backed by the Main TLB (i.e., the second level TLB). The MicroTLBs are fully associative. Each MicroTLB translates a virtual address to a physical address in one cycle when the page mapping information is resident in the MicroTLB (that is, a hit in the MicroTLB).

The Main TLB is a unified TLB that handles misses from the instruction and data MicroTLBs. The Main TLB has two elements:

Eight (8) fully associative, lockable entries, and
A 64-entry, 2-way associative structure.

Main TLB misses are handled by a hardware page table walker. A page table walk requires at least one additional memory access to find the page mapping information in primary memory. The overall Memory Management Unit (TLBs, page walker, etc.) supports four mapping sizes: 4KB, 64KB, 1MB, and 16MB.

The lowest level of the memory hierarchy consists of the fast registers in the CPU. The ARM1176JZF-S core within the BCM2835 has 33 general purpose 32-bit registers and 7 dedicated 32-bit registers. The BCM2835 also implements a VFP11 (vector floating point) coprocessor. The VFP11 register file consists of four banks of eight registers.

Pointer chasing technique

Our approach is to exercise exactly one particular level in the memory hierarchy at a time and observe its behavior by counting hardware performance events. We need a test program with a small simple loop that repeatedly touches the same level in the hierarchy. We instrument the program and measure both the elapsed time and performance counter events. By placing these measurements in a table, we get a profile of memory system behavior.

Linked list pointer chasing is a classic approach to this problem. It is a synthetic microbenchmark — synthetic because it is not a real-world application and micro because it is small and specific. A pointer chasing loop:

  ListItem* p ;

  for(p = head ; p != NULL ; p = p->next) ;

is small, tight and devoid of extraneous arithmetic operations. The trick is to force access to a single level of the memory hierarchy.

We know that each level of the hierarchy has a specific capacity. The hardware sends memory requests to the next level when the capacity of a given level is exceeded. The level 1 (L1) data cache, for example, holds 16KB and is 4-way set associative. Another way to state the L1 capacity is to say that it holds 512 32-byte cache lines. As long as the linked list fits within the L1 data cache, every chase through the next pointer should hit in the L1 data cache.

If we increase the size of the linked list and exceed the capacity of a level, we force chase operations into the next level of the hierarchy due to capacity misses. In the case of the L1 data cache, each chase operation should cause a miss in the L1 data cache in order to force all references into the next level, thereby exercising the next level exclusively. This is accomplished by making each list item the same size as a cache line (conveniently a power of two) and forcing conflict misses. Every chase operation should causes a conflict miss in the L1 data cache once the cache is warmed up and each set is full. Higher levels are exercised and measured by progressively increasing the size of the linked list and causing both capacity and conflict misses at the lower levels.

The fundamental data structure in the pointer chasing program (called chase) is the linked list item:

typedef struct _CacheLine
{
  struct _CacheLine* ptrCacheLine[POINTERS_PER_LINE] ;
} CacheLine ;

This declaration defines a type called CacheLine which contains a one dimensional array of pointers. POINTERS_PER_LINE is a symbolic constant which specifies the number of pointers (32-bit addresses) that fit in a 32-byte cache line, e.g., eight pointers. The program stores the item’s next pointer at array index zero. (Consistency of placement and a cache size which is a power of two assures cache conflicts. Why?)

Linked list initialization is the trickiest part of the program. The function initialize_array() calls malloc() to allocate a contiguous block of memory that is big enough to hold the entire linked list. The base address is cast to the CacheLine* type so we can manipulate the cache lines and pointers within the lines more easily. For safety’s sake, the function initializes all of the pointers to zero (NULL pointers).

The array of cache lines is treated in two halves: top and bottom. The next pointers are initialized such that an item in the bottom half points to an item in the top half and an item in the top half points to an item in the bottom half. Thus, the chase ping-pongs between the top and bottom halves of the array. If the linked list items are laid down in order in the array instead, a chase operation would be a sequential reference from the current cache line to the following cache line in memory. The ARM hardware would be able to detect the sequential reference and optimize for it. The ping-pong defeats the sequential access pattern.

Ping-ponging is not enough to fool some processors. AMD processors, for example, detect reference patterns for long strides. The processors can detect patterns for multiple reference streams even when the accesses from different streams are interspersed. One way to defeat pattern detection in this case is to lay down the list items in random order within the array. Try it!

Here is the array initialization loop within the guts of initialize_matrices():

  steps = array_elements / 2 ;
  bottom = 0 ;
  top = steps ;
  for (i = 0 ; i < steps ; i++) {
    array[bottom].ptrCacheLine[0] = &array[top] ;
    array[top].ptrCacheLine[0] = &array[bottom+1] ;
    top++ ;
    bottom++ ;
  }
  array[(array_elements-1)].ptrCacheLine[0] = NULL ;

The final assignment statement terminates the linked list.

The actual pointer chasing loop within the function test_loop_prototype() is quite simple:

void test_loop_prototype(CacheLine* array)
{
  CacheLine* p ;
  for(p = array ; p != NULL ; p = p->ptrCacheLine[0]) ;
}

The loop chases through the linked list items until it reaches the NULL pointer at the end of the list.

The time required to chase through a 4MB array (the largest test case) is incredibly short. So, the program calls test_loop_prototype() many times in order to increase the overall run time. The longer run time improves repeatability and reduces the effect of clock resolution on time measurements. The program accepts command line options to set the array size, to set the number of test iterations and to select the performance events to be measured. Here is the usage summary for chase:

> ./chase -h
Usage: ./chase
  -a N  Allocate/initialize array on core N
  -c    Measure L1 data cache events
  -d    Display debugging information
  -e    Measure explicit external data access
  -h    Display usage information
  -i    Measure instruction and cycle events
  -l N  Set the cache line size to N bytes
  -n N  Set the number of tests (iteration count)
  -p N  Perform pointer chasing on core N
  -s N  Allocate an array with N * 1024 elements
  -t    Measure TLB events
  -v    Display more information (verbose output)

Experimental design and results

We built a small test script that runs chase for a range of linked list/array sizes. The test script makes it easier to configure and launch all of the test runs. Don’t forget that we need to run chase multiple times to collect all of the performance events of interest, further increasing the number of test runs.

We want to perform the same number of chase operations per run in order to make meaningful comparisons between runs. The iteration count (command line option -n) is adjusted along with array size to achieve the same number of chase operations per run (128M chases per run to be exact). For the 16KB (512 cache lines) and 4MB (128K cache lines) cases,

512 cache lines * 262,144 iterations = 128M chase operations
128K cache lines * 1,024 iterations = 128M chase operations

the 16KB case executes 262,144 iterations of the inner test loop and the 4MB case executes 1,024 iterations of the inner test loop.

Chase loop performance measurements

I have summarized the test results in the three tables below. The first table shows the configuration parameters for each run: the array size in bytes, the array (linked list) size in 32-byte cache lines, and the number of test iterations. Please notice how the number of test iterations increases as the array size decreases in order to keep the total number of chase operations constant across runs.

Array size  Cache lines  Iterations   Time
----------  -----------  ----------  -----
       4MB         128K        1024  25.76
       2MB          64K        2048  25.74
       1MB          32K        4096  25.98
     512KB          16K        8192  25.56
     256KB           8K       16384  24.62
     128KB           4K       32768  19.99
      64KB           2K       65536  11.75
      32KB           1K      131072  10.49
      16KB          512      262144   2.66
       8KB          256      524288   2.39

First, let’s take a look at the elapsed times in the fourth column. There are three clear timing tiers (~2.5 seconds, ~11 seconds, and ~25 seconds). The BCM2835 processor has an L1 data cache (16KB), a 128KB unified L2 cache and a 512MB primary memory. The L2 cache is dedicated to the VideoCore GPU and CPU memory operations are routed around the L2 cache to primary memory. Therefore, one would expect to find only two timing tiers: L1 data cache and primary memory.

What’s up? The fastest two runs are clearly associated with the L1 data cache since the array sizes (8KB and 16KB) fit in the L1 data cache. The remaining runs access primary memory, but are affected by MicroTLB and Main TLB behavior. Coverage is the maximum number of bytes in the pages that the program can touch through a TLB without a miss. The MicroTLB has a coverage of (10 * 4,096) = 40KB and the Main TLB has a coverage of (72 * 4,096) = 288KB. As a TLB begins to saturate, the effective access time goes up due to TLB misses and fills. The MicroTLB fills from the Main TLB and the Main TLB fills from primary memory via the hardware page table walker. A page table walk is especially slow because it must read page information from primary memory.

The second table shows the number of scaled cycles and executed instructions for each run. Cycles per instruction (CPI) is computed from the raw event counts and is shown in the fourth column. CPI is consistent with elapsed time, as expected.

Array size  Cycles/64  Exec Inst    CPI
----------  ---------  --------- ------
       4MB  281762460  879073813  20.51
       2MB  281866308  878960062  20.52
       1MB  284716446  880339064  20.70
     512KB  280131077  878443048  20.41
     256KB  270032804  877033078  19.71
     128KB  219178722  863523985  16.24
      64KB  128910101  841130729   9.81
      32KB  114946914  838561189   8.77
      16KB   29413167  819668676   2.30
       8KB   26223823  824428534   2.04

The third table summarizes memory-related performance events. TLB misses increase with array size and are consistent with the coverage values of 40KB and 288KB. The numeric breaks in the event data are not distinct due to cache and TLB pollution and to slow saturation of the TLBs.

The elapsed times and CPI values give us a rough notion about the relative speed of the L1 data cache and primary memory levels of the hierarchy. However, we need to stop well short of using the elapsed time and CPI as absolute measures of access speed. The program instrumentation measures the chase operations plus the conditional NULL pointer test in the chase loop and the surrounding iterative test loop. Thus, elapsed time and CPI are not pure measures of access speed alone.

Array size  DC access  DC miss  MicroTLB  Main TLB  Ext acc
----------  ---------  -------  --------  --------  -------
       4MB   24696257  5107400   1049399    258783  7433309
       2MB   24927252  5119529   1051719    259531  7509503
       1MB   24878797  5145341   1065404    273960  7521771
     512KB   24668739  5086410   1050024    259104  7461179
     256KB   22999900  4803510   1057814    308477  7263176
     128KB   20051408  4258682    965191    181398  6136462
      64KB   11972871  2683992    620354     91207  3785568
      32KB   10096551  2288319    548311     71756  3231861
      16KB    2621836   594034    136900      8265   804582
       8KB    3480732   446481     72712      2757   480707

The ping-ponging between the bottom and top halves of the array makes a slow march through the items in the bottom and top halves. The march bounces between two current pages (i.e., the working set size is 2 pages). There are 128 items in a 4KB page, so each page is touched 128 times before a new page is required. Thus, two TLB misses (worst case) occur after every 256 chase operations; a TLB miss does not occur for every reference through a next pointer. A random item layout would cause more TLB misses due to a larger working set.

The BCM2835 is a single core machine and all software must execute on the same core. User-space programs and the operating system must share internal hardware structures: the caches, the TLBs and the branch history table. Under ideal test conditions, the benchmark program would run on a single core alone without interference from other programs. Unfortunately, this is not the case here and the shared hardware structures are polluted by the activities of other programs and the OS. (Even if no applications and background services are running, the OS still handles clock ticks and other interrupts.)

The 8KB and 16KB runs best show pollution from other activity on the BCM2835. The array/linked list is small enough to fit into the L1 data cache in each case. There simply shouldn’t be any data cache misses and TLB misses (except for a few mandatory or incidental misses). These cases demonstrate the hazards of performance measurement and analysis on a single core machine.

Direct measurement technique

Many processor implementations like x86 provide a cycles counter which can be used to measure the duration of a code region such as a loop. The x86 implements a special instruction, RDTSC, to read the cycles counter. (“RD” means read and “TSC” means time stamp counter.) The program captures the value of the cycles counter at the beginning and end of a code region and subtracts the start time from the end time to obtain the duration.

This technique sometimes can measure the duration of individual instructions. Extreme care must be taken to assure that the captured values are valid and useful. Superscalar (multiple issue) processors with out-of-order execution may re-schedule and re-order instructions in the pipeline. In-order execution is not guaranteed. The instruction that reads the start time may execute after the instruction to be measured. Other permutations are possible affecting correctness and validity. Further, the instruction which reads the cycles counter takes time to execute, too, and that measurement bias must be taken into account.

The ARM1176JZF-S is a single issue, in-order pipeline. Its simple issue behavior gives us the confidence to measure the execution time of a single instruction. I cloned the chase program and modified the pointer chasing test loops. The new program is named latency. Like chase, the new program runs the inner pointer chasing loop many times. First, list traversal is started and the cache, TLBs, and pipeline are allowed to warm up. During list traversal, the loop selects a chase operation. The loop measures the execution time (cycles) of the chosen pointer dereferencing instruction and the loop stores the execution time in an array. The array is written to a file named samples.dat after all iterations are complete. Thus, samples.dat contains a statistical sampling of the execution times for the pointer dereferencing instruction. The distribution of samples should show us the dominant latency times (access times) for a given level in the memory hierarchy.

I cannot guarantee that this technique will work on any other ARM implementation. In fact, there may be situations on ARM1176 where this technique will fail for all I know!

The following source code is the outer test control loop. The number of iterations is determined by the variable iteration_count. For every test iteration, the loop body:

Sets up and starts the ARM1176 Cycles Counter Register (CCR),
Calls sample_cycles() to traverse the linked list and take a latency sample,
Stores the sampled latency time in the array samples, and
Stops the cycles counter.

The cycles count is not scaled by 64 because we want to count individual cycles. By starting and stopping the CCR with each iteration, we avoid any overflow issues. The inner chase loop executes in much less than the time it would take to overflow the 32-bit CCR.

void sample_latency_cycles()
{
  register int i ;

  // Clear the sticky overflow bits
  armv6_pmcr_write(ARMV6_PMCR_CCOUNT_OVERFLOW |
                   ARMV6_PMCR_COUNT0_OVERFLOW |
                   ARMV6_PMCR_COUNT1_OVERFLOW
		   ) ;

  for(i = iteration_count-1 ; i >= 0 ; i--) {
    // Clear and start the performance counters
    armv6_pmcr_write(ARMV6_PMCR_ENABLE |
		     ARMV6_PMCR_CCOUNT_RESET |
		     ARMV6_PMCR_CTR01_RESET |
		     (0xFF << ARMV6_PMCR_EVT_COUNT0_SHIFT) |
		     (0xFF << ARMV6_PMCR_EVT_COUNT1_SHIFT)
		     ) ;

    // Kind of a bogus "random" sampling within a 64
    // iteration window during list traversal
    samples[i] = sample_cycles(array, 187+(i&0x3F) ) ;

    // Stop the performance counters
    armv6_pmcr_write(0) ;
  }
}

The test control loop passes two arguments to the function sample_cycles(). The first argument is a pointer to the first item in the linked list. The second argument specifies when the chase loop should sample a latency time. We let the chase loop traverse at least 186 elements before taking a sample. The current iteration count chooses a chase operation within a 64 instruction window after the 186th chase operation. The chosen operation is sampled and returned by sample_cycles().

Here is the source code for the pointer chasing function sample_cycles() from latency.c.

static inline uint32_t armv6_read_ccr()
{
  uint32_t value ;
  asm volatile("mrc   p15, 0, %0, c15, c12, 1" : "=r"(value)) ;
  return( value ) ;
}

uint32_t sample_cycles(CacheLine* linked_list, int n)
{
  register CacheLine* item ;
  register uint32_t before, after, cycles ;
  register count = n ;

  cycles = 0 ;
  for(item = linked_list ; item != NULL ; ) {
    if (count-- == 0) {
      before = armv6_read_ccr() ;
      item = item->ptrCacheLine[0] ;
      after = armv6_read_ccr() ;
      cycles = after - before ;
    } else {
      item = item->ptrCacheLine[0] ;
    }
  }
  return( cycles ) ;
}

The pointer chasing loop traverses the linked list. It counts down from n which is the number of chase operations to perform before taking a sample. When the running count hits zero, the code:

Captures the current CCR value in the variable before,
Executes the chase operation by dereferencing the next pointer,
Captures the then current CCR value in after, and
Stores the duration in the variable cycles.

When the list is fully traversed, the sampled latency, cycles, is returned. As we saw earlier, the sampled latency is stored in an array until it is finally written to a file.

The program must be compiled with the -O2 option to assure that instructions to read CCR are inserted in-line and to keep temporary variables in registers to avoid cache pollution.

Direct measurement results

I ran the latency program over the same range of array/list sizes and plotted the distribution of the instruction execution times for each test case. The 8KB test case hits exclusively in the L1 data cache as shown in the following histogram (distribution):

9  1000 ################################################

The first column is execution time in cycles. The second column is the number of samples with that time. In the 8KB case, all 1000 samples are 9 cycles. This is good news because the measurements match our expectations — all read accesses hit in the L1 data cache. The histogram for the 16KB case is similar with 993 samples at 9 cycles and 6 samples in the range [61:63].

The ARM1176JZF-S Technical Reference Manual specifies a 3 cycle load-to-use latency for a hit in the L1 data cache. The measured value (9 cycles) is biased by two factors:

The MRC instructions that capture the before and after cycle counts take time to execute. If we remove the load instruction that chases the next pointer, all 1000 samples are 8 cycles.
The execution of the MRC instructions and the load instruction partially overlap in the pipeline.

The measurement bias is estimated to be (9 cycles – 3 cycles) = 6 cycles.

The 32KB and 64KB cases are the next timing tier with similar distributions. Here is the distribution for the 64KB case.

 9   49 ##### 
17    1 # 
61  320 ###########################
62  571 ################################################
63   11 # 
64    4 # 
65    1 # 
69    1 # 
70   27 ###

Five percent of the samples hit in the L1 data cache. 92% of the samples hit in primary memory. Ten samples (1%) are statistical outliers and are not shown in the interest of space. The 32KB and 64KB cases are within the coverage of the Main TLB. These accesses may be affected by MicroTLB misses, but it’s reasonable to say that primary memory access takes (62 cycles – 6 cycles bias) = 56 cycles when the address is found in the Main TLB or MicroTLB.

The distribution for the test case at 128KB is bimodal and shows the effect of Main TLB misses on access time (latency).

  9    2 # 
 61   91 ########## 
 62  123 ##############
 70   18 ## 
123  465 ###############################################
124  118 #############
126   63 #######
137   21 ### 
138    8 # 
140    9 # 
145    3 # 
146    3 #

35 outliers above 146 cycles (3.5% of the samples) are not shown. The peak at [61:62] are primary memory accesses where the address hits in either the MicroTLB or Main TLB. The peak at [123:137] contains primary memory accesses where the address misses in the Main TLB. In that case, the hardware initiates a page table walk which requires at least one additional primary memory access. The slower accesses are almost exactly twice as long as the faster accesses when the address hits in the TLBs. This is strong evidence for a primary memory latency in the range of 55 to 62 cycles.

The distributions for the remaining test cases are quite similar. Virtually all primary memory accesses miss the Main TLB and cause a page table walk. Here is the distribution for the 1MB case.

113    4 # 
116  224 ###############################################
117   23 ######
118   59 ############
119  129 ##########################
120    5 ## 
121  224 ###############################################
122  185 #######################################
125    4 # 
130    9 ### 
132    7 ## 
133   18 ##### 
135    5 ##

It shows a broad cluster around 122 cycles. 47 outliers above 135 cycles (4.7%) are not shown.

Our results are summarized in the table below. Six cycle measurement bias is subtracted from the measured latency times in order to estimate the actual access time.

Memory hierarchy access time (latency)
Level	Access time	Condition
L1 data cache	3 cycles	Data cache hit
Primary memory	56 cycles	No TLB miss
Primary memory	116 cycles	Main TLB miss

A few closing thoughts

Memory requests from the CPU are routed around the BCM2835 L2 cache. One has to wonder about the L2 latency time and how much the L2 cache would affect overall performance. There is a big spread between 3 cycles and 56 cycles. The L2 could also speed up Main TLB fills by caching recently used page table entries.

Average latency is not always a good performance indicator. The 128KB test case has a bimodal distribution where 23% of the samples access primary memory without a Main TLB miss and 69% of the samples incur a Main TLB miss. The average latency is 110 cycles (adjusted for bias), which is close to the measured 116 cycles. If, however, we were analyzing a program with more pronounced bimodal behavior (a 50-50 split), the average would fall between the two peaks in the distribution. The average would either overstate or understate performance.

Finally, the measurements show the painful effect of Main TLB misses. Data placement and memory access patterns are critical to good program performance.