In my last post, I used a simple pointer chasing loop to characterize the different levels of the BCM2835 (Raspberyy Pi) memory hierarchy. I ran the pointer chasing loop for a range of linked list sizes where the list size determines the storage level to be exercised. For example, a linked list occupying 16KB or less causes the chase loop to exercise the level 1 (L1) data cache. I measured hardware performance events for each test case and summarized the results in a table. The results clearly show three different timing tiers: L1 data cache, access to primary memory (no TLB miss) and access to primary memory with a Main TLB miss. A miss in the Main TLB is quite slow because the hardware page table walker must read page mapping information from primary memory in order to update the TLB and to complete the memory operation in flight.
The one characteristic missing from the analysis, however, is the estimated access time for each level. Sure, the execution times clearly break the test cases into tiers, but elapsed execution time does not really measure the access (latency) time itself.
In the next experiment, I estimate the access time. (See the source code in latency.c.) The approach is essentially unchanged from the first experiment: use pointer chasing on lists of different sizes to hit in particular levels of the memory hierarchy. Like the first experiment, the program (named latency) successively traverses the same linked list many times (default: 1000 traversals). Part way through each traversal, the loop chooses a chase operation and measures its execution time. The sampled execution time is saved temporarily in an array until all traversals are complete. Then, the program writes the execution times to a file named samples.dat. The samples are aggregated into a histogram which shows the distribution of the times.
The sampling technique lets each full traversal get underway before taking a measurement in order to “warm up” the pipeline, cache and TLBs. Taking a single sample per traversal avoids major cache and TLB pollution due to the bookkeeping needed to take and save samples. The measured execution times should faithfully represent access time (plus some measurement bias).
The program measures execution time in cycles. It configures and enables the ARM1176 Cycles Counter Register (CCR) as a free-running cycles counter. The program resets and enables the CCR before starting a traversal, thus avoiding a counter overflow. (A full traversal completes before the CCR overflows.) The pointer chasing loop reads the CCR before chasing the next pointer and reads the CCR after chasing the pointer. The difference between the after and before counts is the estimated execution time of the chase operation. The chase operation is implemented as a single ARM load instruction, so the instruction execution time is a biased estimate of the memory access time.
Here is the code for the function sample_cycles()
and the pointer chasing loop within. The function takes two arguments: a pointer to the head of the linked list and an integer value that chooses the chase operation to be sampled.
uint32_t sample_cycles(CacheLine* linked_list, int n)
{
register CacheLine* item ;
register uint32_t before, after, cycles ;
register count = n ;
cycles = 0 ;
for(item = linked_list ; item != NULL ; ) {
if (count-- == 0) {
before = armv6_read_ccr() ;
item = item->ptrCacheLine[0] ;
after = armv6_read_ccr() ;
cycles = after - before ;
} else {
item = item->ptrCacheLine[0] ;
}
}
return( cycles ) ;
}
On each iteration, the loop body checks and decrements the sampling count maintained in the variable count
. If the sampling count is not zero, then the loop quickly performs an ordinary chase operation. If the count is zero, then the code snaps the before and after CCR values by calling armv6_read_ccr()
, an in-line function that performs an ARM MRC instruction to read CCR. The difference is computed and is saved in the variable cycles
until it can be returned by the function after traversal. The chase loop exits when it reaches the NULL pointer at the end of the linked list.
Local variables are allocated to general registers in order to avoid extraneous cache and TLB references that would perturb measurements. I checked the compiled machine code to make sure that the compiler accepted the register
hints and actually allocated the local variables to registers. I also made sure that the function armv6_read_ccr()
was expanded in-line.
The latency program needs to have user-space access to the ARM1176 performance counters. You must load the aprofile kernel module before running latency. Otherwise, you will get an “Illegal instruction” exception when the program attempts to configure the performance counters.
Here is a histogram showing the distribution of execution time samples when the linked list is 8KB in size. The 8KB test case consistently hits in the L1 data cache.
9 1000 ################################################
The first column is the execution time in cycles. The second column is the number of samples with the value shown in the first column. All 1,000 samples are 9 cycles.
Chapter 16 of the ARM1176JZF-S Technical Reference Manual (TRM) describes instruction cycle timings. A load from L1 data cache has a 3 cycle load-to-use latency (access) time. Taking 3 cycles as the actual access time, I estimate a measurement bias of (9 cycles – 3 cycles) = 6 cycles due to the pipelined execution of the ARM MRC instructions that read the CCR and the load instruction that chases the linked list pointer.
The 32KB and 64KB test cases form the middle timing tier. Both test cases are within the coverage of the Main TLB (288KB) and memory accesses hit in either the Micro TLB or Main TLB. The distributions are similar. Here is the distribution for the 64KB case.
9 49 #####
17 1 #
61 320 ##########################
62 571 ################################################
63 11 #
64 4 #
65 1 #
69 1 #
70 27 ###
There is a strong peak in the narrow range of [61:62] cycles. The nonbiased estimate for access time to primary memory with a TLB hit is (62 cycles – 6 cycles bias) = 56 cycles.
The test cases at 256KB and larger form the third timing tier. These cases exceed the Main TLB coverage. (The 256KB case almost occupies the entire Main TLB and the performance event data place this case in the third and longest running tier.) Load operations miss in the Main TLB forcing a hardware page table walk. The table walker performs at least one additional primary memory read operation in order to obtain page mapping information for the TLB. Here is the distribution for the 1MB case.
113 4 #
116 224 ##############################################
117 23 ######
118 59 ##############
119 129 ###########################
120 5 ##
121 224 ##############################################
122 185 ############
125 4 #
130 9 ###
132 7 ##
133 18 #####
135 5 ##
The samples are clustered in the range of [116:122] cycles. The nonbiased estimate for access time to primary memory with a Main TLB miss is (122 cycles – 6 cycles bias) = 116 cycles. This is approximately twice the access time of a load that hits in either the data MicroTLB or the Main TLB.
The following table summarizes the access time for each level of the memory hierarchy.
Level |
Access time |
Condition |
CPU register |
1 cycle |
|
L1 data cache |
3 cycles |
Data cache hit |
Primary memory |
56 cycles |
No TLB miss |
Primary memory |
116 cycles |
Main TLB miss |
Memory hierarchy access times (latencies)
Where is the L2 cache? The BCM2835 dedicates the L2 cache to the Broadcom VideoCore GPU. Memory operations from the CPU are routed around the L2 cache. The L2 cache is not a factor here and is not part of the analysis.
From these estimates, we can see why the textbook (naive) matrix multiplication program is so slow. The textbook algorithm does not walk sequentially through one of its operand arrays. The long memory stride causes a large number of Main TLB misses. Memory access is twice as slow when an access misses in the Main TLB, thereby slowing down program execution and increasing elapsed time.