Textbook matrix multiplication (part 1) - Sand, software and sound

Source: Makefile, naive.c, readme
Common code: test_common.h, test_common.c, rpi_pmu.h, rpi_pmu.c

Now that we have a kernel module and user-space functions to use the Raspberry Pi (ARM11) performance counters, let’s take some measurements!

Sure, matrix multiplication is a bit of a trivial example, but there are a number of reasons why it’s a good place to start.

Matrix multiplication has only one hot-spot and is easy to analyze.
Matrix multiplication runs the same way every time and the event counts and execution statistics are very repeatable.
Execution time and problem size are easy to adjust — just change the size of the matrices.
The machine code for the inner loops is straight-line and relatively short making it easy to do manual instruction counting.
Performance depends heavily on memory access speed and pattern.
The textbook (naive) algorithm has known performance divots.
The loop nest interchange algorithm demonstrates a quickly achieved speed-up.

As usual, we provide the source code for the naive (textbook) implementation and the loop nest interchange version.

Naive matrix multiplication

Our example program multiplies two square matrices (arrays matrix_a and matrix_b) produces a square result matrix (array matrix_r). The matrix elements are each four byte floating point values. The matrix (array) size is determined by the symbolic constant MSIZE.

#define MSIZE     500
int matrix_size = MSIZE ;

float matrix_a[MSIZE][MSIZE] ;
float matrix_b[MSIZE][MSIZE] ;
float matrix_r[MSIZE][MSIZE] ;

The function initialize_matrices() initializes the arrays matrix_a and matrix_b with random values between 0.0 and 1.0. It initializes the elements of the result array matrix_r to 0.0.

The function multiply_matrices() is where the action is. This function consists of three nested loops. The two outer loops (index variables i and j) determine the row and column of the result element. The innermost loop (index variable k) walks the row in matrix_a and the column in matrix_b to be scalar multiplied. The index variable k changes the fastest. The result is stored in the element of matrix_r selected by i and j.

void multiply_matrices()
{
  int i, j, k ;

  for (i = 0 ; i < MSIZE ; i++) {
    for (j = 0 ; j < MSIZE ; j++) {
      float sum = 0.0 ;
      for (k = 0 ; k < MSIZE ; k++) {
          sum = sum + matrix_a[i][k] * matrix_b[k][j] ;
      }
      matrix_r[i][j] = sum ;
    }
  }
}

This is the algorithm which we have all seen in math textbooks. It mirrors the mathematical definition of matrix multiplication in linear algebra. We call this the naive algorithm because, as we’ll see, the algorithm has a few known performance issues.

I decided to write three additional functions to count ARM11 performance events while multiplying the matrices together. These functions call the user-space performance counter functions to start and stop counting.

void measure_cpi()
{
  initialize_matrices() ;

  start_counting(ARMV6_EVENT_INSTR_EXEC, ARMV6_EVENT_CPU_CYCLES) ;

  multiply_matrices() ;

  stop_counting() ;
}

void measure_cache()
{
  initialize_matrices() ;

  start_counting(ARMV6_EVENT_DCACHE_CACCESS, ARMV6_EVENT_DCACHE_MISS) ;

  multiply_matrices() ;

  stop_counting() ;
}

void measure_tlb()
{
  initialize_matrices() ;

  start_counting(ARMV6_EVENT_DTLB_MISS, ARMV6_EVENT_MAIN_TLB_MISS) ;

  multiply_matrices() ;

  stop_counting() ;
}

The three functions are:

measure_cpi() measures executed instructions and processor clock cycles.
measure_cache() measures data cache accesses and misses.
measure_tlb() measures translation lookaside buffer misses.

All of the events are interesting and useful, but the ARM11 has only two configurable counters. Thus, we need three separate experimental runs to measure everything. A command line option (one of -i, -c, and -t) chooses the events to measure. If no option is given, then the program will measure cycles per instruction (CPI). The program writes the events counts and other information to a file named “naive.txt”.

Let’s run it!

First, we need to load the kernel module aprofile to enable user-space performance counter access.

sudo modprobe aprofile

If you don’t do this and you run the example program (named naive), you will see:

> ./naive Illegal instruction

“>” is the command line prompt (above). Don’t type it. Once the kernel module is loaded, user-space counter access remains granted until the module is unloaded or the system is rebooted.

After loading the kernel module, run naive and then take a look at the output in the file named “naive.txt”.

***************************************************************
Naive matrix multiplication
Thu May 23 16:08:18 2013
***************************************************************
System information
  Number of processors:     1
Run information
  Outer loop instructions:  4000
  Middle loop instructions: 2250000
  Inner loop instructions:  1125000000
  Total instructions:       1127254000
  Estimated cycles:         11200000000
  Cycles (scaled by 64):    175000000
Performance Monitor events
  Cycle Counter: 181984943  Processor cycles (scaled by 64)
  Counter 0:     1190897822  Executed instruction
  Counter 1:     3057101795 (OVERFLOW)  CPU cycles

Run execution summary
  User time:                  16.490 sec
  System time:                 0.030 sec
  Elapsed time:               16.000 (0.000001 sec resolution)

The program took roughly 16 seconds to execute. The kernel measured 16.49 user CPU seconds.

By default (no command line option), naive measures processor cycles and executed instructions. The processor cycles in the cycle counter are scaled by 64. Thus, each count in this register represents 64 clock cycles. The actual number of cycles is 64 times the final event count. The processor cycles in Counter 1 are not scaled. Please note that Counter 1 overflowed. It’s final value is useless and should be ignored. All of the counters are only 32 bits long and are subject to overflow. Unless cycles are scaled, a 32 bit counter overflows in 6.14 seconds. The user-space routines detect and flag counter overflows.

The raw event counts alone are not always meaningful. So, we typically compute one or more ratios that are easier to understand and interpret. In this case, let’s compute cycles per instruction (CPI).

CPI = (64 * (Scaled clock cycles)) / (Executed instruction)
    = (64 * 181,984,943) / 1,190,897,822
    = 11,647,036,352 / 1,190,897,822
    = 9.78 cycles per instruction

In an ideal world, the processor would execute one or more instructions per cycle and CPI would be 1.0 or less. We probably can never attain a CPI of 1.0 for matrix multiplication due to the amount of memory traffic. However, 9.78 cycles per instruction is relatively big and something is slowing down the program.

Memory size, access and performance

One important consequence of memory circuit design is that “You can have fast memory and you can have high density (high capacity) memory, but you cannot have fast high density memory.” One might also add power considerations, but the consequence is the same. The CPU registers are really fast and can read/write a value in one processor clock cycle. Primary memory cells are small, but quite slow. A read/write to primary memory takes tens, if not hundreds, of processor clock cycles. On a bit-by-bit basis, it takes more chip area to implement a register cell (several transistors, higher power) than a dynamic memory cell (one transistor, lower power).

In a way, each kind of memory is suited to its purpose. Fast registers don’t keep the arithmetic-logic unit waiting. Ultra small size dynamic memory cells let us implement high capacity storage in a modest amount of physical space.

On the other hand, the difference in processor speed and primary memory speed is the source of many performance issues and considerations. For example, compilers try to make maximal use of fast CPU registers and to keep as much frequently used data in registers as possible. Good register use leads to higher execution speed since data are readily available and immediately accessible for computations and decisions. The program slows down, however, when data are not in registers and must be read (or written) to primary memory.

**Typical memory hierarchy**
Secondary storage
Primary memory
Level 2 (L2) cache memory
Level 1 (L1) cache memory
CPU registers

Computer architects mitigate the speed/space trade-off by dividing the memory system into a hierarchy of memory subsystems. Storage inside the CPU, the registers, is small and very fast (one cycle per read/write). The CPU and the registers are fed by a relatively larger and slower level 1 (L1) cache memory with a load-to-use latency of a few cycles. The level 1 cache memory is fed by an even larger and slower level 2 (L2) cache memory. Finally, the level 2 cache memory is fed by much bigger and slower primary memory.

The hardware tries to a anticipate when data are needed and bring the data into the next lower level of the hierarchy. Fortunately, programs have certain characteristics, spatial and temporal locality, that hardware can exploit.

Spatial locality: If a program accesses a memory location, it will likely access a neighboring memory location.
Temporal locality: If a program accesses a memory location, it will likely access the same location again within a short period of time.

Instruction fetch is a good example. The program counter (PC) marches sequentially through instructions (locations) in memory until the processor hits a branch instruction. If the branch forms a loop, then the same recently used instructions (locations) are fetched and executed within a short period of time.

You probably noticed that we didn’t mention the top-most level in the hierarchy: secondary storage such as disk or flash memory. This level represents virtual memory where each program executes within its own virtual address space. Thanks to Linux and the hardware memory management unit (MMU), the virtual address space can be larger than the physical size of primary memory. Each virtual address space is broken into 4KB pages. The full virtual program image is maintained in secondary storage and only the most recently used pages are kept in primary memory. Infrequently used pages are kept at the ready in secondary storage. Of course, access is really slow when a page is brought into primary memory (thousands of processor cycles) to satisfy a demand.

Whenever a load or store instruction accesses virtual memory, the virtual address must be translated to a physical address (the physical location where the information is stored). The mapping information is maintained in a page table that describes each and every virtual memory page. The page table itself is quite large, but let’s assume that it is stored in primary memory. If a table look-up is performed for every translation, then a simple read, for example, would require two memory operations: one to read the mapping information and one to read the actual data item. Computer architects speed up translation by caching the mapping information in a small, fast memory called a translation lookaside buffer (TLB). Like the data/instruction caches, the TLB may be divided into two levels. The architects assume that access to a page will be followed by many additional accesses to the same page (locality).

This introduction to memory system behavior is a gross simplification. Real hardware has many tricks to prefetch instructions and data and to route them to where they are needed most! These special cases make certain kinds of forecasting a living Hell…

Measuring memory performance

Now, let’s run naive twice more to count data cache (-c option) and TLB events (-t option). We obtain the following event counts:

Performance Monitor events
  Cycle Counter: 185457520  Processor cycles (scaled by 64)
  Counter 0:     16952312  Data cache access (cacheable)
  Counter 1:     3211874  Data cache miss
Performance Monitor events
  Cycle Counter: 182076277  Processor cycles (scaled by 64)
  Counter 0:     809154  Data MicroTLB miss
  Counter 1:     612688  Main TLB miss

Once again, the raw event counts are not especially informative. Here are a few useful ratios to calculate:

Data cache access rate  = (Data cache access) / ((Executed instructions)/1000)
Data cache miss rate    = (Data cache miss) / ((Executed instructions)/1000)
Data cache miss ratio   = (Data cache miss) / (Data cache access)
Data MicroTLB miss rate =  (Data MicroTLB miss) / ((Executed instructions)/1000)
Main TLB miss rate      =  (Main TLB miss) / ((Executed instructions)/1000)

The access and miss rates are expressed as events per thousand instructions, abbreviated “PTI”. The rates are more intuitive than events per instruction. They better express the frequency at which the events are occurring in the program. The data cache miss ratio expresses the portion of data cache accesses which result in a miss.

Earlier, we measured 1,190,897,822 executed instructions or 1,190,898 thousand instructions. Plugging in the event counts, we get the following rates and miss ratio:

Data cache access rate  = 16,952,312 / 1,190,898 = 14.23 PTI
Data cache miss rate    = 3,211,874 / 1,190,898  =  2.70 PTI
Data cache miss ratio   = 3,211,874 / 16,952,312 =  0.19
Data MicroTLB miss rate = 809,154  / 1,190,898   =  0.68 PTI
Main TLB miss rate      = 612,688  / 1,190,898   =  0.51 PTI

19% of the counted accesses miss in the L1 data cache. The reciprocal of a rate expresses how often an event is observed and counted.

Data cache access freq  = 1000 / 14.23 =  1 access per 70 instructions
Data cache miss freq    = 1000 /  2.70 =  1 miss per 370 instructions
Data MicroTLB miss freq = 1000 /  0.68 =  1 miss per 1,470 instructions
Main TLB miss freq      = 1000 /  0.51 =  1 miss per 1,960 instructions

It’s important to emphasize that ARM11 microarchitectural events are counted here, not architectural events. Take data cache accesses, for example. The hardware counts nonsequential accesses to the data cache. The number of these events depends upon the flow and sequence of read/write operations and the associated addresses through the ARM11 Load Store Unit. This event does not have the same meaning as architectural load/store operations that are executed by the program running on an abstract ARM ISA (instruction set architecture) engine. Only the first nonsequential access to a cache line is counted so that multiple reads/writes to the same cache line are counted as a single access. Thus, the cache miss ratio is an estimate, but maybe not a very good one.

Further, event counts may be affected by power-saving measures implemented in the hardware. Sequential reads to the same cache line do not access the cache tag RAM and, presumably, do not produce hit/miss status or increment a counter.

Therefore, you need to take these rates and miss ratio with a grain of salt. Your mileage may vary (YMMV). The event measurements provide clues to assist code tuning, but should not be accepted at face value. An event may be called “data cache access” or “data cache miss”, but it measures a much more complicated hardware condition than the textbook definitions of these events.

Finally, you may be wondering about L2 cache. The Broadcom BCM2835 at the heart of the Raspberry Pi has a 128KB L2 cache. However, the L2 cache is dedicated to the VideoCore GPU and is not a factor in a program’s direct access to memory. We are ignoring the L2 for that reason.

Extra credit

Here’s an experiment to try on your own. Change the matrix dimension from 500 elements per row and column to 512 elements. The TLB misses become (address) conflict misses instead of capacity misses. Did the overall execution time increase or decrease? How did the event counts, miss ratio and rates change?