Faster matrix multiplication (part 2)

Source: Makefile, interchange.c, readme
Common code: test_common.h, test_common.c, rpi_pmu.h, rpi_pmu.c

Part 2 of the matrix multiplication example introduces and demonstrates a faster algorithm. The faster algorithm walks sequentially through the elements of the operand arrays, thereby reducing the number of performance robbing data cache and TLB misses.

Why is the textbook algorithm slow?

The number and frequency of data cache and TLB miss events provide an important clue for our investigation. They tell us to analyze the memory access speed and patterns in the naive matrix multiplication program.

People in high-performance computing (HPC) often refer to the data streams which drive a program. Matrix multiplication has two input streams consisting of elements from the rows in matrix_a and the columns in matrix_b. The elements from a row and a column must be combined in order to form a single element in the result matrix, matrix_r. The input streams crave memory bandwidth and the overall computation does not progress until the required elements from these streams flow into the CPU. The output stream to matrix_r is a much lower bandwidth stream and has a lesser effect on overall program performance. The program is limited by the slowest input stream.

C language arrays are stored in row major order in memory. The first row is followed by the second row, and so forth. The elements within a row are stored in sequential array and memory order. The elements of the two dimensional array example below:

float example[3][3] = { {1, 2, 3}, {4, 5, 6}, {7, 8, 9} } ;

are arranged contiguously in primary memory in the following way:

1, 2, 3, 4, 5, 6, 7, 8, 9

The nested loops below step through the array elements in sequential memory order.

for(i = 0 ; i < 3 ; i++)
  for(j = 0 ; j < 3; j++)
    sum += example[i][j] ;

Note that the second index (the column index j) changes the most rapidly. Python, by the way, also stores arrays in row major order. Some other languages, most notably FORTRAN, store arrays in column major order where the columns are contiguous and the first index (the row index) changes the fastest.

In part 1, we mentioned that computer architects have tuned the memory design to favor sequential memory access. Therefore, we expect good performance from C code that steps sequentially through each row in an array. Check out the nested loops in the textbook algorithm:

void multiply_matrices()
  int i, j, k ;

  for (i = 0 ; i < MSIZE ; i++) {
    for (j = 0 ; j < MSIZE ; j++) {
      float sum = 0.0 ;
      for (k = 0 ; k < MSIZE ; k++) {
          sum = sum + matrix_a[i][k] * matrix_b[k][j] ;
      matrix_r[i][j] = sum ;

The index k within the innermost loop changes the fastest. The innermost loop walks sequentially through the elements of matrix_a in memory. In fact, the machine code keeps a pointer to the current element of matrix_a and increments the pointer by 4 to advance to the next float element in the array. (Please recall that a float value is four bytes in size.)

This kind of sequential access is good from the hardware's point of view. ARM1176 cache memory contents are organized into 32 byte lines. The program touches each 4 byte float in a cache line before advancing to the next line (possibly incurring a data cache miss). Efficiency of cache line access is 32/32 = 100%. The hardware may even be able to anticipate access to the following cache line, preload the line, and avoid a compulsory cache miss. This kind of sequential access is also good for the TLB. The program uses every byte of a 4KB page before advancing to the next page in memory (possibly incurring a TLB miss).

Unfortunately, the innermost loop leaps across the rows of matrix_b in memory. The machine code keeps a pointer into matrix_b and increments the pointer by 2,000 (500 row elements * 4 floats) to advance to the next element in the array. The program uses only 4 bytes from a cache line before jumping to a different cache line. Efficiency of cache line access is only 4/32 = 12.5%. Even worse, the innermost loop uses only 8 bytes in a 4KB page before advancing to the next page. The program quickly exhausts the available entries in the data MicroTLB and the Main TLB thereby forcing the hardware to read page mapping information from the page tables in slow primary memory. These reads are counted as TLB misses.

The input stream from matrix_b is choking performance. We need to change the access pattern to matrix_b in order to speed up the matrix multiplication program.

A faster algorithm

We made two small changes to the original matrix multiplication program. First, we interchanged the order of the two inner loops. The innermost loop now increments the index variable j and the middle loop increments the index variable k. The outermost loop remains the same (index variable i). Secondly, we now accumulate the partial sums directly into the result matrix. The index variables j and k, which are the column indices that step through the elements in a row, change the fastest.

void multiply_matrices()
  int i, j, k ;

  for (i = 0 ; i < MSIZE ; i++) {
    for (k = 0 ; k < MSIZE ; k++) {
      for (j = 0 ; j < MSIZE ; j++) {
        matrix_r[i][j] = matrix_r[i][j] +
          matrix_a[i][k] * matrix_b[k][j]) ;

These transformations preserve correctness. (Try writing a program that computes the matrix product both ways and compare the results!) Here is an informal argument for correctness/equivalence. Both programs step through all values of i, j and k albeit in a different order. Both programs compute exactly the same product terms, again in a different temporal order. The individual product terms are accumulated directly into the result matrix element to which they belong. Although the execution order is different, the final sum in each result element is the same because addition is both commutative and associative.

The transformation is called a loop nest interchange and we have named the program interchange after the new algorithm. The interchange program should be faster because it steps through the elements in the rows -- and primary memory -- in sequential order. Let's try it and find out!

Naive vs. interchange

I ran interchange and measured the execution time and performance events. The table below shows the measurements for both naive and interchange. The table lets us make a side-by-side comparison of the programs' behavior.

Metric Naive Interchange
Elapsed time 16.0s 6.0s
Scaled cycles 181,984,943 72,657,201
Instructions 1,190,897,822 1,145,963,533
Data cache access 17,588,053 6,220,726
Data cache miss 2,982,314 976,290
MicroTLB miss 809,154 211,476
Main TLB miss 612,688 34,229
CPI 9.78 CPI 4.06 CPI
Data cache access rate 14.77 PTI 5.43 PTI
Data cache miss rate 2.50 PTI 0.85 PTI
Data cache miss ratio 17% 16%
MicroTLB miss rate 0.68 PTI 0.18 PTI
Main TLB miss rate 0.51 PTI 0.03 PTI
Naive vs. interchange

The elapsed execution time is substantially reduced from 16 seconds to 6 seconds. The cycles per instruction (CPI) are more than cut in half. The raw numbers of L1 data cache, data MicroTLB and Main TLB events are greatly reduced including the event per thousand instruction (PTI) rates. The loop nest interchange transformation is clearly a win.

The number of data cache accesses, the data cache miss ratio and data cache access rate should raise an eyebrow, however. The raw number of L1 data cache accesses is lower for interchange. Also, the miss ratio is roughly the same. What's up? Please recall our cautionary statements about the definition of performance events from Part 1. The ARM1176 performance event 0x9 is called "Data cache access," but it really counts nonsequential data cache accesses. These events are a subset of all data cache accesses. The interchange program has an improved memory access pattern that increases the number of sequential data cache accesses. Thus, the raw number of nonsequential data cache access events is much lower. The miss ratio states the portion of misses among the nonsequential accesses.

The number of nonsequential data cache accesses is a useful notion. We really want our programs to step sequentially through memory in order to get good performance. Thus, we want to minimize nonsequential data cache access. However, the more limited behavior and meaning of this performance event doesn't let us compute the pure overall data cache miss rate, that is, all data cache misses divided by all data cache accesses.

Extra credit: Operation counting

We're all familiar (or should be familiar!) with the "big-O" complexity of algorithms. The matrix multiplication algorithms are O(N³), where N is the row and column dimension of the sqaure matrix arguments. We can also study space and time complexity at a finer granularity, sometimes called the microanalysis of programs. Microanalysis is a detailed, close look at the individual operations that make up a program. It uses knowledge about the size of data structures, loop counts and program structure to estimate the number operations (loads, stores, adds, multiplies, etc.) performed by a program. Fine grain analysis occasionally reveals local optimizations to further improve performance.

Even though C is regarded as a fairly low-level programming language, the C source code is still a relatively abstract expression of an algorithm. The C compiler and the optimization flags given to the compiler have enormous control over the machine level expression of the algorithm. Thus, we can learn a lot by taking a look at the machine code emitted by the compiler. It's also a great way to learn ARM assembly language!

There are two ways to obtain the assembly language code for a program using GCC and Linux. The first method tells the compiler to leave a copy of the assembly language file:

gcc -c -S -O2 interchange.c

The -S option asks the compiler to leave a copy of the assembly language code. The code can be found in the file interchange.s. The second method reads the executable binary program file and disassembles it into assembly language.

objdump -d interchange > interchange.asm

The objdump utility can generate all sorts of useful information about a program. The -d option requests a disassembly. In the command above, the disassembly is redirected into the file interchange.asm. The disassembly is somewhat different from the assembler file produced by the compiler. Generally, some information is lost in the final translation to binary.

Here is the assembly language code for the function multiply_matrices() as generated by the C compiler.

	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl}
	mov	r7, #0
	ldr	sl, .L16
	ldr	r5, .L16+4
	ldr	r8, .L16+8
	mov	r4, #2000
	mul	r6, r4, r7
	mov	r0, #0
	sub	ip, r6, #4
	add	ip, r8, ip
	add	r6, sl, r6
	add	ip, ip, #4
	mla	r1, r4, r0, r5
	flds	s13, [ip, #0]
	mov	r2, r6
	sub	r1, r1, #4
	mov	r3, #0
	add	r1, r1, #4
	flds	s15, [r2, #0]
	flds	s14, [r1, #0]
	add	r3, r3, #1
	cmp	r3, #500
	mov	r9, r1
	fmacs	s15, s13, s14
	fstmias	r2!, {s15}
	bne	.L11
	add	r0, r0, #1
	cmp	r0, #500
	bne	.L14
	add	r7, r7, #1
	cmp	r7, #500
	bne	.L10
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl}
	bx	lr

The labels .L10, .L14, and .L11 denote the tops of the three inner loops, respectively. The innermost loop begins at .L11. Register r3 is the loop counter. Register r1 points at an element in matrix_b and register r2 points at an element in matrix_r. The flds instructions load the two elements from their respective arrays. The floating point register s13 contains the element matrix_a[i][k]. The fmacs instruction multiplies matrix_a[i][k] times the element from matrix_b and accumulates the product into floating point register s15. The fstmias instruction stores the partial sum of products back into matrix_r and increments the pointer in register r2 by 4.

Try it: Can you puzzle out the code in the other two nested loops?

We can use operation (instruction) counting to estimate the number of instructions executed by the nested loops. Given the matrix dimension (500) and the loop nest, each instruction within the innermost loop is executed (500 * 500 * 500) = 125,000,000 times. There are nine instructions in the innermost loop, so the innermost loop executes (9 * 125,000,000) = 1,125,000,000 instructions.

The interchange program estimates the number of instructions performed by the outer, middle and innermost loops. It prints these estimates in the results file named "interchange.txt". Compare the estimate with the actual count as reported in Counter 0 (below) and see if the actual count is in the same ballpark as the estimate. This comparison gives us a sanity check on the performance counter and the executed instructions event.

Loop nest interchange matrix multiplication
Thu May 30 15:33:19 2013
System information
  Number of processors:     1
Run information
  Outer loop instructions:  4000
  Middle loop instructions: 2250000
  Inner loop instructions:  1125000000
  Total instructions:       1127254000
  Estimated cycles:         4725000000
  Cycles (scaled by 64):    73828125
Performance Monitor events
  Cycle Counter: 73335322  Processor cycles (scaled by 64)
  Counter 0:     1150281598  Executed instruction
  Counter 1:     398493328 (OVERFLOW)  CPU cycles

Run execution summary
  User time:                   6.740 sec
  System time:                 0.010 sec
  Elapsed time:                6.750 (0.000001 sec resolution)

The actual count does vary and deviate from the estimate. Why do think this is so? Could the variance and deviation be due to interrupts or other system activity that runs on the single processing core in the Raspberry Pi? (Hint: Tick, tick.)

The expected number of processor cycles is estimated by multiplying the elapsed time times the processor clock frequency (700MHz) and then dividing the product by 64 to obtain scaled cycles. The elapsed time is somewhat longer than the execution time of the matrix multiplication itself. Thus, the program tends to overestimate the expected number of cycles by a small amount.

It would be terrific if we could check the accuracy of the microarchitectural events like data cache access and data cache miss. We would need a precise model for the microarchitecture and the behavior of the program in the pipeline. Unfortunately, this kind of precise information is usually proprietary (part of a cycle-accurate simulator) and out of reach.

Copyright © 2013 Paul J. Drongowski