Faster matrix multiplication (part 2 of 2)

Part 2 of two parts on matrix multiplication demonstrates a fast matrix multiplication program. The algorithm is a simple transformation of the textbook algorithm — the olde loop nest interchange. The transformation changes the slow access pattern to one of the arrays so that the program steps sequentially through the array elements in memory. Execution time speeds up from about 16 seconds elapsed time to 6 seconds. Not bad for a few minutes work!

All of the key memory-related performance events are improved since the access pattern is a better fit with the underlying memory microarchitecture. The analysis shows that we need to be careful when interpreting the Data Cache Access event because this event counts nonsequential memory accesses instead of all level 1 DC accesses or architectural loads and stores.

Part 2 also discusses operation or instruction counting to analyze program complexity at a micro-level. I like to look at the assembler code generated by the compiler to see if there are any potential speed-ups. The article shows how to look at the assembler code using the GCC -S option and using the objdump program. I use instruction counting to check the operation and meaning of performance events like the ARM11 Executed Instructions event.

The Broadcom BCM2835 in the Raspberry Pi has an integer core and a Vector Floating Point (VFP) coprocessor. The VFP operates concurrently with the integer core. In fact, it operates quite independently and only synchronizes with the integer core at a few well-defined points. VFP instructions are allowed to complete out of order, which allows for greater speed, but makes FP exceptions somewhat imprecise. (Now exactly where did that underflow/overflow occur?) The VFP coprocessor has 32 registers of its own, which are organized as four 8-register banks. GCC uses the coprocessor for scalar floating point arithmetic, but doesn’t exploit any parallelism.

The VFP operates on short vectors in a register bank. Potentialy, the VFP coprocessor could be exploited to further speed up matrix multiplication. One possibility is to stream incoming data as a four-wide stripe through an array and operate on four elements at once. Or, stream four elements at a time from a single row/column. Take a look at the VFP Math Library.

It’s not all good news, however. The VFP coprocessor is not a true single instruction, multiple data (SIMD) engine. (It’s similar to an old school short vector architecture.) It only has a single floating multiply/accumulate (FMAC) pipeline. A true SIMD would have four FMAC units. Also, computations are relatively difficult to set up and stage. Computations must be double buffered where the integer Load Store Unit (LSU) is filling one register bank while the coprocessor is performing computations in a different register bank. Further, GCC vectorization doesn’t appear to support VFP.

ARM must have gotten the message from its users. Later processors implement NEON SIMD and just enough VFP for the sake of legacy compatibility. The Beaglebone Black (ARM Cortex-A8) has NEON and I’m looking forward to trying it out. GCC vectorization supports NEON, too, and it’s a whole lot easier to let the compiler vectorize your program for you than to write vector code yourself!

The BCM2835 also has the VideoCore GPU for SIMD computation. There are a bunch of folks who are reverse engineering the GPU in order to use it for general purpose computation (GPGPU). Have it at, guys and gals!

Even if VFP is an orphan, the coprocessor has 32 registers where you can stash data. Maybe you can find a way to make use of these extra registers? Side-to-side access (integer/floating) is pretty fast.