Raspberry Pi performance counters (part 1)

Finally, an example to show the Raspberry Pi performance counters in action. My friends will no doubt chuckle because the first example is an analysis of matrix multiplication. (“He always starts with matrix multiplication…”) Matrix multiplication is a good place to start because it is a small easy to build and easy to analyze program with a known performance issue. It’s a great way to get an intuitive feel for the performance events on a new, unfamiliar platform like the Raspberry Pi. I’ve analyzed this example on x86, SPARC, Itanium and Alpha, so I already have a fair bit of history with it.

Part 1 of the example shows how to use the Raspberry Pi performance counter kernel module and the user-space support functions. I collect performance event data for the infamous textbook implementation of matrix multiplication and define a few useful rates and ratios to help interpret the event counts. There is also a brief introduction to memory hierarchy in order to provide a little background for data cache and translation lookaside buffer (TLB) behavior.

I’m in the process of writing part 2, which explains and demonstrates an improve matrix multiplication program. The code for part 2 is already in the source area of this site.

After doing some comparative analysis, I strongly encourage you to read carefully the definitions of the ARM1176 performance events. The “data cache access” events, in particular, only count nonsequential data cache accesses. This important qualification affects the interpretation of performance measurements. In particular, you can’t compute a pure data cache miss ratio, that is, all data cache misses divided by all data cache accesses.

The descriptions of the ARM1176 performance events are a little bit sketchy. ARM did a better job describing the Cortex-A8 events, for example. Adopting a Zen attitude, the ARM1176 events are what they are, they will not change or be updated, and we need to accept them.

Performance events and Zen

It’s always interesting to get started with a new microarchitecture and the ARM 1176 inside of the Raspberry Pi is no exception.

Back when we were in school, we all dutifully went to computer architecture class and learned about memory hierarchy, read/write access, cache memory, and the translation lookaside buffer (TLB). A hit was a clean hit and a miss was, well, a miss.

Real world behavior of cache memory and the TLB is far more complicated. Computer designers study the behavior of benchmark and application programs in order to find behavioral patterns which the hardware can exploit for speed. This includes behavior like sequential access, fixed-length address strides, temporal and spatial locality. Then the designers build hardware which implements every trick in the book, all in the name of faster access to memory data and ultimately, faster programs. In the case of low-power machines like ARM, computer designers use behavioral patterns to turn off or not actively use functional components in order to save power. Inactive components don’t consume power, but they don’t generate observable signals either.

A real world performance event is a dynamic condition which occurs in the midst of this complicated hardware. An access or miss may be counted only for the first of multiple sequential reads/writes to the same cache line. Other undocumented internal microarchitectural conditions affect the event counts. Suddenly, it’s not so easy to interpret performance event counts armed with our textbook notions of cache access and cache miss. It may not even be possible to effectively compare the behavior of two versions of the same program (one version tuned, the other version untuned).

We can whine, whinge and kvetch about the limitations of real world performance events, the lack of documentation, and other shortcomings. At the end of the day, the performance events are what they are and we need to accept them. Therein lies the Zen of performance events.

The fine print in the TRM

I’ve been busy writing up the first example showing how to use the Raspberry Pi performance counter kernel module and the ARM11 performance counters. These write-ups always take longer than I think, so here’s a few small things that I’ve learned along the way.

Run, don’t walk, to the ARM web site and download a copy of the ARM1176JZF-S Technical Reference Manual. The TRM has the goodies and details about the ARM1176 core in the Raspberry Pi. This includes information about cache and TLB sizes, branch prediction, instruction timing and, of course, the performance counters and events. The TRM is essential reading if you want to know more about the processor at the heart of the Raspberry Pi.

Good as it is, the TRM doesn’t contain all of the information and in some cases, the descriptions are very low level and esoteric. Computer designers describe hardware from their point of view, which is not necessarily the perspective of a software engineer. In fact, the terminology is often quite foreign to the ears of a software engineer. The TRM is no exception.

The first question many people ask about the Raspberry Pi is “What happened to the L2 cache?” The L2 cache is not part of the ARM1176 core, so the TRM describes the interface to the L2 and is mum about the L2 itself. The best clues about the L2 cache appear at the beginning of the Broadcom BCM2835 ARM Peripherals manual. Broadcom has implemented a pretty decent graphics processing unit (the VideoCore) and are reluctant to release too many details about it lest competitors learn too much about their design. The diagram at the beginning of the peripherals manual shows the memory layout and bus structure of the BCM2835. There are actually two memory management units (MMU). The ARM MMU maps program virtual addresses to physical addresses and the VC/ARM MMU maps physical addresses onto the VC/CPU bus and real honest to goodness physical memory. The high order bits of the VC/CPU address determine the cacheable status of memory regions including, TA-DA!, the L2 cache. The footnote on page 6 says, “BCM2835 provides a 128KB system L2 cache, which is used primarily by the GPU. Accesses to memory are routed either via or around the L2 cache depending on senior two bits of the bus address.” So, under normal circumstances, memory reads/writes made by a program are not routed through the L2 cache. I suspect, and this is a guess, that the L2 cache boosts the bandwidth for the memory hungry GPU. Well, we may never really know.

The ARM1176 has two separate level 1 (L1) caches: a 16KB instruction cache and a 16KB data cache. I verified the cache size using the coprocessor’s Cache Type Register. The TRM is not always clear about the number and size of the translation lookaside buffers (TLB). There are two level 1 MicroTLBs. Each MicroTLB has ten entries. The MicroTLBs are backed by a Main TLB consisting of eight fully associative elements and a 64 element low-associativity store. Depending on how Linux uses the eight fully associative, lockable entries, as many as 72 entries are available for address translation.

The other characteristics that require some digging are the load-to-use latencies for the L1 cache and the TLBs. I’m just diving into the instruction timing information in Chapter 16 of the TRM. However, the load-to-use latency for a hit in the L1 data cache is three cycles. That is, the data from a load hitting in the L1 data cache is not available for 3 cycles once the load is issued. Dedicated datapaths called bypasses forward load data to other instructions in the pipeline.

The Load/Store Unit does not always block on an L1 data cache miss and supports hit under miss (HUM). The miss goes into a holding state/buffer and non-dependent instructions are allowed to execute. Up to three outstanding misses are allowed. This increases parallelism letting some computations proceed even when a load misses in the L1 data cache.

Finally, the ARM1176 is a single issue machine, that is, one instruction is issued at a time. This constraint simplifies the issue logic. The ARM1176 doesn’t need to implement any complicated issue rules to determine whether two or more instructions of a certain type can issue in the same cycle. Instructions are issued in-order, but out-of-order completion is allowed. out-of-order completion increases exploitable fine-grained parallelism.

Another neat feature of the ARM — nearly all instructions can be predicated. An enabling condition (a predicate) can be defined for an instruction. The enabling condition gates the execution of the instruction. Predication is cool and is used to eliminate branches. More about predication one of these days…

Hope you enjoyed this trip through the fine print!

Performance counter kernel module

As promised, I’ve described the design of a Linux loadable kernel module that allows user-space access to the Raspberry Pi (ARM 1176) performance counters. By the way, the design of the module is not specific to Raspbian Wheezy or even the Raspberry Pi for that matter. I believe that the kernel module could be used on the new Beagleboard Black (BBB) to enable user-space counter access on its ARM Cortex-A8 processor under Linux. I just ordered a BBB and will try out the code when possible. (Assuming quick delivery!)

The kernel module alone isn’t enough to measure performance events. In fact, the kernel module doesn’t even touch the counters. It merely flips a privileged hardware bit which lets user-space programs read and write the performance counters and control register. So, I have also written a few user-space C functions to configure, clear, start and stop the performance counters. An application program just needs to call a few functions to choose the events to be measured and start counting, to stop counting, to get the raw counts, and to print the event counts.

I have uploaded the source for both the kernel module (aprof.c) and the user-space functions (rpi_pmu.h and rpi_pmu.c). In addition, there is source for some utility functions that I like to use in benchmark programs (test_common.h and test_common.c). All of this is a work in progress and I will update the source when major enhancements or changes are made.

Speaking of source, I have found a way of organizing and storing source code through WordPress. WordPress is kind of security paranoid and doesn’t allow you to upload source code or even gzip’ed TAR files. I ran into this issue when I attempted to upload a make file and WordPress wouldn’t let me do it (with complaints about potentially malicious code and so forth). WordPress does let you post source for viewing, however.

So, I’ve added a Source menu item to the main menu. I want the menu structure below the Source item to operate like a browsable code repository. The first level of items below Source are projects, like the kernel module. The next level of menu items navigate into the source belonging to a project. Each make file and source file is a separate page. The source code is displayed using the SyntaxHighligher plug-in in order to keep indentation. No other formatting or highlighting is done just to keep things simple. I could cut and paste code from these pages, so I hope you can, too!

An introduction to performance tuning (and counters)

My latest page is an overview of performance tuning on ARM11. The Raspberry Pi is a nifty little Linux box, but it’s kind of slow at 700MHz. Therefore, I suspect that programmers will have an interest in tuning up application programs and making them run faster. Performance tuning is also a good opportunity to learn more about computer architecture and machine organization, especially the ARM1176 core at the heart of the Raspberry Pi and its memory subsystem.

The ARM1176 has three performance counters which can measure over 20 different microarchitectural events. One of these counters is dedicated to core clock cycles while the other two are configurable. The new performance tuning page has a brief overview of the counters and it has a table with the supported events.

The new page also describes two different use cases for the counters: caliper mode and sampling mode. Caliper mode counts the number of microarchitectural hardware events that occur between two different points in program execution. Caliper mode is good for measuring the number of data cache accesses and misses for a hot code region like a loop. The programmer inserts code to start counting at the beginning of the hot region and inserts code to stop counting at the end of the hot region. This is the easiest use case to visualize and to implement. It’s the approach that I’m taking with my first performance measurement software and experiments (a custom kernel module plus some user-space code). These experiments are almost finished and ready for write up.

Sampling is a statistical technique that produces an event profile. A profile shows the distribution of events across program instructions, routines, source lines, or modules. This is a good way to find hot-spots in a program where tuning is most beneficial. Sampling does not require modification to source.

Performance Events for Linux (informally called “PERF”) is the standard tool for program profiling on Linux. At the moment, PERF has a bug which prevents it from sampling hardware events. I’ve been looking into this problem, too, and hope to post some results. In the long-run, I want to post examples using PERF in order to help people tune up their programs on Raspberry Pi.

Building a loadable kernel module

I needed to design, implement and build a loadable kernel module in order to access the ARM11 performance counters from user-space. I will slowly roll out the design and code for the kernel module. But, first, I’ve posted my notes for building a loadable kernel module. It’s easier to explain the process of building the module and the internal design of the module if I separate the two discussions.

There were a few problems along the way. The kernel source for Raspbian Wheezy 3.6.11+ is not available using Synaptic. Only 3.6.9 is available through Synaptic. I needed to download the source for 3.6.11+ from github.com in order to match the installed Linux image. Next, I needed the module version information for 3.6.11+. Usually this information is built along with the kernel and is stored in the file named Module.symvers. Raspbian Wheezy takes 10+ hours to build on the Raspberry Pi, according to reports on the Web, so I didn’t want to undertake a long-running kernel build just to generate the version information. Fortunately, I could download Module.symvers from github.com, too.

I hope the next Raspbian Wheezy distro has all of the essentials for a module build — headers, version information, the whole she-bang. This would really help a brother out because many people are building custom hardware for the RPi and they need to build custom device drivers, too.

I’m currently in the process of writing pages on performance monitoring on the RPi. That discussion will include the design and code for my kernel module. The kernel module is about as simple as can be and is a kind of “Hello world” example. Please stay tuned.

ARM11 microarchitecture

You probably know by now that the Raspberry Pi uses an ARM processor. In particular, the Raspberry Pi model B uses the Broadcom BCM2835 system on a chip (SoC). The BCM2835 is a member of the ARM11 family. Its name is the ARM1176JZF-S. (Whew!)

Like all computers, the BCM2835 has an internal processor structure called its “microarchitecture”. The word “architecture” refers to the machine features that are visible to a programmer — things like the instruction set. The microarchitecture refers to the building blocks in the guts of the machine, or more properly, in the guts of a specific implementation (BCM2835) of an architectural family (ARM11 or ARMv6).

The microarchitecture can have a big effect on program performance. Compiler writers, for example, study the microarchitecture in order to build compilers that generate the best possible code for the microarchitecture. As we’ll see in later posts, application programmers can also take steps to tune their programs for the underlying microarchitecture. Tuning is important on Raspberry Pi because at 700 MHz, this machine is running its heart out!

Today, I added a page that summarizes the characteristics of the BCM2835 (ARM11) microarchitecture. Please check out the info! We will revisit this page when I discuss profiling and tuning.

Raspberry Pi tips and tricks

If you’re new to the world of Raspberry Pi, you should check out the page about Raspberry Pi tips and tricks. The page has all of the things that I did to configure Raspberry Pi and get started with the operating system, Raspbian Wheezy.

One of the most important things you should do is to configure the operating system for your geographic location or “locale”. The Raspbian Wheezy image is configured for the United Kingdom (UK). Therefore, the operating system and other software formats the data, time, and money for the UK. If you’re living somewhere else, you’ll want to set things to local convention. That’s where the locale comes into play.

Run the raspi-config program to change settings:

sudo raspi-config

The program implements a simple quasi-GUI where you navigate using the arrow keys. The space bar makes selections and the enter key confirms selections, etc.

Of course, you should set the timezone using the change_timezone option. Use the change_locale option to set your locale.

You would think that change_locale would set the keymap for your keyboard, but no! You need to change the keymap (keyboard layout) using the configure_keyboard option. If you type a key like “|” and you get a different character, you probably need to change the keymap to match your keyboard. Raspian Wheezy will build the new keymap when it reboots the next time (a step that takes a little time to execute).

Finally, you should know how to properly shut down the operating system and Raspberry Pi. I use the shutdown program:

sudo shutdown -h now

The -h option tells the processor to halt after shutdown, allowing you to safely pull the adapter plug from the wall. Use the -r option if you want to reboot instead. Don’t forget to do this since the OS may need to update information in permanent storage, thereby avoiding file system corruption.

Hey, don’t forget to have fun with your new system!

What’s this all about?

Electronics and computing are great hobbies. You can even make a career out of them!

I’m a computer scientist and engineer who is exploring Raspberry Pi, Arduino, Papilio and other terrific, low-cost toys. I’m hoping to share my knowledge and discoveries as I go along so that you can benefit from my experience, too. Students and educators are especially welcome! In addition to new information and activities, I will eventually post classroom material about data structures, computer architecture, and computer design which I used when teaching these subjects.

Thanks for visiting this site and I hope that you will check back often for updates. For the moment, comments are closed. Once I gain a little more expertise with WordPress, I will open the site for comments. Thanks, again.