A sweet one-liner for histograms

Posted on June 9, 2013 by pj

Here is a shell script one-liner that is just too sweet to go unrecognized.

I’ve written a program (latency.c) to measure the execution time of individual chase operations in a linked list pointer chasing loop. The program writes the execution times into a file named samples.dat. The distribution of the execution times should show the access time to different levels of the ARM1176 memory hierarchy.

I still needed a way to view the distribution in a histogram. So, I wrote a short C program (histogram.c) to postprocess the execution times. The program produces a histogram-like table — not a chart. Bummer.

A quick search of the Web brought up some rather sophisticated data visualization tools. I experimented with a few of them, but still couldn’t get want I wanted.

Then, this little gem came up from Small Labs, Inc. (Search on “command line histogram.)

history | awk '{h[$2]++}END{for(i in h){print h[i],i|"sort -rn|head -20"}}' |awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%15s %5d %s %s",$2,$1,r,"\n";}'

This one-liner displays a histogram of recent command lines (history) from the most to least frequently used. It displays only the 20 most frequent commands (head -20).

This one-liner looks quite promising, but it needs a few changes. First, it needs to read data from the file samples.dat. It needs to read only one item from each line of the file; history produces two items per line. Finally, we can discard some of the white space in the output and make the lines in the chart narrower.

Here is the revised one-line written in the form of a bash shell script. We’re going to analyze several files and it’s appropriate to put the one-liner into a shell command file.

#!/bin/bash

cat samples.dat | awk '{h[$1]++}END{for(i in h){print h[i],i|"sort -rn|head -20"}}' |awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%6s %5d %s %s",$2,$1,r,"\n";}'

Here is some sample output.

    62  517 ###########################################################
    61  301 ################################### 
     9  119 ############## 
    70   12 ## 
    63   11 ## 
    65    9 ## 
    64    4 # 
    17    4 # 
   336    2 # 
   168    2 # 
   154    2 # 
    71    1 # 
    67    1 # 
    60    1 # 
   525    1 # 
   521    1 # 
   390    1 # 
   389    1 # 
   386    1 # 
   378    1 #

We can easily see the most frequent values, but this doesn’t really show the distribution in a useful way. So, let’s just sort the output on the leading numeric values (sort -n). The numbers in the first column are access/latency times in cycles, by the way.

     9  119 ##############
    17    4 # 
    60    1 # 
    61  301 ###################################
    62  517 ###########################################################
    63   11 ## 
    64    4 # 
    65    9 ## 
    67    1 # 
    70   12 ## 
    71    1 # 
   154    2 # 
   168    2 # 
   336    2 # 
   378    1 # 
   386    1 # 
   389    1 # 
   390    1 # 
   521    1 # 
   525    1 #

Ah, now we can see the level 1 data cache hits (9 cycles including measurement bias) and reads to primary memory (61 and 62 cycles). The bi-modal nature of the distribution is revealed.

Before leaving this topic, I’d like to give a shout-out to Dave Christie at AMD. Dave always gave me a good-natured kidding about these old school histograms. Dave, BTW, is one of the true unsung technical heroes at AMD. All the best!

Perils of performance analysis on single core

Posted on June 7, 2013 by pj

A new article on the Raspberry Pi (Broadcom BCM2835) memory hierarchy is almost ready. The first code has already been posted.

I’ve been working on multi-core processors for so long that I forgot what it’s like to take measurements on a single core machine like the Raspberry Pi.

In the ideal world, a benchmark or performance test program has the machine to itself and no other program or system activity perturbs it. Measurements on the ideal machine accurately and exactly reflect the dynamic behavior and performance of the program. On multi-core, you can usually assign the test program to an idle core (or two), preferably a core that is free of operating system activity. With careful process or thread placement, results on multi-core approach the ideal.

On single core, we don’t have any luxury. The test program has to share the one core with other programs and the operating system. On Raspberry Pi, Linux fires up services that run periodically. Even if we shut the services off, the system clock continues to run and it generates interrupts. At the very least, extraneous activity affects elapsed, user and system time measurements.

When we measure performance events, however, there is a deeper level of interference. The core has one physical level 1 (L1) data cache, one physical MicroTLB, one physical Main TLB and one physical branch history table. These microarchitecural components are transparent to the architecture, but they must be shared between programs and the OS. A context switch may cause a cache or TLB flush which invalidates the entire contents of the cache/TLB. Cache, TLB or branch history may be partially polluted by other software activity. The final performance event counts are affected by flushes and pollution and do not accurately reflect the behavior of the test program.

I ran into this issue while characterizing the memory hierarchy with performance events. One test case is designed to exercise only the L1 data cache and never touch primary memory. Yet, the test case measured a rather significant number of data cache misses beyond the compulsory misses that I would have expected. The extra misses are most likely caused by timer interrupts. I now think of these extra misses as “background radiation” which bias measurement.

Such are the perils of performance measurement and analysis on single core!

Faster matrix multiplication (part 2 of 2)

Posted on June 3, 2013 by pj

Part 2 of two parts on matrix multiplication demonstrates a fast matrix multiplication program. The algorithm is a simple transformation of the textbook algorithm — the olde loop nest interchange. The transformation changes the slow access pattern to one of the arrays so that the program steps sequentially through the array elements in memory. Execution time speeds up from about 16 seconds elapsed time to 6 seconds. Not bad for a few minutes work!

All of the key memory-related performance events are improved since the access pattern is a better fit with the underlying memory microarchitecture. The analysis shows that we need to be careful when interpreting the Data Cache Access event because this event counts nonsequential memory accesses instead of all level 1 DC accesses or architectural loads and stores.

Part 2 also discusses operation or instruction counting to analyze program complexity at a micro-level. I like to look at the assembler code generated by the compiler to see if there are any potential speed-ups. The article shows how to look at the assembler code using the GCC -S option and using the objdump program. I use instruction counting to check the operation and meaning of performance events like the ARM11 Executed Instructions event.

The Broadcom BCM2835 in the Raspberry Pi has an integer core and a Vector Floating Point (VFP) coprocessor. The VFP operates concurrently with the integer core. In fact, it operates quite independently and only synchronizes with the integer core at a few well-defined points. VFP instructions are allowed to complete out of order, which allows for greater speed, but makes FP exceptions somewhat imprecise. (Now exactly where did that underflow/overflow occur?) The VFP coprocessor has 32 registers of its own, which are organized as four 8-register banks. GCC uses the coprocessor for scalar floating point arithmetic, but doesn’t exploit any parallelism.

The VFP operates on short vectors in a register bank. Potentialy, the VFP coprocessor could be exploited to further speed up matrix multiplication. One possibility is to stream incoming data as a four-wide stripe through an array and operate on four elements at once. Or, stream four elements at a time from a single row/column. Take a look at the VFP Math Library.

It’s not all good news, however. The VFP coprocessor is not a true single instruction, multiple data (SIMD) engine. (It’s similar to an old school short vector architecture.) It only has a single floating multiply/accumulate (FMAC) pipeline. A true SIMD would have four FMAC units. Also, computations are relatively difficult to set up and stage. Computations must be double buffered where the integer Load Store Unit (LSU) is filling one register bank while the coprocessor is performing computations in a different register bank. Further, GCC vectorization doesn’t appear to support VFP.

ARM must have gotten the message from its users. Later processors implement NEON SIMD and just enough VFP for the sake of legacy compatibility. The Beaglebone Black (ARM Cortex-A8) has NEON and I’m looking forward to trying it out. GCC vectorization supports NEON, too, and it’s a whole lot easier to let the compiler vectorize your program for you than to write vector code yourself!

The BCM2835 also has the VideoCore GPU for SIMD computation. There are a bunch of folks who are reverse engineering the GPU in order to use it for general purpose computation (GPGPU). Have it at, guys and gals!

Even if VFP is an orphan, the coprocessor has 32 registers where you can stash data. Maybe you can find a way to make use of these extra registers? Side-to-side access (integer/floating) is pretty fast.

Raspberry Pi performance counters (part 1)

Posted on May 30, 2013 by pj

Finally, an example to show the Raspberry Pi performance counters in action. My friends will no doubt chuckle because the first example is an analysis of matrix multiplication. (“He always starts with matrix multiplication…”) Matrix multiplication is a good place to start because it is a small easy to build and easy to analyze program with a known performance issue. It’s a great way to get an intuitive feel for the performance events on a new, unfamiliar platform like the Raspberry Pi. I’ve analyzed this example on x86, SPARC, Itanium and Alpha, so I already have a fair bit of history with it.

Part 1 of the example shows how to use the Raspberry Pi performance counter kernel module and the user-space support functions. I collect performance event data for the infamous textbook implementation of matrix multiplication and define a few useful rates and ratios to help interpret the event counts. There is also a brief introduction to memory hierarchy in order to provide a little background for data cache and translation lookaside buffer (TLB) behavior.

I’m in the process of writing part 2, which explains and demonstrates an improve matrix multiplication program. The code for part 2 is already in the source area of this site.

After doing some comparative analysis, I strongly encourage you to read carefully the definitions of the ARM1176 performance events. The “data cache access” events, in particular, only count nonsequential data cache accesses. This important qualification affects the interpretation of performance measurements. In particular, you can’t compute a pure data cache miss ratio, that is, all data cache misses divided by all data cache accesses.

The descriptions of the ARM1176 performance events are a little bit sketchy. ARM did a better job describing the Cortex-A8 events, for example. Adopting a Zen attitude, the ARM1176 events are what they are, they will not change or be updated, and we need to accept them.

Performance events and Zen

Posted on May 27, 2013 by pj

It’s always interesting to get started with a new microarchitecture and the ARM 1176 inside of the Raspberry Pi is no exception.

Back when we were in school, we all dutifully went to computer architecture class and learned about memory hierarchy, read/write access, cache memory, and the translation lookaside buffer (TLB). A hit was a clean hit and a miss was, well, a miss.

Real world behavior of cache memory and the TLB is far more complicated. Computer designers study the behavior of benchmark and application programs in order to find behavioral patterns which the hardware can exploit for speed. This includes behavior like sequential access, fixed-length address strides, temporal and spatial locality. Then the designers build hardware which implements every trick in the book, all in the name of faster access to memory data and ultimately, faster programs. In the case of low-power machines like ARM, computer designers use behavioral patterns to turn off or not actively use functional components in order to save power. Inactive components don’t consume power, but they don’t generate observable signals either.

A real world performance event is a dynamic condition which occurs in the midst of this complicated hardware. An access or miss may be counted only for the first of multiple sequential reads/writes to the same cache line. Other undocumented internal microarchitectural conditions affect the event counts. Suddenly, it’s not so easy to interpret performance event counts armed with our textbook notions of cache access and cache miss. It may not even be possible to effectively compare the behavior of two versions of the same program (one version tuned, the other version untuned).

We can whine, whinge and kvetch about the limitations of real world performance events, the lack of documentation, and other shortcomings. At the end of the day, the performance events are what they are and we need to accept them. Therein lies the Zen of performance events.

The fine print in the TRM

Posted on May 23, 2013 by pj

I’ve been busy writing up the first example showing how to use the Raspberry Pi performance counter kernel module and the ARM11 performance counters. These write-ups always take longer than I think, so here’s a few small things that I’ve learned along the way.

Run, don’t walk, to the ARM web site and download a copy of the ARM1176JZF-S Technical Reference Manual. The TRM has the goodies and details about the ARM1176 core in the Raspberry Pi. This includes information about cache and TLB sizes, branch prediction, instruction timing and, of course, the performance counters and events. The TRM is essential reading if you want to know more about the processor at the heart of the Raspberry Pi.

Good as it is, the TRM doesn’t contain all of the information and in some cases, the descriptions are very low level and esoteric. Computer designers describe hardware from their point of view, which is not necessarily the perspective of a software engineer. In fact, the terminology is often quite foreign to the ears of a software engineer. The TRM is no exception.

The first question many people ask about the Raspberry Pi is “What happened to the L2 cache?” The L2 cache is not part of the ARM1176 core, so the TRM describes the interface to the L2 and is mum about the L2 itself. The best clues about the L2 cache appear at the beginning of the Broadcom BCM2835 ARM Peripherals manual. Broadcom has implemented a pretty decent graphics processing unit (the VideoCore) and are reluctant to release too many details about it lest competitors learn too much about their design. The diagram at the beginning of the peripherals manual shows the memory layout and bus structure of the BCM2835. There are actually two memory management units (MMU). The ARM MMU maps program virtual addresses to physical addresses and the VC/ARM MMU maps physical addresses onto the VC/CPU bus and real honest to goodness physical memory. The high order bits of the VC/CPU address determine the cacheable status of memory regions including, TA-DA!, the L2 cache. The footnote on page 6 says, “BCM2835 provides a 128KB system L2 cache, which is used primarily by the GPU. Accesses to memory are routed either via or around the L2 cache depending on senior two bits of the bus address.” So, under normal circumstances, memory reads/writes made by a program are not routed through the L2 cache. I suspect, and this is a guess, that the L2 cache boosts the bandwidth for the memory hungry GPU. Well, we may never really know.

The ARM1176 has two separate level 1 (L1) caches: a 16KB instruction cache and a 16KB data cache. I verified the cache size using the coprocessor’s Cache Type Register. The TRM is not always clear about the number and size of the translation lookaside buffers (TLB). There are two level 1 MicroTLBs. Each MicroTLB has ten entries. The MicroTLBs are backed by a Main TLB consisting of eight fully associative elements and a 64 element low-associativity store. Depending on how Linux uses the eight fully associative, lockable entries, as many as 72 entries are available for address translation.

The other characteristics that require some digging are the load-to-use latencies for the L1 cache and the TLBs. I’m just diving into the instruction timing information in Chapter 16 of the TRM. However, the load-to-use latency for a hit in the L1 data cache is three cycles. That is, the data from a load hitting in the L1 data cache is not available for 3 cycles once the load is issued. Dedicated datapaths called bypasses forward load data to other instructions in the pipeline.

The Load/Store Unit does not always block on an L1 data cache miss and supports hit under miss (HUM). The miss goes into a holding state/buffer and non-dependent instructions are allowed to execute. Up to three outstanding misses are allowed. This increases parallelism letting some computations proceed even when a load misses in the L1 data cache.

Finally, the ARM1176 is a single issue machine, that is, one instruction is issued at a time. This constraint simplifies the issue logic. The ARM1176 doesn’t need to implement any complicated issue rules to determine whether two or more instructions of a certain type can issue in the same cycle. Instructions are issued in-order, but out-of-order completion is allowed. out-of-order completion increases exploitable fine-grained parallelism.

Another neat feature of the ARM — nearly all instructions can be predicated. An enabling condition (a predicate) can be defined for an instruction. The enabling condition gates the execution of the instruction. Predication is cool and is used to eliminate branches. More about predication one of these days…

Hope you enjoyed this trip through the fine print!

Performance counter kernel module

Posted on May 20, 2013 by pj

As promised, I’ve described the design of a Linux loadable kernel module that allows user-space access to the Raspberry Pi (ARM 1176) performance counters. By the way, the design of the module is not specific to Raspbian Wheezy or even the Raspberry Pi for that matter. I believe that the kernel module could be used on the new Beagleboard Black (BBB) to enable user-space counter access on its ARM Cortex-A8 processor under Linux. I just ordered a BBB and will try out the code when possible. (Assuming quick delivery!)

The kernel module alone isn’t enough to measure performance events. In fact, the kernel module doesn’t even touch the counters. It merely flips a privileged hardware bit which lets user-space programs read and write the performance counters and control register. So, I have also written a few user-space C functions to configure, clear, start and stop the performance counters. An application program just needs to call a few functions to choose the events to be measured and start counting, to stop counting, to get the raw counts, and to print the event counts.

I have uploaded the source for both the kernel module (aprof.c) and the user-space functions (rpi_pmu.h and rpi_pmu.c). In addition, there is source for some utility functions that I like to use in benchmark programs (test_common.h and test_common.c). All of this is a work in progress and I will update the source when major enhancements or changes are made.

Speaking of source, I have found a way of organizing and storing source code through WordPress. WordPress is kind of security paranoid and doesn’t allow you to upload source code or even gzip’ed TAR files. I ran into this issue when I attempted to upload a make file and WordPress wouldn’t let me do it (with complaints about potentially malicious code and so forth). WordPress does let you post source for viewing, however.

So, I’ve added a Source menu item to the main menu. I want the menu structure below the Source item to operate like a browsable code repository. The first level of items below Source are projects, like the kernel module. The next level of menu items navigate into the source belonging to a project. Each make file and source file is a separate page. The source code is displayed using the SyntaxHighligher plug-in in order to keep indentation. No other formatting or highlighting is done just to keep things simple. I could cut and paste code from these pages, so I hope you can, too!

An introduction to performance tuning (and counters)

Posted on May 15, 2013 by pj

My latest page is an overview of performance tuning on ARM11. The Raspberry Pi is a nifty little Linux box, but it’s kind of slow at 700MHz. Therefore, I suspect that programmers will have an interest in tuning up application programs and making them run faster. Performance tuning is also a good opportunity to learn more about computer architecture and machine organization, especially the ARM1176 core at the heart of the Raspberry Pi and its memory subsystem.

The ARM1176 has three performance counters which can measure over 20 different microarchitectural events. One of these counters is dedicated to core clock cycles while the other two are configurable. The new performance tuning page has a brief overview of the counters and it has a table with the supported events.

The new page also describes two different use cases for the counters: caliper mode and sampling mode. Caliper mode counts the number of microarchitectural hardware events that occur between two different points in program execution. Caliper mode is good for measuring the number of data cache accesses and misses for a hot code region like a loop. The programmer inserts code to start counting at the beginning of the hot region and inserts code to stop counting at the end of the hot region. This is the easiest use case to visualize and to implement. It’s the approach that I’m taking with my first performance measurement software and experiments (a custom kernel module plus some user-space code). These experiments are almost finished and ready for write up.

Sampling is a statistical technique that produces an event profile. A profile shows the distribution of events across program instructions, routines, source lines, or modules. This is a good way to find hot-spots in a program where tuning is most beneficial. Sampling does not require modification to source.

Performance Events for Linux (informally called “PERF”) is the standard tool for program profiling on Linux. At the moment, PERF has a bug which prevents it from sampling hardware events. I’ve been looking into this problem, too, and hope to post some results. In the long-run, I want to post examples using PERF in order to help people tune up their programs on Raspberry Pi.

Building a loadable kernel module

Posted on May 14, 2013 by pj

I needed to design, implement and build a loadable kernel module in order to access the ARM11 performance counters from user-space. I will slowly roll out the design and code for the kernel module. But, first, I’ve posted my notes for building a loadable kernel module. It’s easier to explain the process of building the module and the internal design of the module if I separate the two discussions.

There were a few problems along the way. The kernel source for Raspbian Wheezy 3.6.11+ is not available using Synaptic. Only 3.6.9 is available through Synaptic. I needed to download the source for 3.6.11+ from github.com in order to match the installed Linux image. Next, I needed the module version information for 3.6.11+. Usually this information is built along with the kernel and is stored in the file named Module.symvers. Raspbian Wheezy takes 10+ hours to build on the Raspberry Pi, according to reports on the Web, so I didn’t want to undertake a long-running kernel build just to generate the version information. Fortunately, I could download Module.symvers from github.com, too.

I hope the next Raspbian Wheezy distro has all of the essentials for a module build — headers, version information, the whole she-bang. This would really help a brother out because many people are building custom hardware for the RPi and they need to build custom device drivers, too.

I’m currently in the process of writing pages on performance monitoring on the RPi. That discussion will include the design and code for my kernel module. The kernel module is about as simple as can be and is a kind of “Hello world” example. Please stay tuned.

ARM11 microarchitecture

Posted on May 8, 2013 by pj

You probably know by now that the Raspberry Pi uses an ARM processor. In particular, the Raspberry Pi model B uses the Broadcom BCM2835 system on a chip (SoC). The BCM2835 is a member of the ARM11 family. Its name is the ARM1176JZF-S. (Whew!)

Like all computers, the BCM2835 has an internal processor structure called its “microarchitecture”. The word “architecture” refers to the machine features that are visible to a programmer — things like the instruction set. The microarchitecture refers to the building blocks in the guts of the machine, or more properly, in the guts of a specific implementation (BCM2835) of an architectural family (ARM11 or ARMv6).

The microarchitecture can have a big effect on program performance. Compiler writers, for example, study the microarchitecture in order to build compilers that generate the best possible code for the microarchitecture. As we’ll see in later posts, application programmers can also take steps to tune their programs for the underlying microarchitecture. Tuning is important on Raspberry Pi because at 700 MHz, this machine is running its heart out!

Today, I added a page that summarizes the characteristics of the BCM2835 (ARM11) microarchitecture. Please check out the info! We will revisit this page when I discuss profiling and tuning.

Sand, software and sound

Electronics and computing for the fun of it