PERF tutorial part 2 now available

Part 2 of a three part tutorial about Linux-tools PERF is now available.

Part 1 of the series shows how to find hot execution spots in an application program. It demonstrates the basic PERF commands using software performance events such as CPU clock ticks and page faults.

Part 2 of the series — just released — introduces hardware performance counters and events. I show how to count hardware events with PERF and how to compute and apply a few basic derived measurements (e.g., instructions per cycle, cache miss rate) for analysis. Part 3 is in development and will show how to use sampling to profile a program and to isolate performance issues in code.

All three parts of the series use the same simple, easy to understand example: matrix multiplication. One version of the matrix multiplication program illustrates the impact of severe performance issues and what to look for in PERF measurements. The issues are mitigated in the second, improved version of the program. PERF measurements for the improved program are presented for comparison.

The test platform is the latest second generation Raspberry Pi 2 running Raspbian Wheezy 3.18.9-v7+. The Raspberry Pi 2 has a 900MHz quad-core ARM Cortex-A7 (ARMv7) processor with 1GByte of primary memory. Although the tutorial series demonstrates PERF on Cortex-A7, the same PERF commands and analytical techniques can be employed on other architectures like x86.

A special note for Raspberry Pi users. The current stable distribution of Raspbian Wheezy — 3.18.7-v7+ February 2015 — does not support PERF hardware events. Full PERF support was enabled in a later, intermediate release and full PERF support should be available in the next stable release of Raspbian Wheezy. In the meantime, Raspberry Pi 2 users may profile their programs using PERF software events as shown in Part 1 of the tutorial. First generation Raspberry Pi users are also restricted to software performance events.

Brave souls may try rpi-update to upgrade to the latest and possibly unstable release. I recommend waiting for the next stable release unless you really, really know what you are doing and are willing to chance an unstable kernel with potentially catastrophic consequences.

RPi2: Work in progress 1

Here’s a quick status update on working with Raspberry Pi gen 2. The installed operating system is Raspbian Wheezy 3.18.7-v7+ built on 16 February 2015.

I’m happy to report that I could profile programs using PERF software events. I’m disappointed to report that PERF does not recognize any hardware (performance counter) events. This distro has Linux-tools-3.2 installed. I uninstalled 3.2 and installed 3.18 which matches the kernel:

sudo apt-get remove Linux-tools-3.2
sudo apt-get install Linux-tools-3.18

Still no joy when attempting to use hardware events. If you want to profile your program using PERF software events, please see my current PERF tutorial about finding execution hot-spots. I tried all of the commands and, with the exception of one typo, everything still works!

I’m in the process of troubleshooting my loadable kernel module for user-space performance counter events. I’ve encountered many of the same old stumbling blocks (e.g., finding the correct headers and Module.symvers file). At the present time, the kernel will attempt to load the module, then die. I cannot tell at this stage if there is a problem in the module itself or if there is a bug in Raspbian Wheezy. In case you want to dive into module development yourself, I’ve started a permanent page for building kernel modules on RPi2.

Once again, after two+ years, I want to make a public plea for more open information about the underlying hardware and for guidance and support for end-user device driver development. Quite frankly, Broadcom plays this situation too close to the chest, especially for a computer that’s advertised as a vehicle for learning and education. The dearth of information is stifling. People still struggle to identify and download essential information (e.g., Module.symvers) for device driver development. This is not true of other major Linux distros and the Raspbian folks really need to take note! Broadcom, in particular, runs the risk of killing off the goose laying the golden eggs.

Before signing off, here is a quick PERF command cheat sheet. I recommend reading the tutorial, but if you really must peck away at the keyboard… All the best!

perf help
perf list
perf stat -e cpu-clock ./program
perf record -e cpu-clock ./program
perf record -e cpu-clock,faults .program
perf report
perf report --stdio --sort comm,dso --header
perf report --stdio --dsos=program,
perf annotate --stdio --dsos=program --symbol=function
perf annotate --stdio --dsos=program --symbol=function --no-source
perf record -e cpu-clock --freq=8000 ./program
perf evlist -F

Replace “program” with the name of your application program and replace “function” with the name of a function in your program.

Second generation RPi is here

The second generation Raspberry Pi (RPi2) is now shipping in large quantities! Given the excitement on the Web, this machine should be at least as popular as its first generation parents. Although the RPi2 model B has the same overall form factor as the first generation model B+, the designers made two substantial improvements which make the RPi2 a contender for your desktop:

  1. The single core Broadcom BCM2835 is replaced by the quad core BCM2836.
  2. Primary memory is increased to 1GByte of LPDDR2 RAM.

That’s just the face of it. Not only does the BCM2836 have four processor cores instead of one core, the cores are based on the ARMv7 architecture (Coretx-A7) including the NEON single instruction, multiple data (SIMD) instructions. The clock frequency is increased to 900MHz (from 700MHz). I’ve already begun to explore the ARMv7 micro-architecture and plan to write up a short, concise summary of its performance-related characteristics.

The BCM2836 has a different memory controller. Primary memory is no longer implemented using the Package on Package (PoP) approach. The Elpida (Micron) B8132C4PB-8D-F memory chip is mounted on the bottom of the RPi2 board (instead of the PoP piggyback).

The RPi2 sold out at Sparkfun almost immediately. Fortunately, Canakit, Element14 and Microcenter have received shipments, too. Amazon advertised the Canakit Raspberry Pi 2 Ultimate Starter Kit at a very attractive price and I immediately bought a kit. Microcenter in Cambridge had a mound of RPi2s and impatience took the best of me — I bought one. Yes, after getting the mail, I now have two.

I copied the latest Raspbian Wheezy release (16 February 2015) to a 16MByte microSD card using Win32DiskImager. The Canakit ships with NOOBS on an 8GByte card and I hope to try and report about NOOBS later. There was a little drama while bringing up Raspbian Wheezy as some relatively small, but annoying problems did crop up. Once I got past the sand traps, the new RPi2 proved to be an able performer.

Today, I copied my test software over to the RPi2. Here is a quick comparison between the older RPi model B and the new RPi2.

Platform Naïve MM Interchange MM
RPi model B gen 1 18.67 sec 6.75 sec
RPi gen 2 3.15 sec 2.42 sec

The two test cases are the naïve matrix multiplication program and the loop nest interchange matrix multiplication program. (Get the code in the source section of the web site.) Yes, that is a 6x improvement in performance for the naïve case. It’ll be fun to explore and find the reasons behind the speed-up. Fast matrix multiplication depends upon memory bandwidth and there must be some significant improvements in the memory subsystem. Naïve matrix multiplication incurs a lot of translation lookaside buffer (TLB) misses, so improvements in TLB miss handling could also contribute to the speed-up in the naïve test case.

I ditched the Epiphany Web browser as it seems to have significant bugs. The browser crashed repeatedly when loading the New York Times front page. This is unacceptable. I installed Midori, which came with the initial release of Raspbian Wheezy. The New York Times front page is a bit of a torture test. Midori loaded the page in less time than the RPi gen 1, but still felt slow and logy. I suspect that many applications will need to be compiled for ARMv7 before we end-users get the full benefit of the BCM2836. The initial result, however, is encouraging.

Well, I’ve started to reorganize the site’s menu structure in order to get ready for new content about the RPi2. I intend to retain the older articles as they remain quite relevant. More to come!

Make music with MMS on a PSR

Yamaha Mobile Music Sequencer includes features for Motif, MOX and Tyros5, but did you know that you can create music using MMS on your PSR arranger? Yes, you can!

I’m using MMS with both the Yamaha PSR-E443 and PSR-S950 and I have written up a tutorial on making music with MMS on PSR/Tyros. This article concentrates on set-up, MIDI voice selection and MIDI file export which are aspects not covered by the MMS manual. The tutorial complements the many on-line videos that demonstrate composition and mix down. In particular, I show how to use the full 128 voice General MIDI voice set in the PSR, thereby expanding your sonic palette beyond the limited range of voices built into MMS.

Enjoy and keep on keepin’ on!