Performance events on Raspberry Pi 4: Tips

Performance measurement and tuning experiments with Raspberry Pi 4 are well-underway. Here are a few quick observations and tips.

Linux provides two entries into performance measurement: Performance Events for Linux (PERF) and the kernel performance counter interface (perf_event_open()). PERF is an easy-to-use tool suite and is the best place to start explorations. If you want to measure an application without modifying its code, this is for you.

PERF is built on the kernel performance counter interface. The interface consists of two calls: perf_event_open() and its associated ioctl() functions. The kernel interface is suitable for self-monitoring, that is, adding calls to an application in order to measure its internal operation. Performance counters provide two modes of operation: counting and sampling. Counting mode is most appropriate for self-monitoring. I’m currently writing code that makes self-monitoring a bit easier and hope to post the code when it’s ready.

In the meantime…

Installation

PERF and perf_event_open support are not usually installed with your typical Linux distribution. Originally, PERF was available solely as part of the Linux tools package. Well, it seems like somewhere along the way, Ubuntu and Debian diverged. Ubuntu installs PERF with Linux tools:

    sudo apt-get install linux-tools-common 
sudo apt-get install linux-tools-common-$(uname -r)

As PERF depends heavily upon kernel facilities and interfaces, you should install the version of PERF that matches the installed kernel.

Raspberry Pi OS (once known as Raspian) is a Debian distro. Shucks, wouldn’t you know it, Debian installs PERF differently:

    sudo apt install linux-perf

There are different packages for buster and stretch (the current versions of Raspberry Pi OS and Debian at the time of this writing).

    https://packages.debian.org/buster/linux-perf 
https://packages.debian.org/stretch/linux-perf

Installing on buster produces output like:

    XXX@raspberrypi:~ $ sudo apt install linux-perf 
password for XXX:
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
linux-perf-4.9
Suggested packages:
linux-doc-4.9
The following NEW packages will be installed:
linux-perf linux-perf-4.9
0 upgraded, 2 newly installed, 0 to remove and 107 not upgraded.
Need to get 1,275 kB of archives.
After this operation, 2,735kB of additional space will be used.
Do you want to continue? [Y/n]

Versioning gotcha

And, of course, it’s never that simple. My version of Raspberry Pi OS (buster) is expecting PERF version 5.4. When you enter “sudo perf list” or any other PERF command on the command line, the shell runs the script /usr/bin/perf. The script checks the version of PERF against the kernel and complains when versions don’t match. The Debian install pulled version 4.9, not 5.4.

Rather than sort out versioning, I’ve been entering “perf_4.9” instead of “perf“. This work-around bypasses the perf script which checks versions. Since PERF is now fairly mature, it all seems to work. At some point, I’ll sort out the versioning situation and install 5.4. In the meantime, full steam ahead!

Getting started

Here’s a few PERF commands to get you started:

    perf stat --help 
perf list sw
perf stat
perf top -a
perf top -e cpu_clock
perf record
perf report

The stat approach uses counting mode to measure software and hardware events triggered by an application program (“<cmd>”). The top approach displays event counts dynamically in real-time like the ever-popular “top” utility program. The record and report approach uses sampling to produce performance reports and profiles.

For additional usage information, check out the Linux performance analysis tutorial. There are several other fine tutorials and helpful sites on the Web. Many of the tutorials show use on x86 (Intel and AMD) systems, not Raspberry Pi and ARM. For that, I recommend my own three part tutorial:

  • Part 1 demonstrates how to use PERF to identify and analyze the hottest execution spots in a program. Part 1 covers the basic PERF commands, options and software performance events.
  • Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application.
    Part 3 uses hardware performance event sampling to identify and analyze hot spots within an application program.

In addition to usage, I offer information and guidance concerning ARM micro-architecture. This information is especially helpful when you get into hardware performance events. Check out my summaries of the ARM11 and ARM Cortex-A72 micro-architectures. ARM11 covers Raspberry Pi models 1, 2, and 3 (BCM2835 and BCM2836), while the Cortex-A72 summary covers the Raspberry Pi 4 (BCM2711).

Other helpful on-line resources are:

Paranoia!

Performance measurement is fraught with security issues and holes. The kernel developers implemented a control flag file, /proc/sys/kernel/perf_event_paranoid which sets the level of access and vulnerability when taking measurements. Quoting the Linux man page:

    The perf_event_paranoid file can be set to restrict access 
to the performance counters.
2 allow only user-space measurements (default since
Linux 4.6).
1 allow both kernel and user measurements (default
before Linux 4.6).
0 allow access to CPU-specific data but not raw
tracepoint samples.
-1 no restrictions.
The existence of the perf_event_paranoid file is the
official method for determining if a kernel supports
perf_event_open().

If you’re operating in a fairly closed, single-user environment, then set the content of the file to 0 or -1.

Read the perf_event_open() man page

I recommend reading the perf_event_open() man page. If you’re just starting your journey into performance measurement, you will be overwhelmed by the detail at first. However, just let the information wash over you and know that it’s there. The tutorials don’t always mention the perf_event_paranoid flag and other low-level details. Reading the man page should help you across future stumbling blocks and will enhance your understanding of events, counting and sampling.

Want to learn more about Raspberry Pi 4 (Cortex-A72 / Broadcom BCM2711) performance tuning? Please read:

Copyright © 2020 Paul J. Drongowski

Welcome CS teachers and students!

[Be sure to visit Living Computers in Seattle. SIGCSE 2017 attendees are admitted free during the conference. I visited the museum today and it was a lot of fun! K-12 teachers will enjoy the hands on exhibits.]

The annual ACM Special Interest Group on Computer Science Education (SIGCSE 2017) Technical Symposium is next week (March 8 – 11) in Seattle, Washington. The symposium brings together educators at all levels (K-12 and higher ed) to exchange and discuss the latest methods, practices and results in computer science education.

I don’t often advertise it, but the Sand, Software, Sound site has many resources for educators and students alike. You can browse these resources by clicking on one of the WordPress topic buttons (Raspberry Pi, PERF, Courseware, etc.) above. You can also search for a topic or choose from one of the categories listed in the right sidebar.

Here are a few highlights.

I taught many computer-related subjects during my career and have posted course notes, slides and old projects. The four main sections are:

  • CS2 data structures: Undergraduate data structures course suitable for advanced placement students.
  • Computer design: Undergraduate computer architecture and design which uses a multi-level modeling approach.
  • VLSI systems: Graduate course on VLSI architecture, design and circuits which is suitable for undergraduate seniors.
  • Topics in computer architecture: Material for a special topics seminar about computer architecture (somewhat historical).

Please feel free to dig through these materials and make use of them.

Software and hardware performance analysis formed a major thread throughout my professional life. I recommend reading my series of tutorials on the Linux PERF tool set for software performance analysis:

The ARM11 microarchitecture summary is background material for the PERF tutorial. Program profiling is a good way to bring computer architecture to life and to teach students how to analyze and assess the execution speed of their programs.

There are two additional tutorials and getting started guides for teachers and students working on Raspberry Pi:

Music technology and computer-based music-making have been two of my chief interests over the years. The Arduino section of the site has several of my past projects using the Arduino for music-making. You should also check out my recent blog posts about the littleBits synth modules and littleBits Arduino. Please click on the tags and links at the bottom of each post in order to chase down material.

You might also enjoy my tutorial on software synthesizers for Linux and Raspberry Pi. The tutorial is a getting started guide for musicians of all stripes — music teachers and students are certainly welcome, too!

PERF tutorial part 3 is now on-line

Just wrapped up Part 3 of the Linux-tools PERF tutorial.

The tutorial now consists of three parts. Part 1 covers the most basic PERF commands and shows how to find program hot-spots using software performance events. Part 2 discusses hardware performance events and performance counters, and demonstrates how to measure hardware performance events using PERF counting mode. Part 2 introduces several derived performance metrics like instructions per second (IPC) and applies these metrics to the sample application programs.

Part 3 is the newest addition to the tutorial series. It builds on parts 1 and 2, showing how to use hardware performance events and counter sampling to profile an application program. Part 3 discusses sampling period and frequency, the sampling process, overhead, statistical accuracy/confidence and other practical concerns.

I hope you find the PERF tutorial to be useful in your work! Although I produced the example data on the ARM-based Raspberry Pi, the commands and techniques will also work on x86.

PERF tutorial part 2 now available

Part 2 of a three part tutorial about Linux-tools PERF is now available.

Part 1 of the series shows how to find hot execution spots in an application program. It demonstrates the basic PERF commands using software performance events such as CPU clock ticks and page faults.

Part 2 of the series — just released — introduces hardware performance counters and events. I show how to count hardware events with PERF and how to compute and apply a few basic derived measurements (e.g., instructions per cycle, cache miss rate) for analysis. Part 3 is in development and will show how to use sampling to profile a program and to isolate performance issues in code.

All three parts of the series use the same simple, easy to understand example: matrix multiplication. One version of the matrix multiplication program illustrates the impact of severe performance issues and what to look for in PERF measurements. The issues are mitigated in the second, improved version of the program. PERF measurements for the improved program are presented for comparison.

The test platform is the latest second generation Raspberry Pi 2 running Raspbian Wheezy 3.18.9-v7+. The Raspberry Pi 2 has a 900MHz quad-core ARM Cortex-A7 (ARMv7) processor with 1GByte of primary memory. Although the tutorial series demonstrates PERF on Cortex-A7, the same PERF commands and analytical techniques can be employed on other architectures like x86.

A special note for Raspberry Pi users. The current stable distribution of Raspbian Wheezy — 3.18.7-v7+ February 2015 — does not support PERF hardware events. Full PERF support was enabled in a later, intermediate release and full PERF support should be available in the next stable release of Raspbian Wheezy. In the meantime, Raspberry Pi 2 users may profile their programs using PERF software events as shown in Part 1 of the tutorial. First generation Raspberry Pi users are also restricted to software performance events.

Brave souls may try rpi-update to upgrade to the latest and possibly unstable release. I recommend waiting for the next stable release unless you really, really know what you are doing and are willing to chance an unstable kernel with potentially catastrophic consequences.