A short trip through ARM cores

Digging around in the kernel and PERF source code, I got lost in the labyrinth of ARM products. So, I took a little time to learn about ARM’s naming conventions.

ARM have always been good at separating architecture and implementation (code technology). Architecture is what the programmer sees, rather, the behavioral standard which includes the “instruction set architecture” or “ISA.” Architecture goes beyond instruction sets and includes operating system concerns such as virtual memory, interrupts, etc.

Core technology implements architectural features. Core technology is the underlying machine organization AKA the “micro-architecture.” Depending upon the actual design, Implementation touches on pipeline length and stages, caches, translation look-aside buffers (TLB), branch predictors and all of the other circuit stuff needed for efficient, performant execution.

ARM architecture names have the form “ARMvX”, where X is the version number. The original Raspberry Pi (Broadcom2835 with ARM1176JZF-S processor) implemented the ARMv6 architecture. The current Raspberry Pi 4 (Broadcom BCM2711 with quad Cortex-A72 cores) implements the ARMv8-A architecture. ARMv8 is a multi-faceted beast and a short summary is wholly inadequate to convey its full scope. I suggest reading through the ARMv8 architectural profile on the ARM Web site. Suffice it to say here, ARMv8.2 is the latest.

ARMv8 was and is a big deal. ARMv7 was 32-bit only. ARMv8 took the architecture into 64-bit operation while retaining ARMv7 32-bit functionality. ARMv8 added a 64-bit ISA and operating system features, separating operation into the AArch32 execution state and the (then) new AArch64 execution state. It preserves backward compatibility with ARMv7.

You probably noticed that the core name “ARM1176JZF-S” (above) is neither the most informative nor does it suggests this core’s place in the constellation of ARM products. In 2005, ARM introduced a new core technology naming scheme. The naming scheme categorizes cores by series:

  • Cortex-A: Application
  • Cortex-R: Realtime
  • Cortex-M: Embedded

The series letters spell out “ARM” — clever. Cores within a series are tailored for their intended deployment environment having the appropriate mix of performance (speed), real estate (space) and power envelope.

The following table is a partial, historical roadmap of recent ARM cores:

    32-bit cores       64-bit cores (ARMx8) 
------------ ----------------------------------
Cortex-A5 Cortex-A53 2012 In-order LITTLE
Cortex-A7 Cortex-A57 2012 Out-of-order big
Cortex-A8 Cortex-A72 2015 Out-of-order big
Cortex-A9 Cortex-A73 2016 Out-of-order big
Cortex-A12 Cortex-A55 2017 In-order LITTLE
Cortex-A15 Cortex-A76 2018 Out-of-order big
Cortex-A17 Cortex-A77 2019 Out-of-order big

The 64-bit cores in the second column are a significant architectural break from the 32-bit cores in the first column. I will focus on the ARMv8 (64-bit) cores.

ARM rolled out big.LITTLE multiprocessor configuration at roughly the same time as ARMv8. With big.LITTLE, ARM vendors can design multiprocessors that are a mixture of big cores and LITTLE cores. LITTLE cores are power-efficient implementations much like ARM’s previous core designs for embedded and mobile systems — systems which consume and dissipate as little power as possible. Big cores trade higher power for higher performance. [If you really want to make digital electronics go fast, you must expend energy.] The big.LITTLE approach allows a mix of power-efficient cores and high-performing cores in a multicore system. Thus, a cell phone can spend most of its time in low power cores saving battery while hitting the high power cores when compute performance is required by the user.

The big.LITTLE approach is enabled by common cache and coherent communication bus design. The little guys and the big guys communicate through a common infrastructure. Nice.

ARMv8 LITTLE cores are in-order superscalar designs. In-order designs are simpler than out-of-order superscalar. In-order cores usually have a shorter pipeline, have fewer execution units, and do not require a big register file for renaming and delayed retirement. Out-of-order superscalar designs pull out all of the stops for performance and exploit as much instruction level parallelism (ILP) as they can discover.

The Cortex-A53 was the first ARMv8 in-order LITTLE core. ARM introduced its successor, Cortex-A55, in 2017.

The big core era began with Cortex-A57. This was ARM’s first design that rivaled Intel and AMD out-of-order x86 cores. The Cortex-A72 replaced the A57. Thus, the Raspberry Pi 4 (BCM2711) uses an older ARM big core, the Cortex-A72. [Explains my enthusiasm for a $70 o-o-o superscalar.] ARM have churned out new big cores on an annual basis ever since from its Austin and Sophia design centers.

Before pushing ahead, it’s worth mentioning that the Yamaha Montage synthesizer has an 800MHz Texas Instruments Sitara ARM Cortex-A8 single core processor and a 40MHz Fujitsu MB9AF141NA with an ARM Cortex-M3 core. The four 64-bit A72 cores in the Raspberry Pi 4 have much more compute throughput than the single 32-bit A8 core in the Yamaha Montage. The Montage (MODX) processor provides user interface and control and is not really a compute engine. The SWP70 silicon is the tone generator.

In practice

I began this dive into naming and ARM cores when I needed to determine the actual ARM core support installed with Performance Events for Linux (PERF) and the underlying kernel.

The kernel creates system files which let a program query the characteristics of the hardware platform. You might be familiar with the /proc directory, for example. There are a few such directories associated with the kernel’s performance counter interface. (See the man page for perf_event_open().) The directory:

    /sys/bus/event_source/devices/XXX/events

lists the performance events supported by the processor, XXX. In the case of the Raspberry Pi 4, XXX is “armv7_cortex_a15”. I was expecting “armv8_cortex_a72”.
Supported performance events vary from core to core. Sure, there is some commonality (retired instructions, 0x08), but there are differences between cores in the same ancestral lineage! So, one must question which default events are defined with the current version of the Raspberry Pi OS.

I found the symbolic events perf_event_open() events like PERF_COUNT_HW_INSTRUCTIONS to be reasonably sane. However, beware of the branch, TLB and L2 cache events. One must be careful, in any case, since there really isn’t a precise specification for these events and actual hardware events have many nuances and subtleties which are rarely documented. [I’ve been there.] The perf_event_open() built-in symbolic events depend upon common understanding, which is the surest path to miscommunication and misinterpetation!

Copyright © 2020 Paul J. Drongowski

Performance events on Raspberry Pi 4: Tips

Performance measurement and tuning experiments with Raspberry Pi 4 are well-underway. Here are a few quick observations and tips.

Linux provides two entries into performance measurement: Performance Events for Linux (PERF) and the kernel performance counter interface (perf_event_open()). PERF is an easy-to-use tool suite and is the best place to start explorations. If you want to measure an application without modifying its code, this is for you.

PERF is built on the kernel performance counter interface. The interface consists of two calls: perf_event_open() and its associated ioctl() functions. The kernel interface is suitable for self-monitoring, that is, adding calls to an application in order to measure its internal operation. Performance counters provide two modes of operation: counting and sampling. Counting mode is most appropriate for self-monitoring. I’m currently writing code that makes self-monitoring a bit easier and hope to post the code when it’s ready.

In the meantime…

Installation

PERF and perf_event_open support are not usually installed with your typical Linux distribution. Originally, PERF was available solely as part of the Linux tools package. Well, it seems like somewhere along the way, Ubuntu and Debian diverged. Ubuntu installs PERF with Linux tools:

    sudo apt-get install linux-tools-common 
sudo apt-get install linux-tools-common-$(uname -r)

As PERF depends heavily upon kernel facilities and interfaces, you should install the version of PERF that matches the installed kernel.

Raspberry Pi OS (once known as Raspian) is a Debian distro. Shucks, wouldn’t you know it, Debian installs PERF differently:

    sudo apt install linux-perf

There are different packages for buster and stretch (the current versions of Raspberry Pi OS and Debian at the time of this writing).

    https://packages.debian.org/buster/linux-perf 
https://packages.debian.org/stretch/linux-perf

Installing on buster produces output like:

    XXX@raspberrypi:~ $ sudo apt install linux-perf 
password for XXX:
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
linux-perf-4.9
Suggested packages:
linux-doc-4.9
The following NEW packages will be installed:
linux-perf linux-perf-4.9
0 upgraded, 2 newly installed, 0 to remove and 107 not upgraded.
Need to get 1,275 kB of archives.
After this operation, 2,735kB of additional space will be used.
Do you want to continue? [Y/n]

Versioning gotcha

And, of course, it’s never that simple. My version of Raspberry Pi OS (buster) is expecting PERF version 5.4. When you enter “sudo perf list” or any other PERF command on the command line, the shell runs the script /usr/bin/perf. The script checks the version of PERF against the kernel and complains when versions don’t match. The Debian install pulled version 4.9, not 5.4.

Rather than sort out versioning, I’ve been entering “perf_4.9” instead of “perf“. This work-around bypasses the perf script which checks versions. Since PERF is now fairly mature, it all seems to work. At some point, I’ll sort out the versioning situation and install 5.4. In the meantime, full steam ahead!

Getting started

Here’s a few PERF commands to get you started:

    perf stat --help 
perf list sw
perf stat
perf top -a
perf top -e cpu_clock
perf record
perf report

The stat approach uses counting mode to measure software and hardware events triggered by an application program (“<cmd>”). The top approach displays event counts dynamically in real-time like the ever-popular “top” utility program. The record and report approach uses sampling to produce performance reports and profiles.

For additional usage information, check out the Linux performance analysis tutorial. There are several other fine tutorials and helpful sites on the Web. Many of the tutorials show use on x86 (Intel and AMD) systems, not Raspberry Pi and ARM. For that, I recommend my own three part tutorial:

  • Part 1 demonstrates how to use PERF to identify and analyze the hottest execution spots in a program. Part 1 covers the basic PERF commands, options and software performance events.
  • Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application.
    Part 3 uses hardware performance event sampling to identify and analyze hot spots within an application program.

In addition to usage, I offer information and guidance concerning ARM micro-architecture. This information is especially helpful when you get into hardware performance events. Check out my summaries of the ARM11 and ARM Cortex-A72 micro-architectures. ARM11 covers Raspberry Pi models 1, 2, and 3 (BCM2835 and BCM2836), while the Cortex-A72 summary covers the Raspberry Pi 4 (BCM2711).

Other helpful on-line resources are:

Paranoia!

Performance measurement is fraught with security issues and holes. The kernel developers implemented a control flag file, /proc/sys/kernel/perf_event_paranoid which sets the level of access and vulnerability when taking measurements. Quoting the Linux man page:

    The perf_event_paranoid file can be set to restrict access 
to the performance counters.
2 allow only user-space measurements (default since
Linux 4.6).
1 allow both kernel and user measurements (default
before Linux 4.6).
0 allow access to CPU-specific data but not raw
tracepoint samples.
-1 no restrictions.
The existence of the perf_event_paranoid file is the
official method for determining if a kernel supports
perf_event_open().

If you’re operating in a fairly closed, single-user environment, then set the content of the file to 0 or -1.

Read the perf_event_open() man page

I recommend reading the perf_event_open() man page. If you’re just starting your journey into performance measurement, you will be overwhelmed by the detail at first. However, just let the information wash over you and know that it’s there. The tutorials don’t always mention the perf_event_paranoid flag and other low-level details. Reading the man page should help you across future stumbling blocks and will enhance your understanding of events, counting and sampling.

Want to learn more about Raspberry Pi 4 (Cortex-A72 / Broadcom BCM2711) performance tuning? Please read:

Copyright © 2020 Paul J. Drongowski

Welcome CS teachers and students!

[Be sure to visit Living Computers in Seattle. SIGCSE 2017 attendees are admitted free during the conference. I visited the museum today and it was a lot of fun! K-12 teachers will enjoy the hands on exhibits.]

The annual ACM Special Interest Group on Computer Science Education (SIGCSE 2017) Technical Symposium is next week (March 8 – 11) in Seattle, Washington. The symposium brings together educators at all levels (K-12 and higher ed) to exchange and discuss the latest methods, practices and results in computer science education.

I don’t often advertise it, but the Sand, Software, Sound site has many resources for educators and students alike. You can browse these resources by clicking on one of the WordPress topic buttons (Raspberry Pi, PERF, Courseware, etc.) above. You can also search for a topic or choose from one of the categories listed in the right sidebar.

Here are a few highlights.

I taught many computer-related subjects during my career and have posted course notes, slides and old projects. The four main sections are:

  • CS2 data structures: Undergraduate data structures course suitable for advanced placement students.
  • Computer design: Undergraduate computer architecture and design which uses a multi-level modeling approach.
  • VLSI systems: Graduate course on VLSI architecture, design and circuits which is suitable for undergraduate seniors.
  • Topics in computer architecture: Material for a special topics seminar about computer architecture (somewhat historical).

Please feel free to dig through these materials and make use of them.

Software and hardware performance analysis formed a major thread throughout my professional life. I recommend reading my series of tutorials on the Linux PERF tool set for software performance analysis:

The ARM11 microarchitecture summary is background material for the PERF tutorial. Program profiling is a good way to bring computer architecture to life and to teach students how to analyze and assess the execution speed of their programs.

There are two additional tutorials and getting started guides for teachers and students working on Raspberry Pi:

Music technology and computer-based music-making have been two of my chief interests over the years. The Arduino section of the site has several of my past projects using the Arduino for music-making. You should also check out my recent blog posts about the littleBits synth modules and littleBits Arduino. Please click on the tags and links at the bottom of each post in order to chase down material.

You might also enjoy my tutorial on software synthesizers for Linux and Raspberry Pi. The tutorial is a getting started guide for musicians of all stripes — music teachers and students are certainly welcome, too!

PERF tutorial part 3 is now on-line

Just wrapped up Part 3 of the Linux-tools PERF tutorial.

The tutorial now consists of three parts. Part 1 covers the most basic PERF commands and shows how to find program hot-spots using software performance events. Part 2 discusses hardware performance events and performance counters, and demonstrates how to measure hardware performance events using PERF counting mode. Part 2 introduces several derived performance metrics like instructions per second (IPC) and applies these metrics to the sample application programs.

Part 3 is the newest addition to the tutorial series. It builds on parts 1 and 2, showing how to use hardware performance events and counter sampling to profile an application program. Part 3 discusses sampling period and frequency, the sampling process, overhead, statistical accuracy/confidence and other practical concerns.

I hope you find the PERF tutorial to be useful in your work! Although I produced the example data on the ARM-based Raspberry Pi, the commands and techniques will also work on x86.

PERF tutorial part 2 now available

Part 2 of a three part tutorial about Linux-tools PERF is now available.

Part 1 of the series shows how to find hot execution spots in an application program. It demonstrates the basic PERF commands using software performance events such as CPU clock ticks and page faults.

Part 2 of the series — just released — introduces hardware performance counters and events. I show how to count hardware events with PERF and how to compute and apply a few basic derived measurements (e.g., instructions per cycle, cache miss rate) for analysis. Part 3 is in development and will show how to use sampling to profile a program and to isolate performance issues in code.

All three parts of the series use the same simple, easy to understand example: matrix multiplication. One version of the matrix multiplication program illustrates the impact of severe performance issues and what to look for in PERF measurements. The issues are mitigated in the second, improved version of the program. PERF measurements for the improved program are presented for comparison.

The test platform is the latest second generation Raspberry Pi 2 running Raspbian Wheezy 3.18.9-v7+. The Raspberry Pi 2 has a 900MHz quad-core ARM Cortex-A7 (ARMv7) processor with 1GByte of primary memory. Although the tutorial series demonstrates PERF on Cortex-A7, the same PERF commands and analytical techniques can be employed on other architectures like x86.

A special note for Raspberry Pi users. The current stable distribution of Raspbian Wheezy — 3.18.7-v7+ February 2015 — does not support PERF hardware events. Full PERF support was enabled in a later, intermediate release and full PERF support should be available in the next stable release of Raspbian Wheezy. In the meantime, Raspberry Pi 2 users may profile their programs using PERF software events as shown in Part 1 of the tutorial. First generation Raspberry Pi users are also restricted to software performance events.

Brave souls may try rpi-update to upgrade to the latest and possibly unstable release. I recommend waiting for the next stable release unless you really, really know what you are doing and are willing to chance an unstable kernel with potentially catastrophic consequences.