Performance counter kernel module

Kernel module source: Makefile, aprof.c, readme
User-space source: rpi_pmu.h, rpi_pmu.c

Although we may have an inkling about the behavior and performance of a program, nothing beats measurement for objectivity. The Linux time command reports the elapsed, user and system execution times for a program:

> time ./small

real 0m17.015s
user 0m16.510s
sys 0m0.020s

This is good enough to measure execution time.

If we want to speed up the program, we also need insight into the program’s dynamic behavior, namely, how well the program is using the underlying machine hardware, also known as the microarchitecture. We would like to know if the program is tripping over any performance issues like data cache misses, which slow down access to data in primary memory. That’s where hardware performance counters and events come into play. The performance counters count hardware events like data cache misses, providing important clues about slow-running code. We then need to interpret the clues like a detective and change the code to speed things up.

This page describes a kernel module and a small package of user-space functions that configure, control, clear and read the performance counters on the Raspberry Pi. We need a kernel module because the kernel must grant user-space access to the performance counters and their control register. The kernel module grants access when it is loaded. Then, any user-space program can use the performance counters to take measurements.

The ARM 1176 performance counters

The Raspberry Pi is based on the Broadcom BCM2835 processor. The BCM2835 has an ARM1176JZF-S core, a member of the ARM11 family. The performance monitoring unit (PMU) is part of the ARM 1176 c15 System Control Coprocessor. The ARM1176JZF-S Technical Reference Manual describes the code in detail, including the performance counters and events.

The ARM 1176 PMU has three performance counters:

  • The Cycle Counter Register (CCR) counts core clock cycles.
  • Count Register 0 (CR0) counts one of the 21 event types supported by the ARM 1176.
  • Count Register 1 (CR1) also counts one of the 21 supported event types.

Up to three events can be measured simultaneously although one of those events is fixed and is always core clock cycles.

The counters are configured and controlled through the Performance Monitor Control Register (PMCR). The fields in the PMCR are summarized in the table below. These fields control the operation of the performance counter registers. There are fields to select events (EvtCount0 and EvtCount1), to clear the counters (C and P), to scale cycle counting by 64 (D), to enable the counters (E), and to enable interrupts on counter overflow (ECC, EC0 and EC1).

Bits Field name Function
27:20 EvtCount0 Selects event for CR0
19:12 EvtCount1 Selects event for CR1
11 X Export events to event bus
10 CCR CCR overflow flag
9 CR1 CR1 overflow flag
8 CR0 CR0 overflow flag
7 SBZ Should be zero (reserved)
6 ECC Enable CCR overflow interrupt
5 EC1 Enable CR1 overflow interrupt
4 EC0 Enable CR0 overflow interrupt
3 D Scale cycle count by 64
2 C CCR reset
1 P CR0 and CR1 reset
0 E Enable all counters
Performance Monitor Control Register

Software should NOT enable interrupts because our kernel module does not include an interrupt handler. This implementation is somewhat unsafe. Later versions of ARM (like the Cortex-A8) put the interrupt enable bits into a separate control register and that register is completely inaccessible from user-space. User-space access to the interrupt enable bits could be used to implement a denial of service attack (DoS) or crash the system (the ultimate DoS attack). On the RPi, we are trusting ourselves to do the right thing.

The PMCR has three overflow status flags: the fields CCR, CR0, and CR1. When a counter overflows, the corresponding overflow status flag is set to one. These bits are sticky! When an overflow occurs, the bits remain set even if software writes a zero into the PMCR. An overflow bit is cleared (reset to zero) by writing a one into the bit. Writing 0x700 into the PMCR clears the entire register including the overflow bits. Stickiness is quite useful and desirable in this case although the means to clear the bits may seem to be unorthodox.

The PMCR, CCR, CR0 and CR1 registers are read and written by the ARM MRC and MCR instructions. These instructions are privileged and are intended for use by the operating system, that is, they must be executed in kernel mode. The ARM1176 allows a special exemption in the case of the PMU registers, however. The Secure User and Non-secure Access Validation Control Register (SUNSAV) controls user-space access to the PMU registers. When bit 0 of this register (named “V”) is set to 1, user-space access to the PMU registers is granted and an application program is allowed to execute MRC and MCR instructions which read/write the PMU registers (only). If the V bit is 0, access is not granted and any attempt to execute MRC or MCR causes an illegal instruction exception.

The loadable kernel module, which I describe next, changes the V bit. The user-space measurement functions execute MRC and MCR to set up and read the performance counters. We discuss user-space code later.

The kernel module: aprofile.ko

I decided to call the kernel module aprofile in honor of the venerable Linux profiling system OProfile. The source code is in a file named aprof.c and the kernel module binary file is named aprofile.ko. The binary kernel module is stored in the directory:


where the kernel knows where to find it. You need to do some one-time special preparation (e.g., download the kernel source headers, prepare symbolic information, etc.) in order to build a kernel module. You also need to customize the Makefile. These topics are described on the page devoted to the kernel module build process and they will not be described here. We only concentrate on the module design and code in the following discussion.

One of the neat things about Linux kernel modules is the ability to dynamically load and unload a module. The module does not have to be part of the Linux kernel at boot time. The module loads and unloads like a dynamically linked library. External references are resolved at load time. External references have version information associated with them. The kernel checks the version information at load time to see if the references are consistent (i.e., the module and kernel version information matches). If the references are consistent, the module is loaded. This is why you need to go through the one-time preparation steps before even attempting a build.

Kernel modules are manually loaded and unloaded using the modprobe command. (You must sudo!) The command:

sudo modprobe aprofile

loads the aprofile kernel module and the command:

sudo modprobe -r aprofile

unloads the aprofile kernel module. The lsmod command displays a list of the currently loaded modules. Run lsmod to see if aprofile has successfully loaded or unloaded.

The kernel module functionality itself is very simple. In fact, this module is just about the proverbial “Hello World” example of kernel modules! It just needs to set the V bit in the Secure User and Non-secure Access Validation Control Register when the module loads and to clear this same bit when the module unloads.

The module begins with a bunch of includes. These includes bring in kernel header files. (Remember the special preparation?)

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/workqueue.h>
#include <linux/time.h>
#include <asm/mutex.h>

Next, we define two inline routines to read and write the Secure User and Non-secure Access Validation Control Register. The inline keyword tells the compiler to generate these routines in-line without the usual call and return sequence.

// Read the Secure Use and Non-secure Access Validation
// Control Register and return its value.
static inline unsigned long
  u32 val;
  asm volatile("mrc   p15, 0, %0, c15, c9, 0" : "=r"(val));
  return val;

// Write the Secure Use and Non-secure Access Validation
// Control Register
static inline void
armv6_subsav_write(unsigned long val)
  asm volatile("mcr   p15, 0, %0, c15, c9, 0" : : "r"(val));

Both functions use in-line assembler to generate the MRC and MCR instructions that read and write, respectively, the access validation register. The arguments to the instructions look magical and are taken from the ARM1176 Technical Reference Manual. Suffice it to say, the instructions read/write register c9 in the system coprocessor c15. The function armv6_sunsav_read() returns the current value of the access validation register and armv6_sunsav_write() takes an argument which is the new value to be written into the access validation register.

After that, we define two routines. The first function, aprofile_init(), is called when the module is loaded. It writes a 1 into the V bit of the access validation register. The second routine, aprofile_exit(), is called when the module is unloaded. It writes a zero into the V bit of the access validation register.

static int __init aprofile_init(void)
  int err = 0 ;
  unsigned long sunsav = 0 ;
  printk ("aprofile module loaded\n") ;

  // Enable user-space access to the performance counters
  armv6_subsav_write(0x1) ;

  // Read the access validation register
  sunsav = armv6_sunsav_read() ;
  printk ("SUNSAV: %lu\n", sunsav) ;

  return( err ) ;

static void __exit aprofile_exit(void)
  unsigned long sunsav = 0 ;

  // Disable user-space access to the performance counters
  armv6_subsav_write(0x0) ;

  // Read the access validation register
  sunsav = armv6_sunsav_read() ;

  printk ("SUNSAV: %lu\n", sunsav) ;
  printk ("aprofile module unloading...\n") ;

These functions are a little bit verbose. They both write messages into the kernel log using the kernel printk routine. printk behaves like printf except that it writes the output to the kernel log file, /var/log/kern.log. The easiest way to see the latest log messages is the command:

sudo dmesg | tail

Here are example log messages after loading and unloading aprofile.

[ 3846.372457] aprofile module loaded
[ 3846.372490] SUNSAV: 1
[ 4043.449866] SUNSAV: 0
[ 4043.449915] aprofile module unloading...

You should avoid writing messages to the log during normal operation. However, when you’re debugging a kernel module, printk is a vital tool.

The last several source lines in the kernel module tell the kernel to call aprofile_init() on load and to call aprofile_exit() on unload. The remaining lines give license, author and descriptive information for the module.


MODULE_AUTHOR("Paul Drongowski");
MODULE_DESCRIPTION("Simple ARMv6 profiler");

User-space code: rpi_pmu.h and rpi_pmu.c

I encapsulated the PMU user-space support code in two source files: rpi_pmu.h and rpi_pmu.c. A program needs to include rpi_pmu.h in order to use the support routines.

The include file borrows a small amount of code from the ARMv6 PERF implementation in:


Of this code, the event number definitions are the most important from the user’s perspective. These symbolic constants are passed to the function start_counting() in order to select and enable the events to be measured. Please note that four of these events are only supported on the ARM 1176. The PERF implementation does not currently define these events!

#define ARMV6_EVENT_ICACHE_MISS	    0x0
#define ARMV6_EVENT_IBUF_STALL	    0x1
#define ARMV6_EVENT_DDEP_STALL	    0x2
#define ARMV6_EVENT_ITLB_MISS	    0x3
#define ARMV6_EVENT_DTLB_MISS	    0x4
#define ARMV6_EVENT_BR_EXEC	    0x5
#define ARMV6_EVENT_INSTR_EXEC	    0x7
#define ARMV6_EVENT_DCACHE_HIT	    0x9
#define ARMV6_EVENT_EXPL_D_ACCESS   0x10
#define ARMV6_EVENT_WBUF_DRAINED    0x12
#define ARMV6_EVENT_NOP		    0x20

// Only the ARM1176 supports the following four events
#define ARMV6_EVENT_CALL_EXEC       0x23
#define ARMV6_EVENT_RET_EXEC        0x24
#define ARMV6_EVENT_RET_PREDICT     0x25

As we saw in the kernel module, rpi_pmu.h defines inline functions to read and write hardware registers. The functions armv6_pmcr_read() and armv6_pmcr_write() read and write the Performance Monitor Control Register (PMCR), respectively. Both functions use the normally privileged MRC and MCR instructions. These functions are followed by field and bit mask definitions to extract and set bits in the PMCR.

// Read the PMCR and return its value.
static inline unsigned long
  uint32_t val;
  asm volatile("mrc   p15, 0, %0, c15, c12, 0" : "=r"(val));
  return val;

// Write the PMCR using the specified value.
static inline void
armv6_pmcr_write(unsigned long val)
  asm volatile("mcr   p15, 0, %0, c15, c12, 0" : : "r"(val));

The functions armv6pmu_read_counter() and armv6pmu_write_counter() read and write (respectively) one of the performance counter registers. Both functions take an argument which selects a performance counter. The symbolic constants ARMV6_CYCLE_COUNTER, ARMV6_COUNTER0 and ARMV6_COUNTER1 choose the appropriate hardware register. (For brevity’s sake, I won’t reproduce all of the code here.) MRC and MCR instructions do the heavy lifting.

Finally, arm_tests.h declares four user-callable functions which are defined in arm_tests.c. The four functions are:

extern void start_counting(int evt0, int evt1) ;
extern void stop_counting(void) ;
extern void get_counts(uint64_t* cycles, uint64_t* evt0, uint64_t* evt1) ;
extern void print_counts(FILE* result_file) ;

The function start_counting() chooses the events to be measured, clears the counters and starts counting. It always clears and starts the core clock cycle counter since this event is essentially free. The support code remembers the event identifiers that are passed as arguments.

The function stop_counting() stops the counters. The support code remembers the final event counts and the counter overflow status.

The function get_counts() returns the final event counts through the call-by-reference parameters. Pass NULL if you are not interested in retrieving the final count for a particular event/counter.

The function print_counts() writes the final event counts to a file in a human-readable form. Here’s sample output:

Performance Monitor events
Cycle Counter: 195723338 Processor cycles (scaled by 64)
Counter 0: 1309465988 Executed instruction
Counter 1: 3936359065 (OVERFLOW) CPU cycles

The output shows the number of events measured for each counter/event type. The events are identified by name. (Much better than a hex number.) The output also identifies any counter overflows. In this case, counter 1 had an overflow and the event count is not valid. The cycle count is always scaled by 64 because if it is not scaled, the 32-bit counter overflows in an absurdly short 6.14 seconds.

That’s it! Please see the performance analysis example pages to see the code in action.

Copyright © 2013 Paul J. Drongowski