PERF tutorial: Counting hardware performance events

This article is part 2 of a three part series on the PERF (linux-tools) performance measurement and profiling system.

In part 1, I demonstrate how to use PERF to identify and analyze the hottest execution spots in a program. Part 1 covers the basic PERF commands, options and software performance events.
Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application.
Part 3 uses hardware performance event sampling to identify and analyze program hot-spots.

Each article builds on the usage and background information presented in the previous articles. So, if you’ve arrived here through a Web search, please scan the preceding articles for background. All articles use the same two example programs:

Naive: Naive matrix multiplication program (textbook algorithm).
Interchange: An improved matrix multiplication program (loop nest interchange algorithm).

The textbook algorithm has known performance issues due to its memory access pattern. The loop nest interchange algorithm has a different access pattern which reads/writes data more efficiently.

Author: P.J. Drongowski

Introduction

This article introduces performance events and counters, and shows how to use them to make basic measurements of application program behavior using PERF. I primarily use the perf stat command in this article.

PERF, which is part of the linux-tools package, is an infrastructure and suite of tools for collecting and analyzing performance data. You should install the version of PERF that matches the version of your Linux kernel. The data below were collected on a Raspberry Pi 2 (second generation, BCM2836 processor, ARMv7 Cortex-A7) using Raspbian Wheezy 3.18.9+v7+. Therefore, I installed linux-tools-3.18. Enter:

    uname -a

to identify the kernel on your test platform. Then, enter:

    sudo apt-get linux-tools-X.Y

to install PERF, replacing X and Y with the major and minor version numbers of the kernel (e.g., linux-tools-3.18).

In the first article, I used PERF’s software events to make measurements. The underlying microprocessor hardware is equipped with a small number of performance counters which can be used to count hardware performance events. The number, relative frequency and kind of performance events can tell us quite a bit about how effectively a program is using the hardware. This hard information can help us tune the program to run more efficiently on the microprocessor.

Performance events

A performance event is an occurrence of a hardware condition. Pretty abstract, so let’s bring this down to Earth.

If you’re interested in performance, you probably know the clock speed of the microprocessor in your test platform. I’m working on a Raspberry Pi 2 with a 900MHz ARM Cortex-A7 quad-core processor. The CPU clock ticks at a 900Mhz rate. Every clock tick is a new hardware event. The ARMv7 architecture defines a CPU clock performance event. The number of processor cycles needed to execute a program tells us how long it took for the program to run.

Here’s another example. Successful execution and completion of an instruction is an important event. The ARMv7 architecture defines a performance event for a successfully completed (“architecturally executed”) instruction. Another common term for successfully completed is “retired” and I will use “retired” instead of the awkward “architecturally executed.” The number of retired instructions tells us how many program instructions are successfully executed and retired.

The number and kind of performance events is very specific to the machine (micro-)architecture. The ARMv7 performance events are different from AMD x86 processor events, for example. In every day practice, we need to consult the technical reference manual for the specific microprocessor in the test platform. I have included a list of the ARMv7 (Cortex-A7) performance events at the end of this article. The table is taken from the ARM Cortex-A7 MPCore Technical Reference Manual.

For performance tuning, the general notion of “successfully completed instruction” suffices for analysis. However, a more detailed and nuanced understanding of a specific performance event may be necessary. Contemporary microprocessors use many tricks and techniques to wring good performance out of a localized sequence of instructions in a program:

The processor may issue more than one instruction at a time for execution.
The processor may issue instructions out of program order.
The processor may predict control flow and speculatively begin executing instructions along the predicted program branch.

These techniques involve complicated hardware behavior and conditions and sometimes result in highly nuanced performance events.

PERF defines a set of common, symbolic hardware performance events. (See the following table.) These performance events are the ones that are most commonly used in practice. I included the corresponding ARMv7 Event ID so you can look up the event description in the full ARMv7 performance event table at the end of the article.

**PERF hardware performance events**
Event name	ARMv7 Event ID
cpu-cycles OR cycles	0xFF
instructions	0x08
cache-references	0x04
cache-misses	0x03
branch-instructions OR branches	0x0C
branch-misses	0x10
bus-cycles	0x1D
L1-dcache-loads	0x04
L1-dcache-load-misses	0x03
L1-dcache-stores	0x04
L1-dcache-store-misses	0x03
L1-icache-loads	0x14
L1-icode-load-misses	0x01
LLC-loads	0x16
LLC-load-misses	0x17
LLC-stores	0x16
LLC-store-misses	0x17
dTLB-load-misses	0x05
dTLB-store-misses	0x05
iTLB-load-misses	0x02
branch-loads	0x12
branch-load-misses	0x10

PERF can display a list of the available software and hardware performance events. Just enter the command:

    perf list

to obtain a list of the available symbolic events. You may also specify an event using its raw identifier. For example, “instructions” and “r008” are equivalent. Raw event identifiers let you measure performance events that do not have a PERF symbolic name.

Sharp-eyed readers probably spotted a few PERF symbolic events that have the same physical Event ID, such as L1-dcache-loads and L1-dcache-stores (ARMv7 Event ID 0x04). PERF defines a set of common events, but there isn’t any guarantee that the host processor actually has equivalent events. ARMv7, for example, does not separate cache accesses by load and store operations. As a compromise, the L1-dcache-loads and L1-dcache-stores events are mapped to the ARMv7 data read/write L1 data cache event. You are likely to find similar compromises on other processor implementations, too.

Performance counters

Each processor core has a set of hardware performance counters which count performance events. PERF takes care of the grungy set-up work for you. However, you need to be aware of certain processor-specific limitations such as the number of available counters and event assignment restrictions. The Broadcom BCM 2836 (ARMv7 Cortex-A7) has four performance counters per core. Any ARMv7 performance event can be assigned to any of the four performance counters. Other processors may have a different number of performance counters and certain performance counters may be restricted to one or more specific performance events. Generally, you need to consult the appropriate processor reference manual for this information.

Performance counters (usually) support two usage modes: counting and sampling.

In counting mode, the counters are configured, cleared and started. Then, sometime later, the counters are stopped and the number of events is read and reported.
In sampling mode, the counters are configured, cleared and started. However, they are configured to generate a sampling interrupt after a certain number of events are counted. Upon interrupt, a sample is accumulated and the counter is restarted. After the workload completes, counter interrupts are turned off, the samples are aggregated, and summarized statistics are reported.

This article (part 2) concentrates on counting mode. Part 3 deals with sampling mode. Simple event counting is initiated using the perf stat command. Event sampling is handled using the perf record and perf report commands.

Counting mode is very low overhead since the hardware counters do not put any additional load on the CPU. Sampling mode imposes additional overhead. Although counting mode is low overhead, you can only measure events for the application as a whole. Overhead is higher for sampling mode, but you can isolate performance events and issues to specific regions of code (e.g., hot-spots).

Hardware performance counters ae fixed length registers. The Cortex-A7 performance counters are 32 bits in length. Microprocessors are high speed and a program can easily cause a counter overflow. Measure CPU cycles on the 900MHz Cortex-A7 in the Raspberry Pi 2 and the counter will overflow in a relatively short 4.77 seconds (2³² / 900MHz). PERF takes care of this problem for you by “virtualizing” the counters to 64 bits — so much better than having to do this yourself when directly accessing the counters from user space.

Make measurements

It’s easy to measure the number of performance events executed by an application program. Simply enter a perf stat command specifying both the events to count and the application program to run:

    perf stat -e cpu-cycles,instructions ./naive

In this example command, PERF measures CPU clock cycles and retired instructions while running the naive matrix multiplication program. Once the application program (naive) completes, PERF reports the number of occurences of each performance event:

    Performance counter stats for './naive':

        1,177,945,197   instructions     # 0.42  insns per cycle
        2,822,332,111   cycles

        3.178582734 seconds time elapsed

That’s all there is to it! I recommend putting your PERF commands into a script so that you can easily re-run performance experiments. Program performance tuning is usually performed after bugs and other kinks have been fixed. Tuning is an iterative process involving trial and error, and you should plan to run and re-run experiments many times.

Assess accuracy

In the preceding example, PERF measured over one billion instructions and roughly 2.8 billion CPU clock cycles while running the naive matrix multiplication program. How accurate are these measurements?

I disassembled and analyzed the nested loops that perform matrix multiplication. I counted the number of instructions in each loop and the number of times that each loop is executed:

      Outer loop instructions:          4,000
      Middle loop instructions:     2,250,000
      Inner loop instructions:  1,125,000,000

The estimated number of retired instructions is the sum of the instructions for each loop:

      Total instructions:       1,127,254,000

The naive program has two hot-spots: initialization and matrix multiplication. Matrix multiplication dominates, so even though PERF measures and aggregates the retired instruction events for both initialization and matrix multiplication, the PERF measurement should be only slightly greater than the estimate. This property holds:

    1,177,945,197 (actual) > 1,127,254,000 (estimated)

Estimating the number of CPU clock cycles is straightforward:

    3.178 * 900MHz = 2,860,200,000 clock cycles

Here, I multiply the elapsed execution time (as reported by PERF) times the CPU clock rate (900MHz) to obtain the estimated number of CPU clock events. PERF measured 2,822,332,111 cycles, which is quite close to the estimate.

Program source code

Please take a look at the source code for the naive matrix multiplication program. The symbolic constant MSIZE is defined to be 500. So, all matrices are square (500 by 500 4-byte floating point values).

    void multiply_matrices()
    {
      int i, j, k ;

      // Textbook algorithm
      for (i = 0 ; i < MSIZE ; i++) {
        for (j = 0 ; j < MSIZE ; j++) {
          float sum = 0.0 ;
          for (k = 0 ; k < MSIZE ; k++) {
              sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;
          }
          matrix_r[i][j] = sum ;
        }
      }
    }

The innermost subscript k changes the fastest and the program steps sequentially through the memory elements of matrix_a. However, the program steps across the columns of matrix_b, resulting in a large physical stride through memory. (Recall that C arrays are arranged in row major order in primary memory.) Access to matrix_b makes inefficient use of the L1 data cache and causes costly data translation lookaside buffer (TLB) misses.

Here is the source code for the loop nest interchange program.

    void multiply_matrices()
    {
      int i, j, k ;

        // Loop nest interchange algorithm
        for (i = 0 ; i < MSIZE ; i++) {
          for (k = 0 ; k < MSIZE ; k++) {
            for (j = 0 ; j < MSIZE ; j++) {
                matrix_r[i][j] = matrix_r[i][j] +
                    (matrix_a[i][k] * matrix_b[k][j]) ;
    	    }
          }
        }
    }

The nesting of the innermost loops has been changed. The index variables j and k change the most frequently and the access pattern through the two operand matrices is sequential using a small physical stride (4 bytes.) These changes improve access to memory data via the data cache. Data TLB behavior is improved, too, since long strides are eliminated.

Analyze performance

The following table summarizes the raw number of performance events for the two matrix multiplication programs: naive and interchange.

**Performance events (naive vs. interchange)**
Event name	Naive	Interchange
cpu-cycles	2,822,332,111	2,120,243,900
instructions	1,177,945,197	1,175,914,639
cache-references	267,166,993	391,708,999
cache-misses	52,202,548	8,487,113
branch-instructions	130,211,304	129,407,649
branch-misses	1,083,610	855,409
bus-cycles	1,408,962,614	1,060,812,922
L1-dcache-loads	267,176,142	391,740,555
L1-dcache-load-misses	50,278,616	8,489,871
L1-dcache-stores	267,176,142	391,740,555
L1-dcache-store-misses	50,278,616	8,489,871
L1-icache-loads	600,918,882	605,368,186
L1-icode-load-misses	134,968	152,665
LLC-loads	100,988,063	17,502,494
LLC-load-misses	6,672,864	6,266,143
LLC-stores	100,988,063	17,502,494
LLC-store-misses	6,672,864	6,266,143
dTLB-load-misses	1,999,776	13,767
dTLB-store-misses	1,999,776	13,767
iTLB-load-misses	201	204
branch-loads	131,532,204	130,936,900
branch-load-misses	1,083,610	855,409

Here are a few observations.

The number of CPU cycles is lower for interchange, reflecting its shorter elapsed time.
The number of instructions and branches executed by both programs is roughly the same.
Branches are well-predicted for both programs; Mispredicted branches are not a performance issue.
Neither program misses in the instruction cache or instruction TLB; Instruction misses are not a performance issue.
Although interchange performs more load operations, it has fewer load misses.
Interchange has substantially fewer data TLB misses.
Interchange has substantially fewer last level cache (LLC) loads although the number of LLC misses is about the same for both programs.

Interchange's improved execution time is due to a more efficient memory access pattern and the availability of array data in last level cache (the ARMv7/Cortex-A7 L2 cache).

I measured several additional ARMv7 performance events using raw event IDs, e.g., perf stat -e r013,r006,r007. The following table reports measurements for primary memory accesses and cache behavior.

Event ID	ARMv7 event	Naive	Interchange
r013	Data memory access	269,518,307	393,461,707
r006	Data read executed	260,395,633	260,110,404
r007	Data write executed	5,955,347	130,579,573
r015	Data cache eviction	50,215,783	8,487,225
r018	L2 cache write-back	109,714	88,826

The sum of data read and write operations nearly equals the number of memory access events.

Ratios and rates

The raw number of events is difficult to interpret. Does 52,202,548 cache-misses indicate a performance issue or not? Numeric ratios often provide better insight than raw event counts alone.

Instruction per cycle (IPC) is a measure of overall performance. It reflects the overall efficiency of instruction execution.

The Cortex-A7 core is a partial dual issue pipeline. The core can issue up to two integer instructions per cycle and one NEON SIMD instruction per cycle. There are also other instructions which limit issue. Thus, the absolute maximum IPC for Cortex-A7 is 2, but memory traffic and issue rules lower target IPC to 1 or less.

Here are formulas for the most common and useful ratios and rates including IPC.

Ratio or rate	Formula
Instructions per cycle	instructions / cycles
L1 cache miss ratio	L1-dcache-loads / L1-dcache-load-misses
L1 cache miss rate PTI	L1-dcache-load-misses / (instructions / 1000)
Data TLB miss ratio	dTLB-load-misses / cache-references
Data TLB miss rate PTI	dTLB-load-misses / (instructions / 1000)
Branch mispredict ratio	branch-misses / branch-instructions
Branch mispredict rate PTI	branch-misses / (instructions / 1000)

Event rates are expressed per thousand instructions, abbreviated "PTI." Rates expressed as events per thousand instructions appeal to intuition better than events per individual instruction. Feel free to define and apply rates and ratios of your own.

The following table summarizes the key ratios and rates for the naive and loop nest interchange matrix multiplication programs. IPC is improved for the interchange program due to a better cache hit ratio and fewer data TLB misses.

Event name	Naive	Interchange
Elapsed time (seconds)	3.178	2.376
Instructions per cycle	0.42	0.55
L1 cache miss ratio	0.20	0.02
L1 cache miss rate PTI	44.32	7.22
Data TLB miss ratio	0.01	<0.01
Data TLB miss rate PTI	1.70	0.01
Branch mispredict ratio	0.01	0.01
Branch mispredict rate PTI	0.92	0.73

The innermost loop in each case is ten instructions long. Thus, a thousand instructions represents about 100 iterations of the innermost loop. The rates and ratios are reasonable goal values for a workload of this type -- a short, memory-intensive computational loop with floating point operations.

The differences (above) in rates and ratios between naive and interchange are not especially dramatic for 500 by 500 element matrices. Each matrix is stored in a two-dimensional array of 4 byte floating point values. Each array occupies a little less than 1MByte of primary memory. The L2 cache in the Broadcom BCM2836 is 512KBytes. Therefore, the L2 cache holds a large, active subset of the two operand arrays. The L2 cache has a 2 cycle load-to-use latency which is much shorter than access to primary memory. The L1 and L2 caches are not severely stressed and both the elapsed time and IPC values are somewhat favorable.

Before moving on to Part 3 of the PERF tutorial, I suggest trying larger arrays, say 1,000 by 1,1000 elements. Make measurements with perf stat, compute rates and ratios, and compare results.

**ARM Cortex-A7 (ARMv7) performance events**
Event ID	Description
0x00	Software increment. The register is incremented only on writes to the Software Increment Register.
0x01	Instruction fetch that causes a refill at (at least) the lowest level of instruction or unified cache. Includes the speculative linefills in the count.
0x02	Instruction fetch that causes a TLB refill at (at least) the lowest level of TLB. Includes the speculative requests in the count.
0x03	Data read or write operation that causes a refill at (at least) the lowest level of data or unified cache. Counts the number of allocations performed in the Data Cache because of a read or a write.
0x04	Data read or write operation that causes a cache access at (at least) the lowest level of data or unified cache. This includes speculative reads.
0x05	Data read or write operation that causes a TLB refill at (at least) the lowest level of TLB. This does not include micro TLB misses because of PLD, PLI, CP15 Cache operation by MVA and CP15 VA to PA operations.
0x06	Data read architecturally executed. Counts the number of data read instructions accepted by the Load Store Unit. This includes counting the speculative and aborted LDR/LDM, and the reads because of the SWP instructions.
0x07	Data write architecturally executed. Counts the number of data write instructions accepted by the Load Store Unit. This includes counting the speculative and aborted STR/STM, and the writes because of the SWP instructions.
0x08	Instruction architecturally executed.
0x09	Exception taken. Counts the number of exceptions architecturally taken.
0x0A	Exception return architecturally executed. The following instructions are reported on this event: LDM {..., pc}, RFE, DP S pc.
0x0B	Change to ContextID retired. Counts the number of instructions architecturally executed writing into the ContextID Register.
0x0C	Software change of PC.
0x0D	Immediate branch architecturally executed (taken or not taken). This includes the branches which are flushed due to a previous load/store which aborts late.
0x0E	Procedure return (other than exception returns) architecturally executed.
0x0F	Unaligned load-store.
0x10	Branch mispredicted/not predicted. Counts the number of mispredicted or not-predicted branches executed. This includes the branches which are flushed because of a previous load/store which aborts late.
0x11	Cycle counter.
0x12	Branches or other change in program flow that could have been predicted by the branch prediction resources of the processor. This includes the branches which are flushed because of a previous load/store which aborts late.
0x13	Data memory access.
0x14	Instruction Cache access.
0x15	Data cache eviction.
0x16	Level 2 data cache access
0x17	Level 2 data cache refill
0x18	Level 2 data cache write-back. Data transfers made as a result of a coherency request from the Level 2 caches to outside of the Level 1 and Level 2 caches are not counted. Write-backs made as a result of CP15 cache maintenance operations are counted.
0x19	Bus accesses. Single transfer bus accesses on either of the ACE read or write channels might increment twice in one cycle if both the read and write channels are active simultaneously. Operations that utilise the bus that do not explicitly transfer data, such as barrier or coherency operations are counted as bus accesses.
0x1D	Bus cycle
0x60	Bus access, read
0x61	Bus access, write
0x86	IRQ exception taken.
0x87	FIQ exception taken.
0xC0	External memory request.
0xC1	Non-cacheable external memory request.
0xC2	Linefill because of prefetch.
0xC3	Prefetch linefill dropped.
0xC4	Entering read allocate mode.
0xC5	Read allocate mode.
0xC6	Reserved.
0xC7	ETM Ext Out[0].
0xC8	ETM Ext Out[1].
0xC9	Data Write operation that stalls the pipeline because the store buffer is full.
0xCA	Data snooped from other processor. This event counts memory-read operations that read data from another processor within the local Cortex-A7 cluster, rather than accessing the L2 cache or issuing an external read. It increments on each transaction, rather than on each beat of data.

Sand, software and sound

Electronics and computing for the fun of it