This article is part 2 of a three part series on the PERF (linux-tools) performance measurement and profiling system.
- In part 1, I demonstrate how to use PERF to identify and analyze the hottest execution spots in a program. Part 1 covers the basic PERF commands, options and software performance events.
- Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application.
- Part 3 uses hardware performance event sampling to identify and analyze program hot-spots.
Each article builds on the usage and background information presented in the previous articles. So, if you’ve arrived here through a Web search, please scan the preceding articles for background. All articles use the same two example programs:
- Naive: Naive matrix multiplication program (textbook algorithm).
- Interchange: An improved matrix multiplication program (loop nest interchange algorithm).
The textbook algorithm has known performance issues due to its memory access pattern. The loop nest interchange algorithm has a different access pattern which reads/writes data more efficiently.
Author: P.J. Drongowski
Introduction
This article introduces performance events and counters, and shows how to use them to make basic measurements of application program behavior using PERF. I primarily use the perf stat command in this article.
PERF, which is part of the linux-tools package, is an infrastructure and suite of tools for collecting and analyzing performance data. You should install the version of PERF that matches the version of your Linux kernel. The data below were collected on a Raspberry Pi 2 (second generation, BCM2836 processor, ARMv7 Cortex-A7) using Raspbian Wheezy 3.18.9+v7+. Therefore, I installed linux-tools-3.18. Enter:
uname -a
to identify the kernel on your test platform. Then, enter:
sudo apt-get linux-tools-X.Y
to install PERF, replacing X and Y with the major and minor version numbers of the kernel (e.g., linux-tools-3.18).
In the first article, I used PERF’s software events to make measurements. The underlying microprocessor hardware is equipped with a small number of performance counters which can be used to count hardware performance events. The number, relative frequency and kind of performance events can tell us quite a bit about how effectively a program is using the hardware. This hard information can help us tune the program to run more efficiently on the microprocessor.
Performance events
A performance event is an occurrence of a hardware condition. Pretty abstract, so let’s bring this down to Earth.
If you’re interested in performance, you probably know the clock speed of the microprocessor in your test platform. I’m working on a Raspberry Pi 2 with a 900MHz ARM Cortex-A7 quad-core processor. The CPU clock ticks at a 900Mhz rate. Every clock tick is a new hardware event. The ARMv7 architecture defines a CPU clock performance event. The number of processor cycles needed to execute a program tells us how long it took for the program to run.
Here’s another example. Successful execution and completion of an instruction is an important event. The ARMv7 architecture defines a performance event for a successfully completed (“architecturally executed”) instruction. Another common term for successfully completed is “retired” and I will use “retired” instead of the awkward “architecturally executed.” The number of retired instructions tells us how many program instructions are successfully executed and retired.
The number and kind of performance events is very specific to the machine (micro-)architecture. The ARMv7 performance events are different from AMD x86 processor events, for example. In every day practice, we need to consult the technical reference manual for the specific microprocessor in the test platform. I have included a list of the ARMv7 (Cortex-A7) performance events at the end of this article. The table is taken from the ARM Cortex-A7 MPCore Technical Reference Manual.
For performance tuning, the general notion of “successfully completed instruction” suffices for analysis. However, a more detailed and nuanced understanding of a specific performance event may be necessary. Contemporary microprocessors use many tricks and techniques to wring good performance out of a localized sequence of instructions in a program:
- The processor may issue more than one instruction at a time for execution.
- The processor may issue instructions out of program order.
- The processor may predict control flow and speculatively begin executing instructions along the predicted program branch.
These techniques involve complicated hardware behavior and conditions and sometimes result in highly nuanced performance events.
PERF defines a set of common, symbolic hardware performance events. (See the following table.) These performance events are the ones that are most commonly used in practice. I included the corresponding ARMv7 Event ID so you can look up the event description in the full ARMv7 performance event table at the end of the article.
Event name | ARMv7 Event ID |
---|---|
cpu-cycles OR cycles | 0xFF |
instructions | 0x08 |
cache-references | 0x04 |
cache-misses | 0x03 |
branch-instructions OR branches | 0x0C |
branch-misses | 0x10 |
bus-cycles | 0x1D |
L1-dcache-loads | 0x04 |
L1-dcache-load-misses | 0x03 |
L1-dcache-stores | 0x04 |
L1-dcache-store-misses | 0x03 |
L1-icache-loads | 0x14 |
L1-icode-load-misses | 0x01 |
LLC-loads | 0x16 |
LLC-load-misses | 0x17 |
LLC-stores | 0x16 |
LLC-store-misses | 0x17 |
dTLB-load-misses | 0x05 |
dTLB-store-misses | 0x05 |
iTLB-load-misses | 0x02 |
branch-loads | 0x12 |
branch-load-misses | 0x10 |
PERF can display a list of the available software and hardware performance events. Just enter the command:
perf list
to obtain a list of the available symbolic events. You may also specify an event using its raw identifier. For example, “instructions” and “r008” are equivalent. Raw event identifiers let you measure performance events that do not have a PERF symbolic name.
Sharp-eyed readers probably spotted a few PERF symbolic events that have the same physical Event ID, such as L1-dcache-loads and L1-dcache-stores (ARMv7 Event ID 0x04). PERF defines a set of common events, but there isn’t any guarantee that the host processor actually has equivalent events. ARMv7, for example, does not separate cache accesses by load and store operations. As a compromise, the L1-dcache-loads and L1-dcache-stores events are mapped to the ARMv7 data read/write L1 data cache event. You are likely to find similar compromises on other processor implementations, too.
Performance counters
Each processor core has a set of hardware performance counters which count performance events. PERF takes care of the grungy set-up work for you. However, you need to be aware of certain processor-specific limitations such as the number of available counters and event assignment restrictions. The Broadcom BCM 2836 (ARMv7 Cortex-A7) has four performance counters per core. Any ARMv7 performance event can be assigned to any of the four performance counters. Other processors may have a different number of performance counters and certain performance counters may be restricted to one or more specific performance events. Generally, you need to consult the appropriate processor reference manual for this information.
Performance counters (usually) support two usage modes: counting and sampling.
- In counting mode, the counters are configured, cleared and started. Then, sometime later, the counters are stopped and the number of events is read and reported.
- In sampling mode, the counters are configured, cleared and started. However, they are configured to generate a sampling interrupt after a certain number of events are counted. Upon interrupt, a sample is accumulated and the counter is restarted. After the workload completes, counter interrupts are turned off, the samples are aggregated, and summarized statistics are reported.
This article (part 2) concentrates on counting mode. Part 3 deals with sampling mode. Simple event counting is initiated using the perf stat command. Event sampling is handled using the perf record and perf report commands.
Counting mode is very low overhead since the hardware counters do not put any additional load on the CPU. Sampling mode imposes additional overhead. Although counting mode is low overhead, you can only measure events for the application as a whole. Overhead is higher for sampling mode, but you can isolate performance events and issues to specific regions of code (e.g., hot-spots).
Hardware performance counters ae fixed length registers. The Cortex-A7 performance counters are 32 bits in length. Microprocessors are high speed and a program can easily cause a counter overflow. Measure CPU cycles on the 900MHz Cortex-A7 in the Raspberry Pi 2 and the counter will overflow in a relatively short 4.77 seconds (232 / 900MHz). PERF takes care of this problem for you by “virtualizing” the counters to 64 bits — so much better than having to do this yourself when directly accessing the counters from user space.
Make measurements
It’s easy to measure the number of performance events executed by an application program. Simply enter a perf stat command specifying both the events to count and the application program to run:
perf stat -e cpu-cycles,instructions ./naive
In this example command, PERF measures CPU clock cycles and retired instructions while running the naive matrix multiplication program. Once the application program (naive) completes, PERF reports the number of occurences of each performance event:
Performance counter stats for './naive': 1,177,945,197 instructions # 0.42 insns per cycle 2,822,332,111 cycles 3.178582734 seconds time elapsed
That’s all there is to it! I recommend putting your PERF commands into a script so that you can easily re-run performance experiments. Program performance tuning is usually performed after bugs and other kinks have been fixed. Tuning is an iterative process involving trial and error, and you should plan to run and re-run experiments many times.
Assess accuracy
In the preceding example, PERF measured over one billion instructions and roughly 2.8 billion CPU clock cycles while running the naive matrix multiplication program. How accurate are these measurements?
I disassembled and analyzed the nested loops that perform matrix multiplication. I counted the number of instructions in each loop and the number of times that each loop is executed:
Outer loop instructions: 4,000 Middle loop instructions: 2,250,000 Inner loop instructions: 1,125,000,000
The estimated number of retired instructions is the sum of the instructions for each loop:
Total instructions: 1,127,254,000
The naive program has two hot-spots: initialization and matrix multiplication. Matrix multiplication dominates, so even though PERF measures and aggregates the retired instruction events for both initialization and matrix multiplication, the PERF measurement should be only slightly greater than the estimate. This property holds:
1,177,945,197 (actual) > 1,127,254,000 (estimated)
Estimating the number of CPU clock cycles is straightforward:
3.178 * 900MHz = 2,860,200,000 clock cycles
Here, I multiply the elapsed execution time (as reported by PERF) times the CPU clock rate (900MHz) to obtain the estimated number of CPU clock events. PERF measured 2,822,332,111 cycles, which is quite close to the estimate.
Program source code
Please take a look at the source code for the naive matrix multiplication program. The symbolic constant MSIZE is defined to be 500. So, all matrices are square (500 by 500 4-byte floating point values).
void multiply_matrices() { int i, j, k ; // Textbook algorithm for (i = 0 ; i < MSIZE ; i++) { for (j = 0 ; j < MSIZE ; j++) { float sum = 0.0 ; for (k = 0 ; k < MSIZE ; k++) { sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ; } matrix_r[i][j] = sum ; } } }
The innermost subscript k
changes the fastest and the program steps sequentially through the memory elements of matrix_a
. However, the program steps across the columns of matrix_b
, resulting in a large physical stride through memory. (Recall that C arrays are arranged in row major order in primary memory.) Access to matrix_b
makes inefficient use of the L1 data cache and causes costly data translation lookaside buffer (TLB) misses.
Here is the source code for the loop nest interchange program.
void multiply_matrices() { int i, j, k ; // Loop nest interchange algorithm for (i = 0 ; i < MSIZE ; i++) { for (k = 0 ; k < MSIZE ; k++) { for (j = 0 ; j < MSIZE ; j++) { matrix_r[i][j] = matrix_r[i][j] + (matrix_a[i][k] * matrix_b[k][j]) ; } } } }
The nesting of the innermost loops has been changed. The index variables j
and k
change the most frequently and the access pattern through the two operand matrices is sequential using a small physical stride (4 bytes.) These changes improve access to memory data via the data cache. Data TLB behavior is improved, too, since long strides are eliminated.
Analyze performance
The following table summarizes the raw number of performance events for the two matrix multiplication programs: naive and interchange.
Event name | Naive | Interchange |
---|---|---|
cpu-cycles | 2,822,332,111 | 2,120,243,900 |
instructions | 1,177,945,197 | 1,175,914,639 |
cache-references | 267,166,993 | 391,708,999 |
cache-misses | 52,202,548 | 8,487,113 |
branch-instructions | 130,211,304 | 129,407,649 |
branch-misses | 1,083,610 | 855,409 |
bus-cycles | 1,408,962,614 | 1,060,812,922 |
L1-dcache-loads | 267,176,142 | 391,740,555 |
L1-dcache-load-misses | 50,278,616 | 8,489,871 |
L1-dcache-stores | 267,176,142 | 391,740,555 |
L1-dcache-store-misses | 50,278,616 | 8,489,871 |
L1-icache-loads | 600,918,882 | 605,368,186 |
L1-icode-load-misses | 134,968 | 152,665 |
LLC-loads | 100,988,063 | 17,502,494 |
LLC-load-misses | 6,672,864 | 6,266,143 |
LLC-stores | 100,988,063 | 17,502,494 |
LLC-store-misses | 6,672,864 | 6,266,143 |
dTLB-load-misses | 1,999,776 | 13,767 |
dTLB-store-misses | 1,999,776 | 13,767 |
iTLB-load-misses | 201 | 204 |
branch-loads | 131,532,204 | 130,936,900 |
branch-load-misses | 1,083,610 | 855,409 |
Here are a few observations.
- The number of CPU cycles is lower for interchange, reflecting its shorter elapsed time.
- The number of instructions and branches executed by both programs is roughly the same.
- Branches are well-predicted for both programs; Mispredicted branches are not a performance issue.
- Neither program misses in the instruction cache or instruction TLB; Instruction misses are not a performance issue.
- Although interchange performs more load operations, it has fewer load misses.
- Interchange has substantially fewer data TLB misses.
- Interchange has substantially fewer last level cache (LLC) loads although the number of LLC misses is about the same for both programs.
Interchange's improved execution time is due to a more efficient memory access pattern and the availability of array data in last level cache (the ARMv7/Cortex-A7 L2 cache).
I measured several additional ARMv7 performance events using raw event IDs, e.g., perf stat -e r013,r006,r007. The following table reports measurements for primary memory accesses and cache behavior.
Event ID | ARMv7 event | Naive | Interchange |
---|---|---|---|
r013 | Data memory access | 269,518,307 | 393,461,707 |
r006 | Data read executed | 260,395,633 | 260,110,404 |
r007 | Data write executed | 5,955,347 | 130,579,573 |
r015 | Data cache eviction | 50,215,783 | 8,487,225 |
r018 | L2 cache write-back | 109,714 | 88,826 |
The sum of data read and write operations nearly equals the number of memory access events.
Ratios and rates
The raw number of events is difficult to interpret. Does 52,202,548 cache-misses indicate a performance issue or not? Numeric ratios often provide better insight than raw event counts alone.
Instruction per cycle (IPC) is a measure of overall performance. It reflects the overall efficiency of instruction execution.
The Cortex-A7 core is a partial dual issue pipeline. The core can issue up to two integer instructions per cycle and one NEON SIMD instruction per cycle. There are also other instructions which limit issue. Thus, the absolute maximum IPC for Cortex-A7 is 2, but memory traffic and issue rules lower target IPC to 1 or less.
Here are formulas for the most common and useful ratios and rates including IPC.
Ratio or rate | Formula |
---|---|
Instructions per cycle | instructions / cycles |
L1 cache miss ratio | L1-dcache-loads / L1-dcache-load-misses |
L1 cache miss rate PTI | L1-dcache-load-misses / (instructions / 1000) |
Data TLB miss ratio | dTLB-load-misses / cache-references |
Data TLB miss rate PTI | dTLB-load-misses / (instructions / 1000) |
Branch mispredict ratio | branch-misses / branch-instructions |
Branch mispredict rate PTI | branch-misses / (instructions / 1000) |
Event rates are expressed per thousand instructions, abbreviated "PTI." Rates expressed as events per thousand instructions appeal to intuition better than events per individual instruction. Feel free to define and apply rates and ratios of your own.
The following table summarizes the key ratios and rates for the naive and loop nest interchange matrix multiplication programs. IPC is improved for the interchange program due to a better cache hit ratio and fewer data TLB misses.
Event name | Naive | Interchange |
---|---|---|
Elapsed time (seconds) | 3.178 | 2.376 |
Instructions per cycle | 0.42 | 0.55 |
L1 cache miss ratio | 0.20 | 0.02 |
L1 cache miss rate PTI | 44.32 | 7.22 |
Data TLB miss ratio | 0.01 | <0.01 |
Data TLB miss rate PTI | 1.70 | 0.01 |
Branch mispredict ratio | 0.01 | 0.01 |
Branch mispredict rate PTI | 0.92 | 0.73 |
The innermost loop in each case is ten instructions long. Thus, a thousand instructions represents about 100 iterations of the innermost loop. The rates and ratios are reasonable goal values for a workload of this type -- a short, memory-intensive computational loop with floating point operations.
The differences (above) in rates and ratios between naive and interchange are not especially dramatic for 500 by 500 element matrices. Each matrix is stored in a two-dimensional array of 4 byte floating point values. Each array occupies a little less than 1MByte of primary memory. The L2 cache in the Broadcom BCM2836 is 512KBytes. Therefore, the L2 cache holds a large, active subset of the two operand arrays. The L2 cache has a 2 cycle load-to-use latency which is much shorter than access to primary memory. The L1 and L2 caches are not severely stressed and both the elapsed time and IPC values are somewhat favorable.
Before moving on to Part 3 of the PERF tutorial, I suggest trying larger arrays, say 1,000 by 1,1000 elements. Make measurements with perf stat, compute rates and ratios, and compare results.
Event ID | Description |
---|---|
0x00 | Software increment. The register is incremented only on writes to the Software Increment Register. |
0x01 | Instruction fetch that causes a refill at (at least) the lowest level of instruction or unified cache. Includes the speculative linefills in the count. |
0x02 | Instruction fetch that causes a TLB refill at (at least) the lowest level of TLB. Includes the speculative requests in the count. |
0x03 | Data read or write operation that causes a refill at (at least) the lowest level of data or unified cache. Counts the number of allocations performed in the Data Cache because of a read or a write. |
0x04 | Data read or write operation that causes a cache access at (at least) the lowest level of data or unified cache. This includes speculative reads. |
0x05 | Data read or write operation that causes a TLB refill at (at least) the lowest level of TLB. This does not include micro TLB misses because of PLD, PLI, CP15 Cache operation by MVA and CP15 VA to PA operations. |
0x06 | Data read architecturally executed. Counts the number of data read instructions accepted by the Load Store Unit. This includes counting the speculative and aborted LDR/LDM, and the reads because of the SWP instructions. |
0x07 | Data write architecturally executed. Counts the number of data write instructions accepted by the Load Store Unit. This includes counting the speculative and aborted STR/STM, and the writes because of the SWP instructions. |
0x08 | Instruction architecturally executed. |
0x09 | Exception taken. Counts the number of exceptions architecturally taken. |
0x0A | Exception return architecturally executed. The following instructions are reported on this event: LDM {..., pc}, RFE, DP S pc. |
0x0B | Change to ContextID retired. Counts the number of instructions architecturally executed writing into the ContextID Register. |
0x0C | Software change of PC. |
0x0D | Immediate branch architecturally executed (taken or not taken). This includes the branches which are flushed due to a previous load/store which aborts late. |
0x0E | Procedure return (other than exception returns) architecturally executed. |
0x0F | Unaligned load-store. |
0x10 | Branch mispredicted/not predicted. Counts the number of mispredicted or not-predicted branches executed. This includes the branches which are flushed because of a previous load/store which aborts late. |
0x11 | Cycle counter. |
0x12 | Branches or other change in program flow that could have been predicted by the branch prediction resources of the processor. This includes the branches which are flushed because of a previous load/store which aborts late. |
0x13 | Data memory access. |
0x14 | Instruction Cache access. |
0x15 | Data cache eviction. |
0x16 | Level 2 data cache access |
0x17 | Level 2 data cache refill |
0x18 | Level 2 data cache write-back. Data transfers made as a result of a coherency request from the Level 2 caches to outside of the Level 1 and Level 2 caches are not counted. Write-backs made as a result of CP15 cache maintenance operations are counted. |
0x19 | Bus accesses. Single transfer bus accesses on either of the ACE read or write channels might increment twice in one cycle if both the read and write channels are active simultaneously. Operations that utilise the bus that do not explicitly transfer data, such as barrier or coherency operations are counted as bus accesses. |
0x1D | Bus cycle |
0x60 | Bus access, read |
0x61 | Bus access, write |
0x86 | IRQ exception taken. |
0x87 | FIQ exception taken. |
0xC0 | External memory request. |
0xC1 | Non-cacheable external memory request. |
0xC2 | Linefill because of prefetch. |
0xC3 | Prefetch linefill dropped. |
0xC4 | Entering read allocate mode. |
0xC5 | Read allocate mode. |
0xC6 | Reserved. |
0xC7 | ETM Ext Out[0]. |
0xC8 | ETM Ext Out[1]. |
0xC9 | Data Write operation that stalls the pipeline because the store buffer is full. |
0xCA | Data snooped from other processor. This event counts memory-read operations that read data from another processor within the local Cortex-A7 cluster, rather than accessing the L2 cache or issuing an external read. It increments on each transaction, rather than on each beat of data. |
Copyright © 2015 Paul J. Drongowski
Pingback: [docker][cassandra] Reaching mixed load – 750,000 opsec | Мои IT-заметки
Pingback: [docker][cassandra] Reaching mixed load – 750,000 opsec – Мои IT-заметки