PERF tutorial: Finding execution hot spots

This article is the first of a three part series on the PERF (linux-tools) performance measurement and profiling system.

Part 1 demonstrates how to use PERF to identify and analyze the hottest execution spots in a program. It covers the basic PERF commands, options and software performance events.
Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application. It defines and discusses several useful rates and ratios for performance assessment and analysis.
Part 3 uses hardware performance event sampling to identify and analyze program hot-spots.

Each article builds on the usage and background information presented in the previous articles. All articles use the same two example programs:

Naive: Naive matrix multiplication program (textbook algorithm).
Interchange: An improved matrix multiplication program (loop nest interchange algorithm).

The textbook algorithm has known performance issues due to its memory access pattern. The loop nest interchange algorithm has a different access pattern which reads/writes data more efficiently.

Author: P.J. Drongowski

Source: Makefile, naive.c, readme
Common code: test_common.h, test_common.c, rpi_pmu.h, rpi_pmu.c

Introduction

Performance Events for Linux, called “PERF,” is the standard profiling infrastructure on Linux. You can analyze the performance of a program using the tools provided with PERF, or you can build your own tools on top of PERF.

PERF consists of two major subsystems:

A kernel SYSCALL that provides access to performance data including operating system software events and hardware performance events, and
A collection of user-space tools to collect, display and analyze performance data.

PERF is the standard means of access to the hardware performance counters. It supports both counting mode and sampling mode.

PERF is part of the kernel, so Raspbian Wheezy (3.6.11+) already has PERF built in. However, if you want to work with PERF, you need to install the linux-tools package using either Synaptic or apt-get. The following command does the job:

sudo apt-get install linux-tools-3.6

Be sure to install the version which matches the kernel. You can display information about the currently installed Linux distribution (including the kernel version) by entering: uname -a.

This note is an introduction to the PERF user-space tools. It shows how to use PERF to analyze the textbook matrix multiplication program (naive.c). Once you see how easy it is to use PERF, you can explore the command line options on your own. To get a complete list of PERF commands, enter:

perf --help

To get usage information about a particular PERF command (e.g., report, enter:

perf report --help

I recommend reading the PERF tutorial on the PERF Wiki, too.

What can you measure with PERF?

PERF supports a large number of predefined events. To get a list of supported events, enter:

perf list

PERF supports hardware events and software events. The hardware events are measured using the hardware performance counters. The available events are processor implementation specific. On the Raspberry Pi, the available events are those supported by the Broadcom BCM2835 processor, which uses the ARM1176JZF-S core. Please see the ARM1176JZF-S Technical Reference Manual (TRM) for more information about ARM11 hardware performance events. The Raspberry Pi 2 is the second generation Pi and has a Broadcom BCM2836 processor. The BCM2836 is a quad-core Cortex-A7 (ARMv7) machine. Please consult the ARM Cortex-A7 Technical Reference Manual for details about hardware performance counters and events.

Although this tutorial demonstrates PERF on Raspberry Pi, the techniques are applicable to other architectures like x86. This part of the tutorial series uses Linux software performance events which are common to all architectures.

Certain commonly used hardware events have PERF symbolic names such as:

cpu-cycles OR cycles
instructions
cache-references
cache-misses
branch-instructions OR branches
branch-misses
bus-cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
ref-cycles

You may also specify a hardware performance event by its processor-specific event identifier, usually a hexadecimal value. (See the TRM for event identifiers and more.)

PERF supports several pre-defined software events:

cpu-clock
task-clock
page-faults OR faults
context-switches OR cs
cpu-migrations OR migrations
minor-faults
major-faults
alignment-faults
emulation-faults

The cpu-clock event measures the passage of time. It uses the Linux CPU clock as the timing source. This tutorial focuses on the cpu-clock event because execution time is a good starting point for performance analysis. This event is easy to use and understand. Hardware events are a little bit more difficult to use and we discuss them in a separate tutorial.

Elapsed time and event counting

The first step in analysis is to establish baseline performance. The baseline gives us a way to assess the efficacy of any changes that we make to the source or the build process. Please don’t forget that the compiler performs many optimizations which are enabled by command line options. Sometimes the best way to improve performance is through flag mining, i.e., finding the combination of compiler options that yield the best performance.

The cpu-clock event is exactly what we need to establish a baseline. The perf stat command:

perf stat -e cpu-clock ./naive

Runs the textbook matrix multiplication program named naive and displays the following output:

 Performance counter stats for './naive':

      16515.108000 cpu-clock

      16.752186472 seconds time elapsed

It displays the number of cpu-clock events in milliseconds and the elapsed execution time. These times are the baseline for future comparisons.

We can measure more than one event as shown in the command and output below:

> perf stat -e cpu-clock,faults ./naive

 Performance counter stats for './naive':

      16508.392000 cpu-clock
               885 page-faults

      16.740473493 seconds time elapsed

By the way, the “>” before example commands is the shell’s command line prompt. Don’t enter this character when you try the example commands! In this example, PERF measures two software events: cpu-clock and page-faults.

What questions should I ask?

There are three over-arching practical questions in performance analysis.

Which parts of the program take the most execution time?
Do the number of software or hardware events indicate an actual performance issue to be fixed?
How can we fix the performance issue?

We need to be detectives in order to answer these questions.

The code regions which take the most execution time are called hot spots. The hot spots are the best places to tune and optimize because a little bit of effort on making a hot spot faster can have a big pay-off. It’s not worth spending engineering effort on a bit of code that executes only once or doesn’t take much time. Since this kind of analysis deals with execution time, you can easily guess that the cpu-clock event is part of the analytical method.

The second question asks if the number of events such as page-faults indicate an actual performance issue to be fixed or not. In the second example above, are 885 page faults too many? To answer this question, we need to have some knowledge or intuition about the algorithms and data structures in the program. We at least need to know if the program is CPU-bound (does a lot of computation) or is memory-bound (needs to read/write a lot of data in memory). We also need to know where the events are occuring in the program. If the events correlate with a known hot spot, then we may have found a real performance issue and need to fix it.

The last question is the hardest one to answer. Let’s assume that we have found a hot spot with a high number of page faults. We need to study the program’s algorithms and data structures and determine if the program has an unnecessarily large working set size and touches many pages within a short period of time. We might need to reduce or reorganize the data so that the data reside on fewer pages. Once the code changes are finished, we must measure the performance of the modified program and compare its execution time against the baseline to see if there is any speed-up.

Performance tuning is hard experimental work and please don’t be surprised if the modified program runs slower than the baseline. If one hypothesis fails, then try another. Keep trying!

How do I find the hot spots?

PERF helps us to answer these questions through profiling. A profile is a table or annotated code listing that shows us execution time hot spots and/or the code regions where certain hardware or software events are occuring.

Let’s dive right in. There are two steps:

Run the program and collect profile data.
Analyze the data and display the profile.

The first step is handled by the perf record command. This command runs the program to be analyzed, collects profile data and stores the profile data into a file named perf.data (the default file name). The second step is handled by the perf report and perf annotate commands. These two commands drill down into the profile data. You should start with the hottest functions and drill down to the hottest code regions (long running source code statements and/or instructions).

Step 1: Collect profile data

I recommend compiling and building your program with the -ggdb option before collecting profile data. PERF can collect and display a profile without debug symbols, but profiles are more informative when PERF has debug symbols. PERF must have debug symbols in order to annotate source. Otherwise, PERF can produce only annotated disassemblies.

To collect profile data, enter a perf record command such as:

> perf record -e cpu-clock,faults ./naive
[ perf record: Woken up 14 times to write data ]
[ perf record: Captured and wrote 3.438 MB perf.data (~150220 samples) ]

In this case, PERF runs the program naive and collects profile data for two events: cpu-clock and page-faults. That’s it — step 1 is complete.

The perf record command supports many options including the ability to start collection on an existing thread or process. The --call-graph option records caller/callee information (stack chain/backtrace). There are options to collect data on all CPUs or a selected few cores when you are working on a multi-core machine. (Not the RPi, which is, unfortunately, single core!) You can specify the name of the output file, change the sampling period, suppress status messages, and more.

Step 2: Display profiles (TUI)

The perf report command displays the profile data in prof.data (the default profile data file name). The perf report command also supports many command line options. Three of these options choose the user interface:

The --tui option chooses the Terminal-based User Interface (TUI). The TUI interface is a kind of text-based, poor man’s GUI using the arrow keys and single character commands to navigate and analyze profiles.
The --stdio option sends the profile output to the standard output. The standard I/O interface is appropriate when redirecting output to a file or embedding a perf record command in a shell script.
The --gtk option chooses the GTK interface. As of 14 June 2013, the GTK interface is in development and did not launch on the LDXE desktop. PERF started the TUI interface instead.

The TUI interface supports interactive navigation and exploration of profiles. Don’t let the textual interface put you off, however. Give it a try! PERF starts up the TUI interface by default if you run perf report on the shell command line.

Here’s just a small taste of the TUI interface. When you enter:

perf report

PERF reads the prof.data file and starts the TUI interface. It displays a top-level menu of the available events (samples):

Available samples
74K cpu-clock
75 page-faults

Use the up and down arrow keys to select one of the events. Hit enter to make the selection. PERF displays a list of functions starting from the hottest function with the most execution time to the coldest functions with the least execution time. Use the up and down arrow keys to select a function like multiply_matrices() which appears at the top of the list. Hit enter and PERF displays another menu:

Annotate multiply_matrices
Zoom into naive(2710) thread
Zoom into naive DSO
Browse map details
Exit

Choose “Annotate multiply_matrices” and hit enter. PERF then displays annotated source and/or disassembled machine code for the function. The annotations show the hottest code region (statements and/or instructions) in the function. Enter “h” for help or enter ESCAPE to return to the previous level. The command character “q” seems to behave like ESCAPE.

The source-level profile for multiply_matrices() shows the code region which takes the most execution time as expressed as a percentage of overall execution time. If we had selected the page-faults event instead and drilled down to the list of hottest functions, the function initialize_matrices() has the most page faults expressed as a percentage of all observed page faults. (We are going to explain why in a moment.) We may then drill down further to find the code regions with the most page faults.

This top-to-bottom drill down procedure is how we find hot spots and possible performance issues.

Step 2: Display profiles (stdio)

Returning to the shell command line, if you redirect the output from perf report to a file, PERF uses the standard I/O interface. I use the standard I/O interface in this tutorial because I want to capture PERF output into a file then cut-and-paste the output into a text editor and WordPress.

If you enter perf report --stdio, you get a list of the hottest functions in both user space and the kernel. Since I measured two events, we get two lists: the hottest functions for cpu-clocks and the hottest functions for page-faults. Ordinarily, we aren’t interested in kernel activity, so it’s often helpful to filter out the kernel functions from the results. The --sort option aggregates (“sums up”) and sorts the results by command (comm), dynamically shared object (dso), process identifier (pid), and more. The --sort comm,dso option aggregates the results by command and shared object (i.e., binary executable module) as shown in the example below.

> perf report --stdio --sort comm,dso
# ========
# captured on: Fri Jun 14 14:45:35 2013
# hostname : raspberrypi
# os release : 3.6.11+
# perf version : 3.6.0
# arch : armv6l
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : ARMv6-compatible processor rev 7 (v6l)
# total memory : 448776 kB
# cmdline : /usr/bin/perf_3.6 record -e cpu-clock,faults ./naive 
# event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0
# event : name = page-faults, type = 1, config = 0x2, config1 = 0x0, config2 = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# ========
#
# Samples: 74K of event 'cpu-clock'
# Event count (approx.): 74945
#
# Overhead  Command      Shared Object
# ........  .......  .................
#
    99.31%    naive  naive            
     0.51%    naive  libc-2.13.so     
     0.17%    naive  [kernel.kallsyms]
     0.01%    naive  ld-2.13.so       


# Samples: 75  of event 'page-faults'
# Event count (approx.): 887
#
# Overhead  Command      Shared Object
# ........  .......  .................
#
    83.54%    naive  naive            
     7.55%    naive  ld-2.13.so       
     7.44%    naive  libc-2.13.so     
     1.35%    naive  [kernel.kallsyms]
     0.11%    naive  perf_3.6

The perf report command displays a “header” with all sorts of useful information such as the kernel version (3.6.11+), the machine architecture (armv6l), the number of CPUs (1 core), etc. It even shows the command which was issued to collect the data in prof.data. This information is really handy when you have a lot of data files and you can’t quite remember what’s in them.

The example above summarizes the data by command and shared object. The output shows that 99.31% of execution time is attributed to the matrix multiply program itself. The program calls libc.2.13.so which is responsible for 0.51% of execution time. The program naive is the hottest overall spot in the system.

The next step is to drill down into the program naive and find the hottest functions. The --dsos option restricts output to the specified dynamically shared objects (DSO). The following command restricts output to naive, which is the main executable, and libc-2.13.so, which is the C runtime library.

> perf report --stdio --dsos=naive,libc-2.13.so 
# ========
# captured on: Fri Jun 14 14:45:35 2013
# hostname : raspberrypi
# os release : 3.6.11+
# perf version : 3.6.0
# arch : armv6l
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : ARMv6-compatible processor rev 7 (v6l)
# total memory : 448776 kB
# cmdline : /usr/bin/perf_3.6 record -e cpu-clock,faults ./naive 
# event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0
# event : name = page-faults, type = 1, config = 0x2, config1 = 0x0, config2 = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# ========
#
# Samples: 74K of event 'cpu-clock'
# Event count (approx.): 74813
#
# Overhead  Command  Shared Object                   Symbol
# ........  .......  .............  .......................
#
    99.32%    naive  naive          [.] multiply_matrices  
     0.34%    naive  libc-2.13.so   [.] random             
     0.15%    naive  naive          [.] initialize_matrices
     0.13%    naive  libc-2.13.so   [.] random_r           
     0.04%    naive  libc-2.13.so   [.] rand               
     0.02%    naive  naive          [.] 0x000006d8         
     0.00%    naive  libc-2.13.so   [.] _dl_addr           
     0.00%    naive  libc-2.13.so   [.] __printf_fp        
     0.00%    naive  libc-2.13.so   [.] fprintf            
     0.00%    naive  libc-2.13.so   [.] memchr             


# Samples: 63  of event 'page-faults'
# Event count (approx.): 807
#
# Overhead  Command  Shared Object                   Symbol
# ........  .......  .............  .......................
#
    91.82%    naive  naive          [.] initialize_matrices
     6.44%    naive  libc-2.13.so   [.] 0x000ca168         
     1.61%    naive  libc-2.13.so   [.] get_nprocs         
     0.12%    naive  libc-2.13.so   [.] strchr

The function multiply_matrices() is the hottest function with 99.32% of the total execution time (i.e., the cpu-clock samples). The second hottest function is random() in the C runtime library. The function initialize_matrices() calls random() during the initialization phase. This example illustrates an interesting and important case, namely, an application may spend a significant amount of time in a runtime library and not in the application code itself. It’s a case that we may see in other real-world applications and we need to remain open to this possibility during analysis.

As shown, perf report shows us a list of functions with the most page-faults, too. The function initialize_matrices() incurs 91.82% of the page faults. The initialization phase touches three big arrays for the first time. The first touch to a page causes a fault. This is normal behavior and we can now eliminate page faults as a performance factor. The pages are in the process’s working set after first touch and do not affect matrix multiplication which is the main execution time hot spot.

The program naive has three, 500 by 500 two-dimensional arrays where each array element is a C language float. Thus, each array is 1,000,000 bytes in size and the three arrays together are 3,000,000 bytes. The arrays occupy (3,000,000 / 4,096) = 732 pages. First touch should cause 732 page faults which is almost the approximate event count for page-faults, 807. This is further evidence that the page faults are due to normal initialization behavior.

Finally, we get down to the code with perf annotate. Command line options restrict output to specific dynamically shared objects (DSO) and functions (symbol). The following example displays annotated C language source code for the function multiply_matrices().

> perf annotate --stdio --dsos=naive --symbol=multiply_matrices

 Percent |	Source code & Disassembly of naive
------------------------------------------------
         :
         :	Disassembly of section .text:
         :
         :	00008bd0 :
         :
         :	void multiply_matrices()
         :	{
    0.00 :	  8bd0:       push    {r4, r5, r6, r7, r8, r9, sl}
         :	  int i, j, k ;
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
    0.00 :	  8bd4:       mov     r6, #0
    0.00 :	  8bd8:       ldr     sl, [pc, #124]  ; 8c5c 
    0.00 :	  8bdc:       ldr     r4, [pc, #124]  ; 8c60 
    0.00 :	  8be0:       ldr     r8, [pc, #124]  ; 8c64 
         :
         :	void multiply_matrices()
    0.00 :	  8be4:       mov     r7, #2000       ; 0x7d0
    0.00 :	  8be8:       mul     r5, r7, r6
    0.00 :	  8bec:       mov     r0, #0
    0.00 :	  8bf0:       sub     r5, r5, #4
    0.00 :	  8bf4:       add     ip, r8, r5
    0.00 :	  8bf8:       add     r5, sl, r5
    0.01 :	  8bfc:       vldr    s15, [pc, #84]  ; 8c58 
    0.22 :	  8c00:       mov     r1, r5
    0.00 :	  8c04:       add     r2, r4, r0, lsl #2
    0.20 :	  8c08:       mov     r3, #0
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
         :	    for (j = 0 ; j < MSIZE ; j++) {
         :	      float sum = 0.0 ;
         :	      for (k = 0 ; k < MSIZE ; k++) {
         :	        sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;
    1.76 :	  8c0c:       add     r1, r1, #4
    0.58 :	  8c10:       vldr    s14, [r2]
    1.07 :	  8c14:       vldr    s13, [r1]
         :	  int i, j, k ;
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
         :	    for (j = 0 ; j < MSIZE ; j++) {
         :	      float sum = 0.0 ;
         :	      for (k = 0 ; k < MSIZE ; k++) {
    1.67 :	  8c18:       add     r3, r3, #1
   54.13 :	  8c1c:       cmp     r3, #500        ; 0x1f4
         :	        sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;
    2.39 :	  8c20:       mov     r9, r1
   36.60 :	  8c24:       vmla.f32        s15, s13, s14
         :	  int i, j, k ;
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
         :	    for (j = 0 ; j < MSIZE ; j++) {
         :	      float sum = 0.0 ;
         :	      for (k = 0 ; k < MSIZE ; k++) {
    1.34 :	  8c28:       add     r2, r2, #2000   ; 0x7d0
    0.00 :	  8c2c:       bne     8c0c 
         :	        sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;
         :	      }
         :	      matrix_r[i][j] = sum ;
    0.01 :	  8c30:       vmov    r3, s15
         :	void multiply_matrices()
         :	{
         :	  int i, j, k ;
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
         :	    for (j = 0 ; j < MSIZE ; j++) {
    0.00 :	  8c34:       add     r0, r0, #1
    0.00 :	  8c38:       cmp     r0, #500        ; 0x1f4
         :	      float sum = 0.0 ;
         :	      for (k = 0 ; k < MSIZE ; k++) {
         :	        sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;
         :	      }
         :	      matrix_r[i][j] = sum ;
    0.01 :	  8c3c:       str     r3, [ip, #4]!
         :	void multiply_matrices()
         :	{
         :	  int i, j, k ;
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
         :	    for (j = 0 ; j < MSIZE ; j++) {
    0.01 :	  8c40:       bne     8bfc 
         :
         :	void multiply_matrices()
         :	{
         :	  int i, j, k ;
         :
         :	  for (i = 0 ; i < MSIZE ; i++) {
    0.00 :	  8c44:       add     r6, r6, #1
    0.00 :	  8c48:       cmp     r6, #500        ; 0x1f4
    0.00 :	  8c4c:       bne     8be8 
         :	        sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;
         :	      }
         :	      matrix_r[i][j] = sum ;
         :	    }
         :	  }
         :	}
    0.00 :	  8c50:       pop     {r4, r5, r6, r7, r8, r9, sl}
    0.00 :	  8c54:       bx      lr

The perf annotate command intersperses source and assembler language code. It uses the line numbers in the debug information to associate source lines with the assembler language. (The profile data is attributed to instructions during collection and the data must be “back annotated” to source.) As you can see, parts of the function body are displayed multiple times due to the code rearrangement made by the compiler. Compiler optimizations transform and move instructions, thereby complicating and obfuscating the association of instructions back to source. Sometimes, it makes it hard to read the disassembled instructions. (The TUI interface colors source and instructions differently — one more point in favor of TUI!) Source display is disabled with the --no-source option which makes it easier to read the annotated disassembly by itself. The example below is just the annotated disassembly for multiply_matrices().

> perf annotate --stdio --dsos=naive --symbol=multiply_matrices --no-source

 Percent |	Source code & Disassembly of naive
------------------------------------------------
         :
         :	Disassembly of section .text:
         :
         :	00008bd0 :
    0.00 :	  8bd0:       push    {r4, r5, r6, r7, r8, r9, sl}
    0.00 :	  8bd4:       mov     r6, #0
    0.00 :	  8bd8:       ldr     sl, [pc, #124]  ; 8c5c 
    0.00 :	  8bdc:       ldr     r4, [pc, #124]  ; 8c60 
    0.00 :	  8be0:       ldr     r8, [pc, #124]  ; 8c64 
    0.00 :	  8be4:       mov     r7, #2000       ; 0x7d0
    0.00 :	  8be8:       mul     r5, r7, r6
    0.00 :	  8bec:       mov     r0, #0
    0.00 :	  8bf0:       sub     r5, r5, #4
    0.00 :	  8bf4:       add     ip, r8, r5
    0.00 :	  8bf8:       add     r5, sl, r5
    0.01 :	  8bfc:       vldr    s15, [pc, #84]  ; 8c58 
    0.22 :	  8c00:       mov     r1, r5
    0.00 :	  8c04:       add     r2, r4, r0, lsl #2
    0.20 :	  8c08:       mov     r3, #0
    1.76 :	  8c0c:       add     r1, r1, #4
    0.58 :	  8c10:       vldr    s14, [r2]
    1.07 :	  8c14:       vldr    s13, [r1]
    1.67 :	  8c18:       add     r3, r3, #1
   54.13 :	  8c1c:       cmp     r3, #500        ; 0x1f4
    2.39 :	  8c20:       mov     r9, r1
   36.60 :	  8c24:       vmla.f32        s15, s13, s14
    1.34 :	  8c28:       add     r2, r2, #2000   ; 0x7d0
    0.00 :	  8c2c:       bne     8c0c 
    0.01 :	  8c30:       vmov    r3, s15
    0.00 :	  8c34:       add     r0, r0, #1
    0.00 :	  8c38:       cmp     r0, #500        ; 0x1f4
    0.01 :	  8c3c:       str     r3, [ip, #4]!
    0.01 :	  8c40:       bne     8bfc 
    0.00 :	  8c44:       add     r6, r6, #1
    0.00 :	  8c48:       cmp     r6, #500        ; 0x1f4
    0.00 :	  8c4c:       bne     8be8 
    0.00 :	  8c50:       pop     {r4, r5, r6, r7, r8, r9, sl}
    0.00 :	  8c54:       bx      lr

The hottest part of multiply_matrices() is the assignment statement within the innermost loop:

sum = sum + (matrix_a[i][k] * matrix_b[k][j]) ;

Looking at the annotated disassembly, a large portion of the execution time is ascribed to the compare (cmp) instruction within the innermost loop:

8c1c:       cmp     r3, #500        ; 0x1f4

You shouldn’t read too much into the execution time attributed to individual instructions, however. The compare instruction assuredly does not consume 54.13% of total execution time and I explain why time attribution to instructions is only approximate in a moment.

How does PERF make a profile?

PERF uses statistical sampling to collect profile data. Although you do not need to know all of the details, a basic understanding of the underlying method will help you to collect and analyze profiles.

We mentioned earlier that the PERF cpu-clock event uses the Linux time clock to measure the passage of time. The time clock is programmed to generate an interrupt at regular time intervals. When an interval expires and an interrupt occurs, PERF determines what the CPU was doing at the time of the interrupt. PERF identifies the software component that was interrupted, it captures the CPU core number, program counter and other relevant information, and it writes all of this data to a buffer in the form of a data item called a sample. The buffer is eventually written to the perf.data file. User-space tools like perf report and perf annotate read the samples from the perf.data file, aggregate the samples into a table or profile, and display the finished table/profile.

Sampling does not require any changes to the source code. This is a major advantage of sampling over in situ instrumentation. Sampling can collect and display profiles even if the executable binary images are stripped and no symbolic (debug) information is available. Samples are attributed to instructions and it’s always possible to display an annotated disassembly of a program. Profiles are more informative when symbolic information is available, however, and back-annotation to source absolutely requires debug information (including source file line numbers).

Because PERF is a statistical sampling process, you need to collect enough individual samples to produce a statistically meaningful result and to reduce measurement error. Profiles show the distribution of events like CPU clock intervals with respect to program locations (specifically, instructions in executable binary modules). The program under test needs to run long enough in order to accumulate samples in the code regions of interest (the hot spots).

How long is “long enough?” It depends upon the characteristics of the workload (the program under test). Scientific applications like matrix multiply usually have tight inner loops with intense activity. The naive matrix multiplication program runs for about 18 seconds on the Raspberry Pi — long enough to accumulate about 75,000 cpu-clock samples. Most of these samples are attributed to the innermost loop of the matrix multiplication code. Database applications, however, tend to have a larger footprint with many code regions of low-to-moderate activity. A database application might have to run for a much longer period of time before hot spots are clearly discernable in the final profile.

It’s a judgement call. A practical rule of thumb is to require at least 100 to 500 samples in the code regions of interest before conducting analysis. Similarly, if the hot regions have several thousand samples each, then any cold region with less than 100 samples can be safely ignored and eliminated from analysis. Think statistically!

One way to increase run time (and samples) is to execute a critical loop multiple times. Another way is to run the same workload multiple times and merge the samples from all runs. Check out the perf record --append option which appends samples to the output file.

Another way to increase the number of samples is to increase the resolution at which samples are taken. For the cpu-clock event, this means using a shorter interval between successive samples — a higher sampling frequency. You set the perf record sampling frequency with its --freq option as shown in the example below.

> perf record -e cpu-clock --freq=8000 ./naive

[ perf record: Woken up 26 times to write data ]
[ perf record: Captured and wrote 6.348 MB perf.data (~277351 samples) ]

> perf report --stdio --show-nr-samples --dsos=naive
# dso: naive
# ...
#
# Samples: 165K of event 'cpu-clock'
# Event count (approx.): 165248
#
# Overhead  Samples    Command                   Symbol
# ........ ..........  .......  .......................
#
    99.84%     164990    naive  [.] multiply_matrices  
     0.14%        232    naive  [.] initialize_matrices
     0.02%         26    naive  [.] 0x000006e0

The perf record command in this example collected twice as many samples as the first example (165,248 samples versus 74,429 total cpu-clock samples) thanks to the higher sampling frequency (8,000 samples per second versus the default 4,000).

Here’s another way to think about the relationship of run time and sampling frequency. The elapsed time for the first example run is about 16.75 seconds. The sampling frequency is 4,000 samples per second. So, perf record should have collected about (16.75 * 4,000) = 67,000 samples, which is close to the actual sample count, 74,429. The elapsed time of the last example is 20.78 seconds and the sampling frequency is 8,000 samples per second. The expected number of cpu-clock samples is (20.78 * 8000) = 165,240, which is very near the actual sample count, 165,248.

The perf evlist command displays the kinds of events in a perf.data file. The -v option displays the event names and PERF configuration information for each type of event including the sampling frequency. The -F option displays the event names and the sampling frequency for each event.

> perf evlist -F
cpu-clock: sample_freq=8000

We need to be careful when setting the sampling frequency especially on a single core platform. A higher sampling frequency means higher data collection overhead since PERF needs to collect and store more samples. Consequently, the application program runs longer (i.e., the elapsed time is higher) because the application program and PERF must share the same core and PERF is taking the CPU away from the application program. Further, the program and PERF must use the same microarchitectural resources: the caches, translation lookaside buffers (TLB) and branch history table. PERF, like any other active software component, changes the content of these shared microarchitectural components and perturbs the behavior of the program under test. If the sampling frequency is too high, then the profiles no longer reflect the actual standalone behavior of the application program. The measurement process biases the final result in an undesirable way.

Finally, why don’t cpu-clock profiles identify slow instructions? The answer has to do with the way cpu-clock events are attributed back to binary code when handling a timer interrupt. First, there is a delay between the time when a sampling interrupt is requested and when the sampling interrupt is honored and handled. Second, the program counter value that is captured in a sample is the program address at which execution will restart after interrupt handling is complete; It is not the program location where the interrupt is first asserted. The combination of these two factors is called skid and it affects the attribution and distribution of samples in the final profile. In the presence of skid, samples are attributed to the general neighborhood around performance culprits such as long latency memory load operations. The cpu-clock event cannot be precisely attributed to slow instructions — just the hot code region containing the culprit. This is why you shouldn’t conclude that the compare (cmp) instruction:

8c1c:       cmp     r3, #500        ; 0x1f4

is slowing down the matrix multiplication program. It’s a victim of skid. The real culprit — a slow load from memory — is nearby.

Some machine implementations support the precise attribution of events to program counter values. Precise attribution improves profile quality and gives us accurate information about the behavior of individual instructions. AMD processors implement precise attribution through Instruction-Based Sampling (IBS) and Intel processors implement precise attribution via Precise Event-Based Sampling (PEBS). PERF supports both IBS and PEBS.

Sand, software and sound

Electronics and computing for the fun of it