I hope you have enjoyed my series of articles about Raspberry Pi 4 performance events, measurement and tuning:
- ARM Cortex-A72 tuning: IPC
- Cortex-A72 tuning: Data access
- Cortex-A72 tuning: Branch mispredictions
- Cortex-A72 tuning: Memory access
- ARM Cortex-A72 fetch and branch processing
- ARM Cortex-A72 execution and load/store operations
- Raspberry Pi 4 Performance Events
Today, I want to wrap up the series with C code.
Please don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements. The commands in the PERF tutorial apply to x86, AMD64 and other architectures, too.
Before getting too far, here is the link to the ZIP file with the code. 🙂 The main source files are:
- makefile: The make file (duh!)
- pe_assist.h: Performance event helper header
- pe_assist.c: Performance event helper functions
- pe_cortex_a72.h: A72-specific helper header
- pe_cortex_a72.c: A72-specific helper functions
- pe_test.c: pe_assist check-out test
- pe_matrix.c: pe_assist matrix multiply example
- a72_test.c: A72-specific check-out test
- a72_walk.c: A72-specific array walk kernel
- a72_matrix.c: A72-specific matrix multiply
- a72_misp.c: A72-specific branch mispredict kernel
- a72_chase.c: A72-specific pointer chasing kernel
There are a few surprises, too, such as earlier versions of code, etc.
The programs self-monitor, that is, they call perf_event_open()
to configure, control and read the performance counters. perf_event_open()
has many parameters, so I wrote helper functions assisting counter configuration, control and access. There are two flavors: architecture independent and Cortex-A72 specific. The architecture independent functions are defined in pe_assist.*
and the A72-specific functions are defined in pe_cortex_a72.*
. The architecture independent functions should work on x86, etc., too.
Aside from the two check-out tests, the rest of the source modules are workloads. These are the programs that I used to collect data for the articles about Cortex-A72 performance measurement, analysis and tuning. Feel free to bash away at everything!
Helper functions
As I mentioned above, I separated the helper functions into architecture independent and Cortex-A72 specific modules. The architecture independent helper functions handle Linux performance counter set-up, control and read back:
peInitialize(): Initialize/reset the helper module
:peMakeGroup()
: Make a counter grouppeAddLeader()
: Add leader event to the grouppeAddEvent()
: Add an event to the grouppeStartCounting()
: Start the counter grouppeStopCounting()
: Stop the counter grouppeResetCounters()
: Reset the counterspeReadCount()
: Read an event countpePrintCount()
: Print and event count
The interface is “lite” and uncomplicated. It’s just enough to get the job done. Sometimes during early days, there is a temptation to build the Taj Mahal. I prefer to build something simple and get experience before building up and out. This simple interface proved to be good enough.
perf_event_open()
supports simple event counting and sampling. If you’re familiar with perf stat
, you’ve already seen simple event counting, AKA counting mode. perf stat
measures events across the entire run of an application program. Self-monitoring is similar except you insert measurement code into the application program around the critical code that you wish to measure. perf stat
doesn’t require code modification or recompile, but it doesn’t let you focus on particular critical loops or whatever. Self-monitoring is a little bit more effort, but it allows focus.
The usage model is straightforward:
- Initialize the module data structures.
- Create a performance event group.
- Add a leader event to the group.
- Add other events (up to 6 events for Cortex-A72) to the group.
- Start the counter group.
- Execute the workload or critical inner loops.
- Stop the counter group.
- Read and print the event counts.
Since this sequence is a recurring pattern, I also wrote a few functions which target common types of measurements such as:
peMeasureInstructionEvents()
pePrintInstructionEvents()
peMeasureDataAccessEvents()
pePrintDataAccessEvents()
These functions configure the pre-defined “symbolic” events which the Linux kernel has preselected for the platform architecture. Thus, you should be able to use the pe_assist.*
module on any Linux box.
The Cortex-A72 module, pe_cortex_a72.*
, use “raw” event identifiers for configuration. The available events are defined in pe_cortex_a72.h
and they are specific to ARM Cortex-A72. I rely mainly on the Cortex-A72 events because then I know exactly which A72 events I am measuring. The Cortex-A72 module calls the low-level helper functions and it exports only targeted measurement functions:
a72MeasureInstructionEvents()
a72PrintInstructionEvents()
a72MeasureDataAccessEvents()
a72PrintDataAccessEvents()
a72MeasureTlbEvents()
a72PrintTlbEvents()
Take a peek inside one of the test programs and you’ll see how to call the helper modules.
Internal design
perf_event_open()
is the Swiss Army knife of performance counter configuration and control. On Linux, all counter-related operations go through this single kernel call.
perf_event_open()
allows control of individual counters, of course. However, it also provides a way to control a group of counters. One can save additional trips in and out of the kernel through counter groups. Instead of making six calls to start six counters, one only needs to make one perf_event_open()
call to start an entire group of six events.
A Cortex-A72 group consists of one to six event counters. Each group has a distinguished member: the leader event. You can start, stop and reset the entire group by referring to the leader. Because the group members usually share characteristics like the process (ID) to be measured, the CPU set, flags, etc., it makes sense to define all of these common properties for an entire group. This approach reduces the number of parameters to be passed around during configuration.
In keeping with the “lite” philosophy, the helper module keeps the common flags and such in a few variables and arrays. The group and leader definition functions establish group-wide values for the member events in the group. That’s all there is to it, so hack away! The “lite” approach was good enough 99% of the time, so you might not need to dip into the helper modules at all.
Copyright © 2021 Paul J. Drongowski