ARM Cortex-A72 tuning: Memory access

Today’s post characterizes read access time to three different levels of the Raspberry Pi 4 (Broadcom BCM2711) memory hierarchy. The ARM Cortex-A72 processor has a two level cache structure: Level 1 data (L1D) cache and unified Level 2 cache. There is one L1D cache per core and all four cores share the L2 cache. Primary memory is the third and final level beyond L2 cache.

The test program is a simple kernel (inner loop) that runs through a linked list, i.e., pointer chasing. Each linked list element is exactly one A72 cache line in size, 64 bytes. I have used pointer chasing on other non-ARM architectures (Alpha and AMD64 come to mind) and it’s a pretty simple and effective way to characterize memory access speed.

The trick is to adjust the number of linked elements so that the entire linked list fits entirely within the memory to be characterized. In order to facilitate run-to-run comparisons, there is an outer loop which repeatedly invokes list chasing, i.e., the entire list is walked multiple times per run.

There are two main run parameters:

  • The number of linked list elements (which determines the array size), and
  • The number of iterations (which is the number of times the full list is walked).

When the array is doubled, the number of iterations is cut in half. This keeps the number of individual pointer chase operations (approximately) constant across runs.

The following table summarizes the test run parameters and the memory level to be exercised by the each run:

    #Elements  Iterations  Array Size  Mem Level 
--------- ---------- ---------- ---------
32 8388608 2KB L1D cache
64 4194304 4KB L1D cache
128 2097152 8KB L1D cache
256 1048576 16KB L1D cache
512 524288 32KB L1D cache
1024 262144 64KB L2 cache
2048 131072 128KB L2 cache
4096 65536 256KB L2 cache
8192 32768 512KB L2 cache
16384 16384 1MB L2 cache
32768 8192 2MB RAM
65536 4096 4MB RAM

Here is the C code for the test kernel:

  initialize(number_of_elements) ;
a72MeasureDataAccessEvents() ;

start_clock() ;
peStartCounting() ;
for ( ; iterations > 0 ; iterations--) {
for (CacheLine *p = listHead ; p != NULL ; p = p->nextLine) ;
}
peStopCounting() ;

print_clock_time(stdout, get_clock_time()) ;
a72PrintDataAccessEvents(stdout) ;

Both Linux clock() time and Cortex-A72 performance counter events are measured.

In my first experiments, the linked list elements were laid down in a linear sequential fashion and in a simple ping-pong scheme. I quickly discovered that Cortex-A72’s aggressive data prefetch is too good and naive layout did not produce the expected number of L1D or L2 cache misses. A72 speculatively reads the next cache line beyond a miss. By the time execution would reach the list element beyond the current one (or the very next element), the needed destination element would be available in cache or in flight.

Ideally, we want to fool the memory prefetcher and hit only the intended memory level, taking the full read access penalty each time we chase a pointer. I rewrote array/list initialization to lay down the list elements at (pseudo-)random positions in the array. The Fisher-Yates (Knuth) shuffle algorithm got the job done. Once list element layout was randomized, the pointer chasing test began producing the expected number of reads and misses.

The following table summarizes each run by the number of retired instructions, CPU cycles, instructions per cycle (IPC) and execution time:

    Array  Mem  Retired Ins    CPU Cycles     IPC    Time 
----- --- ----------- -------------- ----- ------
2KB L1D 847,249,436 855,660,776 0.990 0.609
4KB L1D 826,277,916 1,154,215,728 0.716 0.814
8KB L1D 815,792,156 1,114,379,370 0.732 0.806
16KB L1D 810,549,276 1,093,757,212 0.741 0.763
32KB L1D 807,927,836 1,382,324,229 0.584 0.975
64KB L2 806,617,116 5,074,763,198 0.159 3.446
128KB L2 805,961,756 5,643,312,493 0.143 3.805
256KB L2 805,634,076 6,621,262,142 0.122 4.452
512KB L2 805,470,236 7,163,843,161 0.112 4.813
1MB L2 805,388,316 27,563,140,814 0.029 18.421
2MB RAM 805,347,356 49,317,924,775 0.016 32.969
4MB RAM 805,326,876 54,865,753,267 0.015 36.645

No surprise, access to L1D cache is best, L2 is second best and primary memory is worst. Access to L2 cache is about five times as long as L1D cache, in terms of CPU cycles. Access to primary memory is nearly 50 times longer than L1D cache. The effect on IPC is very significant.

Taking a look at the L1D performance event counts:

    Array  Mem   IPC    Time    L1D Reads    L1D Misses  Ratio 
----- --- ----- ------ ----------- ----------- -----
2KB L1D 0.990 0.609 268,435,785 484 0.000
4KB L1D 0.716 0.814 268,435,630 1,316 0.000
8KB L1D 0.732 0.806 268,435,639 1,149 0.000
16KB L1D 0.741 0.763 268,435,622 4,319 <0.001
32KB L1D 0.584 0.975 268,435,828 17,343,069 0.065
64KB L2 0.159 3.446 268,435,603 234,906,566 0.875
128KB L2 0.143 3.805 268,435,592 268,435,529 1.000
256KB L2 0.122 4.452 268,435,625 268,435,588 1.000
512KB L2 0.112 4.813 268,435,599 268,435,530 1.000
1MB L2 0.029 18.421 268,435,594 268,435,782 1.000
2MB RAM 0.016 32.969 268,435,579 268,435,960 1.000
4MB RAM 0.015 36.645 268,435,635 268,435,941 1.000

we see that pointer chasing correctly and independently exercises L1D cache according to design. The L1D cache capacity is 32KB. The particular 32KB run shown here has the shortest execution time of the 32KB runs and thus, is cherry-picked. As I’ve seen on other architectures, measurements get a bit “weird” near cache capacity. When a cache gets nearly full, “weird stuff” starts to happen and run statistics become inconsistent. The shortest run best shows the break between L1D and L2 access.

Finally, here are the L2 cache performance event counts.

    Array  Mem   IPC    Time     L2 Reads    L2 Misses   Ratio 
----- --- ----- ------ ----------- ----------- -----
2KB L1D 0.990 0.609 1,085 68 0.063
4KB L1D 0.716 0.814 8,490,994 228 0.000
8KB L1D 0.732 0.806 4,300,759 151 0.000
16KB L1D 0.741 0.763 2,102,562 163 0.000
32KB L1D 0.584 0.975 18,495,230 1,003 <0.001
64KB L2 0.159 3.446 235,483,730 1,517 <0.001
128KB L2 0.143 3.805 270,831,005 2,745 <0.001
256KB L2 0.122 4.452 269,203,020 31,340 <0.001
512KB L2 0.112 4.813 270,893,954 443,477 0.002
1MB L2 0.029 18.421 302,452,386 107,397,408 0.355
2MB RAM 0.016 32.969 286,244,127 227,010,870 0.793
4MB RAM 0.015 36.645 277,293,265 252,881,540 0.912

As expected, we see a dramatic breakpoint at 1MB, which is the capacity of the unified L2 cache.

Bottom line, these performance measurements reinforce the importance of cache-friendly algorithms and data access patterns. Start with the best algorithms for your application, measure cache events and then tune for minimum misses. Data access should hit most frequently in the Level 1 data cache, then L2 cache. Primary memory is fifty times (!) more expensive than L1D cache and reads out to primary memory should be as infrequent as possible. Your mantra should be, “Bring it into cache, compute the heck out of the in-cache data, then write the final results back to memory, and move on.”

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Next time, I will wrap up this long series of articles with C code so you can perform your own experiments.

Copyright © 2021 Paul J. Drongowski

Linux clock and temperature: An interlude

I’m in the midst of investigating a performance anomaly which seemingly pops up at random. I wrote a pointer chasing program to exercise and measure cache miss performance events. As part of the testing regimen, I run the program several times in a row and compare run time, event counts, etc. and look for inconsistencies.

The program is usually well-behaved/consistent and produces the expected result. For example, when the program is configured to always hit in the level 1 data (L1D) cache, the program measures just a few L1D misses and the run time is short. However, occasionally a run is slow and has a slew of L1D misses. What’s up?

My first thought was re-scheduling, that is, the pointer chasing program starts on one core and is moved by the OS to another core. The cache on the new core is cold and more misses occur. The Linux taskset command launches and pins a program to a core. In fancier language, it sets the CPU affinity for a (running) process. If we pin the program to a particular core, the cache should stay warm.

If you’re an old-timer and haven’t used taskset in a while, please be aware that a user must have CAP_SYS_NICE capability to change the affinity of a process. You can also set CAP_SYS_NICE capability for an application binary using the setcap utility:

    sudo setcap 'cap_sys_nice=eip' <application> 

You can check capabilities with getcap:

    getcap  <application>

The form of the capabilities string is in accordance with the cap_from_text call, so I recommend viewing its man page. The eip flags are case sensitive and specify the effective, inheritabe and permitted sets, respectively.

As to the performance anomoly, setting the CPU (core) affinity did not resolve the issue. Long runs and misses kept popping up. My next thought was “maybe CPU clock throttling?”

There’s quite a bit of on-line material about Raspberry Pi clock throttling and I won’t repeat all of it here. Suffice it to say, the RPi 4 firmware has a so-called CPU scaling governor that kicks in at high temperatures. The governor tries to keep the CPU temperature below 80 . Over-temperature occurs when the temperature rises above 85 . The governor adjusts (throttles) the CPU clock to achieve the configured operating temperature goals.

We do know that Raspberry Pi 4 can run hot. My RPi4 has heat sinks installed, but no case fan. Heat vents out the top of the Canakit plastic enclosure. The heat sinks are warm to the touch, not super hot, really. However, it’s not a bad idea to take the Pi’s temperature.

The following command displays the RPi’s temperature:

    cat /sys/class/thermal/thermal_zone0/temp 

Divide the result by 1000 to obtain the temperature in degrees Celsius. The next command displays the current frequency (kHz):

    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

Divide the result by 1000 to obtain the frequency in MHz. This frequency is the Linux kernel’s requested frequency. The actual, possibly throttled frequency may be different.

The Pi’s vcgencmd command is even better! vcgencmd is the Swiss Army knife of system info utilities. The following command displays a list of vcgencmd subcommands:

    vcgencmd commands

Here’s a few commands to get you started:

    vcgencmd measure_temp 
vcgencmd measure_clock arm
vcgencmd measure_volts core
vcgencmd get_throttled

See the vcgencmd man page for more.

You may run into permission issues with vcgencmd. I couldn’t tame the permissions and simply ran vcgencmd via sudo.

The get_throttled subcommand returns a bit mask. Here’s the magic decoder ring:

    Bit   Hex value  Meaning 
---- --------- -----------------------------------
0 1 Under-voltage detected (< 4.64V)
1 2 Arm frequency capped (temp > 80'C)
2 4 Currently throttled
3 8 Soft temperature limit active
16 10000 Under-voltage has occurred
17 20000 Arm frequency capping has occurred
18 40000 Throttling has occurred
19 80000 Soft temperature limit has occurred

If all of that isn’t enough, you can install and run cpufrequtils:

    sudo apt install cpufrequtils 
cpufreq-info

After running the workload and measuring both CPU clock and temperature, throttling did not appear to be a problem. My current conjecture has to do with Linux fixes for the Spectre security vulnerability. In short, Spectre is a class of vulnerabilities exploiting observable side-effects of the machine micro-architecture in order to set up clandestine information channels that leak confidential data. One way to supress data cache observables is to flush (clean) and invalidate the data caches during a context switch. If the data cache is invalidated, cache misses and program run time will go up. Stay tuned.

Even though I haven’t found the source of the performance anomoly, I welcomed the chance to learn about vcgencmd, etc. Off to investigate Linux hardware cache flushing…

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 tuning: Branch mispredictions

Back on the old day job, I developed and tested software and hardware for program profiling. Testing may sound like drudge-work, but there are ways to make things fun!

Two questions arise while testing a profiling infrastructure — software plus hardware:

  • Does the hardware accurately count (or sample) performance events for a given specific workload?
  • Does the software accurately display the counts or samples?

Clearly, ya need working hardware before you can build working software.

Testing requires a solid, known-good (KG) baseline in order to decide if new test results are correct. Here’s one way to get a KG baseline — a combination of static analysis and measurement:

  • Static analysis: Analyze the post-compilation machine code and predict the expected number of instruction retires, cache reads, misses, etc.
  • Measurement: Run the code and count performance events.
  • Validation: Compare the measured results against the predicted results.

Thereafter, one can compare new measurements taken from the system under test (SUT) and compare against both predicted results and baseline measured results.

Applying this method to performance counter counting mode is straightforward. You might get a little “hair” in the counts due to run-to-run variability, however, the results should be well-within a small measurement error. Performance counter sampling mode is more difficult to assess and one must be sure to collect a statistically significant number of samples within critical workload code in order to have confidence in a result.

One way to make testing fun is to make it a game. I wrotekernel programs that exercised specific hardware events and analyzed the inner test loops. You could call these programs “test kernels.” The kernels are pathologically bad (or good!) code which triggers a large number of specific performance events. It’s kind of a game to write such bad code…

The expected number of performance events is predicted through machine code level complexity analysis known as program “microanalysis.” For example, the inner loops of matrix multiplication are examined and, knowing the matrix sizes, the number of retired instructions, cache reads, branches, etc. are computed in closed form, e.g.,

    (38 inner loop instructions) * (1,000,000,000 iterations) + 
(26 middle loop instructions) * (1,000,000 iterations) +
(9 outer loop instructions) * (1,000 iterations)
-----------------------------------------------------------
38,026,009,000 retired instructions expected
38,227,831,497 retired instructions measured

This formula is the closed form expression for the retired instruction count within the textbook matrix multiplication kernel. The microanalysis approach worked successfully on Alpha, Itanium, x86, x64 and (now) ARM. [That’s a short list of machines that I’ve worked on. 🙂 ]

With that background in mind, let’s write a program kernel to deliberately cause branch mispredictions and measure branch mispredict events.

The ARM Cortex-A72 core predicts conditional branch direction in order to aggressively prefetch and dispatch instructions along an anticipated program path before the actual branch direction is known. A branch mispredict event occurs when the core detects a mistaken prediction. Micro-ops on the wrong path must be discarded and the front-end must be steered down the correct program path. The Cortex-A72 mispredict penalty is 15 cycles.

What we need is a program condition that consistently fools the Cortex-A72 branch prediction hardware. Branch predictors try to remember a program’s tendency to take or not take a branch and the predictors are fairly sensitive; even a 49%/51% split between taken and not taken has a beneficial effect on performance. So, we need a program condition which has 50%/50% split with a random pattern of taken and not taken direction.

Here’s the overall approach. We fill a large array with a random pattern of ‘0’ and ‘1’ characters. Then, we walk through the array and count the number of ‘1’ characters. The function initialize_test_array() fills the array with a (pseudo-)random pattern of ones and zeroes:

void initialize_test_array(int size, char* array, 
int always_one, int always_zero)
{
register char* r = array ;
int s ;
for (s = size ; s > 0 ; s--) {
if (always_one) {
*r++ = '1' ;
} else if (always_zero) {
*r++ = '0' ;
} else {
*r++ = ((rand() & 0x1) ? '1' : '0') ;
}
}
}

The function has options to fill the array with all ones or all zeroes in case you want to see what happens when the inner conditional branch is well-predicted. BTW, I made the array 20,000,000 characters long. The size is not especially important other than the desire to have a modestly long run time.

The function below, test_loop(), contains the inner condition itself:

int test_loop(int size, char* array) 
{
register int count = 0 ;
register char* r = array ;
int s ;
for (s = size ; s > 0 ; s--) {
if (*r++ == '1') count++ ; // Should mispredict!
} return( count ) ;
}

The C compiler translates the test for ‘1’ to a conditional branch instruction. Given an array with random ‘0’ and ‘1’ characters, we should be able to fool the hardware branch predictor. Please note that the compiler generates a conditional branch for the array/loop termination condition, s > 0. This conditional branch should be almost always predicted correctly.

The function run_the_test() runs the test loop:

void run_the_test(int iteration_count, int array_size, char* array) 
{
register int rarray_size = array_size ;
register char* rarray = array ;
int i ;
for (i = iteration_count ; i-- ; ) {
test_loop(array_size, array) ;
}
}

It calls test_loop() many times as determined by iteration_count. Redundant iterations aren’t strictly necessary when taking measurements in counting mode. They are needed, however, in sampling mode in order to collect a statistically significant number of performance event samples. I set the iteration count to 200 — enough to get a reasonable run time when sampling.

The test driver code initializes the branch condition array, configures the ARM Cortex-A72 performance counters, starts the counters, runs the test loop, stops the counters and prints the performance event counts:

initialize_test_array(array_size, array, always_one, always_zero) ; 
a72MeasureInstructionEvents() ;
peStartCounting() ;
run_the_test(iteration_count, array_size, array) ;
peStopCounting() ;
a72PrintInstructionEvents(stdout) ;

The four counter configuration, control and display functions are part of a small utility module that I wrote. I will explain the utility module in a future post and will publish the code, too.

Finally, here are the measurements when scanning an array holding a random pattern of ‘0’ and ‘1’ characters:

    Instructions ret'd:      45,999,735,845 
Instructions spec'd: 98,395,483,123
CPU cycles: 59,010,851,259
Branch speculated : 8,012,669,711
Branch mispredicted: 2,001,934,251
Branch predicted 8,012,669,710
Instructions per cycle: 0.780
Retired/spec'd ratio: 0.467
Branches per 1000 (PTI): 174.189
Branch mispredict ratio: 0.250

Please recall that there are two conditional branches in the inner test loop: a conditional branch to detect ‘1’ characters and a conditional branch to check the array/loop termination condition. The loop check should be predicted correctly almost all the time, accounting for 50% of the total number of correctly predicted branches. The character test, however, should be incorrectly predicted 50% of the time. It’s like guessing coin flips — you’ll be right half the time on average. Overall, 25% of branch predictions should be incorrect, and yes, the measured branch mispredict ratio is 0.250 or 25%.

The number of speculated instructions is also very interesting. Cortex-A72 speculated twice as many ARMv8 instructions as it retired. Over half of the speculated instructions did not complete architecturally and were discarded. That’s what happens when a conditional branch is grossly mispredicted!

I hope you enjoyed this simple experiment. It makes the Cortex-A72 fetch and branch prediction behavior come alive. As a follow-up experiment, I suggest trying all-ones or all-zeroes.

Please check out other articles in this series:

Don’t forget my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Copyright © 2021 Paul J. Drongowski

ARM Cortex-A72 branch-related performance events:

 Number Mnemonic          Event name
------ ---------------- -----------------------------------------
0x08 INST_RETIRED Instruction architecturally executed
0x10 BR_MIS_PRED Mispredicted or not predicted branches
0x11 CPU_CYCLES Processor cycles
0x12 BR_PRED Predictable branch speculatively executed
0x1B INST_SPEC Operation speculatively executed
0x76 PC_WRITE_SPEC Software change of the PC (speculative)
0x78 BR_IMMED_SPEC Immediate branch (speculative)
0x79 BR_RETURN_SPEC Procedure return (speculative)
0x7A BR_INDIRECT_SPEC Indirect branch (speculative)

Disassembled code for test_loop():

00010678 :
10678: e92d0830 push {r4, r5, fp}
1067c: e28db008 add fp, sp, #8
10680: e24dd014 sub sp, sp, #20
10684: e50b0018 str r0, [fp, #-24] ; 0xffffffe8
10688: e50b101c str r1, [fp, #-28] ; 0xffffffe4
1068c: e3a04000 mov r4, #0
10690: e51b501c ldr r5, [fp, #-28] ; 0xffffffe4
10694: e51b3018 ldr r3, [fp, #-24] ; 0xffffffe8
10698: e50b3010 str r3, [fp, #-16]
1069c: ea000008 b 106c4
106a0: e1a03005 mov r3, r5
106a4: e2835001 add r5, r3, #1
106a8: e5d33000 ldrb r3, [r3]
106ac: e3530031 cmp r3, #49 ; 0x31
106b0: 1a000000 bne 106b8 ; Should mispredict!
106b4: e2844001 add r4, r4, #1
106b8: e51b3010 ldr r3, [fp, #-16]
106bc: e2433001 sub r3, r3, #1
106c0: e50b3010 str r3, [fp, #-16]
106c4: e51b3010 ldr r3, [fp, #-16]
106c8: e3530000 cmp r3, #0
106cc: cafffff3 bgt 106a0 ; Correctly predicted
106d0: e1a03004 mov r3, r4
106d4: e1a00003 mov r0, r3
106d8: e24bd008 sub sp, fp, #8
106dc: e8bd0830 pop {r4, r5, fp}
106e0: e12fff1e bx lr

Cortex-A72 tuning: Data access

In my discussion about instructions per cycle as a performance metric, I compared the textbook implementation of matrix multiplication against the loop next interchange version. The textbook program ran slower (28.6 seconds) than the interchange version (19.6 seconds). The interchange program executes 2.053 instructions per cycle (IPC) while the textbook version has a less than stunning 0.909 IPC.

Let’s see why this is the case.

Like many other array-oriented scientific computations, matrix multiplication is memory bandwidth limited. Matrix multiplication has two incoming data streams — one stream from each of the two operand matrices. There is one outgoing data stream for the matrix product. Thanks to data dependency, the incoming streams are more important than the outgoing matrix product stream. Thus, anything that we can do to speed up the flow of the incoming data streams will improve program performance.

Matrix multiplication is one of the most studied examples due to its simplicity, wide-applicability and familiar mathematics. So, nothing in this note should be much of a surprise! Let’s pretend, for a moment, that we don’t know the final outcome to our analysis.

I measured retired instructions, CPU cycles and level 1 data (L1D) cache and level 2 (L2) cache read events:

    Event                          Textbook     Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,831,497 60,210,830,509
CPU cycles 42,068,324,320 29,279,037,884
Instructions per cycle 0.909 2.056
L1 D-cache reads 15,070,922,957 19,094,920,483
L1 D-cache misses 1,096,278,643 9,576,935
L2 cache reads 1,896,007,792 264,923,412
L2 cache read misses 124,888,097 125,524,763

There is one big take-away here. The textbook program misses in the data cache far more often than interchange. The textbook L1D cache miss ratio is 0.073 (7.3%) while the interchange cache miss ratio is 0.001 (0.1%). As a consequence, the textbook program reads the slower level 2 (L2) cache more often to find necessary data.

If you noticed slightly different counts for the same event, good eye! The counts are from different runs. It’s normal to have small variations from run to run due to measurement error, unintended interference from system interrupts, etc. Results are largely consistent across runs.

The behavioral differences come down to the memory access pattern in each program. In C language, two dimensional arrays are arranged in row-major order. The textbook program touches one operand matrix in row-major order and touches the other operand matrix in column-major order. The interchange program touches both operand arrays in row-major order. Thanks to row-major order’s sequential memory access, the interchange program finds its data in level 1 data (L1D) cache more often than the textbook implementation.

There is another micro-architecture aspect to this situation, too. Here are the performance event counts for translation look-aside buffer (TLB) behavior:

    Event                          Textbook     Interchange 
----------------------- -------------- --------------
Retired instructions 38,227,830,517 60,210,830,503
L1 D-cache reads 15,070,845,178 19,094,937,273
L1 DTLB miss 1,001,149,440 17,556
L1 DTLB miss LD 1,000,143,621 10,854
L1 DTLB miss ST 1,005,819 6,702

Due to the chosen matrix dimensions, the textbook program makes long strides through one of the operand matrices, again, due to the column-major order data access pattern. The stride is big enough to touch different memory pages, thereby causing level 1 data TLB (DTLB) misses. The textbook program has a 0.066 (6.6%) DTLB miss ratio. The miss ratio is near zero for the interchange version.

I hope this discussion motivates the importance of cache- and TLB-friendly algorithms and code. Please see the following articles if you need to brush up on ARM Cortex-A72 micro-architecture and performance events:

Check out my Performance Events for Linux tutorial and learn to make your own Raspberry Pi 4 (Broadcom BCM2711) performance measurements.

Here is a list of the ARM Cortex-A72 performance events that are most useful for measuring memory access (load, store and fetch) behavior. Please see the ARM Cortex-A72 MPCore Processor Technical Reference Manual (TRM) for the complete list of performance events.

Number  Mnemonic            Name 
------ ------------------ ------------------------------------
0x01 L1I_CACHE_REFILL Level 1 instruction cache refill
0x02 L1I_TLB_REFILL Level 1 instruction TLB refill
0x03 L1D_CACHE_REFILL Level 1 data cache refill
0x04 L1D_CACHE Level 1 data cache access
0x05 L1D_TLB_REFILL Level 1 data TLB refill
0x08 INST_RETIRED Instruction architecturally executed
0x11 CPU_CYCLES Processor cycles
0x13 MEM_ACCESS Data memory access
0x14 L1I_CACHE Level 1 instruction cache access
0x15 L1D_CACHE_WB Level 1 data cache Write-Back
0x16 L2D_CACHE Level 2 data cache access
0x17 L2D_CACHE_REFILL Level 2 data cache refill
0x18 L2D_CACHE_WB Level 2 data cache Write-Back
0x19 BUS_ACCESS Bus access
0x40 L1D_CACHE_LD Level 1 data cache access - Read
0x41 L1D_CACHE_ST Level 1 data cache access - Write
0x42 L1D_CACHE_REFILL_LD L1D cache refill - Read
0x43 L1D_CACHE_REFILL_ST L1D cache refill - Write
0x46 L1D_CACHE_WB_VICTIM L1D cache Write-back - Victim
0x47 L1D_CACHE_WB_CLEAN L1D cache Write-back - Cleaning
0x48 L1D_CACHE_INVAL L1D cache invalidate
0x4C L1D_TLB_REFILL_LD L1D TLB refill - Read
0x4D L1D_TLB_REFILL_ST L1D TLB refill - Write
0x50 L2D_CACHE_LD Level 2 data cache access - Read
0x51 L2D_CACHE_ST Level 2 data cache access - Write
0x52 L2D_CACHE_REFILL_LD L2 data cache refill - Read
0x53 L2D_CACHE_REFILL_ST L2 data cache refill - Write
0x56 L2D_CACHE_WB_VICTIM L2 data cache Write-back - Victim
0x57 L2D_CACHE_WB_CLEAN L2 data cache Write-back - Cleaning
0x58 L2D_CACHE_INVAL L2 data cache invalidate
0x66 MEM_ACCESS_LD Data memory access - Read
0x67 MEM_ACCESS_ST Data memory access - Write
0x68 UNALIGNED_LD_SPEC Unaligned access - Read
0x69 UNALIGNED_ST_SPEC Unaligned access - Write
0x6A UNALIGNED_LDST_SPEC Unaligned access
0x70 LD_SPEC Speculatively executed - Load
0x71 ST_SPEC Speculatively executed - Store
0x72 LDST_SPEC Speculatively executed - Load or store

Copyright © 2021 Paul J. Drongowski