RPi MIDI bridge

[Update: See Send MIDI from USB-B to 5-pin.]

Here’s a vexing problem that many electronic musicians face.

Let’s say that you own a lot of gear, some of which uses the old school 5-pin DIN MIDI interface. For example, there are a ton of classic (and not so classic) tone modules and keyboards that have 5-pin MIDI IN and MIDI OUT ports.

Then, you buy a new mobile MIDI controller which only does MIDI over USB through a USB B device port. The M-Audio Keystation Mini 32 is an example. This design covers the most common case — hooking the controller to a computer having a USB A host port — but you can’t connect the controller directly to the 5-pin MIDI IN port on one of your old tone modules or keyboards. USB ain’t RS-232 and class-compliant MIDI over USB has its own protocols, too. So, you can’t just whip up a simple cross-over cable or signal converter.

There are two commercial solutions to this problem: the Kenton USB MIDI host and the iConnectivity iConnectMIDI4+. Neither of these solutions is cheap and they cost more than a lot of MIDI controllers themselves!

Some people on the web have suggested an Arduino-based solution. However, here’s an easy riddle. What super low cost single-board computer has two USB host ports? Answer: The Raspberry Pi Model B.

The RPi Model B seems like a natural for this problem. It’s inexpensive, it has the necessary ports, and there are plenty of rugged cases available. Musicians will want to use this solution at the gig, so a good case is essential. There are two issues. First, the RPi can source only a limited amount of power to a USB device. Some MIDI controllers may draw too much current. Second, musicians don’t like to haul extra gear to a gig, so they won’t want to take a display and keyboard to a gig just to boot the RPi and run the software needed to bridge the two USB A ports. The solution must be stand-alone, plug-and-play, and consist only of the RPi itself, a power source, and a few cables.

Here’s what I have in mind for the hardware. The MIDI controller is connected to the RPi using a standard USB A to USB B cable. The MIDI controller draws power from the RPi. Some MIDI controllers have a dedicated power supply jack and in that case, a separate power adapter for the MIDI controller seems prudent. The other USB host port on the RPi is connected to an inexpensive commercial USB to 5-pin MIDI interface — the kind used to connect 5-pin equipment to computers. The commercial interface should be MIDI class-compliant and should not require special drivers. Knowing the state of the world such as it is, you may not easily find proprietary Linux drivers for the interface. The commercial MIDI interface provides the connection to the 5-pin DIN MIDI ports on your old piece of gear.

Musicians usually have an old USB MIDI interface like the Edirol/Roland UM-2EX in the studio. These interfaces are readily available at very low cost on the web for not much more dosh than a cable. This approach doesn’t require custom hardware or shields like an Arduino-based solution.

Here’s what I have in mind for the software. Folks already bridge PC MIDI ports using MIDI-OX. Linux has the ALSA MIDI software. The amidi -l command displays the physical and virtual MIDI ports. The aconnect command connects MIDI ports. The trick will be discovering and connecting MIDI ports after boot without manual intervention, i.e., the RPi boots and builds the bridge without a keyboard, display, a log in, etc.

So, there it is! My hardware lab is currently in disarray so I can’t easily do a proof of concept implementation. However, if you have the RPi and the pieces/parts, please give this a try.

If you enjoyed reading this article, you may find these articles interesting, too:

And now for something completely different

Been a while since the last post, eh?

After the intensive push to publish courseware, I took a little “dreadlock holiday” and spent the last few months devoted to music. (Ahhh, the privileges of retirement!) In particular, I decided to deep dive into the Yamaha MOX workstation which is now my main gigging instrument. I learned to create songs using the rather wonderful library of musical phrases that are built into the MOX. I bought an iPad to use some of the many apps developed by Yamaha, including the Yamaha Mobile Music Sequencer.

As always, in the spirit of sharing what I have learned, I have published a page about getting started with the Yamaha MOX synthesizer. It describes my own journey and it is meant to complement the MOX owner’s manual. I hope that it helps you out.

Finally, I also learned a lot more about “arranger” keyboards. These keyboards ain’t your father’s Wurlitzer any more and surely have a place in professional studios as well as the home recreation room. I’ll have more to say about arranger keyboards in a future post.

On the nerd front, I took a side trip into the basics of quantum mechanics and quantum computing.

Needless to say, this all took a bit of time.

Topics in computer architecture

A new courseware page — topics in computer architecture — is now available. This page covers a wide mix of topics from RISC vs. CISC to non-von Neumann languages and architectures. You might just find an interesting lecture or two for a special topics seminar about computer architecture.

Way back when in the early eighties, I had a chance to work with Al Davis at the University of Utah. Al was a proponent of data-driven dataflow computers. His ideas showed me that there are alternatives to centralized synchronous control, sequential programming and the von Neumann memory bottleneck. The topics page has links to information about some of this interesting work.

Looking back, I’m struck at how mired we are in the von Neumann architecture! Sure, there are a few projects underway to investigate alternatives like quantum computing, functional programming languages, and so forth. However, the world remains dominated by the centralized, sequential von Neumann model and programming languages in which parallelism is glued on through a (class) library or two. General purpose GPU (GPGPU) computation is not really that radical. Yes, GPGPU computation is data parallel, but the underlying model is the sequential execution of SIMD instructions. Speculative execution is a bit of a kludge to allow conditional execution in highly sequential SIMD programs. Further, data transfer from memory to execution units remains a limiting factor in performance.

I think we need to introduce students to a wide range of computing models. Hopefully, some young, clever soul will find a way to pull us out of our rut!

VLSI systems course

The syllabus and notes for my VLSI systems course is now available in the courseware section of the site.

This is a “Mead and Conway” course on VLSI systems and CMOS circuit design. Mead and Conway led the world with their approach to VLSI design. Instead of focusing on device electronics and physics, this approach spans system-level design down to layout and fabrication. It is particularly well-suited for computer science majors who may not have much background in electronics.

I approach VLSI system design like computer design. Heck, a computer is just another digital system to be implemented in CMOS! Thus, students build and test a series of successively more detailed models for a VLSI (digital) system, eventually implementing a circuit layout which is suitable for fabrication. At the time, student designs were manufactured by the MOSIS fabrication service (which is still in business, by the way).

Both the computer design and VLSI systems courses relied quite heavily on simulation. System- and register transfer-level models were written in a stylized C/C++. Logic- and switching transistor-level simulation was handled by IRSIM. Electronic circuit-level simulation was performed through SPICE. It’s good to see that IRSIM is open source and is still available. And, of course, SPICE is still an industry standard.

Most academic courses have switched to VHDL or Verilog for system- and register transfer-level modeling. Both languages and associated simulators can be applied to the logic-level, too. I would strongly consider or recommend either option today. Depending upon tool support, students may also be able to synthesize their designs in field programmable gate array (FPGA) technology. Logisim and SmartSim look like terrific alternatives, too. I saw a LogiSim demo at SIGCSE last March and it’s a pretty spiffy tool. SmartSim runs on the Raspberry Pi — what could be better than that?

Unfortunately, VHDL and Verilog are not the most approachable, easily understood languages. I’m not the only person with this opinion. Please see “FPGA Programming for the Masses” in the April 2013 issue of CACM. I’ve seen advertising that calls Verilog “C-like.” I disagree.

My ulterior motive is to reach and teach digital systems at the high school level. VHDL and Verilog, unfortunately, are not as easily learned as C/C++ or Java. I’m hoping to gain more hands-on experience when I get rolling with Papillio — maybe find a way to bridge Java/C/C++ to VHDL and open the door to a broader community of students.

Wanna design a computer?

The next installment in the courseware section — computer design — is now available.

This course shows how to design a computer starting with an abstract specification of the instruction set architecture (ISA) and ending with a gate-level implementation. The course teaches a method which successively translates a higher-level representation for the machine into a lower-level representation. For example, the ISA is translated to a datapath consisting of large-grain building blocks and a control graph annotated with register transfer statements. Then, the datapath and control graph are mapped into gate-level building blocks and control store. Different datapath and control styles (clocking, pipelining and microprogramming) are discussed. Computer science students should be comfortable with the representations — no scary electronics.

I taught this course for several years at Case Western Reserve University. It’s an undergraduate level course that assumes a prerequisite course in basic logic design. So, if you already know about gates, flip/flops and simple sequential logic, you’re ready to dive right in! Course material includes draft chapters for a book on computer design and slides.

I hope that you will be able to use this course for background knowledge once I begin to experiment with and write up Papillio projects. I would probably base class projects on Xilinx FPGAs (Papillio) if I were teaching this course today. Papillio makes hardware design personal and affordable. I would love to see more computer science students take up hardware design, especially in high school. Perhaps this course will help you out.

Courseware is here

Finally, the courseware section of the site is open for business!

The first course is undergraduate data structures. I taught this course twice at Tufts University. Topics include the most important elements of the old ACM 2001 “CS2” curriculum along with a basic introduction to software engineering. The course is compatible with the “new” ACM 2013 curriculum which distributes computer science topics across one or more courses. Most of the hours are devoted to data structures with software engineering and algorithms/complexity playing secondary roles.

The main web page provides a rationale for the course design. It lays out the syllabus for a one semester course (15 to 16 weeks). The syllabus links to a full set of lecture notes. These notes should be a gold-mine for any instructor — high school or college — who needs to develop and deliver a course on data structures. I tried to meld and blend the best information I could find on each topic, usually from multiple textbooks. Of course, if you are a student, please dive in and browse, too!

The main course page also has links to the projects. I drew inspiration from my background as a system programmer, so the projects are not your usual “search tree this, search tree that.” The projects include lexical analysis, expression evaluation and discrete event simulation among other subjects. For the more advanced projects, I provided a source code framework/infrastructure to the students to help get them started. They were required to design to an interface specification and then integrate their implementation into the framework. This approach allowed my students to attack larger problems with more substance and purpose.

The course uses C++. Quite often, however, the code examples were translated from Java. So, it shouldn’t be too hard to translate back to Java. The examples do not use STL, Boost or templates that would hinder translation to Java or reuse.

Please take a look! For your convenience, here are links to a sampling of topics: stacks,
linked lists, trees, debugging, algebraic datatypes (ADT) and testing.

I will be rolling out additional courses during the month of August. These courses will be hardware-oriented: computer design, VLSI systems and computer architecture. If you ever prep’ed a course, you know how time consuming it can be to simply organize course material. Blog posts will be a little less frequent until I complete the courseware section.

Science!

And shout that word like Thomas Dolby!

I’m in the process of transitioning the site to its new domain name, sandsoftwaresound.net. So, here’s just a few philosophical thoughts in the meantime.

I’ve been reading “Weird Life” by David Toomey. His book is very well written — one of the best bits of science writing that I’ve read. He teaches nonfiction writing at UMass and deserves a tip o’ the hat for bringing this aspect of biology to a mass audience.

In one paragraph, he summarizes the essential elements of an experimental science. A science is a body of theory which is testable and verifiable through experimentation. Robust theory allows us to make predictions and to design new experiments. BTW, if the theory is valid, then experiments should be reproducible, too.

I like program profiling and performance analysis as a field because it is one of the few areas in computer science that exemplify experimental science. Performance events and counters give us the means to observe and measure the behavior of our programs in interesting ways.

I especially love it when I can predict an outcome such as the number of operations performed by a program. On the flip side, I especially hate it when I cannot make predictions. This is the main reason that I dislike performance events with ill-defined behavior or hardware that doesn’t provide a stable time reference. This situation is like giving an astronomer a rusty telescope with a crummy warped lens. How can an experimental scientist make observations and take measurements with a broken instrument?

These limitations aside, I would still encourage teachers and students to study performance measurement and analysis. These activities are a fun and different way to think about programs and the underlying computational engines. If enough of us band together, we even may be able to convince computer manufacturers about the need for well-defined, reliable performance monitoring and measurement!

If you’re looking for another good read, please consider Lance Fortnow’s “The Golden Ticket.” It’s an introduction to the P-NP problem and is also geared toward a broad audience. If a high school student (or anyone else!) thinks that computer science is “just programming,” please point them toward this book.

rpistat: RPi event monitoring tool

[At the time of this post, Performance Events for Linux (PERF) supports only software events on Raspberry Pi. Rpistat is a tool that counts hardware performance events in the style of perf stat and is a poor man’s substitute.]

So far, my example programs for the Raspberry Pi (matrix multiplication, pointer chasing) are instrumented with explicit code to measure and display hardware performance events. The instrumentation code executes privileged ARM instructions to read and write the performance counters and control register. The aprofile kernel module must be loaded in order to enable user-space access to the ARM1176 performance monitoring unit (PMU). If aprofile is not loaded, the privileged instructions trap to the OS and the process terminates abnormally.

This approach, which I call self-measurement, requires modification to the application source code. While self-measurement collects performance event data about specific code regions in the application, source code changes are intrusive, inconvenient and involve more work.

Performance Events for Linux (PERF) is a profiling infrastructure that provides a suite of performance analysis tools on Linux. The perf stat tool launches a workload and counts (selected) hardware or software performance events caused by the workload. No changes to the workload are required at either the source or binary levels.

Rpistat is similar to perf stat. It launches a workload and counts nine of the most commonly used ARM1176 hardware performance events while the workload executes. Running rpistat is easy. Just enter “rpistat” followed by the command you would normally use to launch the workload from the shell.

rpistat naive

In the example above, the workload is the naive matrix multiplication program. Rpistat writes a file named rpistat.txt when the workload completes. The file contains a basic performance report. Here is example output from rpistat.

***************************************************************
rpistat: ./naive
Thu Jun 27 09:58:08 2013
***************************************************************

System information
  Number of processors:     1

Performance events
  [ ... ] = scaled event count
  PTI     = per thousand instructions
  Total periods:      169

  Cycles:             11,759,598,287
  Instructions:       315,810,640  [1,241,209,259]
  IBUF stall cycles:  65,981,902  [259,324,219]
  Instr periods:      43
  CPI:                9.474  
  IBUF stall percent: 2.205  %

  DC cached accesses: 4,558,795  [18,343,722]
  DC misses:          933,837  [3,757,582]
  DC periods:         42
  DC miss ratio:      20.484 %

  MicroTLB misses:    224,886  [904,898]
  Main TLB misses:    172,973  [696,010]
  TLB periods:        42
  Micro miss rate:    0.729   PTI
  Main miss rate:     0.561   PTI

  Branches:           33,438,664  [134,550,814]
  Mispredicted BR:    366,383  [1,474,255]
  BR periods:         42
  Branch rate:        108.403 PTI
  Mispredict ratio:   1.096  %

The report includes raw event counts, scaled event counts, rates and ratios. The raw event counts are the actual number of events counted by the hardware performance counters. A rate tells us how often a given event is occurring in terms of events per thousand instructions (PTI). A ratio tells us what portion of events have a certain property, such as the percentage of non-sequential data cache accesses that result in a miss.

Rpistat uses a time division multiplexing scheme to periodically switch the performance counters across the nine events of interest. Rpistat makes a switch every 100 milliseconds (0.1s). The ARM1176 dedicates one performance counter to processor cycles. Rpistat exploits this counter to the max and measures processor cycles all the time (a 100% duty-cycle for this event). Rpistat switches through the other eight events in pairs called event sets. Each event set is measured for about 25% of the overall time. Rpistat keeps track of the active time for each event and scales the raw event counts up to full duty-cycle estimates. These scaled event counts are displayed within square brackets [ ... ] in the output file.

When rpistat switches between event sets, it first accumulates the current counts into 64-bit unsigned integers called virtual counters. Rpistat avoids overflow problems because the measurement period (0.1s) is much shorter than the time needed to overflow the 32-bit hardware counters. The long virtual counter length (64 bits) postpones any real overflow for a very long time, effectively eliminating the practical possibility of an overflow.

If you would like to know more about rpistat’s implementation including important design concerns and limitations, please read about it here.

How does rpistat compare against self-measurement for accuracy? Here’s a quick comparison between the two approaches when they are applied to the naive (textbook) matrix multiplication program.

Metric Self-measurement Rpistat
Elapsed time 16.0s 16.6s
Scaled cycles 181,984,943 184,288,556
Instructions 1,190,897,822 1,245,571,856
DC access 17,588,053 18,587,014
DC miss 2,982,314 3,735,024
MicroTLB miss 809,154 904,166
Main TLB miss 612,688 661,667
CPI 9.78 CPI 9.47 CPI
DC access rate 14.77 PTI 14.92 PTI
DC miss rate 2.50 PTI 2.99 PTI
DC miss ratio 17.0% 20.1%
MicroTLB rate 0.68 PTI 0.73 PTI
Main TLB rate 0.51 PTI 0.53 PTI
Self-measurement vs. rpistat

The event counts, rates and ratios are within normal run-to-run variability in every case.

The matrix multiplication program has fairly consistent (uniform) dynamic behavior throughout its lifetime. Workloads with distinct processing phases may not fair as well. YMMV. Please see the rpistat page for more information and links to source (rpistat.c).

Memory access time

In my last post, I used a simple pointer chasing loop to characterize the different levels of the BCM2835 (Raspberyy Pi) memory hierarchy. I ran the pointer chasing loop for a range of linked list sizes where the list size determines the storage level to be exercised. For example, a linked list occupying 16KB or less causes the chase loop to exercise the level 1 (L1) data cache. I measured hardware performance events for each test case and summarized the results in a table. The results clearly show three different timing tiers: L1 data cache, access to primary memory (no TLB miss) and access to primary memory with a Main TLB miss. A miss in the Main TLB is quite slow because the hardware page table walker must read page mapping information from primary memory in order to update the TLB and to complete the memory operation in flight.

The one characteristic missing from the analysis, however, is the estimated access time for each level. Sure, the execution times clearly break the test cases into tiers, but elapsed execution time does not really measure the access (latency) time itself.

In the next experiment, I estimate the access time. (See the source code in latency.c.) The approach is essentially unchanged from the first experiment: use pointer chasing on lists of different sizes to hit in particular levels of the memory hierarchy. Like the first experiment, the program (named latency) successively traverses the same linked list many times (default: 1000 traversals). Part way through each traversal, the loop chooses a chase operation and measures its execution time. The sampled execution time is saved temporarily in an array until all traversals are complete. Then, the program writes the execution times to a file named samples.dat. The samples are aggregated into a histogram which shows the distribution of the times.

The sampling technique lets each full traversal get underway before taking a measurement in order to “warm up” the pipeline, cache and TLBs. Taking a single sample per traversal avoids major cache and TLB pollution due to the bookkeeping needed to take and save samples. The measured execution times should faithfully represent access time (plus some measurement bias).

The program measures execution time in cycles. It configures and enables the ARM1176 Cycles Counter Register (CCR) as a free-running cycles counter. The program resets and enables the CCR before starting a traversal, thus avoiding a counter overflow. (A full traversal completes before the CCR overflows.) The pointer chasing loop reads the CCR before chasing the next pointer and reads the CCR after chasing the pointer. The difference between the after and before counts is the estimated execution time of the chase operation. The chase operation is implemented as a single ARM load instruction, so the instruction execution time is a biased estimate of the memory access time.

Here is the code for the function sample_cycles() and the pointer chasing loop within. The function takes two arguments: a pointer to the head of the linked list and an integer value that chooses the chase operation to be sampled.

uint32_t sample_cycles(CacheLine* linked_list, int n)
{
  register CacheLine* item ;
  register uint32_t before, after, cycles ;
  register count = n ;

  cycles = 0 ;
  for(item = linked_list ; item != NULL ; ) {
    if (count-- == 0) {
      before = armv6_read_ccr() ;
      item = item->ptrCacheLine[0] ;
      after = armv6_read_ccr() ;
      cycles = after - before ;
    } else {
      item = item->ptrCacheLine[0] ;
    }
  }

  return( cycles ) ;
}

On each iteration, the loop body checks and decrements the sampling count maintained in the variable count. If the sampling count is not zero, then the loop quickly performs an ordinary chase operation. If the count is zero, then the code snaps the before and after CCR values by calling armv6_read_ccr(), an in-line function that performs an ARM MRC instruction to read CCR. The difference is computed and is saved in the variable cycles until it can be returned by the function after traversal. The chase loop exits when it reaches the NULL pointer at the end of the linked list.

Local variables are allocated to general registers in order to avoid extraneous cache and TLB references that would perturb measurements. I checked the compiled machine code to make sure that the compiler accepted the register hints and actually allocated the local variables to registers. I also made sure that the function armv6_read_ccr() was expanded in-line.

The latency program needs to have user-space access to the ARM1176 performance counters. You must load the aprofile kernel module before running latency. Otherwise, you will get an “Illegal instruction” exception when the program attempts to configure the performance counters.

Here is a histogram showing the distribution of execution time samples when the linked list is 8KB in size. The 8KB test case consistently hits in the L1 data cache.

9  1000 ################################################

The first column is the execution time in cycles. The second column is the number of samples with the value shown in the first column. All 1,000 samples are 9 cycles.

Chapter 16 of the ARM1176JZF-S Technical Reference Manual (TRM) describes instruction cycle timings. A load from L1 data cache has a 3 cycle load-to-use latency (access) time. Taking 3 cycles as the actual access time, I estimate a measurement bias of (9 cycles – 3 cycles) = 6 cycles due to the pipelined execution of the ARM MRC instructions that read the CCR and the load instruction that chases the linked list pointer.

The 32KB and 64KB test cases form the middle timing tier. Both test cases are within the coverage of the Main TLB (288KB) and memory accesses hit in either the Micro TLB or Main TLB. The distributions are similar. Here is the distribution for the 64KB case.

 9   49 ##### 
17    1 # 
61  320 ##########################
62  571 ################################################
63   11 # 
64    4 # 
65    1 # 
69    1 # 
70   27 ### 

There is a strong peak in the narrow range of [61:62] cycles. The nonbiased estimate for access time to primary memory with a TLB hit is (62 cycles – 6 cycles bias) = 56 cycles.

The test cases at 256KB and larger form the third timing tier. These cases exceed the Main TLB coverage. (The 256KB case almost occupies the entire Main TLB and the performance event data place this case in the third and longest running tier.) Load operations miss in the Main TLB forcing a hardware page table walk. The table walker performs at least one additional primary memory read operation in order to obtain page mapping information for the TLB. Here is the distribution for the 1MB case.

113    4 # 
116  224 ##############################################
117   23 ###### 
118   59 ############## 
119  129 ###########################
120    5 ## 
121  224 ##############################################
122  185 ############
125    4 # 
130    9 ### 
132    7 ## 
133   18 ##### 
135    5 ## 

The samples are clustered in the range of [116:122] cycles. The nonbiased estimate for access time to primary memory with a Main TLB miss is (122 cycles – 6 cycles bias) = 116 cycles. This is approximately twice the access time of a load that hits in either the data MicroTLB or the Main TLB.

The following table summarizes the access time for each level of the memory hierarchy.

Level Access time Condition
CPU register 1 cycle
L1 data cache 3 cycles Data cache hit
Primary memory 56 cycles No TLB miss
Primary memory 116 cycles Main TLB miss
Memory hierarchy access times (latencies)

Where is the L2 cache? The BCM2835 dedicates the L2 cache to the Broadcom VideoCore GPU. Memory operations from the CPU are routed around the L2 cache. The L2 cache is not a factor here and is not part of the analysis.

From these estimates, we can see why the textbook (naive) matrix multiplication program is so slow. The textbook algorithm does not walk sequentially through one of its operand arrays. The long memory stride causes a large number of Main TLB misses. Memory access is twice as slow when an access misses in the Main TLB, thereby slowing down program execution and increasing elapsed time.

Memory hierarchy and performance events

My latest set of experiments with Raspberry Pi use the BCM2835 performance counters to exercise and observe the behavior of the RPi memory system. The experiments use a simple microbenchmark that traverses a linked list. This technique is often called pointer chasing because list traversal chases the pointers through the linked list from beginning to end. A pointer chasing loop is simple, is easy to instrument, and has consistently repeatable behavior.

Here’s the approach. We exercise a single specific level in the memory hierarchy (such as the level 1 data cache) simply by adjusting the size of the linked list. We just need to choose a size such that the the physical size of the list is less than or equal to the capacity of a level — 16KB in the case of the level 1 (L1) data cache. Storage for the linked list is allocated as a contiguous block of memory (an array of bytes), so it’s easier to think of the linked list size in terms of its contiguous footprint in memory, the array size. Each linked list item is the same size as a cache line: 32 bytes.

The first experiment measures seven kinds of performance events: cycles, executed instructions, L1 data cache accesses, L1 data cache misses, data MicroTLB misses, Main TLB misses and explicit external data accesses.

Before we take a look at the results of the first experiment, here’s a few things to keep in mind.

  • The Broadcom BCM2835 128KB L2 cache is dedicated to the VideoCore GPU. Memory requests from the CPU are routed around the L2 and go directly to primary memory. L2 cache is not a factor in our analysis.
  • Coverage is the amount of primary memory accessible through a translation lookaside buffer (TLB) without incurring a TLB miss.
  • The MicroTLB has ten fully associative (page) entries. Its coverage is (10 entries * 4,096 bytes) = 40KB.
  • The Main TLB handles MicroTLB misses and has 8 fully associative entries plus 64 2-way associative entries. Its coverage is (72 entries * 4,096 bytes) = 288KB.
  • A hardware page table walker handles Main TLB misses. A page table walk requires at least one additional read in primary memory in order to look up page mapping information.

These memory system characteristics affect the results and should be directly observable.

The following table summarizes basic measurements for ten test runs. The first column is the size of the dynamically allocated block of memory that holds the linked list elements. The second column is the size of the block expressed as 32-byte cache lines. There is one list item per cache lines, so this is also the number of linked list items. The third column is the number of list traversals (one traversal per iteration). I adjusted the number of iterations for each run in order to perform exactly the same number of pointer chasing operations per test case. This allows us to make meaningful run-to-run comparisons.

 Size  Cache lines  Iterations   Time    CPI
-----  -----------  ----------  -----  -----
  4MB         128K        1024  25.76  20.51
  2MB          64K        2048  25.74  20.52
  1MB          32K        4096  25.98  20.70
512KB          16K        8192  25.56  20.41
256KB           8K       16384  24.62  19.71
128KB           4K       32768  19.99  16.24
 64KB           2K       65536  11.75   9.61
 32KB           1K      131072  10.49   8.77
 16KB          512      262144   2.66   2.30
  8KB          256      524288   2.39   2.04

The fourth column is the total elapsed time per run. There are roughly three distinct timing tiers:

  • Tier A: 4MB, 2MB, 1MB, 512KB, 256KB, 128KB
  • Tier B: 32KB, 64KB
  • Tier C: 8KB, 16KB

The 128KB case is somewhat “transitional,” but I placed it in Tier A. The cycles per instruction (CPI column) ratios are consistent with respect to the elapsed times.

The following table summarizes the performance event counts for the same test cases (block/array size and iterations).

                           MicroTLB  Main TLB  External
 Size DC access  DC miss      miss      miss    access
----- ---------  -------  --------  --------  --------
  4MB  24696257  5107400   1049399    258783   7433309 C
  2MB  24927252  5119529   1051719    259531   7509503 C
  1MB  24878797  5145341   1065404    273960   7521771 C
512KB  24668739  5086410   1050024    259104   7461179 C
256KB  22999900  4803510   1057814    308477   7263176 C
128KB  20051408  4258682    965191    181398   6136462 C
 64KB  11972871  2683992    620354     91207   3785568 B
 32KB  10096551  2288319    548311     71756   3231861 B
 16KB   2621836   594034    136900      8265    804582 A
  8KB   3480732   446481     72712      2757    480707 A

Keeping the basic measurements in mind, we can see that the performance event counts for Tier C are consistent with L1 data cache access. We are measuring on a single core computer and the data cache and TLB misses are mostly due to other system activity (like OS timer interrupts). Tier B cases hit in either the MicroTLB or Main TLB. Tier B are mainly references to primary memory without any TLB miss. The Tier B array sizes are within the MicroTLB and/or Main TLB coverage. Tier A cases miss in the TLBs. Tier A references require an additional primary memory read by the page walker. Tier A has the longest elapsed execution time.

Next up, we’ll measure the memory access (latency) time to primary memory.