A sweet one-liner for histograms

Here is a shell script one-liner that is just too sweet to go unrecognized.

I’ve written a program (latency.c) to measure the execution time of individual chase operations in a linked list pointer chasing loop. The program writes the execution times into a file named samples.dat. The distribution of the execution times should show the access time to different levels of the ARM1176 memory hierarchy.

I still needed a way to view the distribution in a histogram. So, I wrote a short C program (histogram.c) to postprocess the execution times. The program produces a histogram-like table — not a chart. Bummer.

A quick search of the Web brought up some rather sophisticated data visualization tools. I experimented with a few of them, but still couldn’t get want I wanted.

Then, this little gem came up from Small Labs, Inc. (Search on “command line histogram.)

history | awk '{h[$2]++}END{for(i in h){print h[i],i|"sort -rn|head -20"}}' |awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%15s %5d %s %s",$2,$1,r,"\n";}'

This one-liner displays a histogram of recent command lines (history) from the most to least frequently used. It displays only the 20 most frequent commands (head -20).

This one-liner looks quite promising, but it needs a few changes. First, it needs to read data from the file samples.dat. It needs to read only one item from each line of the file; history produces two items per line. Finally, we can discard some of the white space in the output and make the lines in the chart narrower.

Here is the revised one-line written in the form of a bash shell script. We’re going to analyze several files and it’s appropriate to put the one-liner into a shell command file.

#!/bin/bash

cat samples.dat | awk '{h[$1]++}END{for(i in h){print h[i],i|"sort -rn|head -20"}}' |awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#";printf "%6s %5d %s %s",$2,$1,r,"\n";}'

Here is some sample output.

    62  517 ###########################################################
    61  301 ################################### 
     9  119 ############## 
    70   12 ## 
    63   11 ## 
    65    9 ## 
    64    4 # 
    17    4 # 
   336    2 # 
   168    2 # 
   154    2 # 
    71    1 # 
    67    1 # 
    60    1 # 
   525    1 # 
   521    1 # 
   390    1 # 
   389    1 # 
   386    1 # 
   378    1 # 

We can easily see the most frequent values, but this doesn’t really show the distribution in a useful way. So, let’s just sort the output on the leading numeric values (sort -n). The numbers in the first column are access/latency times in cycles, by the way.

     9  119 ##############
    17    4 # 
    60    1 # 
    61  301 ###################################
    62  517 ###########################################################
    63   11 ## 
    64    4 # 
    65    9 ## 
    67    1 # 
    70   12 ## 
    71    1 # 
   154    2 # 
   168    2 # 
   336    2 # 
   378    1 # 
   386    1 # 
   389    1 # 
   390    1 # 
   521    1 # 
   525    1 # 

Ah, now we can see the level 1 data cache hits (9 cycles including measurement bias) and reads to primary memory (61 and 62 cycles). The bi-modal nature of the distribution is revealed.

Before leaving this topic, I’d like to give a shout-out to Dave Christie at AMD. Dave always gave me a good-natured kidding about these old school histograms. Dave, BTW, is one of the true unsung technical heroes at AMD. All the best!