Hash tables
P.J. Drongowski
8 December 2004

Keys and selection

  * In search, a key is used to select a key-value pair in order to
    retrieve the value associated with the key
  * A key can be used as a kind of address
    + Memory addresses are unsigned binary integers that select a
      particular memory cells to store and retrieve binary data
    + Similarly, an array index selects an element in an array
  * The notion of key can be generalized to include symbolic names,
    text, etc.
    + Some languages, like PERL, support selection of array elements
      using strings/symbols
    + PostScript dictionaries use symbolic keys to select key-value pairs
  * A hash table is one approach to the implementation of a dictionary
    that supports use of symbolic keys

Hash table (overview)

  * Conceptually, a hash table consists of an array of C buckets, where
    C is the "capacity" of the array
  * Each bucket can hold one or more key-value pairs (aka data items)
  * If there is a one-to-one mapping from a key to an array bucket, a key
    directly selects a key-value pair in a bucket in constant time O(1)
  * If a matching key-value pair is not found, then a sentinel value
    is returned
  * Drawbacks
    + If the mapping is not one-to-one, then multiple keys select the same
      bucket and the collisions must be resolved in some way
    + The number of buckets C is independent of the number of data items, N,
      and many more buckets may be needed than the data items to be stored
    + The bucket array usually requires integer indices not symbolic keys;
      symbolic keys must be mapped to integer indices

Hash table (operational)

  * A hash table data structure consists of:
    + A bucket array
    + A "hash function"
  * The hash function is a mapping from a symbolic key to bucket array index
  * Operations
    + Insert(key, value)
      - Apply the hash function to the key and store the key-value pair in
        the selected bucket
    + Find(key)
      - Apply the hash function to the key, find the key-value pair in the
        selected bucket and return the associated value
    + Remove(key)
      - Apply the hash function to the key and delete the key-value pair
        from the selected bucket
  * The operations resolve collisions when multiple keys map to the same bucket
    + The "load factor" of the hash table is the ratio N/C, where N is the
      number of data items to store and C is the array capacity
    + Design goal is to achieve load factor of one or less

Hash function

  * A hash function should be easy to compute and execute quickly
  * A hash function should minimize collisions; Ideally, each key maps
    to a unique bucket array index
  * A hash function should uniformly distribute key-value pairs across
    the bucket array
  * A hash function often consists of two steps:
    + Computing an integer "hash code" ("hash value")
    + Mapping the hash code into the range of bucket array indices
      ("compression map")

Hash code

  * Maps a symbolic key to an integer
  * The integer does not need to be in the range of bucket indices -- that
    will be handled by the compression map
  * Same symbolic key value should always map to same bucket
  * The hash code should avoid collisions as much as possible
  * Approaches
    + Use the address of the key in memory
      - Problem: Different key objects in memory may have same symbolic
        value, but different addresses
    + Cast key to integer
      - Works for char, short int, int, even float
      - Problem: Cannot directly cast string to int
      - Potential problem: Long int and double
        * May be too long (greater than 32-bits)
        * Key may be truncated causing more collisions due to foldback
        * Could compute sum of high-order word and low-order word to retain
          information potentially lost through truncation
    + Sum characters in symbolic key
      - Ignores order of characters in symbolic key
      - Problem: "stop" "tops" "pots" "spot" all map to same integer
    + Polynomial hash code
      - If characters in symbolic key are x , ... x   , evaluate:
                                           0       k-1
              k-1      k-2
           x a    + x a    + ... + x   a + x
            0        1              k-2     k-1

      - Use Horner's rule for evaluating polynomials:

           x    + a(x    + a(x    + ... + a(x  + a(x  + ax ))...))
            k-1      k-2      k-3            2      1     0

      - Polynomial attempts to take order of characters into account
      - Creates "space" between indices depending upon character position
      - Overflows due to computer word limitations may cause collisions
      - Good values of constant a are 33, 37, 39, 41 for English text
      - Polynomial evaluation is somewhat slow, so compute polynomial for
        only the first few characters of a symbolic key
    + Cyclic shift hash code
      - Replace the constant a with a cyclic shift of the partial sum
      - Must carefully choose the amount to shift
      - See example C++ code below
      - For English text, shifts of 5 or 6 bits produce good results

Compression map

  * Must map the hash code into the allowed range of values for bucket
    array indices
  * Division method
    + Formula:
         | hash_code | mod K    OR    abs(hash_code) % K
    + Choose a prime number for K to reduce collisions by spreading out
      the distribution of hashed keys
    + K is the same as the table capacity!
    + Radke suggests using primes of the form 4k+3 (e.g., 811)
  * Multiply add and divide (MAD) method
    + Formula:
         | a * hash_code + b | mod K
    + K is prime, and a and b are nonnegative integers chosen randomly such
      that (a mod K) does not equal zero
    + Eliminate repeated patterns in the set of hash codes

Similarity to pseudo-random number generation

  * Similarity to pseudo-random number generation is not accidental!
  * Two majors goals when generating uniformly distributed random numbers are:
    + A uniform distribution of the random numbers over a range
    + Avoidance of repeated patterns

Collision handling

  * Separate chaining
    + Each buckets points to a list (array, ...) to store all items that
      map to the bucket
    + Even distribution of hashed pairs will keep the length of the list short
    + If load factor is small, average lookup time will approach O(1)
  * Open addressing -- store items only in bucket array
    + Conserves space
    + Collision handling is more complicated
    + Methods for collision handling
      - Linear probing
        * All buckets must initially be marked as empty
        * Insert: Try next bucket in array until an empty bucket or a
          "deleted empty" bucket is found
        * Find: Search successive buckets until key or empty bucket is found;
          skip "deleted empty" buckets
        * Remove: Delete item and mark bucket specially as "deleted empty"
        * Tends to cluster pairs into "runs" increasing search time
      - Quadratic probing
        * Like linear probing
        * Try to eliminate clustering / runs
        * Add offsets from the sequence 1, 2, 4, 9, ..., k^2 instead of one
        * Causes secondary clustering
        * Behaves badly when C is not a prime number -- may not find an
          empty bucket!
      - Secondary hashing
        * Add offsets which are computed using as a second hash function
        * The second hash function tries to uniformly distribute the
          offset values in order to eliminate clustering
        


Sources:
  Data Structures and Algorithms in Java, Goodrich and Tamassia, 2001
  Data Structures and Problem Solving Using Java, Weiss, 2002
  Data Structures and other Objects Using C++, Main and Savitch, 2005


// *****************************************************************
// Cyclic shift hash function
//   Uses cyclic shift to compute hash code
//   Uses division method to perform compression map
// *****************************************************************

unsigned long int HashTable::hash_and_map(char *symbol)
{
  char *s ;
  // 32-bit platform dependency!
  unsigned long int h ;

  h = 0 ;
  for (s = symbol ; *s != '\0' ; s++) {
    h = (h << 5) | ((h >> 27) & 0x7FFFFFF) ;
    h += (unsigned long int) *s ;
  }

  return( h % table_size ) ;
}