Hash tables P.J. Drongowski 8 December 2004 Keys and selection * In search, a key is used to select a key-value pair in order to retrieve the value associated with the key * A key can be used as a kind of address + Memory addresses are unsigned binary integers that select a particular memory cells to store and retrieve binary data + Similarly, an array index selects an element in an array * The notion of key can be generalized to include symbolic names, text, etc. + Some languages, like PERL, support selection of array elements using strings/symbols + PostScript dictionaries use symbolic keys to select key-value pairs * A hash table is one approach to the implementation of a dictionary that supports use of symbolic keys Hash table (overview) * Conceptually, a hash table consists of an array of C buckets, where C is the "capacity" of the array * Each bucket can hold one or more key-value pairs (aka data items) * If there is a one-to-one mapping from a key to an array bucket, a key directly selects a key-value pair in a bucket in constant time O(1) * If a matching key-value pair is not found, then a sentinel value is returned * Drawbacks + If the mapping is not one-to-one, then multiple keys select the same bucket and the collisions must be resolved in some way + The number of buckets C is independent of the number of data items, N, and many more buckets may be needed than the data items to be stored + The bucket array usually requires integer indices not symbolic keys; symbolic keys must be mapped to integer indices Hash table (operational) * A hash table data structure consists of: + A bucket array + A "hash function" * The hash function is a mapping from a symbolic key to bucket array index * Operations + Insert(key, value) - Apply the hash function to the key and store the key-value pair in the selected bucket + Find(key) - Apply the hash function to the key, find the key-value pair in the selected bucket and return the associated value + Remove(key) - Apply the hash function to the key and delete the key-value pair from the selected bucket * The operations resolve collisions when multiple keys map to the same bucket + The "load factor" of the hash table is the ratio N/C, where N is the number of data items to store and C is the array capacity + Design goal is to achieve load factor of one or less Hash function * A hash function should be easy to compute and execute quickly * A hash function should minimize collisions; Ideally, each key maps to a unique bucket array index * A hash function should uniformly distribute key-value pairs across the bucket array * A hash function often consists of two steps: + Computing an integer "hash code" ("hash value") + Mapping the hash code into the range of bucket array indices ("compression map") Hash code * Maps a symbolic key to an integer * The integer does not need to be in the range of bucket indices -- that will be handled by the compression map * Same symbolic key value should always map to same bucket * The hash code should avoid collisions as much as possible * Approaches + Use the address of the key in memory - Problem: Different key objects in memory may have same symbolic value, but different addresses + Cast key to integer - Works for char, short int, int, even float - Problem: Cannot directly cast string to int - Potential problem: Long int and double * May be too long (greater than 32-bits) * Key may be truncated causing more collisions due to foldback * Could compute sum of high-order word and low-order word to retain information potentially lost through truncation + Sum characters in symbolic key - Ignores order of characters in symbolic key - Problem: "stop" "tops" "pots" "spot" all map to same integer + Polynomial hash code - If characters in symbolic key are x , ... x , evaluate: 0 k-1 k-1 k-2 x a + x a + ... + x a + x 0 1 k-2 k-1 - Use Horner's rule for evaluating polynomials: x + a(x + a(x + ... + a(x + a(x + ax ))...)) k-1 k-2 k-3 2 1 0 - Polynomial attempts to take order of characters into account - Creates "space" between indices depending upon character position - Overflows due to computer word limitations may cause collisions - Good values of constant a are 33, 37, 39, 41 for English text - Polynomial evaluation is somewhat slow, so compute polynomial for only the first few characters of a symbolic key + Cyclic shift hash code - Replace the constant a with a cyclic shift of the partial sum - Must carefully choose the amount to shift - See example C++ code below - For English text, shifts of 5 or 6 bits produce good results Compression map * Must map the hash code into the allowed range of values for bucket array indices * Division method + Formula: | hash_code | mod K OR abs(hash_code) % K + Choose a prime number for K to reduce collisions by spreading out the distribution of hashed keys + K is the same as the table capacity! + Radke suggests using primes of the form 4k+3 (e.g., 811) * Multiply add and divide (MAD) method + Formula: | a * hash_code + b | mod K + K is prime, and a and b are nonnegative integers chosen randomly such that (a mod K) does not equal zero + Eliminate repeated patterns in the set of hash codes Similarity to pseudo-random number generation * Similarity to pseudo-random number generation is not accidental! * Two majors goals when generating uniformly distributed random numbers are: + A uniform distribution of the random numbers over a range + Avoidance of repeated patterns Collision handling * Separate chaining + Each buckets points to a list (array, ...) to store all items that map to the bucket + Even distribution of hashed pairs will keep the length of the list short + If load factor is small, average lookup time will approach O(1) * Open addressing -- store items only in bucket array + Conserves space + Collision handling is more complicated + Methods for collision handling - Linear probing * All buckets must initially be marked as empty * Insert: Try next bucket in array until an empty bucket or a "deleted empty" bucket is found * Find: Search successive buckets until key or empty bucket is found; skip "deleted empty" buckets * Remove: Delete item and mark bucket specially as "deleted empty" * Tends to cluster pairs into "runs" increasing search time - Quadratic probing * Like linear probing * Try to eliminate clustering / runs * Add offsets from the sequence 1, 2, 4, 9, ..., k^2 instead of one * Causes secondary clustering * Behaves badly when C is not a prime number -- may not find an empty bucket! - Secondary hashing * Add offsets which are computed using as a second hash function * The second hash function tries to uniformly distribute the offset values in order to eliminate clustering Sources: Data Structures and Algorithms in Java, Goodrich and Tamassia, 2001 Data Structures and Problem Solving Using Java, Weiss, 2002 Data Structures and other Objects Using C++, Main and Savitch, 2005 // ***************************************************************** // Cyclic shift hash function // Uses cyclic shift to compute hash code // Uses division method to perform compression map // ***************************************************************** unsigned long int HashTable::hash_and_map(char *symbol) { char *s ; // 32-bit platform dependency! unsigned long int h ; h = 0 ; for (s = symbol ; *s != '\0' ; s++) { h = (h << 5) | ((h >> 27) & 0x7FFFFFF) ; h += (unsigned long int) *s ; } return( h % table_size ) ; }