Hash tables
P.J. Drongowski
8 December 2004
Keys and selection
* In search, a key is used to select a key-value pair in order to
retrieve the value associated with the key
* A key can be used as a kind of address
+ Memory addresses are unsigned binary integers that select a
particular memory cells to store and retrieve binary data
+ Similarly, an array index selects an element in an array
* The notion of key can be generalized to include symbolic names,
text, etc.
+ Some languages, like PERL, support selection of array elements
using strings/symbols
+ PostScript dictionaries use symbolic keys to select key-value pairs
* A hash table is one approach to the implementation of a dictionary
that supports use of symbolic keys
Hash table (overview)
* Conceptually, a hash table consists of an array of C buckets, where
C is the "capacity" of the array
* Each bucket can hold one or more key-value pairs (aka data items)
* If there is a one-to-one mapping from a key to an array bucket, a key
directly selects a key-value pair in a bucket in constant time O(1)
* If a matching key-value pair is not found, then a sentinel value
is returned
* Drawbacks
+ If the mapping is not one-to-one, then multiple keys select the same
bucket and the collisions must be resolved in some way
+ The number of buckets C is independent of the number of data items, N,
and many more buckets may be needed than the data items to be stored
+ The bucket array usually requires integer indices not symbolic keys;
symbolic keys must be mapped to integer indices
Hash table (operational)
* A hash table data structure consists of:
+ A bucket array
+ A "hash function"
* The hash function is a mapping from a symbolic key to bucket array index
* Operations
+ Insert(key, value)
- Apply the hash function to the key and store the key-value pair in
the selected bucket
+ Find(key)
- Apply the hash function to the key, find the key-value pair in the
selected bucket and return the associated value
+ Remove(key)
- Apply the hash function to the key and delete the key-value pair
from the selected bucket
* The operations resolve collisions when multiple keys map to the same bucket
+ The "load factor" of the hash table is the ratio N/C, where N is the
number of data items to store and C is the array capacity
+ Design goal is to achieve load factor of one or less
Hash function
* A hash function should be easy to compute and execute quickly
* A hash function should minimize collisions; Ideally, each key maps
to a unique bucket array index
* A hash function should uniformly distribute key-value pairs across
the bucket array
* A hash function often consists of two steps:
+ Computing an integer "hash code" ("hash value")
+ Mapping the hash code into the range of bucket array indices
("compression map")
Hash code
* Maps a symbolic key to an integer
* The integer does not need to be in the range of bucket indices -- that
will be handled by the compression map
* Same symbolic key value should always map to same bucket
* The hash code should avoid collisions as much as possible
* Approaches
+ Use the address of the key in memory
- Problem: Different key objects in memory may have same symbolic
value, but different addresses
+ Cast key to integer
- Works for char, short int, int, even float
- Problem: Cannot directly cast string to int
- Potential problem: Long int and double
* May be too long (greater than 32-bits)
* Key may be truncated causing more collisions due to foldback
* Could compute sum of high-order word and low-order word to retain
information potentially lost through truncation
+ Sum characters in symbolic key
- Ignores order of characters in symbolic key
- Problem: "stop" "tops" "pots" "spot" all map to same integer
+ Polynomial hash code
- If characters in symbolic key are x , ... x , evaluate:
0 k-1
k-1 k-2
x a + x a + ... + x a + x
0 1 k-2 k-1
- Use Horner's rule for evaluating polynomials:
x + a(x + a(x + ... + a(x + a(x + ax ))...))
k-1 k-2 k-3 2 1 0
- Polynomial attempts to take order of characters into account
- Creates "space" between indices depending upon character position
- Overflows due to computer word limitations may cause collisions
- Good values of constant a are 33, 37, 39, 41 for English text
- Polynomial evaluation is somewhat slow, so compute polynomial for
only the first few characters of a symbolic key
+ Cyclic shift hash code
- Replace the constant a with a cyclic shift of the partial sum
- Must carefully choose the amount to shift
- See example C++ code below
- For English text, shifts of 5 or 6 bits produce good results
Compression map
* Must map the hash code into the allowed range of values for bucket
array indices
* Division method
+ Formula:
| hash_code | mod K OR abs(hash_code) % K
+ Choose a prime number for K to reduce collisions by spreading out
the distribution of hashed keys
+ K is the same as the table capacity!
+ Radke suggests using primes of the form 4k+3 (e.g., 811)
* Multiply add and divide (MAD) method
+ Formula:
| a * hash_code + b | mod K
+ K is prime, and a and b are nonnegative integers chosen randomly such
that (a mod K) does not equal zero
+ Eliminate repeated patterns in the set of hash codes
Similarity to pseudo-random number generation
* Similarity to pseudo-random number generation is not accidental!
* Two majors goals when generating uniformly distributed random numbers are:
+ A uniform distribution of the random numbers over a range
+ Avoidance of repeated patterns
Collision handling
* Separate chaining
+ Each buckets points to a list (array, ...) to store all items that
map to the bucket
+ Even distribution of hashed pairs will keep the length of the list short
+ If load factor is small, average lookup time will approach O(1)
* Open addressing -- store items only in bucket array
+ Conserves space
+ Collision handling is more complicated
+ Methods for collision handling
- Linear probing
* All buckets must initially be marked as empty
* Insert: Try next bucket in array until an empty bucket or a
"deleted empty" bucket is found
* Find: Search successive buckets until key or empty bucket is found;
skip "deleted empty" buckets
* Remove: Delete item and mark bucket specially as "deleted empty"
* Tends to cluster pairs into "runs" increasing search time
- Quadratic probing
* Like linear probing
* Try to eliminate clustering / runs
* Add offsets from the sequence 1, 2, 4, 9, ..., k^2 instead of one
* Causes secondary clustering
* Behaves badly when C is not a prime number -- may not find an
empty bucket!
- Secondary hashing
* Add offsets which are computed using as a second hash function
* The second hash function tries to uniformly distribute the
offset values in order to eliminate clustering
Sources:
Data Structures and Algorithms in Java, Goodrich and Tamassia, 2001
Data Structures and Problem Solving Using Java, Weiss, 2002
Data Structures and other Objects Using C++, Main and Savitch, 2005
// *****************************************************************
// Cyclic shift hash function
// Uses cyclic shift to compute hash code
// Uses division method to perform compression map
// *****************************************************************
unsigned long int HashTable::hash_and_map(char *symbol)
{
char *s ;
// 32-bit platform dependency!
unsigned long int h ;
h = 0 ;
for (s = symbol ; *s != '\0' ; s++) {
h = (h << 5) | ((h >> 27) & 0x7FFFFFF) ;
h += (unsigned long int) *s ;
}
return( h % table_size ) ;
}