Searching
P.J. Drongowski
5 December 2004
Dictionary search
* Search is an important part of many computer applications
* Dictionaries are a particular kind of search problem
* A dictionary contain key-value pairs
* Dictionary operations
+ New key-value pairs can be added to the dictionary (insert)
+ Old key-value pairs can be deleted (remove)
+ Given a key, the dictionary will return the associated value (find
or lookup)
* Example: PostScript dictionary
+ /key value def Defines a key-value pair
+ dict key undef Removes a key-value pair from the dictionary
+ /key value store Stores a new value for the key
+ /key load Retrieves value and pushes it on operand stack
* A key may be any type -- integer or string being the most common datatypes
* We will use string-valued keys in the examples to come and we will not
explicitly store a value with a key -- the value just comes along for
the ride!
How can we implement a dictionary?
* So far, we have been studying individual data structures and algorithms
* How does a software engineer choose an algorithm for a particular
application?
* Engineers define design criteria and make choices based on the criteria
* Basic performance criteria
+ Speed
- How much time is available?
- Is there a response time deadline?
- How much execution time may the solution take?
+ Space
- How may key-value pairs must be stored?
- How big are the keys? The values? (sizeof)
- How much space is needed for infrastructure? For bookkeeping?
+ Power (important for portable/battery powered devices)
- How much heat will the solution generate?
- How much stored energy will the solution consume?
* Other criteria - the "-abilities"
+ Maintainability
- How easy is it to isolate and fix bugs?
+ Extensibility
- How easy is it to add new features to the system, especially ones
that were not easily anticipated?
+ Reliability and availability
- Is the system available 24x7?
- Often requires redundancy or extra space for error checking codes
+ Scalability
- Does the system scale with problem size? With the execution platform?
* Application-specific criteria
+ Relative frequency of insert, remove and find operations
+ Amount of preparatory work need (like sorting by key)
How can we quantify space and speed?
* Space
+ Usually easier to quantify than speed
+ Need to know number of key-value pairs to store
+ Need to know size of typical key-value pair
- Can be difficult in practice (string size variations)
- May need to characterize existing data using statistics
+ Size of a fixed-sized C/C++ data structure can be found using sizeof
* Speed
+ Build and measure prototype
- Variation occurs due to differences in compiler, processor design
and speed, etc.
- Can be time consuming to prototype and measure -- must run
many experiments
+ Static analysis (analysis of algorithms)
- Execution time depends upon the number of data items to be processed
- Use mathematics to formally describe the execution time of an algorithm
- Big-O notation expresses execution time as a function of the
dominant term
Simple examples of search
* Sequential search using an unsorted array of data items
+ Stores just data (low space overhead)
+ Linear lookup (search)
- Must search from beginning to end (or the found dataitem)
- Worst case is N where N is number of key-value pairs in array
* Unsuccessful search must test every data item in the array
* Successful search *may* require each data item to be tested
- Average case is (N / 2)
+ Insert
- Must find insertion point
- Must shuffle elements to create space
- Worst case - N operations
+ Remove
- Must find key-value pair to remove
- Must shuffle elements to reclaim space
- Worst case - N operations
* Binary search using a sorted array of data items
+ Lookup can be performed as binary search
- Unsuccessful search takes floor(log N)+1 iterations (worst case)
- Successful search takes floor(log N) iterations (worst case)
- Average case is only one iteration less than worst case
+ Insert
- Must maintain sorted order
- Requires shuffling - worst case N operations
+ Remove
- Must maintain sorted order
- Requires shuffling - worst case N operations
* Interpolation search using a sorted array of data items [Weiss]
+ Data must be sorted and must be uniformly distributed in the array
+ Time access a data item is much higher than computation time
+ Guess the approximate position of a data item in the array
x - a[low]
next = low + [ -------------- * (high - low - 1) ]
a[high]-a[low]
where x is the value to be found and a[] is the array of data items
+ Cost of computation could be quite high (floating point, division)
+ Average case is O(log log N) when items are uniformly distributed
+ Worst case is O(N) -- it may be necessary to examine all data items
* Linked list data structure / algorithm
+ Lookup must be performed as sequential search
- Unsuccessful search takes N iterations, O(N) (worst case)
- Successful search takes N iterations, O(N) (worst case)
- Average case is (N / 2)
+ Insert
- If insert is to front of list, insert takes constant time, O(1)
- If insert is to middle of list, must find insertion point, O(N)
- No shuffling required to insert an item
+ Remove
- Must find data item (key-value pair) to remove, O(N)
- Removal of the data item takes constant time, when item is known
- No shuffling required when an item is removed
What can we observe?
* Array implementation
+ Search on sorted keys is very fast
+ Preparatory work may be required to sort the data
+ Insert/remove is painful due to shuffling
+ A good solution when insert/remove are infrequent
* Linked list implementation
+ Can speed up insertion (no shuffling)
+ Sequential search is required
* Are there other approaches to searching? Sure!
+ Binary search trees (BST)
+ Hash tables
+ Balanced trees
+ ...
------------------------------------------
Big-O complexity of find operations
------------------------------------------
Sequential search (unsorted array)
Case #items Big-O
------------ ------ -------
Worst case N O(N)
Average case N / 2 O(N)
Binary search (sorted array)
Case #items Big-O
------------ -------------- --------
Worst case floor(log N)+1 O(log N)
Average case N / 2 O(N)
Linked list
Case #items Big-O
------------ ------ -------
Worst case N O(N)
Average case N / 2 O(N)
*******************************************************
Sequential search through unsorted array
*******************************************************
// Function: Linear search
// Parameters:
// words: List of words
// number_of_words: Number of words in the list
// target: Word to lookup in the list
// Returns:
// The index of the word in the array if found, else -1
int linear_search(char *words[], int number_of_words, char *target)
{
int i ;
int compare_result ;
for (i = 0 ; i < number_of_words ; i++) {
number_of_comparisons++ ;
if (strcmp(target, words[i]) == 0) {
return( i ) ;
}
}
return( -1 ) ;
}
*******************************************************
Binary search using sorted array (iterative loop)
*******************************************************
// Function: binary_search (iterative loop version)
// Parameters:
// words: List of words
// number_of_words: Number of words in the list
// target: Word to lookup in the list
// Returns:
// The index of the word in the array if found, else -1
int binary_search(char *words[], int number_of_words, char *target)
{
int low, high, middle ;
int compare_result ;
low = 0 ;
high = number_of_words - 1 ;
while( low <= high ) {
middle = (low + high) / 2 ;
compare_result = strcmp(target, words[middle]) ;
number_of_comparisons++ ;
if (compare_result == 0) {
return( middle ) ;
} else if (compare_result < 0) {
high = middle - 1 ;
} else {
low = middle + 1 ;
}
}
return( -1 ) ;
}
*******************************************************
Binary search using sorted array (recursive)
*******************************************************
// Function: binary_search (recursive version)
// Parameters:
// words: List of words
// number_of_words: Number of words in the list
// target: Word to lookup in the list
// low: Low index of range to search
// high: High index of range to search
// Returns:
// The index of the word in the array if found, else -1
int binary_search(char *words[], int number_of_words,
char *target, int low, int high)
{
int middle ;
int compare_result ;
if (low > high) return( -1 ) ;
middle = (low + high) / 2 ;
compare_result = strcmp(target, words[middle]) ;
number_of_comparisons++ ;
if (compare_result == 0) {
return( middle ) ;
} else if (compare_result < 0) {
return(
binary_search(words, number_of_words, target, low, middle - 1)
) ;
} else {
return(
binary_search(words, number_of_words, target, middle + 1, high)
) ;
}
}