Searching P.J. Drongowski 5 December 2004 Dictionary search * Search is an important part of many computer applications * Dictionaries are a particular kind of search problem * A dictionary contain key-value pairs * Dictionary operations + New key-value pairs can be added to the dictionary (insert) + Old key-value pairs can be deleted (remove) + Given a key, the dictionary will return the associated value (find or lookup) * Example: PostScript dictionary + /key value def Defines a key-value pair + dict key undef Removes a key-value pair from the dictionary + /key value store Stores a new value for the key + /key load Retrieves value and pushes it on operand stack * A key may be any type -- integer or string being the most common datatypes * We will use string-valued keys in the examples to come and we will not explicitly store a value with a key -- the value just comes along for the ride! How can we implement a dictionary? * So far, we have been studying individual data structures and algorithms * How does a software engineer choose an algorithm for a particular application? * Engineers define design criteria and make choices based on the criteria * Basic performance criteria + Speed - How much time is available? - Is there a response time deadline? - How much execution time may the solution take? + Space - How may key-value pairs must be stored? - How big are the keys? The values? (sizeof) - How much space is needed for infrastructure? For bookkeeping? + Power (important for portable/battery powered devices) - How much heat will the solution generate? - How much stored energy will the solution consume? * Other criteria - the "-abilities" + Maintainability - How easy is it to isolate and fix bugs? + Extensibility - How easy is it to add new features to the system, especially ones that were not easily anticipated? + Reliability and availability - Is the system available 24x7? - Often requires redundancy or extra space for error checking codes + Scalability - Does the system scale with problem size? With the execution platform? * Application-specific criteria + Relative frequency of insert, remove and find operations + Amount of preparatory work need (like sorting by key) How can we quantify space and speed? * Space + Usually easier to quantify than speed + Need to know number of key-value pairs to store + Need to know size of typical key-value pair - Can be difficult in practice (string size variations) - May need to characterize existing data using statistics + Size of a fixed-sized C/C++ data structure can be found using sizeof * Speed + Build and measure prototype - Variation occurs due to differences in compiler, processor design and speed, etc. - Can be time consuming to prototype and measure -- must run many experiments + Static analysis (analysis of algorithms) - Execution time depends upon the number of data items to be processed - Use mathematics to formally describe the execution time of an algorithm - Big-O notation expresses execution time as a function of the dominant term Simple examples of search * Sequential search using an unsorted array of data items + Stores just data (low space overhead) + Linear lookup (search) - Must search from beginning to end (or the found dataitem) - Worst case is N where N is number of key-value pairs in array * Unsuccessful search must test every data item in the array * Successful search *may* require each data item to be tested - Average case is (N / 2) + Insert - Must find insertion point - Must shuffle elements to create space - Worst case - N operations + Remove - Must find key-value pair to remove - Must shuffle elements to reclaim space - Worst case - N operations * Binary search using a sorted array of data items + Lookup can be performed as binary search - Unsuccessful search takes floor(log N)+1 iterations (worst case) - Successful search takes floor(log N) iterations (worst case) - Average case is only one iteration less than worst case + Insert - Must maintain sorted order - Requires shuffling - worst case N operations + Remove - Must maintain sorted order - Requires shuffling - worst case N operations * Interpolation search using a sorted array of data items [Weiss] + Data must be sorted and must be uniformly distributed in the array + Time access a data item is much higher than computation time + Guess the approximate position of a data item in the array x - a[low] next = low + [ -------------- * (high - low - 1) ] a[high]-a[low] where x is the value to be found and a[] is the array of data items + Cost of computation could be quite high (floating point, division) + Average case is O(log log N) when items are uniformly distributed + Worst case is O(N) -- it may be necessary to examine all data items * Linked list data structure / algorithm + Lookup must be performed as sequential search - Unsuccessful search takes N iterations, O(N) (worst case) - Successful search takes N iterations, O(N) (worst case) - Average case is (N / 2) + Insert - If insert is to front of list, insert takes constant time, O(1) - If insert is to middle of list, must find insertion point, O(N) - No shuffling required to insert an item + Remove - Must find data item (key-value pair) to remove, O(N) - Removal of the data item takes constant time, when item is known - No shuffling required when an item is removed What can we observe? * Array implementation + Search on sorted keys is very fast + Preparatory work may be required to sort the data + Insert/remove is painful due to shuffling + A good solution when insert/remove are infrequent * Linked list implementation + Can speed up insertion (no shuffling) + Sequential search is required * Are there other approaches to searching? Sure! + Binary search trees (BST) + Hash tables + Balanced trees + ... ------------------------------------------ Big-O complexity of find operations ------------------------------------------ Sequential search (unsorted array) Case #items Big-O ------------ ------ ------- Worst case N O(N) Average case N / 2 O(N) Binary search (sorted array) Case #items Big-O ------------ -------------- -------- Worst case floor(log N)+1 O(log N) Average case N / 2 O(N) Linked list Case #items Big-O ------------ ------ ------- Worst case N O(N) Average case N / 2 O(N) ******************************************************* Sequential search through unsorted array ******************************************************* // Function: Linear search // Parameters: // words: List of words // number_of_words: Number of words in the list // target: Word to lookup in the list // Returns: // The index of the word in the array if found, else -1 int linear_search(char *words[], int number_of_words, char *target) { int i ; int compare_result ; for (i = 0 ; i < number_of_words ; i++) { number_of_comparisons++ ; if (strcmp(target, words[i]) == 0) { return( i ) ; } } return( -1 ) ; } ******************************************************* Binary search using sorted array (iterative loop) ******************************************************* // Function: binary_search (iterative loop version) // Parameters: // words: List of words // number_of_words: Number of words in the list // target: Word to lookup in the list // Returns: // The index of the word in the array if found, else -1 int binary_search(char *words[], int number_of_words, char *target) { int low, high, middle ; int compare_result ; low = 0 ; high = number_of_words - 1 ; while( low <= high ) { middle = (low + high) / 2 ; compare_result = strcmp(target, words[middle]) ; number_of_comparisons++ ; if (compare_result == 0) { return( middle ) ; } else if (compare_result < 0) { high = middle - 1 ; } else { low = middle + 1 ; } } return( -1 ) ; } ******************************************************* Binary search using sorted array (recursive) ******************************************************* // Function: binary_search (recursive version) // Parameters: // words: List of words // number_of_words: Number of words in the list // target: Word to lookup in the list // low: Low index of range to search // high: High index of range to search // Returns: // The index of the word in the array if found, else -1 int binary_search(char *words[], int number_of_words, char *target, int low, int high) { int middle ; int compare_result ; if (low > high) return( -1 ) ; middle = (low + high) / 2 ; compare_result = strcmp(target, words[middle]) ; number_of_comparisons++ ; if (compare_result == 0) { return( middle ) ; } else if (compare_result < 0) { return( binary_search(words, number_of_words, target, low, middle - 1) ) ; } else { return( binary_search(words, number_of_words, target, middle + 1, high) ) ; } }