Searching
P.J. Drongowski
5 December 2004

Dictionary search

  * Search is an important part of many computer applications
  * Dictionaries are a particular kind of search problem
  * A dictionary contain key-value pairs
  * Dictionary operations
    + New key-value pairs can be added to the dictionary (insert)
    + Old key-value pairs can be deleted (remove)
    + Given a key, the dictionary will return the associated value (find
      or lookup)
  * Example: PostScript dictionary
    + /key value def       Defines a key-value pair
    + dict key undef       Removes a key-value pair from the dictionary
    + /key value store     Stores a new value for the key
    + /key load            Retrieves value and pushes it on operand stack
  * A key may be any type -- integer or string being the most common datatypes
  * We will use string-valued keys in the examples to come and we will not
    explicitly store a value with a key -- the value just comes along for
    the ride!

How can we implement a dictionary?

  * So far, we have been studying individual data structures and algorithms
  * How does a software engineer choose an algorithm for a particular
    application?
  * Engineers define design criteria and make choices based on the criteria
  * Basic performance criteria
    + Speed
      - How much time is available?
      - Is there a response time deadline?
      - How much execution time may the solution take?
    + Space
      - How may key-value pairs must be stored?
      - How big are the keys? The values?    (sizeof)
      - How much space is needed for infrastructure? For bookkeeping?
    + Power (important for portable/battery powered devices)
      - How much heat will the solution generate?
      - How much stored energy will the solution consume?
  * Other criteria - the "-abilities"
    + Maintainability
      - How easy is it to isolate and fix bugs?
    + Extensibility
      - How easy is it to add new features to the system, especially ones
        that were not easily anticipated?
    + Reliability and availability
      - Is the system available 24x7?
      - Often requires redundancy or extra space for error checking codes
    + Scalability
      - Does the system scale with problem size? With the execution platform?
  * Application-specific criteria
    + Relative frequency of insert, remove and find operations
    + Amount of preparatory work need (like sorting by key)

How can we quantify space and speed?

  * Space
    + Usually easier to quantify than speed
    + Need to know number of key-value pairs to store
    + Need to know size of typical key-value pair
      - Can be difficult in practice (string size variations)
      - May need to characterize existing data using statistics
    + Size of a fixed-sized C/C++ data structure can be found using sizeof
  * Speed
    + Build and measure prototype
      - Variation occurs due to differences in compiler, processor design
        and speed, etc.
      - Can be time consuming to prototype and measure -- must run
        many experiments
    + Static analysis (analysis of algorithms)
      - Execution time depends upon the number of data items to be processed
      - Use mathematics to formally describe the execution time of an algorithm
      - Big-O notation expresses execution time as a function of the
        dominant term

Simple examples of search

  * Sequential search using an unsorted array of data items
    + Stores just data (low space overhead)
    + Linear lookup (search)
      - Must search from beginning to end (or the found dataitem)
      - Worst case is N where N is number of key-value pairs in array
        * Unsuccessful search must test every data item in the array
        * Successful search *may* require each data item to be tested
      - Average case is (N / 2)
    + Insert
      - Must find insertion point
      - Must shuffle elements to create space
      - Worst case - N operations
    + Remove
      - Must find key-value pair to remove
      - Must shuffle elements to reclaim space
      - Worst case - N operations

  * Binary search using a sorted array of data items
    + Lookup can be performed as binary search
      - Unsuccessful search takes floor(log N)+1 iterations (worst case)
      - Successful search takes floor(log N) iterations (worst case)
      - Average case is only one iteration less than worst case
    + Insert
      - Must maintain sorted order
      - Requires shuffling - worst case N operations
    + Remove
      - Must maintain sorted order
      - Requires shuffling - worst case N operations

  * Interpolation search using a sorted array of data items [Weiss]
    + Data must be sorted and must be uniformly distributed in the array
    + Time access a data item is much higher than computation time
    + Guess the approximate position of a data item in the array

                         x - a[low]
        next = low + [ -------------- * (high - low - 1) ]
                       a[high]-a[low]

      where x is the value to be found and a[] is the array of data items

    + Cost of computation could be quite high (floating point, division)
    + Average case is O(log log N) when items are uniformly distributed
    + Worst case is O(N) -- it may be necessary to examine all data items

  * Linked list data structure / algorithm
    + Lookup must be performed as sequential search
      - Unsuccessful search takes N iterations, O(N) (worst case)
      - Successful search takes N iterations, O(N) (worst case)
      - Average case is (N / 2)
    + Insert
      - If insert is to front of list, insert takes constant time, O(1)
      - If insert is to middle of list, must find insertion point, O(N)
      - No shuffling required to insert an item
    + Remove
      - Must find data item (key-value pair) to remove, O(N)
      - Removal of the data item takes constant time, when item is known
      - No shuffling required when an item is removed

What can we observe?

  * Array implementation
    + Search on sorted keys is very fast
    + Preparatory work may be required to sort the data
    + Insert/remove is painful due to shuffling
    + A good solution when insert/remove are infrequent
  * Linked list implementation
    + Can speed up insertion (no shuffling)
    + Sequential search is required
  * Are there other approaches to searching? Sure!
    + Binary search trees (BST)
    + Hash tables
    + Balanced trees
    + ...

------------------------------------------
Big-O complexity of find operations
------------------------------------------

  Sequential search (unsorted array)

      Case          #items   Big-O
      ------------  ------  -------
      Worst case       N      O(N)
      Average case   N / 2    O(N)

  Binary search (sorted array)

      Case              #items       Big-O
      ------------  --------------  --------
      Worst case    floor(log N)+1  O(log N)
      Average case   N / 2    O(N)

  Linked list

      Case          #items   Big-O
      ------------  ------  -------
      Worst case       N      O(N)
      Average case   N / 2    O(N)


*******************************************************
Sequential search through unsorted array
*******************************************************

// Function: Linear search
// Parameters:
//   words: List of words
//   number_of_words: Number of words in the list
//   target: Word to lookup in the list
// Returns:
//   The index of the word in the array if found, else -1

int linear_search(char *words[], int number_of_words, char *target)
{
  int i ;
  int compare_result ;

  for (i = 0 ; i < number_of_words ; i++) {
    number_of_comparisons++ ;
    if (strcmp(target, words[i]) == 0) {
      return( i ) ;
    }
  }

  return( -1 ) ;
}

*******************************************************
Binary search using sorted array (iterative loop)
*******************************************************

// Function: binary_search (iterative loop version)
// Parameters:
//   words: List of words
//   number_of_words: Number of words in the list
//   target: Word to lookup in the list
// Returns:
//   The index of the word in the array if found, else -1

int binary_search(char *words[], int number_of_words, char *target)
{
  int low, high, middle ;
  int compare_result ;

  low = 0 ;
  high = number_of_words - 1 ;

  while( low <= high ) {
    middle = (low + high) / 2 ;
    compare_result = strcmp(target, words[middle]) ;
    number_of_comparisons++ ;
    if (compare_result == 0) {
      return( middle ) ;
    } else if (compare_result < 0) {
      high = middle - 1 ;
    } else {
      low = middle + 1 ;
    }
  }
  return( -1 ) ;
}

*******************************************************
Binary search using sorted array (recursive)
*******************************************************

// Function: binary_search (recursive version)
// Parameters:
//   words: List of words
//   number_of_words: Number of words in the list
//   target: Word to lookup in the list
//   low: Low index of range to search
//   high: High index of range to search
// Returns:
//   The index of the word in the array if found, else -1

int binary_search(char *words[], int number_of_words,
  char *target, int low, int high)
{
  int middle ;
  int compare_result ;

  if (low > high) return( -1 ) ;

  middle = (low + high) / 2 ;

  compare_result = strcmp(target, words[middle]) ;
  number_of_comparisons++ ;
  if (compare_result == 0) {
    return( middle ) ;
  } else if (compare_result < 0) {
    return(
      binary_search(words, number_of_words, target, low, middle - 1)
    ) ;
  } else {
    return(
      binary_search(words, number_of_words, target, middle + 1, high)
    ) ;
  }
}