COMP15 Project #8: Dictionary ADT (hash table)

COMP 15 Project #8: Dictionary ADT (hash table)

COMP15 Spring 20XX
Due: Monday, May 2, 11:00PM

Overview

This project is a follow-up to Project #7 in which we used a binary search tree to implement a dictionary abstract data type (ADT.) In this project, we will keep the same basic interface to the dictionary ADT, but will use a hash table to implement the dictionary. We will again measure the execution speed of insertion and look up operations -- to increase the speed of the hash table implementation and to compare the hash table method with the binary search tree implementation.

Objectives

The objectives for this project are:

To implement a simple dictionary ADT using a hash table.
To learn to instrument an application to measure its execution speed.
To learn to tune up the performance of an application.
To gain further experience with software reuse and testing.

What needs to be done

Several files are provided:

hash_node.h, which is the declaration and definition of the HashNode class and its interface.
dictionary.h, which is the declaration of the Dictionary interface.
dictionary.cpp, which is the definition of the Dictionary class.
hash_search.cpp, a preliminary skeleton for testing and measuring the Dictionary class.
stats.h and stats.cpp, a class for collecting and displaying simple descriptive statistics.
timer.h and timer.cpp, a class for measuring elapsed wall-clock time and for displaying execution time measurements.
makefile, which is to be used for building the test programs.

All files are available in the directory /g/15/class/project8.

You will need to implement the dictionary ADT using the files dictionary.h and dictionary.cpp. The implementation must use a hash table to maintain the contents of the dictionary. Dictionary items are represented using HashNode objects which must be allocated and deallocated dynamically. To simplify, a HashNode elides the value part of a key-value pair and only stores the symbolic key in the form of a C-style string. The key symbol must be dynamically allocated and deallocated, too.

It is not necessary to implement a copy constructor or overloaded assignment operator for the Dictionary class.

A new function, printStats, has been added to the Dictionary class interface. This function computes and displays statistics on hash table utilization and the length of the table's lists.

You will need to reuse the test and measurement driver from Project #7. To avoid confusion when building and making comparisons with the binary search tree method, the test driver for Project #8 is contained in the file hash_search.cpp. If you were careful in Project #7, you should be able to reuse your existing test driver and just add a call to compute and display hash table statistics. Please see the sample output below.

    Number of words in dictionary: 354984
    Measurements for reading dictionary word list
      Elapsed time: 0.343066 seconds
      User time:    0.320000 seconds
      System time:  0.040000 seconds
    Measurements for inserting words into dictionary
      Elapsed time: 14.136600 seconds
      User time:    14.200000 seconds
      System time:  0.230000 seconds
    Number of words to lookup: 30065
    Measurements for reading list of words to be found
      Elapsed time: 0.024554 seconds
      User time:    14.230000 seconds
      System time:  0.230000 seconds
    Starting lookups
    Lookups finished
    Measurements for finding words in dictionary
      Elapsed time: 1.043691 seconds
      User time:    15.260000 seconds
      System time:  0.240000 seconds
    Statistics for comparisons per find operation
      Samples:      30065
      Minimum:      1.000000
      Maximum:      405.000000
      Mean:         173.646133
      Variance:     39344.784834
      Std dev:      198.355199
      Coeff of var: 1.142296
    Hash table utilization statistics
      Hash table entries:      1031
      Hash table entries used: 1031
      Hash table utilization:  1.000000
    Hash table list length statistics
      Samples:      1031
      Minimum:      291.000000
      Maximum:      407.000000
      Mean:         344.309408
      Variance:     118783.291980
      Std dev:      344.649520
      Coeff of var: 1.000988

The command line

The test and measurement driver takes two command line arguments: the name of the file containing words to be place into the dictionary and the name of the file containing words to be looked-up. The usage synopsis for the driver is:

    Usage: hash_search wordfile lookupfile

The test and measurement driver must check for the correct number of command line arguments.

Test data files

Two test data files are provided:

moby.scrambled, a list of 350,000+ words to be loaded into the dictionary.
lookup, a list of 30,000+ words to be looked-up in the the dictionary.

Each file is organized in the following way. The first line in the file is the number of words in the file. The remaining lines in the file contain exactly one word.

These are the same files that were used in Project #7. When doing performance studies, it's important to use the same data sets and to take measurements under the same conditions in order to make meaningful comparisons.

Additional requirements

Here are some additional requirements to be satisfied:

You must use the supplied makefile to build your solution. This makefile produces four executable programs:
- hash_search, the program to test and measure the dictionary ADT.
- hashtest, a simple smoke test for the dictionary ADT.
- statstest, a simple smoke test for the statistics collection module.
- timertest, a simple smoke test for the execution timer module.
You must thoroughly and systematically test the dictionary ADT. You may extend the simple smoke test for the dictionary ADT.
Add comments to the source code to describe your approach and solution.

Don't forget to fill out any header comments with your name, section and e-mail address.

Submitting your work

Use the provide system to submit your finished program:

    /local/bin/provide comp15 p8 makefile hash_node.h ...

The files to be submitted are:

    hash_node.h       Declaration/definition of a hash table node
    dictionary.h      Declaration of the dictionary ADT interface
    dictionary.cpp    Definition of the dictionary ADT
    hash_search.cpp   Test and measurement driver
    stats.h           Declaration of the statistics interface
    stats.cpp         Definition of the statistics module
    timer.h           Declaration of the timing interface
    timer.cpp         Definition of the timer module
    makefile          makefile to build everything

Do not submit the test files moby.scrambled and lookup.

Be sure you are completely satisfied with your work before submitting! Build and test your program on comp15.cs.tufts.edu before submitting your work. This system is the target build and test platform for all course projects. Projects which do not build correctly receive a score of zero.

Extra credit

Space-time efficiency of programs is important and as engineers, we must make trade-offs between execution time and memory space. One metric for space-time efficiency is the product of execution time and memory space. For this assignment, the efficiency metric is the product of CPU time (user+system) and hash table size.

Extra credit will be awarded for the top five programs (one program per student) with the most efficient find operation. Extra credit will be awarded on a graduated scale: 1 point (10% extra) for the top program, 0.8 point for second place, 0.6 point for third place, etc. To qualify for extra credit, your program must compute and display the product of CPU time for the lookup (find) phase of operation and hash table size (i.e., the number of entries in the hash array itself.)

    (User CPU + System CPU) * table_size

The ExecutionTimer function printProcessTimes saves the user CPU time and system CPU time. After calling printProcessTimes, you may retrieve the current user CPU time and system CPU time by calling the functions getUserTime and getSystemTime, respectively. Remember, you need to compute the user and system CPU time for the lookup phase, so you will need to "snap" the CPU times before and after the lookup phase of operation.

The decision of the judges is final. Contest void where prohibited by law.

Informal interface specification: Dictionary

There is one new method function: printStats. This function computes and prints statistics on hash table utilization and the length of the hash table lists. These statistics can be used to judge the effectiveness of the hash table design.

Function: Dictionary()
  Purpose: 
    Constructor for the Dictionary class. Initializes the
    dictionary to its empty state.
  Parameters:
    None
  Returns:
    Nothing
  Error conditions:
    None

Function: ~Dictionary()
  Purpose: 
    Deallocates (deletes) any items in the dictionary
  Parameters:
    None
  Returns:
    Nothing
  Error conditions:
    None

Function: insert(char *new_symbol)
  Purpose: 
    Inserts a new entry into the dictionary with the
    key specified by new_symbol
  Parameters:
    new_symbol: The key (a C-style string)
  Returns:
    Nothing
  Error conditions:
    Memory space exhausted

Function: find(char *target)
  Purpose: 
    Performs a lookup to find an item in the dictionary
    with the specified key
  Parameters:
    target: The key (a C-style string)
  Returns:
    True if the item was found, otherwise false
  Error conditions:
    Memory space exhausted

Function: remove(char *target)
  Purpose: 
    Finds and removes the item with the specified key
    from the dictionary
  Parameters:
    target: The key (a C-style string)
  Returns:
    Nothing
  Error conditions:
    None

Function: print
  Purpose: 
    Prints (dumps) the contents of the dictionary; For
    debugging purposes only
  Parameters:
    None
  Returns:
    Nothing
  Error conditions:
    None

Function: printStats
  Purpose: 
    Prints hash table utilization statistics
  Parameters:
    None
  Returns:
    Nothing
  Error conditions:
    None

Function: clear_number_of_comparisons
  Purpose: 
    Resets the running count of symbol (string) comparisons
  Parameters:
    None
  Returns:
    Nothing
  Error conditions:
    None

Function: inc_number_of_comparisons
  Purpose: 
    Increments the running count of symbol (string) comparisons
  Parameters:
    None
  Returns:
    Nothing
  Error conditions:
    None

Function: get_number_of_comparisons
  Purpose: 
    Retrieves the running count of symbol (string) comparisons
  Parameters:
    None
  Returns:
    Current running count of symbol comparisons
  Error conditions:
    None