Homework 5: Hash Tables

In this assignment, you'll be exploring hash tables with various collision resolution strategies. Your task is to implement these approaches and measure their performance using profiling techniques.

Objectives

  1. Implement hash tables using both open addressing and chaining collision resolution strategies.
  2. Design and implement efficient resizing and rehashing mechanisms to maintain O(1) average time complexity.
  3. Profile and benchmark hash table implementations using JMH to measure time and space performance.
  4. Compare and analyze the performance tradeoffs between open addressing and chaining approaches.
  5. Optimize hash table performance through load factor tuning and prime number table sizing.

Important Dates

EventDayDateTime
ReleaseSatOct 2609:00 AM ET
DueSatNov 0205:00 PM ET

Homework Description

In this assignment, you'll implement two key techniques for building hash maps: open addressing and chaining. To put these concepts into practice, we've provided a simple search engine that uses your hash map implementations to index and map data.

You will use this search engine to evaluate the performance of your hash map implementations, comparing the results of open addressing and chaining strategies. Then, you'll share your findings with us!

Note, you are not allowed to use (1) Java Collections (built-in data structures), (2) data structures that are not covered in the course yet. The exception is for the auxiliary data structure to implement the chaining strategy in ChainingHashMap. It is okay to use Java's ArrayList or LinkedList (or their equivalents/other data structures we developed in the course or earlier homework).

Starter Code

Download the starter code from the following link: hw5.zip.

The starter code is a zip file named hw5-starter.zip. You must unzip this file. After unzipping, you will get a folder called hw5-starter. Open this folder as an IntelliJ project.

Tasks

This assignment consists of two main tasks:

  1. Implementation and Profiling: Implement and profile two hash map variants: ChainingHashMap and OpenAddressingHashMap.
  2. Discussion: Provide insights on profiling techniques and implementation details for both hash maps.

Implement Hash Tables

Your first task is to implement two hash table classes, ChainingHashMap and OpenAddressingHashMap, which both implement the Map interface.

You have flexibility in your implementation:

  • For chaining, choose an auxiliary data structure (e.g., ArrayList or LinkedList) that best suits your needs.
  • For open addressing, select a probing strategy that minimizes collisions.
  • Decide on a threshold for rehashing the table based on load factor to optimize performance.

Your task is to figure out how to implement and optimize these hash tables. Additionally, report which one performs better (ChainingHashMap or OpenAddressingHashMap).

Requirements:

  • Use a plain (generic) array as the underlying data structure for both ChainingHashMap and OpenAddressingHashMap.
  • Manually resize the table to maintain an acceptable load factor.
  • When selecting a table size (capacity), begin by starting at 5. As you resize, choose a prime number that is larger than double the previous size.
  • All critical map operations must run in average (or amortized) O(1) expected time.

You can use the provided unit tests to ensure your implementation is correct. Feel free to add more tests to MapTest but do not modify the provided test cases.

Profile Hash Table Performance

For your second task, you'll need to profile the performance of your hash tables. To that aim, you'll use your hash tables to power a search engine. We'll guide you through this process in the sections that follow. Please take the time to read carefully until you reach the profiling section.

JHUgle Search Engine

We've developed a simple search engine called JHUgle, which can process text files containing lists of websites and their corresponding keywords. JHUgle uses this data to create an index that maps each keyword to a collection of URLs where it appears. Once the index is created, you can run the search engine and query it.

Try out our JHUgle program by running Driver.main in the hw5 package. If you're on a Windows machine, please open Config.java and follow the instructions marked with a TODO before running the driver program.

=== JHUgle Search Engine Running! === 
>>> Add your query below >>> 
> 

In the prompt (seen as > above), enter red and then enter ? as shown below. Notice the program will respond with a list of URLs.

> red
> ?
https://en.wikipedia.org/wiki/Cat
http://www.foobar.com/baz
http://www.cars.com/
> 

Enter ! to stop the search engine.

Data files

You can control what data file is fed to JHUgle through Config.DATA_FILENAME:

public class Config {

  // Update this to any other data file for benchmarking experiments.
  public static String DATA_FILENAME = "urls.txt";

  // Other members are not shown to save space!

}

There are several data files in src/main/resources:

src/main/resources   
  apache.txt     (large-ish)
  jhu.txt        (small)
  joanne.txt     (small)
  newegg.txt     (large-ish)
  random164.txt  (big!)
  urls.txt       (tiny)

These files could be many lines long, but a very short example named urls.txt looks as follows:

http://www.cars.com/
car red truck wheel blue fast brown silver gray
https://en.wikipedia.org/wiki/Cat
cat feline small fur red brown fast gray
http://www.foobar.com/baz
foo bar baz lorem ipsum red blue yellow silver

Notice how each URL is followed by a list of space-separated keywords. The idea is to associate each URL with a set of keywords. In turn, you can search and retrieve all the URLs that were associated with a keyword (or a combination of keywords).

Queries

The ultimate goal of JHUgle is to enable users to query for URLs that contain specific sets of words. These word sets will be specified using logical expressions. For example, when searching based on the urls.txt data file, if you ask for pages containing "red AND fast", JHUgle should return the first two URLs, excluding the third one since it doesn't include "fast" even though it contains "red".

Similarly, if you query for "car OR baz", the second URL would be omitted because it lacks both of these words. Note that search queries can involve multiple operands (words) and operators (AND, OR), making JHUgle a powerful tool for finding relevant URLs.

To make it easier to enter queries, we'll use a simple notation called postfix (where operator follows operands). We'll also use symbols for operations instead of words, so that the search engine can distinguish between query words and operators. For example, the two queries above would be entered as:

  • red fast && (logical AND)
  • car baz || (logical OR)

When you run JHUgle, it will prompt you with a > symbol to enter your query one word or logical operation at a time. However, keep in mind that the input is cumulative and builds up into a single long query.

Two more operations are implemented in JHUgle:

  1. The ? operation: When you enter this command, JHUgle will print the URLs corresponding to your most recent query expression (based on a single keyword input or an operation result). Each URL will be displayed on its own line with no additional formatting.

  2. The ! operation: To exit JHUgle, simply type !. The program will respond by quitting without producing any further output.

Here is a sample run showcases the full functionality of the program, using the urls.txt input file as a reference. The results demonstrated here illustrate all aspects of program operation.

=== JHUgle Search Engine Running! ===
>>> Add your query below >>> 
> ?
> baz
> red
> ?
https://en.wikipedia.org/wiki/Cat
http://www.foobar.com/baz
http://www.cars.com/
> &&
> ?
http://www.foobar.com/baz
> !

Process finished with exit code 0

In this snippet, > is the prompt JHUGle prints for the user. No output is printed in response to the initial ? operation because no query data has been input yet. The second ? result displays URLs that contain the word "red". The third ? yields URLs that contain both "baz" and "red", resulting from the logical "&&" operation.

The JHUgle source code is hidden

Open the Jhugle.java file and notice it is a simple wrapper class around SimpleSearchEngine.

public class Jhugle {

  // A supremely simple search engine based on a hash table.
  private static SimpleSearchEngine sse;

}

The SimpleSearchEngine is provided through the obfuscated JAR file hw5lib.jar. It is imported as follows:

import obfuscate.SimpleSearchEngine;

Notice the constructor of Jhugle that takes in a Map:

/**
  * Construct a JHUgle search engine.
  *
  * @param map any implementation of Map.
  */
public Jhugle(Map<String, Set<String>> map) {
  sse = new SimpleSearchEngine(map);
}

This implementation follows a design pattern known as dependency injection. It allows clients of Jhugle to change the implementation of Map ADT to be used in the (simple) Search Engine.

Here is how the Driver program provides a Map to the constructor of Jhugle:

Jhugle jhUgle = new Jhugle(Config.getMap());

Notice getMap() is a public method in the Config.java file:

/**
  * Instantiate a Map to be used for JHUgle's search engine.
  *
  * @return an implementation of Map ADT.
  */
public static Map getMap() {
  // Update this to any other implementation of Map for benchmarking experiments.
  return new JdkHashMap();
}

Currently, it returns JdkHashMap (which is our wrapper around the HashMap built into Java). You can update this to any other implementation of a Map (ChainingHashMap or OpenAddressingHashMap) for profiling.

The Driver allows you to experiment with different data files as well! However, you must not make any direct changes to the Driver program. Instead, update the corresponding DATA_FILENAME field in the Config.java file.

Profiling

Open the file src/test/hw5/JmhRuntimeTest.java to run a benchmark test that profiles the performance of building JHUgle with different data files and hash tables. This program not only measures time, but also tracks memory usage (space). The test relies on Config.getMap for dependency injection into the JHUgle program.

Important: Do not run the buildSearchEngine method directly by clicking the play button next to it. Instead, run the main method, which attaches a garbage collector profiler (GcProfiler) to this benchmark.

When you run the JmhRuntimeTest.main method, it may take several minutes and produce a large amount of output. At the end, you should see a report summarizing the results, such as the one shown below (note that the actual results will vary depending on your computer).

Benchmark                                                                 (fileName)  Mode  Cnt           Score   Error   Units
JmhRuntimeTest.buildSearchEngine                                          apache.txt  avgt    2         263.017           ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate                       apache.txt  avgt    2             NaN          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space              apache.txt  avgt    2         876.189          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm         apache.txt  avgt    2   255234506.796            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen                 apache.txt  avgt    2           0.392          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm            apache.txt  avgt    2      114593.563            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space          apache.txt  avgt    2          11.503          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space.norm     apache.txt  avgt    2     3349146.084            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count                            apache.txt  avgt    2          22.000          counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc          apache.txt  avgt    2   792264704.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc               apache.txt  avgt    2   152084192.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time                             apache.txt  avgt    2         134.000              ms
JmhRuntimeTest.buildSearchEngine                                             jhu.txt  avgt    2           0.356           ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate                          jhu.txt  avgt    2             NaN          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space                 jhu.txt  avgt    2        1142.128          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm            jhu.txt  avgt    2      450302.027            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen                    jhu.txt  avgt    2           0.004          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm               jhu.txt  avgt    2           1.639            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count                               jhu.txt  avgt    2          66.000          counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc             jhu.txt  avgt    2   346357760.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc                  jhu.txt  avgt    2    25008220.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time                                jhu.txt  avgt    2          25.500              ms
JmhRuntimeTest.buildSearchEngine                                          joanne.txt  avgt    2           0.133           ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate                       joanne.txt  avgt    2             NaN          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space              joanne.txt  avgt    2        1149.724          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm         joanne.txt  avgt    2      169090.703            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen                 joanne.txt  avgt    2           0.016          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm            joanne.txt  avgt    2           2.378            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count                            joanne.txt  avgt    2          55.000          counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc          joanne.txt  avgt    2   411271168.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc               joanne.txt  avgt    2    24917504.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time                             joanne.txt  avgt    2          23.500              ms
JmhRuntimeTest.buildSearchEngine                                          newegg.txt  avgt    2         138.387           ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate                       newegg.txt  avgt    2             NaN          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space              newegg.txt  avgt    2         876.438          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm         newegg.txt  avgt    2   134330890.763            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen                 newegg.txt  avgt    2           0.343          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm            newegg.txt  avgt    2       52606.947            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space          newegg.txt  avgt    2           4.779          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space.norm     newegg.txt  avgt    2      732113.874            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count                            newegg.txt  avgt    2          25.500          counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc          newegg.txt  avgt    2   681345024.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc               newegg.txt  avgt    2    77338920.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time                             newegg.txt  avgt    2         102.000              ms
JmhRuntimeTest.buildSearchEngine                                       random164.txt  avgt    2         812.268           ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate                    random164.txt  avgt    2             NaN          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space           random164.txt  avgt    2         570.721          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm      random164.txt  avgt    2   510282020.571            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space       random164.txt  avgt    2          14.406          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space.norm  random164.txt  avgt    2    13169615.238            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count                         random164.txt  avgt    2          13.000          counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc       random164.txt  avgt    2  3645702144.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc            random164.txt  avgt    2  1746337820.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time                          random164.txt  avgt    2        1202.000              ms
JmhRuntimeTest.buildSearchEngine                                            urls.txt  avgt    2           0.037           ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate                         urls.txt  avgt    2             NaN          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space                urls.txt  avgt    2         668.453          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm           urls.txt  avgt    2       27033.600            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen                   urls.txt  avgt    2           0.004          MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm              urls.txt  avgt    2           0.156            B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count                              urls.txt  avgt    2          46.500          counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc            urls.txt  avgt    2   291635200.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc                 urls.txt  avgt    2    24581236.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time                               urls.txt  avgt    2          17.000              ms

While there are many metrics displayed here, don't worry too much about most of them! Instead, focus on two key metrics: time and space (specifically, maximumUsedAfterGc).

The time it takes to build our search engine using a JdkHashMap is reflected in these two important metrics:

Benchmark                            (fileName)  Mode  Cnt           Score   Error   Units
JmhRuntimeTest.buildSearchEngine     apache.txt  avgt    2         263.017           ms/op
JmhRuntimeTest.buildSearchEngine        jhu.txt  avgt    2           0.356           ms/op
JmhRuntimeTest.buildSearchEngine     joanne.txt  avgt    2           0.133           ms/op
JmhRuntimeTest.buildSearchEngine     newegg.txt  avgt    2         138.387           ms/op
JmhRuntimeTest.buildSearchEngine  random164.txt  avgt    2         812.268           ms/op
JmhRuntimeTest.buildSearchEngine       urls.txt  avgt    2           0.037           ms/op

And the space it takes to build the search engine:

Benchmark                                                       (fileName)  Mode  Cnt           Score   Error   Units
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc     apache.txt  avgt    2   152084192.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc        jhu.txt  avgt    2    25008220.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc     joanne.txt  avgt    2    24917504.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc     newegg.txt  avgt    2    77338920.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc  random164.txt  avgt    2  1746337820.000           bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc       urls.txt  avgt    2    24581236.000           bytes

Discussion

Use your README file to explain why one of your implementations (OpenAddressingHashMap or ChainingHashMap) is a better choice for the search engine. Describe the approaches you tried, such as probing strategies and rehashing strategies, and why you chose them. Note any specific tweaks that improved or worsened performance. Record failed or abandoned attempts, if any.

Summarize how you developed, evaluated, and improved your hash tables. Include all benchmarking data, results, and analysis that led to your final decision on which implementation to use for the search engine.

Additionally, explain why you chose your final Map implementation as the best one. Analyze your benchmark data and draw conclusions. What results were surprising, and which were expected?

You should use your README file to explain which of your implementations (between OpenAddressingHashMap and ChainingHashMap) is a better choice to be used for the search engine. Which approaches did you try to implement each hash table (e.g., probing strategies, rehashing strategies, etc.), and why did you choose them? What specific tweaks to your implementation improved or made things worse? Make note of failed or abandoned attempts, if any. In summary, describe all the different ways you developed, evaluated, and improved your hash tables. Include all the benchmarking data, results, and analysis that contributed to your final decision on which implementation to use for the search engine.

Moreover, why did you choose your final Map implementation as the best one? Provide an analysis of your benchmark data and conclusions. What results were surprising, and which were expected?

Submission

In preparation, you must (1) run checkstyle to ensure your code is checkstyle compliant (if not, fix the styling issues). (2) Create a zip file containing the src folder (and all therein). (3) Submit the zip file to Gradescope.

You can submit as many times as you like until deadline. We will only grade the last submission.

Rubric

Teaching assistants will evaluate your submission using the following criteria:

Please note, we reserve the right to change this rubric without advance notice, even after the submission deadline. However, the chances of doing so are slim, and at any rate, should not be a significant change.

  • Spec 1: The OpenAddressingHashMap constructor is correctly and efficiently implemented. [1 point]

  • Spec 2: The implementation of OpenAddressingHashMap.insert is correct and efficient, demonstrating a thorough comprehension of the open addressing technique and its operational intricacies. [3 points]

  • Spec 3: The implementation of OpenAddressingHashMap.remove is correct and efficient, demonstrating a thorough comprehension of the open addressing technique and its operational intricacies. [2 points]

  • Spec 4: The implementation of OpenAddressingHashMap.put is correct and efficient, allowing to update the value associated with a key. [1 point]

  • Spec 5: The implementation of get and has operations in OpenAddressingHashMap are correct and efficient, ensuring fast lookups and membership tests. [1 point]

  • Spec 6: The implementation of OpenAddressingHashMap.size is correct and efficient. [0.5 points]

  • Spec 7: The iterator of OpenAddressingHashMap is implemented correctly and efficiently. [2 points]

  • Spec 8: The ChainingHashMap constructor is correctly and efficiently implemented. [1 point]

  • Spec 9: The implementation of ChainingHashMap.insert is correct and efficient, demonstrating a thorough comprehension of the chaining technique and its operational intricacies. [2 points]

  • Spec 10: The implementation of ChainingHashMap.remove is correct and efficient, demonstrating a thorough comprehension of the chaining technique and its operational intricacies. [2 points]

  • Spec 11: The implementation of ChainingHashMap.put is correct and efficient, allowing to update the value associated with a key. [1 point]

  • Spec 12: The implementation of get and has operations in ChainingHashMap are correct and efficient, ensuring fast lookups and membership tests. [1 point]

  • Spec 13: The implementation of ChainingHashMap.size is correct and efficient. [0.5 points]

  • Spec 14: The iterator of ChainingHashMap is implemented correctly and efficiently. [2 points]

  • Spec 15: The submission contains (in the README.md file) a description of the different ways they developed, evaluated, and improved their hash tables. [1 point]

  • Spec 16: The submission contains (in the README.md file) answers to the discussion question about benchmarking. The points made in the answer are adequate and backed by experimental data. [1 point]

  • Spec 17: The code is well-organized, readable, and follows coding standards. [1 point]

The total score for this homework is 23 points.