Homework 5: Hash Tables
In this assignment, you'll be exploring hash tables with various collision resolution strategies. Your task is to implement these approaches and measure their performance using profiling techniques.
Objectives
- Implement hash tables using both open addressing and chaining collision resolution strategies.
- Design and implement efficient resizing and rehashing mechanisms to maintain O(1) average time complexity.
- Profile and benchmark hash table implementations using JMH to measure time and space performance.
- Compare and analyze the performance tradeoffs between open addressing and chaining approaches.
- Optimize hash table performance through load factor tuning and prime number table sizing.
Important Dates
Event | Day | Date | Time |
---|---|---|---|
Release | Sat | Oct 26 | 09:00 AM ET |
Due | Sat | Nov 02 | 05:00 PM ET |
Homework Description
In this assignment, you'll implement two key techniques for building hash maps: open addressing and chaining. To put these concepts into practice, we've provided a simple search engine that uses your hash map implementations to index and map data.
You will use this search engine to evaluate the performance of your hash map implementations, comparing the results of open addressing and chaining strategies. Then, you'll share your findings with us!
Note, you are not allowed to use (1) Java Collections (built-in data structures), (2) data structures that are not covered in the course yet. The exception is for the auxiliary data structure to implement the chaining strategy in
ChainingHashMap
. It is okay to use Java's ArrayList or LinkedList (or their equivalents/other data structures we developed in the course or earlier homework).
Starter Code
Download the starter code from the following link: hw5.zip.
The starter code is a zip file named hw5-starter.zip
. You must unzip this file. After unzipping, you will get a folder called hw5-starter
. Open this folder as an IntelliJ project.
Tasks
This assignment consists of two main tasks:
- Implementation and Profiling: Implement and profile two hash map variants:
ChainingHashMap
andOpenAddressingHashMap
. - Discussion: Provide insights on profiling techniques and implementation details for both hash maps.
Implement Hash Tables
Your first task is to implement two hash table classes, ChainingHashMap
and OpenAddressingHashMap
, which both implement the Map
interface.
You have flexibility in your implementation:
- For chaining, choose an auxiliary data structure (e.g., ArrayList or LinkedList) that best suits your needs.
- For open addressing, select a probing strategy that minimizes collisions.
- Decide on a threshold for rehashing the table based on load factor to optimize performance.
Your task is to figure out how to implement and optimize these hash tables. Additionally, report which one performs better (ChainingHashMap
or OpenAddressingHashMap
).
Requirements:
- Use a plain (generic) array as the underlying data structure for both
ChainingHashMap
andOpenAddressingHashMap
. - Manually resize the table to maintain an acceptable load factor.
- When selecting a table size (capacity), begin by starting at 5. As you resize, choose a prime number that is larger than double the previous size.
- All critical map operations must run in average (or amortized) O(1) expected time.
You can use the provided unit tests to ensure your implementation is correct. Feel free to add more tests to MapTest
but do not modify the provided test cases.
Profile Hash Table Performance
For your second task, you'll need to profile the performance of your hash tables. To that aim, you'll use your hash tables to power a search engine. We'll guide you through this process in the sections that follow. Please take the time to read carefully until you reach the profiling section.
JHUgle Search Engine
We've developed a simple search engine called JHUgle, which can process text files containing lists of websites and their corresponding keywords. JHUgle uses this data to create an index that maps each keyword to a collection of URLs where it appears. Once the index is created, you can run the search engine and query it.
Try out our JHUgle program by running Driver.main
in the hw5
package. If you're on a Windows machine, please open Config.java
and follow the instructions marked with a TODO
before running the driver program.
=== JHUgle Search Engine Running! ===
>>> Add your query below >>>
>
In the prompt (seen as >
above), enter red
and then enter ?
as shown below. Notice the program will respond with a list of URLs.
> red
> ?
https://en.wikipedia.org/wiki/Cat
http://www.foobar.com/baz
http://www.cars.com/
>
Enter !
to stop the search engine.
Data files
You can control what data file is fed to JHUgle through Config.DATA_FILENAME
:
public class Config {
// Update this to any other data file for benchmarking experiments.
public static String DATA_FILENAME = "urls.txt";
// Other members are not shown to save space!
}
There are several data files in src/main/resources
:
src/main/resources
apache.txt (large-ish)
jhu.txt (small)
joanne.txt (small)
newegg.txt (large-ish)
random164.txt (big!)
urls.txt (tiny)
These files could be many lines long, but a very short example named urls.txt
looks as follows:
http://www.cars.com/
car red truck wheel blue fast brown silver gray
https://en.wikipedia.org/wiki/Cat
cat feline small fur red brown fast gray
http://www.foobar.com/baz
foo bar baz lorem ipsum red blue yellow silver
Notice how each URL is followed by a list of space-separated keywords. The idea is to associate each URL with a set of keywords. In turn, you can search and retrieve all the URLs that were associated with a keyword (or a combination of keywords).
Queries
The ultimate goal of JHUgle is to enable users to query for URLs that contain specific sets of words. These word sets will be specified using logical expressions. For example, when searching based on the urls.txt
data file, if you ask for pages containing "red AND fast", JHUgle should return the first two URLs, excluding the third one since it doesn't include "fast" even though it contains "red".
Similarly, if you query for "car OR baz", the second URL would be omitted because it lacks both of these words. Note that search queries can involve multiple operands (words) and operators (AND, OR), making JHUgle a powerful tool for finding relevant URLs.
To make it easier to enter queries, we'll use a simple notation called postfix (where operator follows operands). We'll also use symbols for operations instead of words, so that the search engine can distinguish between query words and operators. For example, the two queries above would be entered as:
red fast &&
(logical AND)car baz ||
(logical OR)
When you run JHUgle, it will prompt you with a >
symbol to enter your query one word or logical operation at a time. However, keep in mind that the input is cumulative and builds up into a single long query.
Two more operations are implemented in JHUgle:
-
The
?
operation: When you enter this command, JHUgle will print the URLs corresponding to your most recent query expression (based on a single keyword input or an operation result). Each URL will be displayed on its own line with no additional formatting. -
The
!
operation: To exit JHUgle, simply type!
. The program will respond by quitting without producing any further output.
Here is a sample run showcases the full functionality of the program, using the urls.txt
input file as a reference. The results demonstrated here illustrate all aspects of program operation.
=== JHUgle Search Engine Running! ===
>>> Add your query below >>>
> ?
> baz
> red
> ?
https://en.wikipedia.org/wiki/Cat
http://www.foobar.com/baz
http://www.cars.com/
> &&
> ?
http://www.foobar.com/baz
> !
Process finished with exit code 0
In this snippet, >
is the prompt JHUGle prints for the user. No output is printed in response to the initial ?
operation because no query data has been input yet. The second ?
result displays URLs that contain the word "red". The third ?
yields URLs that contain both "baz" and "red", resulting from the logical "&&" operation.
The JHUgle source code is hidden
Open the Jhugle.java
file and notice it is a simple wrapper class around SimpleSearchEngine
.
public class Jhugle {
// A supremely simple search engine based on a hash table.
private static SimpleSearchEngine sse;
}
The SimpleSearchEngine
is provided through the obfuscated JAR file hw5lib.jar
. It is imported as follows:
import obfuscate.SimpleSearchEngine;
Notice the constructor of Jhugle
that takes in a Map:
/**
* Construct a JHUgle search engine.
*
* @param map any implementation of Map.
*/
public Jhugle(Map<String, Set<String>> map) {
sse = new SimpleSearchEngine(map);
}
This implementation follows a design pattern known as dependency injection. It allows clients of Jhugle
to change the implementation of Map ADT to be used in the (simple) Search Engine.
Here is how the Driver
program provides a Map to the constructor of Jhugle
:
Jhugle jhUgle = new Jhugle(Config.getMap());
Notice getMap()
is a public method in the Config.java
file:
/**
* Instantiate a Map to be used for JHUgle's search engine.
*
* @return an implementation of Map ADT.
*/
public static Map getMap() {
// Update this to any other implementation of Map for benchmarking experiments.
return new JdkHashMap();
}
Currently, it returns JdkHashMap
(which is our wrapper around the HashMap built into Java). You can update this to any other implementation of a Map (ChainingHashMap
or OpenAddressingHashMap
) for profiling.
The Driver allows you to experiment with different data files as well! However, you must not make any direct changes to the Driver program. Instead, update the corresponding DATA_FILENAME
field in the Config.java
file.
Profiling
Open the file src/test/hw5/JmhRuntimeTest.java
to run a benchmark test that profiles the performance of building JHUgle with different data files and hash tables. This program not only measures time, but also tracks memory usage (space). The test relies on Config.getMap
for dependency injection into the JHUgle program.
Important: Do not run the buildSearchEngine
method directly by clicking the play button next to it. Instead, run the main
method, which attaches a garbage collector profiler (GcProfiler
) to this benchmark.
When you run the JmhRuntimeTest.main
method, it may take several minutes and produce a large amount of output. At the end, you should see a report summarizing the results, such as the one shown below (note that the actual results will vary depending on your computer).
Benchmark (fileName) Mode Cnt Score Error Units
JmhRuntimeTest.buildSearchEngine apache.txt avgt 2 263.017 ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate apache.txt avgt 2 NaN MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space apache.txt avgt 2 876.189 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm apache.txt avgt 2 255234506.796 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen apache.txt avgt 2 0.392 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm apache.txt avgt 2 114593.563 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space apache.txt avgt 2 11.503 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space.norm apache.txt avgt 2 3349146.084 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count apache.txt avgt 2 22.000 counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc apache.txt avgt 2 792264704.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc apache.txt avgt 2 152084192.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time apache.txt avgt 2 134.000 ms
JmhRuntimeTest.buildSearchEngine jhu.txt avgt 2 0.356 ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate jhu.txt avgt 2 NaN MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space jhu.txt avgt 2 1142.128 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm jhu.txt avgt 2 450302.027 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen jhu.txt avgt 2 0.004 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm jhu.txt avgt 2 1.639 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count jhu.txt avgt 2 66.000 counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc jhu.txt avgt 2 346357760.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc jhu.txt avgt 2 25008220.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time jhu.txt avgt 2 25.500 ms
JmhRuntimeTest.buildSearchEngine joanne.txt avgt 2 0.133 ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate joanne.txt avgt 2 NaN MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space joanne.txt avgt 2 1149.724 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm joanne.txt avgt 2 169090.703 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen joanne.txt avgt 2 0.016 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm joanne.txt avgt 2 2.378 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count joanne.txt avgt 2 55.000 counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc joanne.txt avgt 2 411271168.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc joanne.txt avgt 2 24917504.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time joanne.txt avgt 2 23.500 ms
JmhRuntimeTest.buildSearchEngine newegg.txt avgt 2 138.387 ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate newegg.txt avgt 2 NaN MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space newegg.txt avgt 2 876.438 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm newegg.txt avgt 2 134330890.763 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen newegg.txt avgt 2 0.343 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm newegg.txt avgt 2 52606.947 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space newegg.txt avgt 2 4.779 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space.norm newegg.txt avgt 2 732113.874 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count newegg.txt avgt 2 25.500 counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc newegg.txt avgt 2 681345024.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc newegg.txt avgt 2 77338920.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time newegg.txt avgt 2 102.000 ms
JmhRuntimeTest.buildSearchEngine random164.txt avgt 2 812.268 ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate random164.txt avgt 2 NaN MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space random164.txt avgt 2 570.721 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm random164.txt avgt 2 510282020.571 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space random164.txt avgt 2 14.406 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Survivor_Space.norm random164.txt avgt 2 13169615.238 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count random164.txt avgt 2 13.000 counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc random164.txt avgt 2 3645702144.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc random164.txt avgt 2 1746337820.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time random164.txt avgt 2 1202.000 ms
JmhRuntimeTest.buildSearchEngine urls.txt avgt 2 0.037 ms/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.alloc.rate urls.txt avgt 2 NaN MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space urls.txt avgt 2 668.453 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Eden_Space.norm urls.txt avgt 2 27033.600 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen urls.txt avgt 2 0.004 MB/sec
JmhRuntimeTest.buildSearchEngine:+c2k.gc.churn.G1_Old_Gen.norm urls.txt avgt 2 0.156 B/op
JmhRuntimeTest.buildSearchEngine:+c2k.gc.count urls.txt avgt 2 46.500 counts
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumCommittedAfterGc urls.txt avgt 2 291635200.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc urls.txt avgt 2 24581236.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.time urls.txt avgt 2 17.000 ms
While there are many metrics displayed here, don't worry too much about most of them! Instead, focus on two key metrics: time and space (specifically, maximumUsedAfterGc
).
The time it takes to build our search engine using a JdkHashMap
is reflected in these two important metrics:
Benchmark (fileName) Mode Cnt Score Error Units
JmhRuntimeTest.buildSearchEngine apache.txt avgt 2 263.017 ms/op
JmhRuntimeTest.buildSearchEngine jhu.txt avgt 2 0.356 ms/op
JmhRuntimeTest.buildSearchEngine joanne.txt avgt 2 0.133 ms/op
JmhRuntimeTest.buildSearchEngine newegg.txt avgt 2 138.387 ms/op
JmhRuntimeTest.buildSearchEngine random164.txt avgt 2 812.268 ms/op
JmhRuntimeTest.buildSearchEngine urls.txt avgt 2 0.037 ms/op
And the space it takes to build the search engine:
Benchmark (fileName) Mode Cnt Score Error Units
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc apache.txt avgt 2 152084192.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc jhu.txt avgt 2 25008220.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc joanne.txt avgt 2 24917504.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc newegg.txt avgt 2 77338920.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc random164.txt avgt 2 1746337820.000 bytes
JmhRuntimeTest.buildSearchEngine:+c2k.gc.maximumUsedAfterGc urls.txt avgt 2 24581236.000 bytes
Discussion
Use your README
file to explain why one of your implementations (OpenAddressingHashMap
or ChainingHashMap
) is a better choice for the search engine. Describe the approaches you tried, such as probing strategies and rehashing strategies, and why you chose them. Note any specific tweaks that improved or worsened performance. Record failed or abandoned attempts, if any.
Summarize how you developed, evaluated, and improved your hash tables. Include all benchmarking data, results, and analysis that led to your final decision on which implementation to use for the search engine.
Additionally, explain why you chose your final Map implementation as the best one. Analyze your benchmark data and draw conclusions. What results were surprising, and which were expected?
You should use your README
file to explain which of your implementations (between OpenAddressingHashMap
and ChainingHashMap
) is a better choice to be used for the search engine. Which approaches did you try to implement each hash table (e.g., probing strategies, rehashing strategies, etc.), and why did you choose them? What specific tweaks to your implementation improved or made things worse? Make note of failed or abandoned attempts, if any. In summary, describe all the different ways you developed, evaluated, and improved your hash tables. Include all the benchmarking data, results, and analysis that contributed to your final decision on which implementation to use for the search engine.
Moreover, why did you choose your final Map implementation as the best one? Provide an analysis of your benchmark data and conclusions. What results were surprising, and which were expected?
Submission
In preparation, you must (1) run checkstyle to ensure your code is checkstyle compliant (if not, fix the styling issues). (2) Create a zip file containing the src
folder (and all therein). (3) Submit the zip file to Gradescope.
You can submit as many times as you like until deadline. We will only grade the last submission.
Rubric
Teaching assistants will evaluate your submission using the following criteria:
Please note, we reserve the right to change this rubric without advance notice, even after the submission deadline. However, the chances of doing so are slim, and at any rate, should not be a significant change.
-
Spec 1: The
OpenAddressingHashMap
constructor is correctly and efficiently implemented. [1 point] -
Spec 2: The implementation of
OpenAddressingHashMap.insert
is correct and efficient, demonstrating a thorough comprehension of the open addressing technique and its operational intricacies. [3 points] -
Spec 3: The implementation of
OpenAddressingHashMap.remove
is correct and efficient, demonstrating a thorough comprehension of the open addressing technique and its operational intricacies. [2 points] -
Spec 4: The implementation of
OpenAddressingHashMap.put
is correct and efficient, allowing to update the value associated with a key. [1 point] -
Spec 5: The implementation of
get
andhas
operations inOpenAddressingHashMap
are correct and efficient, ensuring fast lookups and membership tests. [1 point] -
Spec 6: The implementation of
OpenAddressingHashMap.size
is correct and efficient. [0.5 points] -
Spec 7: The iterator of
OpenAddressingHashMap
is implemented correctly and efficiently. [2 points] -
Spec 8: The
ChainingHashMap
constructor is correctly and efficiently implemented. [1 point] -
Spec 9: The implementation of
ChainingHashMap.insert
is correct and efficient, demonstrating a thorough comprehension of the chaining technique and its operational intricacies. [2 points] -
Spec 10: The implementation of
ChainingHashMap.remove
is correct and efficient, demonstrating a thorough comprehension of the chaining technique and its operational intricacies. [2 points] -
Spec 11: The implementation of
ChainingHashMap.put
is correct and efficient, allowing to update the value associated with a key. [1 point] -
Spec 12: The implementation of
get
andhas
operations inChainingHashMap
are correct and efficient, ensuring fast lookups and membership tests. [1 point] -
Spec 13: The implementation of
ChainingHashMap.size
is correct and efficient. [0.5 points] -
Spec 14: The iterator of
ChainingHashMap
is implemented correctly and efficiently. [2 points] -
Spec 15: The submission contains (in the
README.md
file) a description of the different ways they developed, evaluated, and improved their hash tables. [1 point] -
Spec 16: The submission contains (in the
README.md
file) answers to the discussion question about benchmarking. The points made in the answer are adequate and backed by experimental data. [1 point] -
Spec 17: The code is well-organized, readable, and follows coding standards. [1 point]
The total score for this homework is 23 points.