Homework 3: Sorting & Searching
In this homework, you will implement and analyze a few sorting and searching algorithms.
Objectives
- Demonstrate the ability to design and implement a data structure that tracks operations performed on an underlying collection.
- Apply algorithmic thinking to develop an efficient sorting solution, considering optimization techniques to minimize computational resources.
- Develop problem-solving skills by implementing heuristics to improve the performance of linear search and analyzing their effectiveness.
- Evaluate and compare the performance of various sorting algorithms, drawing conclusions about their strengths and weaknesses.
- Utilize profiling tools (JMH) to analyze the efficiency of Set implementations and identify areas for improvement.
- Demonstrate understanding of coding standards and style guidelines by consistently applying them throughout the project.
Important Dates
Event | Day | Date | Time |
---|---|---|---|
Release | Thu | Sep 26 | 09:00 AM ET |
Due | Thu | Oct 03 | 05:00 PM ET |
Homework Description
In this assignment, you will implement the Insertion Sort algorithm with the "minimizing swaps" optimization. Additionally, you will develop another implementation of the IndexedList
interface to analyze the number of array operations performed by various sorting algorithms, allowing for in-depth analysis of the performance of different sorting techniques using a custom driver program.
You will also develop and incorporate two heuristics - "move-to-front" and "transpose-sequential-search" - to enhance the efficiency of the "find" operation in implementations of the Set ADT. To evaluate the effectiveness of these heuristics, you will utilize Java Microbenchmark Harness (JMH) to profile the performance of the Set implementations provided and compare it with that of the Set implementation incorporating the heuristics.
Note, you are not allowed to use (1) Java Collections (built-in data structures), (2) data structures that are not covered in the course yet.
Starter Code
Download the starter code from the following link: hw3.zip
The starter code is a zip file named hw3-starter.zip
. You must unzip this file. After unzipping, you will get a folder called hw3-starter
. Open this folder as an IntelliJ project.
Tasks
This assignment consists of four main tasks:
- Implementing a
MeasuredIndexedList
class that tracks the number of access and mutation operations performed on anIndexedList
. - Implementing the Insertion Sort algorithm with the "minimizing swaps" optimization.
- Implementing the "move-to-front" and "transpose-sequential-search" heuristics to improve the performance of linear search.
- Answering discussion questions related to the design of the
MeasuredIndexedList
, the performance of sorting algorithms, the analysis of the descending selection sort algorithm, and the effectiveness of the heuristics in the Set ADT.
The Measured IndexedList
Your first task for this assignment is to develop a new kind of IndexedList
implementation that keeps track of how many access and mutate operations have been performed on it. It also counts the number of occurrences of a particular value in the IndexedList
. Check out the Measured
interface first (it is inside the sort
package), reproduced here in compressed form:
public interface Measured<T> {
void reset();
int accesses();
int mutations();
int count(T t);
}
This describes what we expect of an object that can collect statistics about itself. For example, after a Measured
object has been "in use" for a while, we can check how many access and mutate operations it has been asked to perform, through the accesses
and mutations
methods, respectively. We can also tell it to "forget" what has happened before and start counting both kinds of operations from zero again using the reset
method.
-
Accessor methods include any that are looking up data stored in the IndexedList. This should not include
length()
which is meta-data about the IndexedList. -
Mutator methods include any that change the data stored in the IndexedList (after creation/initialization).
- Consider a freshly constructed Measured object. It would return $0$ for both
accesses
andmutations
. - Now, imagine we call the
length
operation followed by three calls to theget
operation. At this point, our object would return $3$ foraccesses
but still $0$ formutations
. - If we now call the
put
operation twice, the object would return $2$ formutations
but still $3$ foraccesses
. (You don't have to check whether a put operation actually changes the value or not since that's not how the put operation was initially written.)
- Consider a freshly constructed Measured object. It would return $0$ for both
-
The reset operation should set both the number of accesses and mutations back to $0$.
-
The
count
method returns the number of occurrences of the parameter value. Since it will always need to inspect every value in the array, it should also naturally update theaccesses
value accordingly.
Testing MeasuredIndexedList
You must write unit tests for Measured IndexedList. Your focus should be on the "Measured" aspect of it (i.e., reset
, accesses
, mutations
, count
). As you test these operations, naturally, you will need to call IndexedList
interface methods (i.e., put
, get
, length
) to trigger the various possible outcomes.
The MeasuredIndexedListTest.java
is where you need to add unit tests. We provide a skeleton code, with a few tests, as part of the starter. However, you need to add many more tests in there.
Implementing MeasuredIndexedList
You need to develop a class MeasuredIndexedList
that extends our dear old ArrayIndexedList
(already provided in the starter code) and implements the Measured
interface; yes, both at the same time. When a MeasuredIndexedList
is created, you initialize internal counters to keep track of the number of access and mutate operations it has been asked to perform so far; both counts start at zero. You will need to override the accessor and mutator methods of the class so that the relevant counter is incremented each time that type of operation succeeds (i.e., is executed fully, given valid parameter values).
Hints:
-
Make sure you make proper use of
super
. You should not be duplicating any code fromArrayIndexedList
and should be making good use of inheritance. -
Pay attention to where and when you are calling
super
methods and how that might/should affect the counts. For example, with theArrayIndexedList
, we report the number of successful reads and writes. So if some operation causes an exception, that operation shouldn't affect the statistics. -
Don't forget that your constructor for
MeasuredIndexedList
will also have to invoke theArrayIndexedList
constructor! However, this operation is neither an accessor nor a mutator.
Sorting
Your second task for this homework is to explore some of the basic sorting algorithms and their analysis. These algorithms are quadratic in terms of their asymptotic performance, but they differ in their actual performance.
We'll focus on the following three algorithms:
- Bubble Sort (with the "stop early if no swaps" optimization)
- Selection Sort
- Insertion Sort (with the "minimizing swaps" optimization)
The provided files (inside the sort
package) contain a basic framework for evaluating sorting algorithms. You'll need a working MeasuredIndexedList
class, and you'll need to understand the following interface as well (again compressed):
public interface SortingAlgorithm<T extends Comparable<T>> {
void sort(IndexedList<T> list);
String name();
}
Let's look at this interface
-
Given an object of
SortingAlgorithm
, we can (a) ask it to sort a givenIndexedList
and (b) ask it for its name (e.g., "Insertion Sort"). -
The use of
extends
inside the angle brackets means that any typeT
we want to sort must implement the interfaceComparable
as well. It obviously can't just be any old type; it must be a type for which the expression "a is less than b" actually makes sense. UsingComparable
in this form is Java's way of saying that we can order the objects; we've covered this related lecture notes (you should read up the details).
Implement Insertion Sort
You must implement Insertion Sort in the provided InsertionSort.java
file. Note that the InsertionSort
class implements the SortingAlgorithm
.
Implement the Insertion Sort with the "minimizing swaps" optimization.
Minimizing swaps: Visit the Insertion Sort entry on Wikipedia. In the section on its algorithm, find the passage that reads as "After expanding the swap operation in-place as ..., a slightly faster version can be produced that moves A[i]
to its position in one go and only performs one assignment in the inner loop body." You should base your implementation off (what you learned in the lecture and) the pseudo-code which follows the passage above.
We expect you to understand the optimization well enough to implement it by reading the Wikipedia entry. Therefore, please refrain from asking questions from the teaching staff on how the optimization works or whether you have implemented it correctly. Figuring it out on your own is part of the homework!
We also provide a working GnomeSort
implementation of SortingAlgorithm
, a working BubbleSort
and SelectionSort
, and a NullSort
that doesn't do anything. GnomeSort
is an intentionally inefficient sorting algorithm that we can use for comparison.
Hacking Linear Search
For this part of the homework, you will implement the "move-to-front" and "transpose-sequential-search" heuristics to improve the performance of the "find" operation in an implementation of the Set ADT. These heuristics are described in the lecture notes on Set ADT.
These heuristics are implemented as two new classes, MoveToFrontLinkedSet
and TransposeArraySet
, respectively. Both classes implement the Set
interface. The MoveToFrontLinkedSet
class uses a linked list to store the elements, while the TransposeArraySet
class uses an array. Both classes should use the LinkedSet
and ArraySet
classes from the set
package as their superclasses.
Important Notes:
- You will need to override relevant method(s) from LinkedSet/ArraySet, respectively. (The changes must be kept as minimal as possible; do you need to override any of the core operations or the helper methods?)
- You many not make any changes to
Set.java
,LinkedSet.java
andArraySet.java
files. - You may add as many private fields/methods as you need to
MoveToFrontLinkedSet
andTransposeArraySet
. However, you may not add any public members in there.
Once you are done, run the tests in SetTest
for MoveToFrontLinkedSet
and TransposeArraySet
. Make sure they all pass as a minimal assurance that your implementation does not break the code contract of the Set ADT.
Next, add a main
method to each of MoveToFrontLinkedSet
and TransposeArraySet
. Inside the main
method, add a sequence of operations to "insert", "remove" and "search" (use has
operation) for some sample values. It is entirely up to you what values and what sequence of operations you use. The goal is to observe the behavior of the heuristics in action.
Important Note: Your implementation of the heuristics must adhere to the specification of each and work as expected when the
has
operation is called. If not, you must debug and resolve the issue. However, do not update the implementation of heuristics nor howremove
andinsert
work if you find the heuristics don't play well with those methods.
Discussion
In the final part of this homework, you will write a five-part discussion in the README.md
file. The discussion will cover the design of the MeasuredIndexedList
, the performance of sorting algorithms, the analysis of the descending selection sort algorithm, the effectiveness of the heuristics in the Set ADT, and the performance of the Set implementations using JMH.
Part I
In the README.md
file, under Part I, discuss from a design perspective whether or not iterating over a MeasuredIndexedList
should affect the accesses and mutation counts.
Note that you are not asked to change the ArrayIndexedListIterator
. However, if you wanted to include the next()
and hasNext()
methods in the statistics measured, can you inherit from ArrayIndexedList
and override the relevant methods, or not? Explain.
In answering this question, please refrain from getting help from the teaching staff. We want you to figure it out on your own.
Part II
We provide a SortingAlgorithmDriver.java
file that runs the various sorting algorithms on several input data and reports on its statistics. You are expected to run this driver program and analyze the results.
Important Notes:
-
You can only run
SortingAlgorithmDriver.java
after implementingMeasuredIndexedList
. Otherwise, by running it, you will get aNullPointerException
. -
Make sure you store the homework files in a location on your computer where the path to it does not contain space or special characters.
-
Make sure the
main/resources
folder is marked as the project "resources." (See the documentation here.) This should be the case if you open the project starter correctly.
The following is an example output from the driver program (note InsertionSort
was not implemented yet):
Data file Algorithm Sorted? Accesses Mutations Seconds
ascending.data Null Sort false 0 0 0.000004
ascending.data Gnome Sort true 15,230,058 5,074,020 0.305117
ascending.data Selection Sort true 16,003,980 7,980 0.191780
ascending.data Bubble Sort true 12,055,014 5,074,020 0.231695
ascending.data Insertion Sort false 0 0 0.000002
descending.data Null Sort false 0 0 0.000002
descending.data Gnome Sort true 47,988,000 15,996,000 0.232058
descending.data Selection Sort true 16,000,000 4,000 0.095581
descending.data Bubble Sort true 31,992,000 15,996,000 0.181071
descending.data Insertion Sort false 0 0 0.000002
random.data Null Sort false 0 0 0.000002
random.data Gnome Sort true 24,145,478 8,045,828 0.117493
random.data Selection Sort true 16,003,992 7,992 0.083146
random.data Bubble Sort true 24,019,776 8,045,828 0.174053
random.data Insertion Sort false 0 0 0.000002
The driver is configured to read the first 4000 entries from each data file (ascending.data
, descending.data
, random.data
) and sort them using various sorting algorithms. These data files are available in the src/main/resources
folder.
The report includes a "Sorted?" column that indicates whether the array was sorted correctly. Additionally, it reports how many operations of the underlying MeasuredIndexedList
were used to perform the sort ("Accesses" and "Mutations" columns). Finally, the report also indicates how long it took to sort the array ("Seconds" column). Please note that this number will vary widely across machines. It's hard to use time as an actual benchmark; you can only use it for relative comparisons on the device running the experiment.
You might want to run SortingAlgorithmDriver
several times and ignore outliers to compensate for variations in load. Feel free to change the value for SIZE
so you can experiment with more or fewer data values. Moreover, you can make your own data file and add them (their name) to getDataFiles
method, so the driver run the sorting algorithm on it.
Don't make any changes to the driver program other than changing the size of the data!
Once you have run the driver program several times, you should have a good idea of how the algorithms compare in terms of the number of operations they perform. Please note there is a discrepancy between the result from this experiment and what is expected from each sorting algorithm based on what we know about their implementation and performance. In other words, the numbers outputted by SortingAlgorithmDriver
(in some cases) don't match our expectations!
There is an intentional mistake in the way the earlier experiment is set up and implemented, resulting in the discrepancy mentioned above. The mistake may be in the code or data or both.
You must use the measurements to find the discrepancy and then use a debugger to step through the code/data to catch that mistake. Once you notice the mistake, you should describe it in the README.md
file under Part II. You should also describe the series of experiments you ran, what data you collected, and your conclusions about these algorithms' relative performance. Include concrete examples in your discussion (e.g., time measurements, data set, size, etc.).
Important Notes:
-
You are not asked to correct the mistake. In fact, you should not make any changes to
SortingAlgorithmDriver
or any of the data files. Instead, you should describe the error and how you have come to find it. -
Please refrain from checking with us on whether your finding is the answer we are looking for. You should discover this on your own without the help of the teaching staff. Moreover, avoid posting your hypothesis/finding on the course discussion board so that other students can make the same connection on their own. If you post a public note/question about this on the discussion board, you will get a penalty on your evaluation for this homework.
Part III
In Part III of the discussions, you will analyze the following descending selection sort algorithm by counting the number of RAM steps it takes and expressing that as a function of the size of the input (which is the length of the array to be sorted here). More specifically, you need to determine how many comparisons $C(n)$ and assignments $A(n)$ are performed by this implementation of selection sort in the worst-case. Both of these should be polynomials of degree 2 since you know that the asymptotic complexity of selection sort is $\Omicron(n^2)$.
You are asked to analyze the algorithm below, not the implementation in
SelectionSort.java
. You must reference the line numbers below in your writeup for this problem, so that we can follow your analysis when grading.
Here's the code:
1 public static void selectionSort(int[] a) {
2 int max, temp;
3 for (int i = 0; i < a.length - 1; i++) {
4 max = i;
5 for (int j = i + 1; j < a.length; j++) {
6 if (a[j] > a[max]) {
7 max = j;
8 }
9 }
10 temp = a[i];
11 a[i] = a[max];
12 a[max] = temp;
13 }
14 }
Don't just state the polynomials. Your writeup has to show how you derived them, line by line!
Part IV
Once you have implemented the MoveToFrontLinkedSet
and TransposeArraySet
classes, you should test them to ensure they work as expected. In particular, run the main
method we asked you to add to each class. This method should insert, remove, and search for some sample values. You should run the main
method in debug mode and execute the sequence of operations step by step. Use the debugger tool (or jGRASP/Java Visualizer plugins) to observe the state of the underlying array/linked list for each implementation. Make sure your implementation works as expected of each heuristics.
As you run your main method(s), take a closer look at the values after removal! How does the removal logic contribute to achieving the objective of these heuristics? The goal of these heuristics is to enhance the performance of the has
operation. Do you think the removal, combined with how the heuristics work, will help achieve this goal? Write your answer in the README.md
file.
Part V
In earlier parts of this homework, you've seen us measure the performance various sorting algorithms over a given dataset by calculating the elapsed time (in milliseconds). Unfortunately, this measure is usually not accurate even when run on the same machine. As you execute a Java program, a lot goes unnoticed to you, under the hood. From starting the Java Virtual Machine (JVM) to various hacks and optimizations to compete with the performance of compiled languages such as C and C++. (You can read this Wikipedia article on Java Performance to know more about it.) This overhead is often referred to as the "warm-up" time. For an accurate measure of time, you have to allow for warm-up, repeat the experiment, remove outliers, etc. The Java Microbenchmark Harness, or simply JMH, is program that helps you do just that. JMH is a set of tools introduced in OpenJDK 11. It is aimed to help with performance analysis (profiling) and benchmarking of Java programs. Similar to JUnit, the JMH provides several annotations which we can use to write code (test) for measuring performance.
Before continuing, please install IntelliJ Plugin for JMH Java Microbenchmark Harness.
In this part of the homework, you will use JMH to profile the performance of the Set
implementations we have provided. You will compare the performance of ArraySet
and LinkedSet
with the performance of MoveToFrontLinkedSet
and TransposeArraySet
. The goal is to answer the following question: Do heuristics improve linear search?
Notice the JmhRuntimeTest.java
file inside src/test/hw3/
. This file contains several methods that are annotated with annotations unfamiliar to you. Please ignore the annotations; we won't ask you to write such methods on your own. For this homework, you need to be able to run these methods. Notice you can run these methods individually (like unit tests, there must be a play icon next to each). You can also run all the methods in JmhRuntimeTest
; In IntelliJ, right-click on the file and select run in the Project panel. It must produce a lot of output (including, potentially some warnings in red). After several minutes, you will get a report similar to this:
Benchmark Mode Cnt Score Error Units
JmhRuntimeTest.arraySet avgt 2 0.385 ms/op
JmhRuntimeTest.linkedSet avgt 2 0.910 ms/op
JmhRuntimeTest.moveToFront avgt 2 0.892 ms/op
JmhRuntimeTest.transposeSequence avgt 2 0.411 ms/op
Don't compare the measurements presented above to what you will see when you run your code. The measures depend on the computer on which the code is executed. Also, at the time we conducted this experiment, the MoveToFrontLinkedSet
and TransposeArraySet
were not implemented! For your information, the measurements reported under "Score" are average elapsed time in milliseconds.
If you are unable to run
JmhRuntimeTest
, it is most likely that you have not installed the JDK as it was instructed in the course logistics. You may also have multiple versions of JDK installed, and IntelliJ is confused as to where to find JMH. You can try to fix this by going toFile -> Project Structure -> Project -> Project SDK
and selecting the correct JDK.
Notice each method that is annotated with @Benchmark
is designed so that it first instantiates an implementation of Set ADT. It then passes the set
object to the experiment
method. The experiment
method, as it stands now, simply adds all the values from the data
field into the set
argument passed to it. Next, the data
field is instantiated in the setUp
method. As the setUp
method stands now, it simply adds integers from $0$ to $999$ to data
, and then it shuffles it. The outcome is a random permutation of natural numbers smaller than $1000$.
Task: You must design an experiment in which the heuristics would improve the performance of Array/Linked Set.
You may not make any changes to the methods that are annotated with @Benchmark
. You may change the setUp
or the experiment
method (or both). If you need more parameters to control your experiment, you can add them as private fields to the JmhRuntimeTest
class.
In your README
, please describe the experiment and the setup which you have used. Please discuss the results from using the heuristic "improvements" to the Set
implementations. Were there any noticeable differences compared to the implementations we provided (ArraySet
and LinkedSet
)? If they didn't perform as expected, why do you think that was the case? Were your data sets big enough? Suppose you didn't see any measurable differences. What do you think it would take to measure the relative performance of each implementation?
Submission
In preparation, you must (1) run checkstyle to ensure your code is checkstyle compliant (if not, fix the styling issues). (2) Create a zip file containing the src
folder (and all therein). (3) Submit the zip file to Gradescope.
You can submit as many times as you like until deadline. We will only grade the last submission.
Rubric
Teaching assistants will evaluate your submission using the following criteria:
Please note, we reserve the right to change this rubric without advance notice, even after the submission deadline. However, the chances of doing so are slim, and at any rate, should not be a significant change.
-
Spec 1: The constructor of
MeasuredIndexedList
correctly sets both the number of accesses and mutations to 0 and initializes the underlyingArrayIndexedList
. [0.5 point] -
Spec 2: The
reset
method ofMeasuredIndexedList
correctly sets both the number of accesses and mutations back to 0. [0.5 point] -
Spec 3: The "access" counts are correctly updated for the
get
andlength
methods, and theaccesses
method ofMeasuredIndexedList
accurately returns the number of access operations performed. [1 point] -
Spec 4: The "mutate" counts are correctly updated for the
put
method, and themutations
method ofMeasuredIndexedList
accurately returns the number of mutate operations performed. [1 point] -
Spec 5: The
count
method ofMeasuredIndexedList
correctly returns the number of occurrences of the parameter value and updates theaccesses
count accordingly. [0.75 point] -
Spec 6: The
MeasuredIndexedListTest
class contains focused, appropriately named unit tests for thereset
,accesses
,mutations
, andcount
methods ofMeasuredIndexedList
. [1.5 points] -
Spec 7: The
MeasuredIndexedListTest
covers edge cases and potential exceptions for the measured operations. [0.75 point] -
Spec 8: The
InsertionSort
class correctly implements the Insertion Sort algorithm with the "minimizing swaps" optimization. [2 points] -
Spec 9: The
MoveToFrontLinkedSet
class correctly and efficiently implements the "move-to-front" heuristic for thefind
operation in a linked list based Set. The class extendsLinkedSet
and overrides only the necessary methods. [2 points] -
Spec 10: The
TransposeArraySet
class correctly and efficiently implements the "transpose-sequential-search" heuristic for thefind
operation in an array based Set. The class extendsArraySet
and overrides only the necessary methods. [2 points] -
Spec 11: The
main
method is added toMoveToFrontLinkedSet
andTransposeArraySet
to demonstrate the behavior of the heuristics with a sequence of insert, remove and search operations. [1 point] -
Spec 12: Part I of the discussion in
README.md
thoughtfully considers the design implications of tracking iterator operations inMeasuredIndexedList
. [1 point] -
Spec 13: Part II of the discussion in
README.md
describes the discrepancy found in theSortingAlgorithmDriver
results, the debugging process used to identify the mistake, and draws reasonable conclusions about the relative performance of the sorting algorithms based on corrected data. [3 points] -
Spec 14: Part III of the discussion in
README.md
correctly analyzes the provided descending selection sort algorithm, determining the number of comparisons $C(n)$ and assignments $A(n)$ performed in the worst-case, expressed as degree 2 polynomials. The derivation process is shown line by line. [2 points] -
Spec 15: Part IV of the discussion in
README.md
considers the effect of the removal logic inMoveToFrontLinkedSet
andTransposeArraySet
on the performance of thehas
operation, based on observations from running themain
methods in debug mode. [1 points] -
Spec 16: Part V of the discussion in
README.md
describes an experiment designed using JMH to measure the performance impact of the "move-to-front" and "transpose-sequential-search" heuristics compared to the baseArraySet
andLinkedSet
implementations. The setup is explained, results are discussed, and thoughtful analysis is provided. [2 points] -
Spec 17: The code is well-organized, easy to read, and complies with the provided checkstyle rules and coding standards. [1 point]
The total score for this homework is 23 points.