Homework 3: Sorting & Searching

In this homework, you will implement and analyze a few sorting and searching algorithms.

Objectives

  • Demonstrate the ability to design and implement a data structure that tracks operations performed on an underlying collection.
  • Apply algorithmic thinking to develop an efficient sorting solution, considering optimization techniques to minimize computational resources.
  • Develop problem-solving skills by implementing heuristics to improve the performance of linear search and analyzing their effectiveness.
  • Evaluate and compare the performance of various sorting algorithms, drawing conclusions about their strengths and weaknesses.
  • Utilize profiling tools (JMH) to analyze the efficiency of Set implementations and identify areas for improvement.
  • Demonstrate understanding of coding standards and style guidelines by consistently applying them throughout the project.

Important Dates

EventDayDateTime
ReleaseThuSep 2609:00 AM ET
DueThuOct 0305:00 PM ET

Homework Description

In this assignment, you will implement the Insertion Sort algorithm with the "minimizing swaps" optimization. Additionally, you will develop another implementation of the IndexedList interface to analyze the number of array operations performed by various sorting algorithms, allowing for in-depth analysis of the performance of different sorting techniques using a custom driver program.

You will also develop and incorporate two heuristics - "move-to-front" and "transpose-sequential-search" - to enhance the efficiency of the "find" operation in implementations of the Set ADT. To evaluate the effectiveness of these heuristics, you will utilize Java Microbenchmark Harness (JMH) to profile the performance of the Set implementations provided and compare it with that of the Set implementation incorporating the heuristics.

Note, you are not allowed to use (1) Java Collections (built-in data structures), (2) data structures that are not covered in the course yet.

Starter Code

Download the starter code from the following link: hw3.zip

The starter code is a zip file named hw3-starter.zip. You must unzip this file. After unzipping, you will get a folder called hw3-starter. Open this folder as an IntelliJ project.

Tasks

This assignment consists of four main tasks:

  1. Implementing a MeasuredIndexedList class that tracks the number of access and mutation operations performed on an IndexedList.
  2. Implementing the Insertion Sort algorithm with the "minimizing swaps" optimization.
  3. Implementing the "move-to-front" and "transpose-sequential-search" heuristics to improve the performance of linear search.
  4. Answering discussion questions related to the design of the MeasuredIndexedList, the performance of sorting algorithms, the analysis of the descending selection sort algorithm, and the effectiveness of the heuristics in the Set ADT.

The Measured IndexedList

Your first task for this assignment is to develop a new kind of IndexedList implementation that keeps track of how many access and mutate operations have been performed on it. It also counts the number of occurrences of a particular value in the IndexedList. Check out the Measured interface first (it is inside the sort package), reproduced here in compressed form:

public interface Measured<T> {
  void reset();
  int accesses();
  int mutations();
  int count(T t);
}

This describes what we expect of an object that can collect statistics about itself. For example, after a Measured object has been "in use" for a while, we can check how many access and mutate operations it has been asked to perform, through the accesses and mutations methods, respectively. We can also tell it to "forget" what has happened before and start counting both kinds of operations from zero again using the reset method.

  • Accessor methods include any that are looking up data stored in the IndexedList. This should not include length() which is meta-data about the IndexedList.

  • Mutator methods include any that change the data stored in the IndexedList (after creation/initialization).

    • Consider a freshly constructed Measured object. It would return $0$ for both accesses and mutations.
    • Now, imagine we call the length operation followed by three calls to the get operation. At this point, our object would return $3$ for accesses but still $0$ for mutations.
    • If we now call the put operation twice, the object would return $2$ for mutations but still $3$ for accesses. (You don't have to check whether a put operation actually changes the value or not since that's not how the put operation was initially written.)
  • The reset operation should set both the number of accesses and mutations back to $0$.

  • The count method returns the number of occurrences of the parameter value. Since it will always need to inspect every value in the array, it should also naturally update the accesses value accordingly.

Testing MeasuredIndexedList

You must write unit tests for Measured IndexedList. Your focus should be on the "Measured" aspect of it (i.e., reset, accesses, mutations, count). As you test these operations, naturally, you will need to call IndexedList interface methods (i.e., put, get, length) to trigger the various possible outcomes.

The MeasuredIndexedListTest.java is where you need to add unit tests. We provide a skeleton code, with a few tests, as part of the starter. However, you need to add many more tests in there.

Implementing MeasuredIndexedList

You need to develop a class MeasuredIndexedList that extends our dear old ArrayIndexedList (already provided in the starter code) and implements the Measured interface; yes, both at the same time. When a MeasuredIndexedList is created, you initialize internal counters to keep track of the number of access and mutate operations it has been asked to perform so far; both counts start at zero. You will need to override the accessor and mutator methods of the class so that the relevant counter is incremented each time that type of operation succeeds (i.e., is executed fully, given valid parameter values).

Hints:

  • Make sure you make proper use of super. You should not be duplicating any code from ArrayIndexedList and should be making good use of inheritance.

  • Pay attention to where and when you are calling super methods and how that might/should affect the counts. For example, with the ArrayIndexedList, we report the number of successful reads and writes. So if some operation causes an exception, that operation shouldn't affect the statistics.

  • Don't forget that your constructor for MeasuredIndexedList will also have to invoke the ArrayIndexedList constructor! However, this operation is neither an accessor nor a mutator.

Sorting

Your second task for this homework is to explore some of the basic sorting algorithms and their analysis. These algorithms are quadratic in terms of their asymptotic performance, but they differ in their actual performance.

We'll focus on the following three algorithms:

  • Bubble Sort (with the "stop early if no swaps" optimization)
  • Selection Sort
  • Insertion Sort (with the "minimizing swaps" optimization)

The provided files (inside the sort package) contain a basic framework for evaluating sorting algorithms. You'll need a working MeasuredIndexedList class, and you'll need to understand the following interface as well (again compressed):

public interface SortingAlgorithm<T extends Comparable<T>> {
  void sort(IndexedList<T> list);
  String name();
}

Let's look at this interface

  • Given an object of SortingAlgorithm, we can (a) ask it to sort a given IndexedList and (b) ask it for its name (e.g., "Insertion Sort").

  • The use of extends inside the angle brackets means that any type T we want to sort must implement the interface Comparable as well. It obviously can't just be any old type; it must be a type for which the expression "a is less than b" actually makes sense. Using Comparable in this form is Java's way of saying that we can order the objects; we've covered this related lecture notes (you should read up the details).

Implement Insertion Sort

You must implement Insertion Sort in the provided InsertionSort.java file. Note that the InsertionSort class implements the SortingAlgorithm.

Implement the Insertion Sort with the "minimizing swaps" optimization.

Minimizing swaps: Visit the Insertion Sort entry on Wikipedia. In the section on its algorithm, find the passage that reads as "After expanding the swap operation in-place as ..., a slightly faster version can be produced that moves A[i] to its position in one go and only performs one assignment in the inner loop body." You should base your implementation off (what you learned in the lecture and) the pseudo-code which follows the passage above.

We expect you to understand the optimization well enough to implement it by reading the Wikipedia entry. Therefore, please refrain from asking questions from the teaching staff on how the optimization works or whether you have implemented it correctly. Figuring it out on your own is part of the homework!

We also provide a working GnomeSort implementation of SortingAlgorithm, a working BubbleSort and SelectionSort, and a NullSort that doesn't do anything. GnomeSort is an intentionally inefficient sorting algorithm that we can use for comparison.

For this part of the homework, you will implement the "move-to-front" and "transpose-sequential-search" heuristics to improve the performance of the "find" operation in an implementation of the Set ADT. These heuristics are described in the lecture notes on Set ADT.

These heuristics are implemented as two new classes, MoveToFrontLinkedSet and TransposeArraySet, respectively. Both classes implement the Set interface. The MoveToFrontLinkedSet class uses a linked list to store the elements, while the TransposeArraySet class uses an array. Both classes should use the LinkedSet and ArraySet classes from the set package as their superclasses.

Important Notes:

  1. You will need to override relevant method(s) from LinkedSet/ArraySet, respectively. (The changes must be kept as minimal as possible; do you need to override any of the core operations or the helper methods?)
  2. You many not make any changes to Set.java, LinkedSet.java and ArraySet.java files.
  3. You may add as many private fields/methods as you need to MoveToFrontLinkedSet and TransposeArraySet. However, you may not add any public members in there.

Once you are done, run the tests in SetTest for MoveToFrontLinkedSet and TransposeArraySet. Make sure they all pass as a minimal assurance that your implementation does not break the code contract of the Set ADT.

Next, add a main method to each of MoveToFrontLinkedSet and TransposeArraySet. Inside the main method, add a sequence of operations to "insert", "remove" and "search" (use has operation) for some sample values. It is entirely up to you what values and what sequence of operations you use. The goal is to observe the behavior of the heuristics in action.

Important Note: Your implementation of the heuristics must adhere to the specification of each and work as expected when the has operation is called. If not, you must debug and resolve the issue. However, do not update the implementation of heuristics nor how remove and insert work if you find the heuristics don't play well with those methods.

Discussion

In the final part of this homework, you will write a five-part discussion in the README.md file. The discussion will cover the design of the MeasuredIndexedList, the performance of sorting algorithms, the analysis of the descending selection sort algorithm, the effectiveness of the heuristics in the Set ADT, and the performance of the Set implementations using JMH.

Part I

In the README.md file, under Part I, discuss from a design perspective whether or not iterating over a MeasuredIndexedList should affect the accesses and mutation counts.

Note that you are not asked to change the ArrayIndexedListIterator. However, if you wanted to include the next() and hasNext() methods in the statistics measured, can you inherit from ArrayIndexedList and override the relevant methods, or not? Explain.

In answering this question, please refrain from getting help from the teaching staff. We want you to figure it out on your own.

Part II

We provide a SortingAlgorithmDriver.java file that runs the various sorting algorithms on several input data and reports on its statistics. You are expected to run this driver program and analyze the results.

Important Notes:

  1. You can only run SortingAlgorithmDriver.java after implementing MeasuredIndexedList. Otherwise, by running it, you will get a NullPointerException.

  2. Make sure you store the homework files in a location on your computer where the path to it does not contain space or special characters.

  3. Make sure the main/resources folder is marked as the project "resources." (See the documentation here.) This should be the case if you open the project starter correctly.

The following is an example output from the driver program (note InsertionSort was not implemented yet):

Data file         Algorithm        Sorted?  Accesses     Mutations    Seconds     

ascending.data    Null Sort        false    0            0            0.000004    
ascending.data    Gnome Sort       true     15,230,058   5,074,020    0.305117    
ascending.data    Selection Sort   true     16,003,980   7,980        0.191780    
ascending.data    Bubble Sort      true     12,055,014   5,074,020    0.231695    
ascending.data    Insertion Sort   false    0            0            0.000002    

descending.data   Null Sort        false    0            0            0.000002    
descending.data   Gnome Sort       true     47,988,000   15,996,000   0.232058    
descending.data   Selection Sort   true     16,000,000   4,000        0.095581    
descending.data   Bubble Sort      true     31,992,000   15,996,000   0.181071    
descending.data   Insertion Sort   false    0            0            0.000002    

random.data       Null Sort        false    0            0            0.000002    
random.data       Gnome Sort       true     24,145,478   8,045,828    0.117493    
random.data       Selection Sort   true     16,003,992   7,992        0.083146    
random.data       Bubble Sort      true     24,019,776   8,045,828    0.174053    
random.data       Insertion Sort   false    0            0            0.000002 

The driver is configured to read the first 4000 entries from each data file (ascending.data, descending.data, random.data) and sort them using various sorting algorithms. These data files are available in the src/main/resources folder.

The report includes a "Sorted?" column that indicates whether the array was sorted correctly. Additionally, it reports how many operations of the underlying MeasuredIndexedList were used to perform the sort ("Accesses" and "Mutations" columns). Finally, the report also indicates how long it took to sort the array ("Seconds" column). Please note that this number will vary widely across machines. It's hard to use time as an actual benchmark; you can only use it for relative comparisons on the device running the experiment.

You might want to run SortingAlgorithmDriver several times and ignore outliers to compensate for variations in load. Feel free to change the value for SIZE so you can experiment with more or fewer data values. Moreover, you can make your own data file and add them (their name) to getDataFiles method, so the driver run the sorting algorithm on it.

Don't make any changes to the driver program other than changing the size of the data!

Once you have run the driver program several times, you should have a good idea of how the algorithms compare in terms of the number of operations they perform. Please note there is a discrepancy between the result from this experiment and what is expected from each sorting algorithm based on what we know about their implementation and performance. In other words, the numbers outputted by SortingAlgorithmDriver (in some cases) don't match our expectations!

There is an intentional mistake in the way the earlier experiment is set up and implemented, resulting in the discrepancy mentioned above. The mistake may be in the code or data or both.

You must use the measurements to find the discrepancy and then use a debugger to step through the code/data to catch that mistake. Once you notice the mistake, you should describe it in the README.md file under Part II. You should also describe the series of experiments you ran, what data you collected, and your conclusions about these algorithms' relative performance. Include concrete examples in your discussion (e.g., time measurements, data set, size, etc.).

Important Notes:

  1. You are not asked to correct the mistake. In fact, you should not make any changes to SortingAlgorithmDriver or any of the data files. Instead, you should describe the error and how you have come to find it.

  2. Please refrain from checking with us on whether your finding is the answer we are looking for. You should discover this on your own without the help of the teaching staff. Moreover, avoid posting your hypothesis/finding on the course discussion board so that other students can make the same connection on their own. If you post a public note/question about this on the discussion board, you will get a penalty on your evaluation for this homework.

Part III

In Part III of the discussions, you will analyze the following descending selection sort algorithm by counting the number of RAM steps it takes and expressing that as a function of the size of the input (which is the length of the array to be sorted here). More specifically, you need to determine how many comparisons $C(n)$ and assignments $A(n)$ are performed by this implementation of selection sort in the worst-case. Both of these should be polynomials of degree 2 since you know that the asymptotic complexity of selection sort is $\Omicron(n^2)$.

You are asked to analyze the algorithm below, not the implementation in SelectionSort.java. You must reference the line numbers below in your writeup for this problem, so that we can follow your analysis when grading.

Here's the code:

 1   public static void selectionSort(int[] a) {
 2       int max, temp;
 3       for (int i = 0; i < a.length - 1; i++) {
 4          max = i;
 5          for (int j = i + 1; j < a.length; j++) {
 6            if (a[j] > a[max]) {
 7              max = j;
 8            }
 9          }
10          temp = a[i];
11          a[i] = a[max]; 
12          a[max] = temp;
13       }
14   }

Don't just state the polynomials. Your writeup has to show how you derived them, line by line!

Part IV

Once you have implemented the MoveToFrontLinkedSet and TransposeArraySet classes, you should test them to ensure they work as expected. In particular, run the main method we asked you to add to each class. This method should insert, remove, and search for some sample values. You should run the main method in debug mode and execute the sequence of operations step by step. Use the debugger tool (or jGRASP/Java Visualizer plugins) to observe the state of the underlying array/linked list for each implementation. Make sure your implementation works as expected of each heuristics.

As you run your main method(s), take a closer look at the values after removal! How does the removal logic contribute to achieving the objective of these heuristics? The goal of these heuristics is to enhance the performance of the has operation. Do you think the removal, combined with how the heuristics work, will help achieve this goal? Write your answer in the README.md file.

Part V

In earlier parts of this homework, you've seen us measure the performance various sorting algorithms over a given dataset by calculating the elapsed time (in milliseconds). Unfortunately, this measure is usually not accurate even when run on the same machine. As you execute a Java program, a lot goes unnoticed to you, under the hood. From starting the Java Virtual Machine (JVM) to various hacks and optimizations to compete with the performance of compiled languages such as C and C++. (You can read this Wikipedia article on Java Performance to know more about it.) This overhead is often referred to as the "warm-up" time. For an accurate measure of time, you have to allow for warm-up, repeat the experiment, remove outliers, etc. The Java Microbenchmark Harness, or simply JMH, is program that helps you do just that. JMH is a set of tools introduced in OpenJDK 11. It is aimed to help with performance analysis (profiling) and benchmarking of Java programs. Similar to JUnit, the JMH provides several annotations which we can use to write code (test) for measuring performance.

Before continuing, please install IntelliJ Plugin for JMH Java Microbenchmark Harness.

In this part of the homework, you will use JMH to profile the performance of the Set implementations we have provided. You will compare the performance of ArraySet and LinkedSet with the performance of MoveToFrontLinkedSet and TransposeArraySet. The goal is to answer the following question: Do heuristics improve linear search?

Notice the JmhRuntimeTest.java file inside src/test/hw3/. This file contains several methods that are annotated with annotations unfamiliar to you. Please ignore the annotations; we won't ask you to write such methods on your own. For this homework, you need to be able to run these methods. Notice you can run these methods individually (like unit tests, there must be a play icon next to each). You can also run all the methods in JmhRuntimeTest; In IntelliJ, right-click on the file and select run in the Project panel. It must produce a lot of output (including, potentially some warnings in red). After several minutes, you will get a report similar to this:

Benchmark                         Mode  Cnt  Score   Error  Units
JmhRuntimeTest.arraySet           avgt    2  0.385          ms/op
JmhRuntimeTest.linkedSet          avgt    2  0.910          ms/op
JmhRuntimeTest.moveToFront        avgt    2  0.892          ms/op
JmhRuntimeTest.transposeSequence  avgt    2  0.411          ms/op

Don't compare the measurements presented above to what you will see when you run your code. The measures depend on the computer on which the code is executed. Also, at the time we conducted this experiment, the MoveToFrontLinkedSet and TransposeArraySet were not implemented! For your information, the measurements reported under "Score" are average elapsed time in milliseconds.

If you are unable to run JmhRuntimeTest, it is most likely that you have not installed the JDK as it was instructed in the course logistics. You may also have multiple versions of JDK installed, and IntelliJ is confused as to where to find JMH. You can try to fix this by going to File -> Project Structure -> Project -> Project SDK and selecting the correct JDK.

Notice each method that is annotated with @Benchmark is designed so that it first instantiates an implementation of Set ADT. It then passes the set object to the experiment method. The experiment method, as it stands now, simply adds all the values from the data field into the set argument passed to it. Next, the data field is instantiated in the setUp method. As the setUp method stands now, it simply adds integers from $0$ to $999$ to data, and then it shuffles it. The outcome is a random permutation of natural numbers smaller than $1000$.

Task: You must design an experiment in which the heuristics would improve the performance of Array/Linked Set.

You may not make any changes to the methods that are annotated with @Benchmark. You may change the setUp or the experiment method (or both). If you need more parameters to control your experiment, you can add them as private fields to the JmhRuntimeTest class.

In your README, please describe the experiment and the setup which you have used. Please discuss the results from using the heuristic "improvements" to the Set implementations. Were there any noticeable differences compared to the implementations we provided (ArraySet and LinkedSet)? If they didn't perform as expected, why do you think that was the case? Were your data sets big enough? Suppose you didn't see any measurable differences. What do you think it would take to measure the relative performance of each implementation?

Submission

In preparation, you must (1) run checkstyle to ensure your code is checkstyle compliant (if not, fix the styling issues). (2) Create a zip file containing the src folder (and all therein). (3) Submit the zip file to Gradescope.

You can submit as many times as you like until deadline. We will only grade the last submission.

Rubric

Teaching assistants will evaluate your submission using the following criteria:

Please note, we reserve the right to change this rubric without advance notice, even after the submission deadline. However, the chances of doing so are slim, and at any rate, should not be a significant change.

  • Spec 1: The constructor of MeasuredIndexedList correctly sets both the number of accesses and mutations to 0 and initializes the underlying ArrayIndexedList. [0.5 point]

  • Spec 2: The reset method of MeasuredIndexedList correctly sets both the number of accesses and mutations back to 0. [0.5 point]

  • Spec 3: The "access" counts are correctly updated for the get and length methods, and the accesses method of MeasuredIndexedList accurately returns the number of access operations performed. [1 point]

  • Spec 4: The "mutate" counts are correctly updated for the put method, and the mutations method of MeasuredIndexedList accurately returns the number of mutate operations performed. [1 point]

  • Spec 5: The count method of MeasuredIndexedList correctly returns the number of occurrences of the parameter value and updates the accesses count accordingly. [0.75 point]

  • Spec 6: The MeasuredIndexedListTest class contains focused, appropriately named unit tests for the reset, accesses, mutations, and count methods of MeasuredIndexedList. [1.5 points]

  • Spec 7: The MeasuredIndexedListTest covers edge cases and potential exceptions for the measured operations. [0.75 point]

  • Spec 8: The InsertionSort class correctly implements the Insertion Sort algorithm with the "minimizing swaps" optimization. [2 points]

  • Spec 9: The MoveToFrontLinkedSet class correctly and efficiently implements the "move-to-front" heuristic for the find operation in a linked list based Set. The class extends LinkedSet and overrides only the necessary methods. [2 points]

  • Spec 10: The TransposeArraySet class correctly and efficiently implements the "transpose-sequential-search" heuristic for the find operation in an array based Set. The class extends ArraySet and overrides only the necessary methods. [2 points]

  • Spec 11: The main method is added to MoveToFrontLinkedSet and TransposeArraySet to demonstrate the behavior of the heuristics with a sequence of insert, remove and search operations. [1 point]

  • Spec 12: Part I of the discussion in README.md thoughtfully considers the design implications of tracking iterator operations in MeasuredIndexedList. [1 point]

  • Spec 13: Part II of the discussion in README.md describes the discrepancy found in the SortingAlgorithmDriver results, the debugging process used to identify the mistake, and draws reasonable conclusions about the relative performance of the sorting algorithms based on corrected data. [3 points]

  • Spec 14: Part III of the discussion in README.md correctly analyzes the provided descending selection sort algorithm, determining the number of comparisons $C(n)$ and assignments $A(n)$ performed in the worst-case, expressed as degree 2 polynomials. The derivation process is shown line by line. [2 points]

  • Spec 15: Part IV of the discussion in README.md considers the effect of the removal logic in MoveToFrontLinkedSet and TransposeArraySet on the performance of the has operation, based on observations from running the main methods in debug mode. [1 points]

  • Spec 16: Part V of the discussion in README.md describes an experiment designed using JMH to measure the performance impact of the "move-to-front" and "transpose-sequential-search" heuristics compared to the base ArraySet and LinkedSet implementations. The setup is explained, results are discussed, and thoughtful analysis is provided. [2 points]

  • Spec 17: The code is well-organized, easy to read, and complies with the provided checkstyle rules and coding standards. [1 point]

The total score for this homework is 23 points.