CS331 - Datastructures and Algorithms

Version 1

Course webpage for CS331

Project

Project due date: 05/12

Organization

Like the programming assignments, the project will be submitted through github. Note that the project is much more open-ended than the labs, so you should start early to ensure you can finish in time.

Description

In this project you will implemented a sorting algorithm called radix sort. In contrast to most other sorting algorithms, this algorithm does not compare the full keys of elements to be sorted, i.e., it is not a comparison-based sort. In radix sort, each key is split into a sequence of parts. Some examples are:

  • For integers in base 10 we may split a number into its digits. For instance, the key 19015 corresponds to the sequence [1,9,0,1,5]

  • For strings, we can interpret them as sequences of characters. For instance, the key 'Peter' corresponds to the sequence ['P','e','t','e','r']

  • An 8-byte integer can be interpreted as a sequence of 8 1-byte integers

The algorithms runs multiple iterations. In each iteration, the keys are sorted based on one position i in their sequence. Since the number of possible values at each position is relatively small, we can allocate to a bucket for each possible value and insert each key k into the bucket k[i]. For that we have to do one pass over the input array. Afterwards, the elements of the buckets are inserted into the array starting with the elements of first bucket, then all elements of the second bucket and so on. Buckets can be implemented using lists.

Radix sort either starts from the first element of each key (i=0) and increase i in each iteration or starts from the largest possible position and decreases i in each iteration. The earlier is called the most-significant digit (MSD) variant of radix sort, because if the keys are numbers than the first digit is the most significant digit. Similarly, the later variant is called least-significant digit (LSD). The variant described above is LSD.

Because each iterations passes over the keys in the left to right order in the array, the order established in the previous iterations will be retained for all keys k1 and k2 such that k1[i] = k2[i]=. It follows that after we have iterated over all positions, the array will be sorted.

Let w be the maximum length of keys in the input and n the number of elements to be sorted. Radix sort runs in $O(w * n)$ time.

You are free to implement any radix sort variant of your choice. For example, the variant that first counts the number of occurrences of each possible value during an iteration to be able to sort in-place. For more information about radix sort see the references below.

Supported key types

Your implementation only needs to support strings consisting of ASCII characters as keys. Python has a bytes type that is suited well for this purpose: https://docs.python.org/3.9/library/stdtypes.html#bytes Bytes literals are strings prefixed with b. Index-based access to characters returns integers (ASCII codes if your string consists only of ASCII characters). This will be handy for implementing radix sort.

mybytes = b'Testing'
print(mybytes[0])
84

You can translate between regular strings and the bytes type like this:

mystr = 'Testing a string'
mybytes = mystr.encode('ascii')
print(f"mystr: <{mystr}>")
print(f"mybytes: <{mybytes}>")
mystr: <Testing a string>
mybytes: <b'Testing a string'>

Because for the project you only need to support ASCII, it may be necessary to translate Unicode strings to ASCII. This can be achieved like shown below. Note that for sorting books there will be a helper function book_to_words (described below) that you can use to ensure that your implementation is consistent with our test cases.

mystr = 'Testing ®À⃠⊙'
mybytes = mystr.encode('ascii','replace')
print(f"mystr: <{mystr}>")
print(f"mybytes: <{mybytes}>")
mystr: <Testing ®À⃠⊙>
mybytes: <b'Testing ????'>

Sorting books

In contrast to the labs, we do not provide skeleton code for the project with the exception of a method that sorts books:

def radix_a_book(book_url='https://www.gutenberg.org/files/84/84-0.txt'):
    pass

This methods takes as input a url of a book (a txt file) and should return a python list with all words of the book sorted alphabetically. For sake of this project, we define words to be the result of

  • Note that that one challenge is that the keys (the words in the book) are not all of the same length. Depending on the variant of radix sort that you use, you may have to pad the keys to all be of the same length!

Books on https://www.gutenberg.org/ are encoded as unicode text. To be able to sort them using your ASCII string radix sort, you have to transform the book into ASCII and then split on whitespace. To avoid any ambiguity in the requirements, use the following helper method to create the list of words:

import urllib
import requests

def book_to_words(book_url='https://www.gutenberg.org/files/84/84-0.txt'):
    booktxt = urllib.request.urlopen(book_url).read().decode()
    bookascii = booktxt.encode('ascii','replace')
    return bookascii.split()