DS&A - Data Structures & Algorithms - Choose the Right Data Types

Choose the Right Data Types

Apr 302018

One of the most important and overlooked parts of designing algorithms is choosing the right data types. We often assume that the types of data an algorithm will use are determined by the inputs and the output — but it might help to use a temporary collection of some data, if that collection has useful operations. Those operations are determined by the collection’s type.

In this post we’ll look at a couple of examples where using the right data types makes the problem simpler.

Find all duplicates

A “duplicate” is a value which has already occurred. Here’s the problem we’ll solve:

Problem: given a list, print the indices and values of all duplicates.
- Input: a list.
- Output (to be printed): the index and value of each duplicate, on separate lines.
Example: given the list [19, 12, 19, 12], the algorithm should print
```
index = 2, value = 19
index = 3, value = 12
```

Detecting duplicates is often important to avoid doing the same work twice.

Bad solution

The value at index i is a duplicate if it occurs at another index before i. We can test this directly, and filter for duplicates:

# subproblem: test if value occurs in lst before index
def value_occurs_before(lst, index, value):
    for i in range(index):
        if lst[i] == value:
            return True
    return False

def print_duplicates(lst):
    for i, value in enumerate(lst):
        if value_occurs_before(lst, i, value):
            # duplicate detected
            print('index = {}, value = {}'.format(i, value))

This works, but there is a simpler way.

Good solution

The easiest way to detect duplicates is to keep a set of all the values we’ve “already seen”. Each time we see a value, if it’s already in the set then it’s a duplicate; otherwise, we should add it to the set (because we’ve just “seen” it).

Here’s the solution:

def print_duplicates(lst):
    seen = set()
    for i, value in enumerate(lst):
        if value in seen:
            # duplicate detected
            print('index = {}, value = {}'.format(i, value))
        else:
            seen.add(value)

This code is shorter, simpler, and more efficient.

Moral of the story

To detect duplicates, keep a set of values you’ve “already seen”. If a value is in the set, you’ve already seen it before; otherwise, add new values to the set when you see them. I call this the “set of things seen” plan.(1)

Sorting out-of-place

Consider the out-of-place selection sort algorithm, which repeatedly finds the smallest unsorted value, removes it, and appends it to the output list. To avoid destroying the original list, it works on a copy.

Bad solution

Here’s an implementation of this algorithm:

# subproblem: find the index of the smallest unsorted value
def find_index_of_min(lst):
    min_index = 0
    min_value = lst[0]
    for i, value in enumerate(lst):
        if value < min_value:
            min_index = i
            min_value = value
    return min_index

def selection_sort_out_of_place(input_list):
    # make a copy so we don't destroy the original data
    input_list = list(input_list)
    output_list = []
    
    # sort values into output_list, one at a time
    while input_list:
        min_index = find_index_of_min(input_list)
        value = input_list.pop(min_index)
        output_list.append(value)
    
    return output_list

This is one of the simplest sorting algorithms; the main subproblem is finding and removing the minimum value from a list. This requires a search, because it’s not one of the basic operations supported by a list.

Good solution

Instead of a list, we should use a collection type which makes it easy to “find and remove the minimum”. This is exactly what the poll operation on a min-priority queue does! If we use a priority queue instead of a list, then out-of-place sorting is trivial:

from queue import PriorityQueue

def priority_queue_sort(input_list):
    # put the values into a priority queue
    priority_queue = PriorityQueue()
    for value in input_list:
        priority_queue.put(value)
    
    output_list = []
    
    # poll values into output_list, one at a time
    while not priority_queue.empty():
        value = priority_queue.get()
        output_list.append(value)
    
    return output_list

This is not really a simpler algorithm, because the sorting still has to be done by the priority queue. But the code is simpler.

From another perspective, both algorithms use a priority queue, it’s just that the first one uses an unsorted list as a priority queue data structure, whereas the second one uses Python’s built-in PriorityQueue class.

Unsurprisingly, using the built-in priority queue is simpler, and more efficient.(2)

Moral of the story

Choose data types that do what you want, rather than data types that store what you want. Not every sequence has to be a list.

Before writing an algorithm to solve a subproblem, check whether there is any abstract data type that does what you want as a basic operation. This saves you from implementing your own data structure without realising it.

Footnotes

This plan is what’s needed to make DFS and BFS work on graphs.
Python’s PriorityQueue class is implemented as a heap, which is a very efficient priority queue data structure.

Tags:
algorithms, collections, plans, abstract data types

Choose the Right Data Types

Find all duplicates

Bad solution

Good solution

Moral of the story

Sorting out-of-place

Bad solution

Good solution

Moral of the story

Footnotes

There are no published comments.