Choose the Right Data Types
Apr 302018One of the most important and overlooked parts of designing algorithms is choosing the right data types. We often assume that the types of data an algorithm will use are determined by the inputs and the output — but it might help to use a temporary collection of some data, if that collection has useful operations. Those operations are determined by the collection’s type.
In this post we’ll look at a couple of examples where using the right data types makes the problem simpler.
Find all duplicates
A “duplicate” is a value which has already occurred. Here’s the problem we’ll solve:
- Problem: given a list, print the indices and values of all duplicates.
- Input: a list.
- Output (to be printed): the index and value of each duplicate, on separate lines.
- Example: given the list
[19, 12, 19, 12]
, the algorithm should printindex = 2, value = 19 index = 3, value = 12
Detecting duplicates is often important to avoid doing the same work twice.
Bad solution
The value at index i
is a duplicate if it occurs at another index before i
.
We can test this directly, and filter for duplicates:
# subproblem: test if value occurs in lst before index
def value_occurs_before(lst, index, value):
for i in range(index):
if lst[i] == value:
return True
return False
def print_duplicates(lst):
for i, value in enumerate(lst):
if value_occurs_before(lst, i, value):
# duplicate detected
print('index = {}, value = {}'.format(i, value))
This works, but there is a simpler way.
Good solution
The easiest way to detect duplicates is to keep a set of all the values we’ve “already seen”. Each time we see a value, if it’s already in the set then it’s a duplicate; otherwise, we should add it to the set (because we’ve just “seen” it).
Here’s the solution:
def print_duplicates(lst):
seen = set()
for i, value in enumerate(lst):
if value in seen:
# duplicate detected
print('index = {}, value = {}'.format(i, value))
else:
seen.add(value)
This code is shorter, simpler, and more efficient.
Moral of the story
To detect duplicates, keep a set of values you’ve “already seen”. If a value is in the set, you’ve already seen it before; otherwise, add new values to the set when you see them. I call this the “set of things seen” plan.(1)
Sorting out-of-place
Consider the out-of-place selection sort algorithm, which repeatedly finds the smallest unsorted value, removes it, and appends it to the output list. To avoid destroying the original list, it works on a copy.
Bad solution
Here’s an implementation of this algorithm:
# subproblem: find the index of the smallest unsorted value
def find_index_of_min(lst):
min_index = 0
min_value = lst[0]
for i, value in enumerate(lst):
if value < min_value:
min_index = i
min_value = value
return min_index
def selection_sort_out_of_place(input_list):
# make a copy so we don't destroy the original data
input_list = list(input_list)
output_list = []
# sort values into output_list, one at a time
while input_list:
min_index = find_index_of_min(input_list)
value = input_list.pop(min_index)
output_list.append(value)
return output_list
This is one of the simplest sorting algorithms; the main subproblem is finding and removing the minimum value from a list. This requires a search, because it’s not one of the basic operations supported by a list.
Good solution
Instead of a list, we should use a collection type which makes it easy to “find and remove the minimum”. This is exactly what the poll operation on a min-priority queue does! If we use a priority queue instead of a list, then out-of-place sorting is trivial:
from queue import PriorityQueue
def priority_queue_sort(input_list):
# put the values into a priority queue
priority_queue = PriorityQueue()
for value in input_list:
priority_queue.put(value)
output_list = []
# poll values into output_list, one at a time
while not priority_queue.empty():
value = priority_queue.get()
output_list.append(value)
return output_list
This is not really a simpler algorithm, because the sorting still has to be done by the priority queue. But the code is simpler.
From another perspective, both algorithms use a priority queue, it’s just that the first one uses an unsorted list as a priority queue data structure, whereas the second one uses Python’s built-in PriorityQueue
class.
Unsurprisingly, using the built-in priority queue is simpler, and more efficient.(2)
Moral of the story
Choose data types that do what you want, rather than data types that store what you want. Not every sequence has to be a list.
Before writing an algorithm to solve a subproblem, check whether there is any abstract data type that does what you want as a basic operation. This saves you from implementing your own data structure without realising it.
There are no published comments.