Choose the Right Data Structures
May 072018In an earlier post we saw the importance of choosing the right data types when designing an algorithm. But we don’t just choose a type — whether we’re aware of it or not, we are also choosing a data structure which implements that type.
Both choices can have a big impact on how efficient an algorithm is; let’s consider some examples.
Smallest missing integer
Our first example is an exercise from Codility:
- Problem: given a list of integers, find the smallest positive integer that does not occur in it.
- Input: a list of integers.
- Output: the smallest integer missing from the list.
- Example: for the list
[1, 2, 3]
, the correct output is4
. - Example: for the list
[3, 1, 5, 4]
, the correct output is2
.
This is a straightforward problem with a straightforward solution. Unfortunately, the straightforward solution is a bad one!
Bad solution
We can solve this problem directly by testing if each number 1, 2, 3… is missing or not.
Since we try them in order, the first missing number we find must be the smallest one.
There is a subproblem of testing whether a number is missing from the list; fortunately, Python lists support the not in
operator.
Here’s the solution:
def smallest_missing_integer(numbers):
i = 1
while True:
if i not in numbers:
return i
i += 1
This solution is bad because it’s unnecessarily slow.
To see why, consider how the not in
operator works; the only way to test if a number is missing from a list is by linear search.
That is, although Python makes not in
look like a basic operation, what it’s actually doing is more like this:
def not_in(lst, target):
for value in lst:
if value == target:
return False
return True
def smallest_missing_integer(numbers):
i = 1
while True:
if not_in(numbers, i):
return i
i += 1
This code works exactly the same way, but now it’s clear that numbers
is being searched over and over again.
The not_in
function is O(n), but the algorithm calls it from inside a loop,(1)
so the overall complexity is O(n2).
Codility will reject this solution, because it’s not efficient enough.
Good solution
The problem isn’t the not in
operator — it’s the data structure.
Testing whether or not numbers
contains i
is only slow because numbers
is a list.
In contrast, the complexity of the contains operation on a Python set is O(1).(2) Here’s the solution using a set:
def smallest_missing_integer(numbers):
numbers_set = set(numbers)
i = 1
while True:
if i not in numbers_set:
return i
i += 1
The code is almost exactly the same as before, but now there is no searching. This algorithm is O(n), which is efficient enough to satisfy Codility.
Moral of the story
Beware of O(n) operations disguised as basic operations, like in
or not in
on a list.
Set data structures are designed so that the contains operation is faster than searching a list. If you want to test whether some values are in (or not in) a collection, use a set.
Most common value
Our next example is another simple problem:
- Problem: given a list, find the value which occurs the most times.
- Input: a list.
- Output: the value with the most occurrences in the list.
- Example: the most common value in
[3, 2, 1, 3]
is3
. - Example: the most common value in
["f", "o", "o"]
is"o"
.
This is a realistic problem; in statistics, the most common value in a list of numbers is called the mode.
Bad solution
We can solve this problem by directly applying “plans”: for each value, count the occurrences, and keep track of which value has the most occurrences so far.
# subproblem: count occurrences of a value in a list
def count_occurrences(lst, value):
counter = 0
for x in lst:
if x == value:
counter += 1
return counter
def most_common_value(lst):
# find the highest count
most_common = lst[0]
highest_count = count_occurrences(lst, most_common)
for value in lst:
count = count_occurrences(lst, value)
if count > highest_count:
most_common = value
highest_count = count
return most_common
This works, but it’s a bit complicated, and inefficient.
The count_occurrences
function is O(n), but the algorithm calls it from inside a loop, so it’s O(n2).
From another perspective, every time this algorithm sees a value, it counts the occurrences, even if they’ve already been counted before.
There is a simpler, faster solution.
Good solution
We can count every value in one pass through the list, using a dictionary to store the counters. Each time we see a value, we’ve either seen it before or we haven’t:
- If the value is already in the dictionary, increase its counter by 1.
- Otherwise, this is the first occurrence, so set its counter to 1.
After we’ve counted the occurrences of every value, we can search the dictionary for the one with the highest counter. Here’s the solution:
def most_common_value(lst):
# count occurrences of every value
occurrences = dict()
for value in lst:
if value in occurrences:
occurrences[value] += 1
else:
occurrences[value] = 1
# find the highest count
most_common = lst[0]
highest_count = occurrences[most_common]
for value, count in occurrences.items():
if count > highest_count:
most_common = value
highest_count = count
return most_common
The total amount of code is about the same as before, but now there is no “loop within a loop”. There are two loops, but both are O(n), so the algorithm is O(n) in total.(3)
Python-specific detail
Python’s collections module has a class called Counter
, which is a dictionary for counting occurrences.
The following code works the same way as above, but is simpler:
from collections import Counter
def most_common_value(lst):
# count occurrences of every value
occurrences = Counter(lst)
# ...
Moral of the story
If you need to count multiple things, count them all in the same loop. Do as much “work” as you can in one pass.
Instead of calling the same function again and again with the same inputs to find the same results, store those results in a dictionary.(4)
Substitution cipher
Our final example is a simple encryption scheme which works by substituting letters for other letters. The substitutions are defined by a “plaintext alphabet” and a “ciphertext alphabet”; for example, the plaintext alphabet is usually
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
and the ciphertext alphabet could be
"CEKTOBYZJAPDLVMRXHWNQSIUFG"
Using these alphabets, "A"
is encrypted as "C"
, "B"
is encrypted as "E"
, and so on.
Symbols which aren’t in the plaintext alphabet, like " "
or "."
, are not substituted.
- Problem: encrypt a message.
- Input: a plaintext alphabet.
- Input: a ciphertext alphabet.
- Input: a message.
- Output: the encrypted message.
- Example: with the ciphertext alphabet
"NRJMAUEFTXBHVPQOZILKYDGWCS"
and the usual plaintext alphabet, the message
is encrypted asFEW FALSE IDEAS HAVE MORE FIRMLY GRIPPED THE MINDS OF SO MANY INTELLIGENT MEN THAN THE ONE THAT, IF THEY JUST TRIED, THEY COULD INVENT A CIPHER THAT NO ONE COULD BREAK. - DAVID KAHN
UAG UNHLA TMANL FNDA VQIA UTIVHC EITOOAM KFA VTPML QU LQ VNPC TPKAHHTEAPK VAP KFNP KFA QPA KFNK, TU KFAC XYLK KITAM, KFAC JQYHM TPDAPK N JTOFAI KFNK PQ QPA JQYHM RIANB. - MNDTM BNFP
This isn’t a secure cryptographic scheme, but there are real reasons to substitute some symbols with others.
For example, we might want to replace symbols like "é"
with "e"
when the font doesn’t support them.
Bad solution
We can encrypt a symbol by finding its index in the plaintext alphabet, and getting the substitute from the same index in the ciphertext alphabet. If the symbol is not in the plaintext alphabet, no substitution will be made.
Here’s the solution:
def encrypt(plaintext_alphabet, ciphertext_alphabet, message):
output = ""
for symbol in message:
# if symbol is in the plaintext alphabet, substitute it
i = plaintext_alphabet.find(symbol)
if i >= 0:
symbol = ciphertext_alphabet[i]
output += symbol
return output
This works, but it is not efficient.
You may have noticed that the find
method is a linear search, and we’re doing this repeatedly with the same few symbols.
Following the advice above, it would be better to store the plaintext-to-ciphertext substitutions in a dictionary.
But there’s a worse, more subtle problem. Every time we add a symbol to the output string, the whole string is copied to make a new string. If n is the message length, then the algorithm is O(n2) just because of string concatenation.
Good solution
Let’s fix both problems; we can begin by building a dictionary of substitutions. Then, to build the output string, we should put the parts in a list and join them all at once. Here’s the improved solution:
def encrypt(plaintext_alphabet, ciphertext_alphabet, message):
# build a dictionary of substitutions
substitutes = dict()
for i, symbol in enumerate(plaintext_alphabet):
substitutes[symbol] = ciphertext_alphabet[i]
# collect the output in a list
output = []
for symbol in message:
# if symbol is in the plaintext alphabet, substitute it
if symbol in substitutes:
symbol = substitutes[symbol]
output.append(symbol)
# join the parts all at once
return ''.join(output)
The code is now more complicated, but we replaced the find
method with a dictionary’s get operation, and we replaced the string concatenation with a list’s append operation.
Both of these are O(1) on Python’s built-in data structures, and the algorithm is now O(n) — a big improvement.
Moral of the story
If you repeatedly search one sequence to find an index in another, build a dictionary instead. Dictionary data structures are designed to make the get operation efficient.
Never build a long string by concatenation. Instead, collect the parts in a list, then join them all together at once.
Footnotes
- The first missing integer is at most n + 1, so the main loop runs O(n) times in the worst case.
- Python’s sets are a kind of hash map. Another efficient set data structure specifically for positive integers is a bit set.
- The first loop is O(n) assuming that accessing a key in the dictionary is O(1). This is true for Python dictionaries, which are a kind of hash map.
- That is, assuming there are no side-effects of calling the function.
There are no published comments.