DS&A - Data Structures & Algorithms - Choose the Right Data Structures

Choose the Right Data Structures

May 072018

In an earlier post we saw the importance of choosing the right data types when designing an algorithm. But we don’t just choose a type — whether we’re aware of it or not, we are also choosing a data structure which implements that type.

Both choices can have a big impact on how efficient an algorithm is; let’s consider some examples.

Smallest missing integer

Our first example is an exercise from Codility:

Problem: given a list of integers, find the smallest positive integer that does not occur in it.
- Input: a list of integers.
- Output: the smallest integer missing from the list.
Example: for the list [1, 2, 3], the correct output is 4.
Example: for the list [3, 1, 5, 4], the correct output is 2.

This is a straightforward problem with a straightforward solution. Unfortunately, the straightforward solution is a bad one!

Bad solution

We can solve this problem directly by testing if each number 1, 2, 3… is missing or not. Since we try them in order, the first missing number we find must be the smallest one. There is a subproblem of testing whether a number is missing from the list; fortunately, Python lists support the not in operator. Here’s the solution:

def smallest_missing_integer(numbers):
    i = 1
    while True:
        if i not in numbers:
            return i
        i += 1

This solution is bad because it’s unnecessarily slow.

To see why, consider how the not in operator works; the only way to test if a number is missing from a list is by linear search. That is, although Python makes not in look like a basic operation, what it’s actually doing is more like this:

def not_in(lst, target):
    for value in lst:
        if value == target:
            return False
    return True

def smallest_missing_integer(numbers):
    i = 1
    while True:
        if not_in(numbers, i):
            return i
        i += 1

This code works exactly the same way, but now it’s clear that numbers is being searched over and over again. The not_in function is O(n), but the algorithm calls it from inside a loop,(1) so the overall complexity is O(n²). Codility will reject this solution, because it’s not efficient enough.

Good solution

The problem isn’t the not in operator — it’s the data structure. Testing whether or not numbers contains i is only slow because numbers is a list.

In contrast, the complexity of the contains operation on a Python set is O(1).(2) Here’s the solution using a set:

def smallest_missing_integer(numbers):
    numbers_set = set(numbers)
    i = 1
    while True:
        if i not in numbers_set:
            return i
        i += 1

The code is almost exactly the same as before, but now there is no searching. This algorithm is O(n), which is efficient enough to satisfy Codility.

Moral of the story

Beware of O(n) operations disguised as basic operations, like in or not in on a list.

Set data structures are designed so that the contains operation is faster than searching a list. If you want to test whether some values are in (or not in) a collection, use a set.

Most common value

Our next example is another simple problem:

Problem: given a list, find the value which occurs the most times.
- Input: a list.
- Output: the value with the most occurrences in the list.
Example: the most common value in [3, 2, 1, 3] is 3.
Example: the most common value in ["f", "o", "o"] is "o".

This is a realistic problem; in statistics, the most common value in a list of numbers is called the mode.

Bad solution

We can solve this problem by directly applying “plans”: for each value, count the occurrences, and keep track of which value has the most occurrences so far.

# subproblem: count occurrences of a value in a list
def count_occurrences(lst, value):
    counter = 0
    for x in lst:
        if x == value:
            counter += 1
    return counter

def most_common_value(lst):
    # find the highest count
    most_common = lst[0]
    highest_count = count_occurrences(lst, most_common)
    
    for value in lst:
        count = count_occurrences(lst, value)
        if count > highest_count:
            most_common = value
            highest_count = count
    
    return most_common

This works, but it’s a bit complicated, and inefficient. The count_occurrences function is O(n), but the algorithm calls it from inside a loop, so it’s O(n²).

From another perspective, every time this algorithm sees a value, it counts the occurrences, even if they’ve already been counted before.

There is a simpler, faster solution.

Good solution

We can count every value in one pass through the list, using a dictionary to store the counters. Each time we see a value, we’ve either seen it before or we haven’t:

If the value is already in the dictionary, increase its counter by 1.
Otherwise, this is the first occurrence, so set its counter to 1.

After we’ve counted the occurrences of every value, we can search the dictionary for the one with the highest counter. Here’s the solution:

def most_common_value(lst):
    # count occurrences of every value
    occurrences = dict()
    for value in lst:
        if value in occurrences:
            occurrences[value] += 1
        else:
            occurrences[value] = 1
    
    # find the highest count
    most_common = lst[0]
    highest_count = occurrences[most_common]
    
    for value, count in occurrences.items():
        if count > highest_count:
            most_common = value
            highest_count = count
    
    return most_common

The total amount of code is about the same as before, but now there is no “loop within a loop”. There are two loops, but both are O(n), so the algorithm is O(n) in total.(3)

Python-specific detail

Python’s collections module has a class called Counter, which is a dictionary for counting occurrences. The following code works the same way as above, but is simpler:

from collections import Counter

def most_common_value(lst):
    # count occurrences of every value
    occurrences = Counter(lst)
    
    # ...

Moral of the story

If you need to count multiple things, count them all in the same loop. Do as much “work” as you can in one pass.

Instead of calling the same function again and again with the same inputs to find the same results, store those results in a dictionary.(4)

Substitution cipher

Our final example is a simple encryption scheme which works by substituting letters for other letters. The substitutions are defined by a “plaintext alphabet” and a “ciphertext alphabet”; for example, the plaintext alphabet is usually

"ABCDEFGHIJKLMNOPQRSTUVWXYZ"

and the ciphertext alphabet could be

"CEKTOBYZJAPDLVMRXHWNQSIUFG"

Using these alphabets, "A" is encrypted as "C", "B" is encrypted as "E", and so on. Symbols which aren’t in the plaintext alphabet, like " " or ".", are not substituted.

Problem: encrypt a message.
- Input: a plaintext alphabet.
- Input: a ciphertext alphabet.
- Input: a message.
- Output: the encrypted message.

Example: with the ciphertext alphabet "NRJMAUEFTXBHVPQOZILKYDGWCS" and the usual plaintext alphabet, the message

FEW FALSE IDEAS HAVE MORE FIRMLY GRIPPED THE MINDS OF SO
MANY INTELLIGENT MEN THAN THE ONE THAT, IF THEY JUST TRIED,
THEY COULD INVENT A CIPHER THAT NO ONE COULD BREAK.
- DAVID KAHN

is encrypted as

UAG UNHLA TMANL FNDA VQIA UTIVHC EITOOAM KFA VTPML QU LQ
VNPC TPKAHHTEAPK VAP KFNP KFA QPA KFNK, TU KFAC XYLK KITAM,
KFAC JQYHM TPDAPK N JTOFAI KFNK PQ QPA JQYHM RIANB.
- MNDTM BNFP

This isn’t a secure cryptographic scheme, but there are real reasons to substitute some symbols with others. For example, we might want to replace symbols like "é" with "e" when the font doesn’t support them.

Bad solution

We can encrypt a symbol by finding its index in the plaintext alphabet, and getting the substitute from the same index in the ciphertext alphabet. If the symbol is not in the plaintext alphabet, no substitution will be made.

Here’s the solution:

def encrypt(plaintext_alphabet, ciphertext_alphabet, message):
    output = ""
    
    for symbol in message:
        # if symbol is in the plaintext alphabet, substitute it
        i = plaintext_alphabet.find(symbol)
        if i >= 0:
            symbol = ciphertext_alphabet[i]
        
        output += symbol
    
    return output

This works, but it is not efficient.

You may have noticed that the find method is a linear search, and we’re doing this repeatedly with the same few symbols. Following the advice above, it would be better to store the plaintext-to-ciphertext substitutions in a dictionary.

But there’s a worse, more subtle problem. Every time we add a symbol to the output string, the whole string is copied to make a new string. If n is the message length, then the algorithm is O(n²) just because of string concatenation.

Good solution

Let’s fix both problems; we can begin by building a dictionary of substitutions. Then, to build the output string, we should put the parts in a list and join them all at once. Here’s the improved solution:

def encrypt(plaintext_alphabet, ciphertext_alphabet, message):
    # build a dictionary of substitutions
    substitutes = dict()
    for i, symbol in enumerate(plaintext_alphabet):
        substitutes[symbol] = ciphertext_alphabet[i]
    
    # collect the output in a list
    output = []
    
    for symbol in message:
        # if symbol is in the plaintext alphabet, substitute it
        if symbol in substitutes:
            symbol = substitutes[symbol]
        
        output.append(symbol)
    
    # join the parts all at once
    return ''.join(output)

The code is now more complicated, but we replaced the find method with a dictionary’s get operation, and we replaced the string concatenation with a list’s append operation. Both of these are O(1) on Python’s built-in data structures, and the algorithm is now O(n) — a big improvement.

Moral of the story

If you repeatedly search one sequence to find an index in another, build a dictionary instead. Dictionary data structures are designed to make the get operation efficient.

Never build a long string by concatenation. Instead, collect the parts in a list, then join them all together at once.

Footnotes

The first missing integer is at most n + 1, so the main loop runs O(n) times in the worst case.
Python’s sets are a kind of hash map. Another efficient set data structure specifically for positive integers is a bit set.
The first loop is O(n) assuming that accessing a key in the dictionary is O(1). This is true for Python dictionaries, which are a kind of hash map.
That is, assuming there are no side-effects of calling the function.

Tags:
efficiency, data structures, algorithms

Choose the Right Data Structures

Smallest missing integer

Bad solution

Good solution

Moral of the story

Most common value

Bad solution

Good solution

Python-specific detail

Moral of the story

Substitution cipher

Bad solution

Good solution

Moral of the story

Footnotes

There are no published comments.