The ‘Hash Map’ Data Structure
Feb 192018Computers read data from memory by its address — and memory addresses are numbers — so arrays are a natural data structure for sequence types which support ‘getting’ values by their numeric index. When we ‘get’ from an array, the computer can use the index to calculate the memory address of the value we want.
To implement a dictionary, we need a way to ‘get’ values by their keys.
An array is an efficient data structure because its indices are a compact range of numbers; but the keys in a dictionary generally aren’t numbers, so it’s not obvious how a key like "foo"
can be used to find where in memory the value associated with it is.
If we can’t directly calculate where the value is, we’d have to search for it, and this could be quite slow if there is a lot of data to search through.
What we really need is a way to map the dictionary’s keys to integers in a compact range; then we could use those integers as array indices to efficiently ‘get’ values by their keys. This is the core idea of the hash map data structure (also called a hash table). We assume there is such a function which takes a key as input, and outputs an integer (called the hash of the key).
Try the following in Python; what other types support the hash
operation?
>>> hash('foo')
-9176760052727676461
>>> hash('bar')
-2018549587826568091
>>> hash('baz')
8935266257754080596
Try the following in BlueJ’s code pad; what other types support the hashCode
operation?
>>> "foo".hashCode()
101574 (int)
>>> "bar".hashCode()
97299 (int)
>>> "baz".hashCode()
97307 (int)
Hashes have many uses; here, we just want to use them as indices in an array. However, it’s easy to see that real hashes are not within a compact range suitable for array indices. For example, if our array has length 10, then its indices are from 0 to 9. No problem; we can just take the hash’s last decimal digit, which will be in the desired range.(1) With this, we can use an array to store the dictionary’s values. Here’s our first attempt:
(For simplicity, let’s not worry about the other dictionary operations like ‘remove’ right now.)
After trying out this interactive model for a while, you should notice that it behaves like a dictionary most of the time, but not always.
The problem is that different keys can map to the same array index; if we ‘set’ a value for the key "blue"
, then we would incorrectly ‘get’ that value for the key "pink"
.
Similarly, if we ‘set’ a new value for the key "pink"
, it would overwrite the value for the key "blue"
.
Dictionaries are not supposed to behave like this!
This problem is called hash collision. To resolve it, we need to store the keys too.(2) Then when we ‘get’ the value for a key, we can check whether the key at that index really is the one we’re looking for. Let’s store them in another array — here’s our second attempt:
Now the ‘get’ operation won’t return a wrong value from a different key.
Unfortunately, it’s still impossible for this scheme to represent a dictionary with both "blue"
and "pink"
as keys.
From here, we must either store multiple key/value pairs at each array index (i.e. have an array of lists), or choose different indices when collisions occur. Both strategies are common, and there are many different ways to fill in the details. For this post, we’ll just consider one simple option: when an index is occupied by a different key, try the next index, then the next, and so on.
Try setting values for both "blue"
and "pink"
with the model below:
This model now correctly accounts for collisions, and includes almost all of the necessary ideas for implementing hash maps. There are still a few more details to fill in before it is a fully correct implementation of the dictionary ADT:
- We need to handle inserting a new key when the array is full. As we did for array-lists, we can create larger arrays and copy the keys and values across.(3)
- Checking whether the hash map contains a key is straightforward.
- Removing a key/value pair from the hash map is somewhat tricky, since splitting up a sequential run of keys could cause the ‘get’ or ‘set’ algorithms to behave incorrectly. We may need to move some other key/value pairs to different indices.
- Iterating over the keys, values or pairs is as simple as iterating over the arrays and ignoring the indices where no key exists.
Warning
We can only use the hash map data structure if the keys are hashable, i.e. they support a ‘hash’ operation with the following characteristics:
- The input is a key, and the output is an integer;
- The hash of the same key will never change;
- If two keys are equal, then their hashes are equal.
Not every type satisfies these assumptions.
It is desirable for different keys to have different hashes, but most hash functions do not guarantee this.
Footnotes
- More generally, we can use the modulo operation —
%
in most programming languages — to get indices for any array length. - Of course, we also need to store the keys so we can implement the ‘iterate over the keys’ operation.
Another more subtle reason is so we can tell whether a key is absent, or present with the value
null
. - This should use the ‘set’ operation rather than directly copying, since the indices must be calculated using the hash function. It is also very much desirable to do this before the arrays are full — about 75% full is a good heuristic, because probing stops when it finds an index with no key, and this happens quickly if there are many such indices.
There are no published comments.