Swissing a table
Moving step by step from a simple hash table to a “swiss table” — the design behind Go’s new map implementation.
We have an annual hackathon at Ravelin in the lead up to Christmas each year. This year I chose to investigate “Swiss tables”, the new hash table idea behind recent improvements to Go’s maps.
Why am I interested in this? I’ve written a number of specialist hash tables, and I wondered if these approaches would lead to any performance improvements. In particular, I’m responsible for a proprietary graph database we have at Ravelin, and the time it takes to add roughly 1 billion entries to one of these hash tables is crucial for the restart time for this database. I’d like to make the database restart faster if possible.
This blog post explains what Swiss tables are by working in stages from a very simple hash table to a full blown, but simplified, “Swiss” table. I’ve done some simple benchmarks at each point to illustrate somewhat what the impact of each step is.
You can find the code supporting this blog at https://github.com/philpearl/hashblog. I’ve taken code and ideas from both the Go source code and the Cockroach DB Swiss table implementation at https://github.com/cockroachdb/swiss.
The starting point
We’ll start with a basic generic hash table using a fixed sized table and a simple open addressing scheme.
The table is just an array of (key, value) pairs. I’ve chosen 32768 (2 the power 15) as the size of the array.
To put something in this table, we take the key and pass it to a hash function. The hash function generates an integer, which we call “the hash”, for the key using a mathematical process. The hash is always the same for a given key, and very likely different for different keys.
We use this hash as an index into our array. Or rather, since the table is only 32768 entries long and our hash is potentially much bigger than that, we use the remainder after dividing the integer by 32768.
If that slot in the table is empty, that’s the slot we use. If not, we move forward in the table slot-by-slot until we find an empty slot. We use the first empty slot we find.
When finding something in the table we do the same thing. We use the hash of the key as the initial index. If the slot is in use we check the key matches. If not, we check the next slot. And carry on until we find a match or an empty slot. If we hit an empty slot then the key is not present in the table.
Both the Get and Set operations on this table are supported by the following find method, which finds an appropriate table entry to get or set. Note this is a very simple hash table indeed, and doesn’t support entries where the key is the zero value!
func (st *SimpleTable[K, V]) find(key K) (*entry[K, V], bool) {
index := hash(key) % simpleTableSize
var zero K
for {
e := &st.entries[index]
if e.key == key {
// Found our entry
return e, true
}
if e.key == zero {
// Empty slot - this means the key is not
// present in the table
return e, false
}
index = (index + 1) % simpleTableSize
}
}
That’s a very basic hash table. We can try ideas from Go’s new map one by one and see if they improve things.
A better probe sequence
If our initial target slot is not available, we can do something more sophisticated than just moving to the next slot. We can take larger steps. As long as we come up with a sequence of steps that is repeatable and hits every possible spot exactly once, we can step forward however we like.
The sequence of steps were use to find the next free slot is called the “probe sequence”. In our initial hash table the probe sequence is very simple: we just move on to the next slot in the table. This is called “linear probing”.
Why try a more complicated probe sequence? The thought is that just moving forward one slot at a time is more likely to run into a larger sequence of used slots, and to tend to make large sequences of used slots ever-larger. And that could lead us to waste a lot of time checking keys. But as far as I can tell it’s just practical experience that says this is the case, and there is no mathematical proof as yet that it’s always better.
The probe sequence used in modern Go maps is as follows.
- We start looking at the slot determined by the hash, call this h.
- If that slot is full, and not a match, we move on to h+1 (1 step forward)
- If that’s full & not a match we look at h+3 (2 steps forward)
- .. then h+6 (3 steps forward)
- .. then h+10 (4 steps forward)
- etc.
All of these are modulo the table length (that is, as before, we look at the remainder after dividing by 32768).
The code for the probe sequence is as follows. Note it’s the offset in the probeSeq struct that we use to index into the hash table.
type probeSeq struct {
mask hashValue
offset hashValue
index hashValue
}
func makeProbeSeq(hash, mask hashValue) probeSeq {
return probeSeq{
mask: mask,
offset: hash & mask,
index: 0,
}
}
// next advances the probe sequence to the next offset.
//
// Steps increase by 1 each time, so the sequence is:
//
// h, h+1, h+3, h+6, h+10, ..., all modulo the table size.
func (s probeSeq) next() probeSeq {
s.index++
s.offset = (s.offset + s.index) & s.mask
return s
}
And the find method now becomes the following. The difference is that we set up and increment the probe sequence in the for loop and use seq.offset as the index into the hash table.
func (st *SimpleTableProbe[K, V]) find(key K) (*entry[K, V], bool) {
h := hash(key)
var zero K
for seq := makeProbeSeq(h, hashValue(simpleTableSize-1)); ; seq = seq.next() {
e := &st.entries[seq.offset]
if e.key == key {
return e, true
}
if e.key == zero {
// Empty slot - this means the key is not
// present in the table
return e, false
}
}
}
Does that improve things?
I’ve written a simple benchmark that creates a hash table where the keys are strings and the values are ints. It then inserts n entries, with n varying between 10 and 32768.
32768 is the number of entries in the table, so when we approach having that number of keys in the table we’d expect it to run very slowly as there will be a lot of cases where a lot of probing is needed to find an empty slot. Hash tables aren’t usually allowed to be more than one half to three-quarters full!
Below are the results of the benchmark. SimpleTable and SimpleTableProbe (the one with the new probing sequence) take about the same amount of time per insert until we get a bit pathological and completely fill the table. At that point SimpleTableProbe is a lot faster.
| Size | SimpleTable (sec/op) | SimpleTableProbe (sec/op) | vs base |
|---|---|---|---|
| 10 | 6.215n ± 2% | 6.262n ± 1% | +0.75% |
| 100 | 6.335n ± 2% | 6.155n ± 0% | -2.85% |
| 1000 | 6.363n ± 0% | 6.248n ± 1% | -1.80% |
| 2000 | 7.098n ± 1% | 6.861n ± 0% | -3.33% |
| 4000 | 7.483n ± 0% | 7.399n ± 0% | -1.11% |
| 8000 | 8.966n ± 0% | 9.014n ± 4% | ~ |
| 16000 | 11.46n ± 0% | 11.26n ± 0% | -1.75% |
| 24000 | 14.76n ± 1% | 15.24n ± 0% | +3.25% |
| 32768 | 229.75n ± 0% | 58.70n ± 0% | -74.45% |
| geomean | 11.86n | 10.12n | -14.69% |
Groups
Each slot in our hash table can hold a single key and value. What happens if instead we put a group of eight keys and values in each slot?
I’m going to guess this inevitably makes things slower, but it’s a step towards Swissing our table.
We define a group to be 8 entries as follows. We’ll also make the table smaller so it contains 32768/8 = 4096 groups, so that this table has the same number of entries and same memory footprint as the previous tables.
const groupSize = 8
type group[K comparable, V any] [groupSize]entry[K, V]
And our find method now becomes the following. The table becomes a table of 4096 groups rather than 32768 entries. The probe sequence is used to select a group rather than an individual entry. We search the current group for the key or an empty slot before moving on to the next group in the probe sequence.
func (m *GroupTable[K, V]) find(key K) (*entry[K, V], bool) {
h := hash(key)
var zero K
for seq := makeProbeSeq(h, hashValue(groupTableSize-1)); ; seq = seq.next() {
g := &m.groups[seq.offset]
// Is the key in this group?
for i := range g {
e := &g[i]
if e.key == key {
return e, true
}
if e.key == zero {
// Empty slot - this means the key is not
// present in the table
return e, false
}
}
}
}
Could this be faster than the non-grouped table? Perhaps the groups will provide us with some cache-locality? Let’s see what happens.
The tale of the tape
The GroupTable is slower than SimpleTableProbe at all points in our test. This grouping doesn’t seem like a good idea! But next we’re going to layer on some additional ideas that should improve things.
| Size | SimpleTableProbe (sec/op) | GroupTable (sec/op) | vs base |
|---|---|---|---|
| 10 | 6.262n ± 1% | 6.558n ± 1% | +4.74% |
| 100 | 6.155n ± 0% | 6.506n ± 1% | +5.72% |
| 1000 | 6.248n ± 1% | 7.734n ± 0% | +23.77% |
| 2000 | 6.861n ± 0% | 9.341n ± 0% | +36.15% |
| 4000 | 7.399n ± 0% | 11.680n ± 0% | +57.85% |
| 8000 | 9.014n ± 4% | 15.770n ± 0% | +74.95% |
| 16000 | 11.26n ± 0% | 19.54n ± 1% | +73.58% |
| 24000 | 15.24n ± 0% | 25.72n ± 0% | +68.77% |
| 32768 | 58.70n ± 0% | 69.50n ± 0% | +18.41% |
| geomean | 10.12n | 13.94n | +37.76% |
Group control bytes
Perhaps we can speed up the groups if we add a mechanism to more quickly identify potential matches and empty entries?
We add eight control bytes to each group of eight keys and values. Each control byte represents one entry in the group. The top bit of the control byte is 1 if the slot is empty and 0 if it is full. The bottom 7 bits are the bottom 7 bits of the hash for the entry.
We use these bytes as a (hopefully) fast way to find candidate matches or empty entries.
With this implementation we need to set the control bytes as well as the entries, and I’ve decided to abandon having a common find method for Get and Set . That may have not been necessary, but the find method looked a little clunky, and I thought a little duplication of code between the Get and Set methods was perhaps a better compromise. So, instead of looking at the find methods, here’s the implementation of the Set method.
func (m *GroupTableCtrl[K, V]) Set(key K, value V) {
h := hash(key)
// h1 is the control byte value (bottom 7 bits of hash, top bit
// clear to indicate the slot in the group is in use).
//
// h2 is used for probing at the group level. We don't use the
// bottom 7 bits of the hash so that entries with the same h1
// are more likely to be in different groups.
h1 := byte(h & 0x7F)
h2 := (h >> 7)
for seq := makeProbeSeq(h2, hashValue(groupTableSize-1)); ; seq = seq.next() {
g := &m.groups[seq.offset]
// Is the key in this group?
for i, ctrl := range g.ctrl {
switch ctrl {
case h1:
if e := &g.entries[i]; e.key == key {
e.value = value
return
}
case 0x80:
// Empty slot - this means the key is not
// present in the table
g.ctrl[i] = h1
g.entries[i] = entry[K, V]{key: key, value: value}
return
}
}
}
}
The scores on the doors
GroupTableCtrl is better than GroupTable at all levels. It is slower than SimpleTableProbe in most cases, but is handily faster in our pathological full table case.
So this is beginning to look promising. If we can optimise things a bit further perhaps we can make some improvements over our simple table?
| Size | SimpleTableProbe (sec/op) | GroupTable (sec/op) | vs base | GroupTableCtrl (sec/op) | vs base |
|---|---|---|---|---|---|
| 10 | 6.262n ± 1% | 6.558n ± 1% | +4.74% | 5.933n ± 0% | -5.25% |
| 100 | 6.155n ± 0% | 6.506n ± 1% | +5.72% | 6.002n ± 0% | -2.48% |
| 1000 | 6.248n ± 1% | 7.734n ± 0% | +23.77% | 7.064n ± 0% | +13.06% |
| 2000 | 6.861n ± 0% | 9.341n ± 0% | +36.15% | 9.047n ± 0% | +31.86% |
| 4000 | 7.399n ± 0% | 11.680n ± 0% | +57.85% | 12.115n ± 0% | +63.73% |
| 8000 | 9.014n ± 4% | 15.770n ± 0% | +74.95% | 13.770n ± 1% | +52.76% |
| 16000 | 11.26n ± 0% | 19.54n ± 1% | +73.58% | 16.25n ± 1% | +44.32% |
| 24000 | 15.24n ± 0% | 25.72n ± 0% | +68.77% | 18.83n ± 0% | +23.56% |
| 32768 | 58.70n ± 0% | 69.50n ± 0% | +18.41% | 33.18n ± 0% | -43.48% |
| geomean | 10.12n | 13.94n | +37.76% | 11.62n | +14.83% |
Clever bit tricks
This is, I believe, what finally makes the table “Swiss”. There are some clever tricks we can do to check the control bytes in fewer operations.
We have 8 control bytes. A uint64 is 8 bytes long. What if we treat our 8 control bytes as a single uint64?
h1 is the bottom 7 bits of our hash value. If you remember, any group entry who’s control byte matches this h1 is a candidate match. Can we quickly create a uint64 that has h1 in every byte, then do comparisons with that?
If we multiply our h1 by 0x0101_0101_0101_0101, that’s what we end up with, a uint64 with h1 in every byte.
If we XOR this with the 8 control bytes (also converted to a uint64) then the resulting bytes will be zero if they match h1 — let’s call this matchesAreZero.
We can then subtract 0x0101_0101_0101_0101 from that. Any byte that was zero will now have its high bit set. Some other bytes may also have the high bit set. If we AND NOT that with matchesAreZero then we’ll clear some bits, but none from the bytes that were all zero. So that will reduce false-positives.
We AND that with 0x8080_8080_8080_8080 to end up with candidate bytes with the high bit set.
Let’s try a worked example. Imagine our control bytes for a group are 0x3838_5676_3723_1280 and our hash is 0xABCD_EF01_0203_0756.
h1 is the bottom 7 bits of 0xABCD_EF01_0203_0756, or 0x56. We can see that in byte 5 (counting from the low-order) of the control.
- We multiply 0x56 by 0x0101_0101_0101_0101 to get 0x5656_5656_5656_5656.
- We XOR this with the control bytes. 0x3838_5676_3723_1280 XOR 0x5656_5656_5656_5656 is 0x6E6E_0020_6175_44D6. We call this value
matchesAreZero. Note byte 5 is 00. - We subtract 0x0101_0101_0101_0101 from
matchesAreZero, with the aim of setting the top bit in any byte that is zero. The result is 0x6D6C_FF1F_6074_43D5. Note byte 5 is FF. But note also that the top bit of byte 0 is set. - To reduce false positives with AND this with the inverse of
matchesAreZero. This clears some bits from any byte that is not 0. The result is 0x0100_FF1F_0000_0301. - We AND with 0x8080_8080_8080_8080 so that just the top bits of each byte remain set. The result is 0x0000_8000_0000_0000.
- We use bits.TrailingZeros64 / 8 to find the first candidate slot in the group. This yields 5.
Phew! That’s a complex process for a human to understand, but it’s fast to execute for a computer.
The algorithm gives some false positives as shown, but not too many, and we go on to check the keys so they’re not important.
The code for our Set method is now as follows.
func (m *SwissTable[K, V]) Set(key K, value V) {
h := hash(key)
h1 := byte(h & 0x7F)
h2 := (h >> 7)
// Expand h1 to a 64-bit value where each byte is h1. This allows
// us to compare against all control bytes in a group
// simultaneously.
h1Expanded := uint64(h1) * 0x0101010101010101
for seq := makeProbeSeq(h2, hashValue(groupTableSize-1)); ; seq = seq.next() {
g := &m.groups[seq.offset]
// Find possible matches for this entry in the group.
// findMatches returns a bitmask where each byte with a
// matching control byte has its high bit set.
matches := g.ctrl.findMatches(h1Expanded)
for matches != 0 {
i := bits.TrailingZeros64(matches) / 8
if e := &g.entries[i]; e.key == key {
e.value = value
return
}
// Clear the lowest set bit and continue
matches &= matches - 1
}
// Check for empty slot in group. This returns a bitmask where
// each byte that is empty has its high bit set.
if empties := g.ctrl.findEmpty(); empties != 0 {
i := bits.TrailingZeros64(empties) / 8
// Empty slot - this means the key is not present in the
// table
g.entries[i] = entry[K, V]{key: key, value: value}
g.ctrl[i] = h1
return
}
}
}
My God, it’s full of Toblerone!
Our current champion SimpleTableProbe beats the Swiss table until it gets about 1/4 full, then the Swiss table begins to pull ahead. The Swiss table even does reasonably well when fully populated, indicating that it could be used with fuller tables than the previous implementations, potentially saving quite a bit of memory despite needing additional control bytes.
| Size | SimpleTableProbe (sec/op) | Swiss (sec/op) | vs base |
|---|---|---|---|
| 10 | 6.262n ± 1% | 7.477n ± 1% | +19.40% |
| 100 | 6.155n ± 0% | 7.359n ± 0% | +19.56% |
| 1000 | 6.248n ± 1% | 8.524n ± 0% | +36.42% |
| 2000 | 6.861n ± 0% | 8.851n ± 0% | +29.00% |
| 4000 | 7.399n ± 0% | 8.854n ± 0% | +19.65% |
| 8000 | 9.014n ± 4% | 8.848n ± 0% | -1.85% |
| 16000 | 11.260n ± 0% | 9.167n ± 0% | -18.59% |
| 24000 | 15.24n ± 0% | 10.11n ± 0% | -33.66% |
| 32768 | 58.70n ± 0% | 16.64n ± 0% | -71.65% |
| geomean | 10.12n | 9.262n | -8.45% |
SIMD
SIMD stands for Single Instruction, Multiple Data. Its a feature of CPUs. The idea is that you can use a single CPU instruction to perform an operation on multiple bits of data at once.
Our control bytes are 8 pieces of data. We’d like to check them all against our h1 value all at once. So SIMD seems like it might apply.
The Go map implementation uses SIMD instructions to speed up the GOOS=amd64 (Intel) version. It does not yet use SIMD with ARM chips, so it’s not involved on my modern Apple laptop. As of Go 1.25 it’s not possible for mere mortals to use SIMD instructions efficiently without writing vast swathes of Assembly.
Go 1.26 introduces a library package of SIMD compiler intrinsics for Intel chips, which means you can efficiently use them with normal Go code. Unfortunately it doesn’t allow access to all the instructions used in the map implementation. I think I can build an Intel Swiss map implementation using intrinsics, but only on CPUs that support AVX512 instructions. And the machines I have easy access to do not.
So I think I’m going to save my SIMD investigations for later.
But the SIMD approach does sound interesting. Intel SIMD supports acting on 16 bytes at once, so it may make sense to increase the group size to 16 with a SIMD implementation. It will be interesting to see what effect that has.
Conclusion
The approaches behind Swiss tables are definitely interesting, but at least with my amateurish efforts, very simple hash tables, and only looking at insertion times, the benefits are not that enormous.
The biggest benefit appears to be that you can run with a fuller hash table with very little penalty. That would make it an opportunity to save memory rather than an enormous speed boost.
If I maintain enthusiasm I intend to follow up this post with 2 further efforts: one trying to use SIMD and another looking at approaches to growing a hash table.