Bad Go: guessing

Just write a benchmark

This is the 5th in a series of posts about Bad Go - a clickbaity title for posts about Go code that I’ve found frustrating because it could just be that little bit better. Better in my mind means more performant with less impact on GC, without being more complex or harder to read.

In this post I’ll rant about a problem I’ve seen in a number of blog posts. I’ve seen it more in blog posts than in actual code, but I’m sure people do this with real coding too. They just don’t shout about it.

The problem is guessing about performance. This is not good Go. Go has testing and benchmarking built in. Good Go includes benchmarks to prove that performance related changes actually improve things. If you make a performance change and don’t have a benchmark that indicates it’s an improvement, that’s bad Go.

Bad Blog

Let’s write a bad blog post. We’ll write a merge sort, then write another one that we’ll claim is better with no proof! We’ll claim it’s better because it runs in parallel.

Here’s our initial merge sort. It takes a slice of integers, splits it in half then calls itself to sort each half. It then merges the two sorted halves together into a final sorted list.

func mergeSort(in []int) []int {
	if len(in) <= 1 {
		return in
	}
	mid := len(in) / 2
	sorted1 := mergeSort(in[:mid])
	sorted2 := mergeSort(in[mid:])
	return merge(sorted1, sorted2)
}

// merge zips two sorted slices together
func merge(in1, in2 []int) []int {
	size1, size2 := len(in1), len(in2)
	out := make([]int, size1+size2)

	var idx1, idx2, index int
	for idx1 < size1 && idx2 < size2 {
		if in1[idx1] < in2[idx2] {
			out[index] = in1[idx1]
			idx1++
		} else {
			out[index] = in2[idx2]
			idx2++
		}
		index++
	}
	for idx1 < size1 {
		out[index] = in1[idx1]
		idx1++
		index++
	}
	for idx2 < size2 {
		out[index] = in2[idx2]
		idx2++
		index++
	}
	return out
}

The code breaks the problem down into smaller and smaller problems, until at the bottom it’s sorting a list of length 1. Then it combines pairs of those length-1 solutions together by zipping together the two sorted lists, then zipping together pairs of larger sorted lists, then pairs of even larger sorted lists until it eventually zips together the sorted versions of the initial split and comes up with a final sorted answer.

Let’s start with a quick test to see if it actually sorts the []int. We’ll sort a short list of integers and check the result is as we expected.

func TestSort(t *testing.T) {
	vals := []int{
		7, 3, 2, 293, 1, 34, 4, 99,
	}

	sorted := mergeSort(vals)
	assert.Equal(t, []int{1, 2, 3, 4, 7, 34, 99, 293}, sorted)
}

It passes! Our code can at least sort this one list.

So parallel!

We want a parallel version. Since the algorithm splits the initial list in two, the obvious thing to do is to sort those two halves in parallel. Then when we split each of those halves and sort them we’ll do each of those further halves in parallel too, and so on right down until we have length 1 lists that don’t need sorting.

As we’re sorting each half of our slice on a separate goroutine, we’ll pass back the sorted data in a channel. Then our merge step will take two sorted streams of ints from incoming channels and produce a merged stream in another output channel. Each layer in the process will take two sorted channels from the layer below and merge them into a further sorted channel.

Here’s our code for the parallel version.

func mergeSort2(in []int) []int {
	ch := make(chan int, len(in))
	mergeSortCh(in, ch)

	out := make([]int, 0, len(in))
	for i := range ch {
		out = append(out, i)
	}
	return out
}

func mergeSortCh(in []int, sorted chan int) {
	defer close(sorted)
	if len(in) <= 1 {
		if len(in) == 1 {
			sorted <- in[0]
		}
		return
	}
	mid := len(in) / 2
	sorted1 := make(chan int, mid)
	sorted2 := make(chan int, len(in)-mid)

	// Sort each half concurrently
	go mergeSortCh(in[:mid], sorted1)
	go mergeSortCh(in[mid:], sorted2)

	// Merge the channels
	mergeCh(sorted1, sorted2, sorted)
}

// mergeCh takes two sorted channels of ints in1, in2 and outputs a sorted stream of ints into out
func mergeCh(in1, in2, out chan int) {
	i1, ok1 := <-in1
	i2, ok2 := <-in2
	for ok1 && ok2 {
		if i1 < i2 {
			out <- i1
			i1, ok1 = <-in1
		} else {
			out <- i2
			i2, ok2 = <-in2
		}
	}
	for ok1 {
		out <- i1
		i1, ok1 = <-in1
	}
	for ok2 {
		out <- i2
		i2, ok2 = <-in2
	}
}

Nearly all our sorting and merging can take place on goroutines so we must be able to make good use of all our processors. My laptop has 4 cores with hyperthreading so this version will be 4 to 8 times faster.

You believe me don’t you?

I need proof

OK, OK, let’s stop our imaginary bad blog post and go back to my normal entirely immitable style. Of course we’re going to write a benchmark!

Regular readers know a benchmark is just a special kind of Go test that runs the code in question N times. The benchmarking framework takes care of deciding on N and doing all the measuring. For our benchmark we’ll take a random permutation of the ints from 0 to 999, and sort it. We’ll sort it both with the standard library sort.Ints and our initial mergeSort and the new parallel mergeSort2.

One tricky point is that sort.Int sorts the []int in place, so to make sure we actually perform the sort for each iteration of the benchmark we need to copy our unsorted slice each time. We’ll do this for both the standard library sort and the two merge sorts so we’re comparing like with like.

func BenchmarkSort(b *testing.B) {
	vals := rand.Perm(1000)
	toSort := make([]int, len(vals))

	b.Run("std", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			copy(toSort, vals)
			sort.Ints(toSort)
		}
	})

	b.Run("merge", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			copy(toSort, vals)
			mergeSort(toSort)
		}
	})

	b.Run("merge2", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			copy(toSort, vals)
			mergeSort2(toSort)
		}
	})
}

Here are our results. Our first merge sort isn’t too bad - a little faster than the standard library sort, but with loads of allocations. The parallel sort is pretty bad. Sorting takes over 6 times longer and there are twice as many allocations as even the merge sort.

std-8    10000   105966 ns/op      32 B/op     1 allocs/op
merge-8  12502    95599 ns/op   81536 B/op   999 allocs/op
merge2-8  1761   664894 ns/op  299926 B/op  2011 allocs/op

So why doesn’t adding goroutines and channels make this faster? Well, for one thing every single int is pushed onto a channel, and last time I measured it that would take ~90ns to transmit something on a channel. With 1000 ints to sort that’s already 90,000ns before we’ve started to sort. And starting a goroutine takes ~350ns, and with this algorithm we start nearly 2000 of them to sort 1000 ints, so that’s 700,000ns. So we’ve already burned 790,000ns of CPU before we’ve done anything useful. We do spread this work across processors, but this isn’t a sensible approach. Channels and goroutines are great, but they’re not magic.

Bonus round

If we’d never measured anything we’d never have known whether this version was faster or slower. And we’d never have seen that the first version of our merge does quite so many allocations. Perhaps removing those allocations would be a better first step if we want to improve things?

How could we do that? Well, if you stare hard at this algorithm you’ll eventually spot that it splits the original slice into pairs of ints and orders those, then takes pairs of pairs and zips them together so it has several sequences of 4 ordered ints, then takes pairs of those, etc, etc. Awkward to do in place, but we could do it with two slices of ints, zipping ordered pairs of slices from one slice into the other slice, then swapping over.

Here’s our code. It looks very different from the original merge sort, but if you think about it long enough you’ll realise that from the data point of view it’s doing exactly the same thing. We’ve basically just unrolled the recursive function calls so we can track where the data’s going better.

Our two slices of ints are current and next, and we keep swapping between them. stride controls the size of the ordered lists we’re zipping together, starting with 1 then doubling each time.

func merge3(in []int) []int {
	ll := len(in)
	if ll <= 1 {
		return in
	}

	current := in
	next := make([]int, len(in))
	stride := 1
	for stride < ll {
		end := ll - stride
		for i := 0; i < end; {
			// zip these two together
			idx1, end1 := i, i+stride
			idx2, end2 := end1, i+stride*2
			if end2 > ll {
				end2 = ll
			}

			for idx1 < end1 && idx2 < end2 {
				if current[idx1] < current[idx2] {
					next[i] = current[idx1]
					idx1++
				} else {
					next[i] = current[idx2]
					idx2++
				}
				i++
			}
			for idx1 < end1 {
				next[i] = current[idx1]
				idx1++
				i++
			}
			for idx2 < end2 {
				next[i] = current[idx2]
				idx2++
				i++
			}
		}
		stride *= 2
		current, next = next, current
	}

	return current
}

If we feed this into our benchmark we’ll see it does reasonably well. It takes about half as long as the standard library or our initial merge sort, and only makes one allocation.

std-8    10000   105966 ns/op      32 B/op     1 allocs/op
merge-8  12502    95599 ns/op   81536 B/op   999 allocs/op
merge2-8  1761   664894 ns/op  299926 B/op  2011 allocs/op
merge3-8 23721    50860 ns/op	 8192 B/op     1 allocs/op

By running a benchmark we’re able to spot a weakness in our original implementation: that it makes a large number of allocations. We have been able to guess at an improvement and been able to confirm the guess really does improve things. So please write benchmarks when you are doing performance work, and newsletter compilers, please don’t include performance oriented blog posts that don’t provide evidence for their suggestions.

If you’re still itching to parallelise a merge sort there are reasonable approaches that result in genuine improvements, and there are good blog posts describing how to do it. This one is pretty good!

If you’ve enjoyed this you might enjoy the previous posts in the series. The first two posts are about slices of pointers and pointer returns from functions. The third is about frivolous use of fmt.Sprintf. And the fourth tells you why you should size slices where you can.