Good Go: Contributing to encoding/json

Some sequences have been shortened

Fri, Nov 15, 2019

In my last post I whinged endlessly about encoding/json and in particular about the poor performance of anything with a custom marshaler. I decided to try to do something about it and I raised a proposal for a new marshaler interface.

As I half expected I wasn’t immediately welcomed as the new Go Messiah and my proposal has been effectively rejected (stop press - actually rejected). There’s very little inclination amongst the Go maintainers to change any of the interfaces in the standard libraries, and for some reason changing encoding/json or encoding/xml is viewed with particular horror.

So I could cry and sulk a bit (and of course I will), but I could also see what else I could do to improve things. When I looked at this before there were 3 allocations for every field with a custom JSON marshaler. One is imposed by the poor interface. I wondered what I could do about the other two?

Now, I happen to have been to a Go contributor’s workshop, and I have made a simple contribution to Go before, so I do have the Go git repo on my laptop. So step one was to pull master, build the Go toolchain and run the encoding/json tests and benchmarks.

After a bit of poking about I found there was no benchmark for marshaling a struct with a custom marshaler. Following the principle that you cannot improve what you do not measure, adding a new benchmark was step one.

This is the new benchmark. It uses an Encoder as this can be reused, which reduces the noise from the test and shows the overhead due to the custom marshaler in stark relief.

func BenchmarkEncodeMarshaler(b *testing.B) {
	b.ReportAllocs()

	m := struct {
		A int
		B RawMessage
	}{}

	b.RunParallel(func(pb *testing.PB) {
		enc := NewEncoder(ioutil.Discard)

		for pb.Next() {
			if err := enc.Encode(&m); err != nil {
				b.Fatal("Encode:", err)
			}
		}
	})
}

And here are the results running on Go 1.13-ish. As expected from my previous post there are 3 allocations on every run.

BenchmarkEncodeMarshaler-8	 8169886	130 ns/op	100 B/op	3 allocs/op

Next we run a profile to see where those allocations are. We run the benchmark again with memory profiling enabled, then start the profiling tool and select the alloc_objects index. This index counts every time an object is allocated over the run, so is useful for showing us where most of the allocations occur. The -run ^$ on the test command is to ensure we don’t run any tests which might pollute the profiling data. ^$ is a regular expression. ^ matches the beginning of the line and $ matches the end, so -run ^$ won’t match any tests, so just our benchmark will run.

go test -run ^$ -bench BenchmarkEncodeMarshaler -memprofile mem.prof
go tool pprof json.test mem.prof 
(pprof) sample_index=alloc_objects

Let’s look for the functions that allocate the most objects. top does this for us.

(pprof) top
Showing nodes accounting for 21529479, 100% of 21529577 total
Dropped 4 nodes (cum <= 107647)
Showing top 10 nodes out of 13
      flat  flat%   sum%        cum   cum%
   9437472 43.83% 43.83%    9437472 43.83%  reflect.packEface
   9404990 43.68% 87.52%    9404990 43.68%  encoding/json.compact
   2687017 12.48%   100%    2687017 12.48%  encoding/json.RawMessage.MarshalJSON
         0     0%   100%   21529479   100%  encoding/json.(*Encoder).Encode
         0     0%   100%   21529479   100%  encoding/json.(*encodeState).marshal
         0     0%   100%   21529479   100%  encoding/json.(*encodeState).reflectValue

There are only 3 functions with significant allocations. One is within the RawMessage.MarshalJSON call and is the one we know is forced by the interface. One is in the compact function in the json library, and one is within the reflect package.

Let’s start with compact. We can look at an annotated source listing to show us where the allocations are in that function.

(pprof) list encoding/json.compact
Total: 21529577
ROUTINE ======================== encoding/json.compact in /Users/phil/Projects/go/src/encoding/json/indent.go
   9404990    9404990 (flat, cum) 43.68% of Total
         .          .     12:	return compact(dst, src, false)
         .          .     13:}
         .          .     14:
         .          .     15:func compact(dst *bytes.Buffer, src []byte, escape bool) error {
         .          .     16:	origLen := dst.Len()
   9404990    9404990     17:	var scan scanner
         .          .     18:	scan.reset()
         .          .     19:	start := 0
         .          .     20:	for i, c := range src {
         .          .     21:		if escape && (c == '<' || c == '>' || c == '&') {
         .          .     22:			if start < i {

So all the allocations here are creating the scan variable. This must escape to the heap and hence cause an allocation. (I’m not going to go into the details of why it escapes - suffice it to say that some shit goes down)

Why are we calling compact? The code below shows it is called after each call to a MarshalJSON method, and the comment implies it is there because the library doesn’t trust marshalers to produce valid, maximally compact JSON. Whatever the reasoning, the call to compact will happen for every field in the JSON with a custom marshaler, and each will cause an allocation - so it’s worth addressing.

	b, err := m.MarshalJSON()
	if err == nil {
		// copy JSON into buffer, checking validity.
		err = compact(&e.Buffer, b, opts.escapeHTML)
	}

What can we do about this? Well, our problem is not that we get a few allocations when we marshal some JSON, it’s that our programs do this marshaling over and over again, and the allocations occur over and over again. If we find some way to re-use these scanners from one Marshal call to the next then we’ll have addressed the problem.

The encoding/json API doesn’t provide a mechanism for a program to maintain any state from one marshaling call to the next, so within encoding/json we need to maintain some scanners for re-use without help from the application. sync.Pool is for this precise purpose. We’ll get scanners from the pool and return them to the pool when we’re done. sync.Pool is very fast so we should make savings over allocating a new scanner each time and having it garbage collected. We do need to ensure the scanner is fully reset each time - nothing will zero its fields for us.

We’ll create a pool for scanners, and some helper functions to get and release them which will also ensure the scanner is reset.

var scannerPool = sync.Pool{
	New: func() interface{} {
		return &scanner{}
	},
}

func newScanner() *scanner {
	scan := scannerPool.Get().(*scanner)
	// scan.reset by design doesn't set bytes to zero
	scan.bytes = 0
	scan.reset()
	return scan
}

func freeScanner(scan *scanner) {
	// Avoid hanging on to too much memory in extreme cases.
	if len(scan.parseState) > 1024 {
		scan.parseState = nil
	}
	scannerPool.Put(scan)
}

We’ll then change the compact call to use a scanner from the pool rather than declare a scanner value.

func compact(dst *bytes.Buffer, src []byte, escape bool) error {
	origLen := dst.Len()
	scan := newScanner()
	defer freeScanner(scan)

Note that defer performance will greatly improve in Go 1.14 so we can happily use it here without worrying about the overhead. (Also note that there are some other small changes involved in this PR excluded here for brevity).

If we run the benchmark again at this point and run old and new results through benchstat this is what we get. Quite a nice speed improvement and importantly we’ve removed the allocation.

name               old time/op    new time/op    delta
EncodeMarshaler-8     118ns ± 2%     104ns ± 1%  -12.21%  (p=0.001 n=7+7)

name               old alloc/op   new alloc/op   delta
EncodeMarshaler-8      100B ± 0%       36B ± 0%  -64.00%  (p=0.000 n=8+8)

name               old allocs/op  new allocs/op  delta
EncodeMarshaler-8      3.00 ± 0%      2.00 ± 0%  -33.33%  (p=0.000 n=8+8)

A word of advice about running benchmarks. When measuring performance shut down absolutely everything you can on your machine to stop it interfering with the running time of the benchmark. This should reduce the variation between runs. Oh, and run the benchmark multiple times (I used -count 8), and use benchstat to summarise and compare results.

If we re-run our profiling top shows compact is no longer causing allocations.

(pprof) top
Showing nodes accounting for 12075339, 100% of 12075851 total
Dropped 3 nodes (cum <= 60379)
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
   9650470 79.92% 79.92%    9650470 79.92%  reflect.packEface
   2424869 20.08%   100%    2424869 20.08%  encoding/json.RawMessage.MarshalJSON
         0     0%   100%   12075851   100%  encoding/json.(*Encoder).Encode
         0     0%   100%   12075339   100%  encoding/json.(*encodeState).marshal
         0     0%   100%   12075339   100%  encoding/json.(*encodeState).reflectValue

This change (and one other that gets rid of the other allocation) will be part of Go 1.14! How exciting!

How was the experience of contributing these changes? Well, it’s satisfying to get the changes in, but the process takes some getting used to if you’re an emotional weakling like me. Particularly if you are used to slapping changes into git then pushing them to production the same day after a cursory review from a busy colleague. Expect lots of picky code review and very high standards for every aspect of each change. And quite a few rounds of feedback as you try to feel out what’s good for the Go codebase. Some sequences have certainly been shortened in this explanation of how these changes came about! You can see the full extent of the process and my many mis-steps here. Many thanks to Daniel for his patient & expert review, and particularly for his enthusiasm when the changes were finally ready to be merged.