GC is bad and you should feel bad

Don't feel bad

Some time ago I wrote about how I went to great lengths moving allocations off the Go heap into memory allocated directly from the OS in order to reduce the overhead of GC. This was in a new Graph Database I was developing at Ravelin to catch bad people more efficiently. At the time I wasn’t entirely certain that the GC CPU overhead was a terrible thing, but it was untidy and I didn’t want to risk putting the new code into production without getting rid of it.

Well, over the past couple of days I’ve pulled the same trick with another component in Ravelin. In our ML fraud scoring pipeline we have a huge (>5GB) table of pre-calculated probabilities that we kept in Go map[string]X in RAM. Recently we’d seen high GC pause times, corresponding with high 95th percentile response time peaks on our API. The GC pause time suddenly jumped around 6pm on January 7th, and we’re still not sure why. Rather than fret any longer we took the decision to simply move this data away from where the GC could see it.

Once we’d moved the allocations off-heap into a custom hash-map backed by mmap allocated memory (yep, go ahead and hate me, it’s riddled with “unsafe”, “syscall” and pointer arithmetic. And no, I couldn’t “just use postgres” for this), the GC pauses practically disappeared.

Even more pleasingly it removed over 100ms from the peak 95th percentile response time of the component. Meaning improved response times to our clients and less risk of timeouts.

The Go GC isn’t bad. It’s actually very good. But if you have large multi-GB data structures involving pointers (and in particular string map keys contain pointers!) then you’re probably paying a hefty penalty as Go periodically goes to check that every single one of those strings is still in use. And if this is happening to you now and you don’t think it is a problem, it may suddenly become a problem one grey and wet January morning.

Oh, and we’re hiring. If you like code and you hate bad people, come and work with us!