The why of Go strings
A story about strings that might even be true.
Regular readers of my blog will all be aware that in Go a string is in fact a struct with a pointer to an area of memory containing the byte content of the string and an integer Len
that tells you how many bytes make up the string.
Ever wondered why? Probably not, as it seems “obvious” that you need both the length of the string and the bytes of the string to make a string. But it wasn’t always this way.
Pain
The elder progenitors of Go are also deeply implicated in the invention of C. C is a fantastic language that has underpinned most of computing for 50 years. In C there is (or at least was) no such thing as a string type. More of a string convention. A string is represented by a pointer to its first character. The next character is in the next memory location. And so on until the end of the string, which is marked by a zero.
There is no associated length field. To discover the length you call a function strlen
, which is in the the C standard library. This starts at the first character of your string, then moves forward in memory character by character until it finds a zero. The length is the number of bytes it passed over to reach this zero.
Zero
What if you’d failed to put a zero at the end of the string? Well, strlen would just keep going until it found a zero or the program crashed. Thus the history of strings in C is an astonishingly dangerous, precipitous and complex history of knowing exactly where your zeros are.
Want to copy a string? You better know how long it is. Want to compare two strings? That’s two lengths that have to be right or your program might crash. Passed a string by a third party? Do you trust them enough to risk seeing how long it is?
strnlen
So not long into the history of C someone invented strnlen
. This is like strlen
, but you can specify a maximum length. Also strncpy
and strncmp
, for copying and comparing strings with some control over how far you are prepared to look. Various other safer variations followed, but all are difficult to use correctly.
Consequences
Not having the length closely associated with the string has had terrible consequences. The most minor are fixed sized form fields. Ever wondered why your address must be less than 100 characters or your password 8 or less? Part of the answer is that it was just too hard to allow arbitrary length without risk that the program would crash.
The bigger problems are the horrible security flaws mistakes in handling string lengths open up. Say an attacker sends you a string that’s longer than you expect and you copy it without checking. The string may overflow the memory region it is being copied into, and may overwrite some other piece of memory. If you are unlucky and the attacker is very cunning then they may be able to cause your program to execute arbitrary code. Which is bad.
So that’s why Go strings have a length field that’s set when the string is created. You always know the length of a string, and the compiler ensures the underlying memory buffer is adequate and that you don’t step off it. If you do manage to get things wrong the worst that happens is a panic. Nothing gets overwritten with unexpected values.
To be clear I don’t think Go introduced this idea for handling strings, and I’m certain Go didn’t invent the idea of keeping track of the length. But this is one of the things that makes Go a “safer” language for writing servers.
First published here