Buffers and Allocation (Go and Python)

Over a year ago, Rob Pike wrote:

I was asked a few weeks ago, "What was the biggest surprise you encountered rolling out Go?" I knew the answer instantly: Although we expected C++ programmers to see Go as an alternative, instead most Go programmers come from languages like Python and Ruby. Very few come from C++.

I am one of these unexpected converts. I've been using Python for over 10 years, much of that time as a "professional" web developer, but Go has displaced Python for nearly all of my recreational development. This happened in a very short period of time, and in spite of the fact that I've given a real go at many other heavily fancied languages over that period. I picked up Go shortly after the release of 1.0 in late March of 2012, and have been reasonably active in the growing community ever since.

Because of this I read with interest the blog posts from developers who are undergoing the same transition that I did. I even wrote of my own.

One of Go's great strengths, which has been widely reported anecdotally but perhaps not yet fully explored, is the ease with which developers new to Go are able to write functioning, well performing Go code. For instance, bit.ly remarked:

We’ve had developers start from zero experience with Go to writing working, production, code in a day. That is why we’re so excited about its place in our stack.

As a consequence, many developers who are past the learning stage and onto the stage of giving advice to other Python developers have less Go experience than I do. This is absolutely a testament to Go's virtues, not an accusation of hubris. Aditya gave a great Go themed talk to the NYC Python Meetup after having used Go for only 6 months.

Still, I find these articles to be interesting on a philosophical level. What are the concepts that Python developers most often have a hard time with? What are the underlying causes for this confusion? Is it generally due to something which Python has that Go lacks, or vice versa? The most recent to cross my path was What Python developers need to know before migrating to Go(lang).

There are generally some concepts that Python developers find it hard to adapt to, which often manifest themselves in specific complaints. If you are approaching Go from Python, the one thing I want you to be aware of is the fanatical extent to which Go is concerned with allocation. This is expressed in the tools it provides to control it and the way its semantics and standard library inform the use of these tools throughout the wider ecosystem.

To quote, well, myself:

In perhaps the seminal work on the subject of performance in Go, Russ Cox profiles a naively written Go program which commits both cardinal performance sins and, through excising hashes in favor of structs and lists and reducing allocations through classic techniques like caching/memoization, he manages to speed up a program by 15x while reducing its memory usage by nearly 2.5x. But, while Go has implicit allocation for single value variables, it has widely used idiomatic ways to pre-allocate slices and pointers to assist in reuse and to reduce copying.

C is a language with a reputation for being stingy with memory usage. Because there is no garbage collection, lots of C libraries avoid allocating anything and only operate on buffers whose memory is controlled entirely by the caller. This leaves the user open to many, many sources of bugs, but it does make absolutely clear when allocation will occur. Alex Gaynor called this BYOB: bring your own buffer.

In Go, you can control the capacity of any slice using the third argument to the make builtin, which more or less gives you a level of control over allocation similar to the *alloc family of functions in C. The builtin append is very fast when the capacity of your slice is sufficient, but must reallocate and copy the underlying array if you run out of capacity. To avoid buffer overflows, it uses similar reallocation strategies to Java Vectors and Python lists, and similarly checking the capacity against the length is a constant time operation, but unlike Python, you can preallocate and potentially avoid copys.

Strings, as in Python, are immutable, which means that any operation on a string which returns strings will be copying and allocating. Because of this, many standard APIs in Go will primarily use the []byte type to avoid unnecessary allocations, and provide *String variants. The regexp library is a prime example of this.

Others will operate interfaces like io.Reader and io.Writer, which can be used given a []byte without significant extra allocation via bytes.Buffer, or via a buffered stream, eg. from a socket or a file, without saving the entire stream in memory at any one time, similar to what the underused Python file API allows. The encoding/json interface, for instance, provides the Marshal/Unmarshall functions which both operate with []byte, but also provides Encoder and Decoder interfaces take anio.Reader and io.Writer.

Even where interfaces work with strings for convenience sake, the underlying code will use []bytes for efficiency's sake. This is often exposed in alternate APIs, such asstrconv's Atoi doppelganger strconv.AppendInt, which appends the string representation of an integer to a []byte without proxying through the string type or using reflection like fmt.Sprint would. If you are formatting many numbers into a buffer, say, to export a large CSV file, the savings in copies and allocations could add up to quite a lot.

The repustate article records these observations (among others):

Going between []byte and string. regexp uses []byte (they’re mutable). It makes sense, but it’s annoying all the same having to cast & re-cast some variables.

Different assignment operator is used depending on whether you are inside & outside of function ie. = vs :=

Writing to a file, there’s File.Write([]byte) and File.WriteString(string) – a bit of a departure for Python developers who are used to the Python zen of having one way to do something.

If you’re using JSON and your JSON is a mix of types, goooooood luck. You’ll have to create a custom struct that matches the format of your JSON blob, and then Unmarshall the raw json into an instance of your custom struct. Much more work than just obj = json.loads(json_blob) like we’re used to in Python land.

These points all suggest a failure to grasp the extent to which Go is concerned with allocation, and the advantages that widespread use of fast buffers confers upon Go over many of the dynamic languages out there today. Many Python programmers are left thinking that Go is a bit awkward but its execution speed is worth it, not realizing that much of what makes it fast is knowledge that can be exported in part back to Python. A reference to json.loads rather than its File API equivalent json.load shows just how prevalent loading from strings rather than from buffers is in Python; it's a window into that mentality. The reality is that copying and allocation are going to be the enemy of speed in most languages, and fully appreciating the impact of constructs you've been missing always happens after you've learned to ape their usage.

Sep 4 2013

jmoiron plays the blues

Buffers and Allocation (Go and Python)