Breaking the Loop

In the absence of Voldemort, writing reusable abstractions on slices in Go has come to focus on the index values, which always have the same type (int) regardless of the slice. Though the pattern pre-dates it, accepting a closure that received one or more offsets as data was popularised by the introduction of sort.Slice:

func Slice(slice interface{}, less func(i, j int) bool)

This pattern neatly avoids reflection for item access, so the closure you write can can directly access your slice, avoiding overhead and remaining typesafe. When the abstraction you're writing must control the iteration because it is complex and non-linear, like sorting or binary search, then hiding the loop details are a significant part of the abstraction. When this isn't the case, this pattern offers less.

A recent blog post at pace.dev introduced a simple abstraction for batching operations in Go. I can't stress enough that this is a great post. It is a superb demonstration of writing simple and useful abstractions in Go, covering the design and testing process in a way too often left out.

It defines this batch function, using the closure style, and in the process it abstracts a pretty straightforward loop:

func batch(count, batchSize int, eachFn func(start, end int) error) error

I wanted to explore this a bit, and compare a contrasting style for writing iterators that doesn't break the caller's control of the loop, inspired more by database/sql.

In this style, the goal is to allow the caller to control the iteration directly, with standard control flow using for, by providing an iterator that implements this interface:

type Iterator interface {
    // Next returns false when there are no more items
    Next() bool
}

This allows loops to be written like this:

for iter.Next() {
    // your code here
}

In order for the iterator to actually be useful, it has to make data available. In the sql iterator case, a Rows can Scan into your local variables, giving you access to the values in the current row. In the batching case, we only need the indexes. This can be done in several ways, but a method is fine for demonstration:

for it.Next() {
    start, end := it.Bounds()
    batch := slice[start:end]
    ...
}

Now that we understand our desired interface for the iterator, we can create a concrete type for the iterator and a batch function for convenience:

type BatchIter struct {
    count, batchSize, current int
}

func (b *BatchIter) Next() bool
func (b *BatchIter) Bounds() (start, end int)

func Batch(count, batchSize) *BatchIter

And to use it:

it := Batch(len(slice), 10)
for it.Next() {
    // ...
}

The code ends up looking pretty similar, so what are the differences between the approaches?

The most obvious difference is that we don't need to define a special Abort error sentinel to stop iteration early; we can rely on normal Go control flow to do that with the break statement:

for it.Next() {
    if uhOh() {
        break
    }
}

I consider this significant, because in general we expect Go programmers to know Go, but our interface must be learned. The resulting code is not too different, but this suggests that the Abort signal is a re-implementation of control flow that already existed, waiting to be used.

This difference is due to direct versus indirect iteration control, and it can manifest itself in other ways.

While it's equally easy to write something that only processes even batches with either approach, other operations like scanning are more difficult when control is indirect. This is unlikely to be a problem for batch, but it pops up frequently in parsers. Suppose you wanted to skip 5 batches; in the closure version, that state would have to be defined outside of the "loop" and then maintained within:

var skip int

batch(count, batchSize, func(start, end int) bool {
    if skip > 0 {
        skip--
        return true
    } 
    // processing code
    if needToSkip() {
        skip = 5
    }
})

The equivalent code with direct iterator control allows you to embed all that logic very naturally using Go syntax:

it := Batch(count, batchSize)

for it.Next() {
    // processing code
    if needToSkip() {
        for i:=0; i<5; i++ { 
            it.Next() 
        }
    }
}

Since we have an iterator type now, we could implement a Skip function that did this for any iterator:

// Skip advances iter n times
func Skip(iter Iterator, n int)

Doing this more cleanly with the closure version would be possible by defining a new Skip sentinel and modifying Batch, more clearly demonstrating that Abort is special cased control flow.

At work, we have several data storage systems written in Go, and the iterator approach has proven its value many times. Often, the right high level interface does not match the optimal way to read from storage, and bridging that gap with batches and iterators has allowed us to achieve good performance without sacrificing too much sanity. An implementation of BatchIter as described above has lived in our codebase for a long time.

We've come to adopt concrete iterator types that implicitly implement this basic interface most of the time, in preference to the closure style. On top of the control flow concerns, types end up being more easily composable in several key ways that functions are not. As a rule of thumb, if there's only one for that you're abstracting, it's more flexible to let the caller own it.

The closure style has become increasingly popular over time, to the point where it's the first thing many programmers think of when an abstraction related to a loop is needed. Before you reach for it, consider how you might implement your abstraction without breaking the for loop for your users.

Feb 17 2020

jmoiron plays the blues

Breaking the Loop