NodeJS, events, and closures

I've been writing a bit in nodejs recently, writing an extended version of etsy's brilliant statsd which handles logging as well called logd. While writing it, I made a couple of mistakes that are probably quite common even to people who know JavaScript quite well but have not used it much outside the browser environment.

NodeJS is an event based framework around google's V8 engine that essentially allows you to write transparently asynchronous server-side code in JavaScript. It's probably worth it to explain what that all means, not only because I enjoy being didactic, but because some readers of my blog are non-technical and this is something many web-only js programmers might intuitively understand but fail to grasp the full implications of.

In a regular synchronous environment, programming is linear; execution follows the program text, and the statement on one line is executed after the statement on the previous line. The problem with this execution model occurs when you have to wait on a resource, like a database or a network operation. Typically we'd like to continue processing other things while we are waiting on these resources. Consider an advertisement being loaded off an ad server to be displayed on a page; we'd like to build the page while this process goes on in the background somehow.

The classical ways to approach this involve leaning on the operating system's scheduler to provide us some type of concurrency, like a thread or a process. So typically, you'd create a thread, and fetch the advert in the thread, go about your business creating the page in your original thread, and at the end wait on that thread's results. This puts the onus of scheduling that task either on your VM or your OS, and they generally do an alright job at that.

But there are other models to achieve synchronicity, and node's model uses cooperative multitasking. This avoids the overhead of creating threads (which can be costly), and simplifies scheduling since all of your handlers are probably quite fast as they themselves use event-based IO and cannot wait on slow resources.

In this way, nodejs executes your script, which defines different resources it will use and handlers for the events associated with those resources, and then enters a loop waiting on those events. The handlers can define additional resources to use, and additional event handlers, and so on. So for instance, your script can bind a socket to a local interface and listen, but it cannot wait on incoming messages. Instead, you define a handler function which receives a message as a parameter, and node executes this function when that event occurs.

In this way you can define how to handle various events (generally I/O in nature) and failure scenarios. Other languages have these types of systems, but JavaScript is particularly suited for this style of programming for a few reasons.

Firstly, this is actually the way most people employ JavaScript on the web; the browser runs a loop in which events are trigged by actions on the page. So when a user clicks a link, a "click" link is fired, and generally the way to capture that and do something is to listen for a click event from that link and then define some code to run in response to that event.

JavaScript itself also features a very natural anonymous function syntax; nameless function object literals which can be set to variables or passed to functions. It also has an implicit parameters variable and an implicit this variable, the first of which allows event handling functions to be called successfully even if their parameter list does not match the callers parameter list, the latter allowing the event loop to bind useful data to a common variable. In the DOM, libraries like jQuery generally bind the DOM element which triggered the event to this, but it's easy to see how this allows for widespread handler reuse, as there is a standard way for handlers to inspect the context of their execution.

So given all this suitability, how did I run into problems?

Well, it's actually rather easy to forget that code is not imperative, and introduce subtle synchronicity bugs. Consider the following simplified pseudo-code:

for (var i=0; i < logfiles.length; i++) {
    var path = logfiles[i];

    database.fetch(path, function() {
        database.delete('expired.' + this + '.' + path, function(err) { 
            if (err) console.log(err); 
        });
    });
}

This code loops through a list of logfiles, fetches something from those files from the database, and then deletes something in the database. The database.fetch operation takes an anonymous function as a callback; ie, when it's finished executing, it will call that function. This code is nice because it looks very nearly identical to what normal synchronous code looks like, but each database fetch and delete call happen asynchronously, and execution is allowed to continue both in this loop and in other event handlers.

The code unfortunately contains a bug that's quite crippling. The anonymous function passed as an event handler to fetch has access to the outside scope it's defined in, which contains the variable path, but because execution can continue in the outer loop, that scope can change while waiting for the database to do its thing. This means that path does not necessarily mean what you think it means by the time the event handler calling delete is called. In fact, it's very likely that the loop will continue to loop through the paths and complete by the time the first fetch is completed and its event handler fired.

The result is that when the handler passed to fetch is finally called, path is generally set to the final logfile, as the loop has already gone through all of its paces and exited on the final path. So although you might have had 10 paths, you call delete on the same final path each time.

Fortunately or unfortunately, this is often sidestepped because while other languages have syntactic support for a for each construct, JavaScript does not, and mimics this with a prototype function on the built in Array object. So another common way of writing that code might be:

logfiles.forEach(function() {
    var path = this;

    database.fetch(path, function() {
        database.delete('expired.' + this + '.' + path, function(err) { 
            if (err) console.log(err); 
        });
    });
});

This code does not suffer the same bug as above because each item in the array logfiles executes a new copy of the handler, which in turns creates a new local copy of path in its scope. The outer anonymous function forms a closure over the internal one, and its path variable is never modified within its context.

Many programmers might use this form because it's more convenient but miss the closure aspect of the outer anonymous function. Since Array.forEach in nodejs is synchronous anyway, if you wanted to keep the loop version (for instance, you needed i), you could form your own closure around the callback to fetch, locking the values of both path and i in the outer functions parameters:

for (var i=0; i < logfiles.length; i++) {
    var path = logfiles[i];

    database.fetch(path, function(path, i) {
        return function() {
            database.delete('expired.' + this + '.' + path + '.' + i, function(err) { 
                if (err) console.log(err); 
            });
        } 
    }(path, i));
}

Despite knowing all this, when switching back and forth between different environments and different languages, it's easy to forget until this type of bug bites you very hard. While writing logd, a bug of this nature did bite me. Twice. Ouch.

Aug 6 2011

jmoiron plays the blues

NodeJS, events, and closures