Is johnny-cache for you?

I've been pleasantly surprised with the amount of interest in johnny-cache since Jeremy and I released it this past weekend. A lot of the comments revealed that perhaps the documentation is missing an important discussion on the repercussions of using Johnny. They are also pretty positive about the name :)

"Is johnny-cache for you?" is the most important question that is not answered by the documentation. Using Johnny is really adopting a particular caching strategy. This strategy isn't always a win; it can impact performance negatively:

any real database read is first a cache miss, then a cache write
any database write is a cache write
any write to any table invalidates all cache depending on that table
there are extra cache reads on every request to load the current generations

The major positive impact is:

any cached read doesn't hit your database

This turns out to be a pretty exceptional positive for pretty large class of applications. Loading from memcached is going to smoke even your db's queryset cache with respects to latency while giving you cheap and easy horizontal scalability. It's not often you get these two coming hand in hand.

Every time you do a query that hits cache, your database doesn't have to accept a connection, allocate cursors, examine your query, execute it, and return the result. This is a fairly heavy cognitive load to lift off of your database servers.

If you were using something akin to MySQL's queryset cache before, you can pretty much turn it off. Not only do you get that memory back for loading indexes, performing queries, etc, but you can now horizontally scale your query cache with ease.

Pre-Django 1.2, splitting db reads and db writes at the application level was a real pain. Scaling reads across a pool of RODB databases is no picnic, either. For a read-heavy application, Johnny can alleviate so much read traffic that you can potentially just scale reads in memcached. Even if you need to horizontally scale reads across an rodb pool, they now have a shared queryset cache, such that reads on one slave saves reads to another.

Still, writes eventually happen, and when they do, Johnny will blow away the cache depending on the table written to. The implications of this are that Johnny's effectiveness is reduced if you:

have "logical" write operations that hit many tables
write heavily to one table that is then featured in many joins
have very few tables

An unappreciated caveat to this is that the relative frequency of your writes and reads matters quite a bit. For a simple one page, one query, one table scenario where you are receiving about 1 write per second. This might seem like too often for Johnny to be useful, but if you serve 30 pages per second, you are hitting cache 96% of the time.

Typical webapps are going to read far more often than they write, and serve a few pages far more often than the other pages on the site. For these apps, Johnny will probably work quite well. Even in cases where it doesn't fly, it's probably a good starting point.

But due to the magic of the internet, I don't have to rely solely on hypothetical and anecdotal evidence. Someone running such an application tried Johnny out and wrote a nice little blog post about his results. His chart even suggests that his application is quite write heavy. It also looks pretty similar to what we saw when we pushed the primordial version of Johnny live last year. The post itself is pretty fascinating; the readers digest translation is that he already had some caching in place, but installed Johnny, set it up, and his query count still dropped pretty dramatically (illustrated). Note that it wasn't just cache hits that dropped; Johnny can cache some queries that MySQL can't, and there are other classes of queries that are impossible to cache but are easily avoided. Despite that initial positive result, he noticed that his CPU utilization and context-switching increased, likely because memcached and mysql (and I perhaps even his app server) were running on the same box.

So, where to take Johnny from here? Johnny is version '0.1' not because we think it's barely ready for use, but because we felt like we released the smallest piece of software that could actually be of use.

The first improvement would be a way to allow application authors to keep Johnny from caching result sets from tables that receive very heavy write traffic, like a log table. Although monkeypatching was really the only way to achieve the level of integration and simplicity we needed, you always have to acknowledge that there will be cases where people only want to use your code some of the time, or maybe most of the time, but not all of the time. Some kind of model annotation or table blacklist might suffice here, but I want to think through this and its invalidation implications a bit more before deciding on how to do it.

Another improvement I want is increased access to the generational keys Johnny maintains. I recognize cases where you might want to use Johnny's invalidation to consistently cache higher level objects like html fragments or even entire pages. Consider something like a @invalidate_on_model(Post) decorator for an RSS feed of latest blog posts that would only have to be generated upon the first read, and invalidates automatically when the Post's table is altered (or after some optional timeout). I'm still trying to work out how to increase this idea's usefulness when you introduce pagination.

Towards answering the question that is the title of and reason for this post, I'd like to either build in or provide separately something that utilizes Johnny's hit/miss signals to give per-page and per-table statistics about cache hits and misses.

Every application has its own set of circumstances and requirements, and probably its own optimal caching strategy, but if you're a perfectionist with a deadline, Johnny might just get you a whole lot of bang for fairly little buck.

Mar 2 2010

jmoiron plays the blues

Is johnny-cache for you?