L

Johnny Cache

Feb. 28th 2010 12:11:56

I've been waiting a long time to write about this. Johnny Cache is now released upon the world. It's a drop-in caching library/framework for Django that will cache all of your querysets forever in a consistent and safe manner. You can install it via pip install johnny-cache.

Conceptually, Johnny Cache started when I wrote the 'Queryset Caching' post last May. That was written after the ideas for how to implement such a cache had coalesced into a plan, but before an implementation had been created. A proof of concept was developed that summer and put into production on a fairly large site. The code that went into that version was probably not releasable; but a lot of work had gone into the code (and the testing suite), and I couldn't bear to start over on a clean implementation.

This January rolled along, and a few events converged that convinced me I needed to increase my open source footprint. Johnny had proven to be central to the scaling capability of the application where it was in use, and I felt that it would be a real benefit to the community to rewrite it. I created a repository for an MIT licensed project and threw up the easy part; a thread-local caching mechanism similar to the locmem backend, but cleared after every request.

After I gave Jeremy the URL to the hg repository, he banged out a nice framework for how the queryset caching mechanism would work in 1.2 (the patch point had changed), and declared it "pretty much done." It was fantastic; ~100 lines or so of clean, concise python code. And of course, it didn't work at all. Nearly 80 commits, 2000 lines of tests, fixtures, documentation, edge case handling, 1.1 support, and actual implementation later (his version handled generation keys but completely left out queryset keys based on the generations), we had something we were confident would finally work as advertised. Johnny's documentation explains what the project is, and what it does, but I want to reflect a bit more on the process of its development.

This is the first project I've really worked on with a Distributed VCS that I've had more than one developer working on. Both of us are really used to a centralized repos, and I don't think we quite embraced the "every change is a branch" philosophy. Other than that, working in a distributed nature wasn't really that helpful, because most of our machines don't have unfettered access to the internet and so it can be difficult to share revisions between us without hitting the central repository. This is, perhaps, one of the true draws of sites like bitbucket and github. I've found that dvcs is great when developers are working in isolation (like, forking beaker to write a new auth backend) but I prefer a centralized development repos when working on the same issues. Mercurial does either quite well.

This is also the first time I've used the Sphinx Documentation system that python.org et al use. I've used stuff like JavaDoc and Doxygen in the past, and frankly the documentation it produces is almost always worthless, even for a library. It doesn't highlight the important pieces properly, it doesn't provide room for exposition, etc. I've shied away from Sphinx in the past for a few reasons; the startup cost seemed a bit steep (it really isn't), and there was a conceptual confusion in the steps between writing the documentation and getting the results you desire. Finally, I didn't want to really write lots of extra documentation that would live outside my code.

My feelings on writing "extra" documentation have substantially changed, however, and Sphinx offers a best-of-both-worlds hybrid, with commands that will automatically pull documentation strings or automatically document modules, functions, classes, etc, but more or less leave the entire form and function of the documentation up to the actual ReST documents themselves. When I need to explain something, I explain it. When I want to include docstrings, or highlight a specific function as a method of doing some higher level action within the context of my app/library, I can do that with ease. It was a delight.

This was also the first project that I've used a significant amt of TDD for. The implementation details for Django 1.1 and Django 1.2 were radically different (1.2 had to support multiple databases, itself a major change), but in the end we wanted the software to more or less operation in the same transparent manner. Having a thorough testing suite caught tons of subtle bugs in behavior (including lots of regressions) that we might never have found otherwise. There were a few database-specific behavioral bugs that would have passed under one db but not another, which could very well have gone unnoticed were it not so easy to set up different environments and test.

I still feel like TDD works much better when it's easy to define correctness, but perhaps it's the case that any software development is much better when that's the case. The test suite for johnny-cache is extensive, and at least 1/3rd of the tests were written before the code to pass it was. A culture of healthy fear grew up around the tests; if there were acknowledged holes in the testing suite, the code that would supposedly functioned to pass those gaps was presumed to be buggy, and this defensive posture helped us find a few bugs.

Finally, the difference between "hacking" and "shipping" is pretty apparent when you do the legwork to get proper documentation written, set up the right distribution channels, and you have a desire for that code to be used by people, and in some way be a representation of your ideas and abilities. There are still lots of options for future development on Johnny; we've only got the basic operation of the cache up, but there's tons of things you might want to do once you start to understand how your app is utilizing Johnny.

comments

Matthew Schinckel 19:34 Feb. 28th

Johnny-cache looks awesome, and makes my django project fly: without any changes, the number of queries per request for my most frequently loaded page drops to one: and that is for sessions.

However, there seems to be an issue with DELETE queries to the database. This happens even with the django admin interface.

It is an issue with line 323 in cache.py: I fixed it by adding a test for sql statements starting with DELETE as well as UPDATE being not cached.

Matthew Schinckel 20:21 Feb. 28th

I've patched an forked onto bitbucket:

http://bitbucket.org/schinckel/johnny-cache/

Jeremy 21:09 Feb. 28th

Thanks Matthew, I merged your changes and added a test for that...looks good. :) Thanks, will let Jason know too.

Matthew Schinckel 22:07 Feb. 28th

I've also fixed the issue you mention in a TODO with python 2.5, and keyword arguments not being able to be used after *args.

Instead of method(*args, db=db)

Use method(args, *{'db':db})

Both of these changes are in the fork I mentioned in the previous comment.

Jason 23:07 Feb. 28th

Thank's for the patch Matthew!

I can't believe we missed something so basic o_O; We spent a few weeks working on the tests just making sure we crossed all our t's and dotted all of our i's.

A new version should be up on pypi shortly with the fixes.

Andy 18:47 March 1st

Looks like cache will be invalidated as soon as any row of a table is updated.

For frequently updated tables that would basically make the cache not usable. For those cases, it'd be better to cache on the object-level and only invalidate the cache when a specific object is updated.

Any plan to extend Johnny Cache to cover that use case?

Manuel Saelices 20:22 March 1st

per object cache was implemented by Mike Malone, and its cover the objects.get(...) method (only when you filter by "id" or "pk").

See this code:

http://github.com/mmalone/django-caching/blob/master/app/managers.py

You can integrate this kind of individual caching easily like we've integrated in our CachingManager here:

https://tracpub.yaco.es/cmsutils/browser/trunk/cache.py

I hope these snippets like you.

Sean 21:03 March 1st

Wonderful job! Thanks so much for sharing with us.

jmoiron 21:48 March 1st

@andy:

First off, Manuel is right. If you want per-object cache, mmalone's excellent per-object cache is what you want. I've mixed it with johnny on a past project without any issues.

Per-object caching for querysets presents a lot of problems for invalidation. How do you handle inserts or updates that happen to objects that are not part of your result set but would become part of it after the write? If you want to maintain 100% consistency, you will eventually have to re-implement the whole query mechanics of your database; this is very complex and defeats the purpose; your database is faster at answering "What is in this query?" than a python/memcached hack would be.

Now, it's highly possible that the volume of writes you receive would make Johnny a bad option for your site. The reality is no caching strategy works for everyone. I plan on adding a 'table/model blacklist' to Johnny in the future that would safely avoid writing to the cache for certain tables. If that isn't enough, and you can live with a degree of 'stale-ness' to your caching, you might want to look at django-cache-machine, which does something very close to what you suggest.

T 15:06 March 2nd

I'm looking forward to using this, but I'm having trouble installing. I need to install memcached from http://memcached.org/ right?

And then do I install johnny in site-packages as normal, or should it live as another app where my main django apps live? thanks, --T

Jason Moiron 16:19 March 2nd

@T:

memcached installed from memcached.org, or if your OS has packages, use those.

You can just pip install johnny-cache (or setup.py install if pulling form hg) into site-packages as normal, and follow the configuration advice in the documentation:

http://packages.python.org/johnny-cache/#usage

Make sure that you use johnny's memcached backend and your host/port pairs are right; see django's cache documentation for more info:

http://docs.djangoproject.com/en/dev/topics/cache/#memcached

Andy 03:16 March 3rd

@Manuel Thanks. I didn't know about mmalone's per-object cache @jmoiron Actually I was just referring to straight per-object cache rather than per-object cache for QuerySet. Sorry if I didn't make myself clear. django-cache-machine looks interesting. How do they handle the invalidation issue you broguht up in your post? Both johnny-cache and django-cache-machine cache QuerySet. How would you compare them?

Jason Moiron 12:13 March 3rd

@andy

cache-machine is pretty cool; it's a different strategy but I think it's one will work in many scenarios. It handles on-delete invalidation of querysets, but it basically doesn't do insert/update forced invalidation; it depends on a cache expiry period for eventual consistency. cache-machine is way less invasive (you use a mixin on your model class, and a custom model manager superclass), so you could throw it on models that can handle that type of consistency and keep it off others that can't. This is quite a bit deeper a design consideration, I think, compared to Johnny's, since Johnny's won't change the functionality of your application.

Some types of apps, and some tables, are going to get a ton of distance out of caching infinitely. Some things (like a shopping cart, or a CMS) can't deal with eventual consistency. Other things are way more tolerant and would get more mileage out of cache-machine; it's a set of considerations that every developer has to weigh for every project's necessities.

There's no reason at all you couldn't mix johnny and cache-machine, or other techniques. Johnny does a good job of handling a lot of the lower level legwork of fetching things from the database, but won't save you on other potentially intensive operations like hitting web services, rendering large templates, etc. It's caching machinery happens pretty deep in the ORM; the only negative to combining them is the potential for potentially considerable duplication.

Anonymous 10:32 March 4th

The latest release of Django 1.2 beta 1 does not work with Johnny Cache because the method empty_iter() has been moved from query.py to compiler.py in the django.db.models.sql module. So, line 192 of johnny/cache.py should not be return query.empty_iter(), rather, it should use empty_iter from compiler.py. This can be imported at the top of the method, just after importing query.

Jason Moiron 13:35 March 4th

@Anonymous

I've verified this made the fix. None of our tests go down that code path; do you have an example of how to access it?

jav 05:14 March 6th

Breaks django-registration (current stable 0.7 for django 1.1) for me, login+registration does not work anymore. Didn't troubleshoot further.

jav 07:40 March 6th

Ignore the above. Seems while developing I dropped the database, syncdb'd, and received stale information from memcached causing a big fuss.

Anonymous 12:00 March 6th

@Jason Moiron

Sure, you can reach that branch of your code by running the following query (assuming you have a 'Model' which contains a 'field'):

Model.objects.filter(field__in = [])

Jason Moiron 16:11 March 6th

@Anonymous

Thanks! Added a test that goes down that branch. Works just fine with the empty_iter in compiler.py.

Benjamin Wohlwend 04:41 March 7th

Very nice project. It reduced the number of queries per page for a CMS site from ~20 to zero.

What version of Python are you targeting? I'm still on 2.4 (thanks to RHEL 5), which doesn't like the " x if y else z" syntax. I patched that and now it works perfectly. Patch is at http://pastebin.org/103116

+ leave a comment on "Johnny Cache"