Spent a few hours today trying to figure out how to do QuerySet caching in django in some kind of transparent manner. We are working on a multi-tiered caching system, sort of like that which was released by the pownce guys a few weeks ago, except with a bit more there there.

To recap the issues with caching sets of objects, the main problem is invalidation. If you have a QuerySet that has items A, B, and C, and then you update C, you have to somehow determine the cache keys you need to expire that have C in them. You could attempt to keep that in a list somehow, but then there are synchronization and concurrency issues with this. If you keep them in your cache itself, it could expire or be punted, leaving you with a cache full of stale data.

One method to avoid this, described by Tobias Lutke, is to add verisoning information to your cache tags. UUID versions are nice because they get rid of possible race conditions with serials (there's no atomic increment operatin in your cache). The idea would be that if you had QuerySet [A, B, C] from table foo, you'd apply a cache tag like foo-queryset-(hash)-(UUID), where the hash was a hash of that queryset. When you get a write against the foo table, change the UUID and let the old querysets expire.

For a QuerySet cache, this is somewhat undesirable for a lot of reasons. It is simple, but invalidation takes time. MySQL's query cache already does something very similar, and you'd be discarding all cached QuerySets on that table every time a write hits that table.

In the Django world, what's worse, is that if you cache other queries who have items related to that query, you could end up with stale data in those too. It seems like caching Django QuerySets in a way that avoids stale data is very difficult, and it is much easier to do a front-end (or even app-level) full page or larger object cache in front of that.

Caching objects larger than a QuerySet has a few benefits: although it carries the obvious drawback of stale data, at least that stale data is fairly simply defined (some cache timeout, which needn't be large to save lots of work as the object is large and might represent many queries!) I can't really wrap my head around what is going to be the best for my app, but I have a feeling that the answer for anyone is going to be "it depends." Since we are allergic to gathering statistics at work, we might not know what the best strategy is for some time.

May 19 2009