login
v2
v1

jmoiron.net

On WSGI, CouchDB

posted January 30th, 2008 @ 01:25:41

- tags: development , python

- comments: 0

pythons on a couch

I've been thinking about WSGI and CouchDB recently, while on the subject of digital inflexibility. First, I want to clarify a few things about what I mean by flexibility with respect to an application, and how the current crop of frameworks approach this problem. If you want to follow this musing well, I highly suggest reading "What PHP Deployment Gets Right" by Ian Bicking; or just his entire blog, and most of the crosstalk on the web about REST, Web Services, and the evolution of the WWW.

How do modern frameworks (Rails, Django, Turbogears, or other "canonical" ones) deal with the problem of flexibility? They don't, for the most part. For one, flexibility is hard both programmatically and conceptually. Secondly, they replace flexibility with simplicity, which is almost always a tradeoff that results in quality. They achieve this simplicity by strictly dividing tasks and then conquering each task by building up a structure around how one is supposed to go about solving that task.

These are not negative qualities at all; one of the things that Rails has gotten right is that design ought to be opinionated. So your task when you go to develop in your now "classic" framework is to set up your REST API (your "Routes" or "urls.py") and your controllers, design your models, and set up your views, and you've got the whole MVC ready to go. The problem is rigidity, repetition, and BigDesignUpFront.

The solution is flexibility. Joel defends BigDesignUpFront, and when you are working with a team on some critical make-or-break software for your company, BDUF might be well worth it for it's benefits in fleshing out potential problems, helping with schedule and cost estimation, etc. But for prototyping, exploring technology, or exploring a problem space, BDUF is deadly. For "agile" or TDD, popular buzzwords that are worth far less than their hype but still provide useful insight for all developers, this is potentially damaging. Coupled with rigidity (in the form of SQL) and repetition (even in DRY espousing frameworks like Django) this is tough to overcome, especially when looking at migrating lots of data to a new application framework.

The first way to overcome flexibility I want to talk about is WSGI, whose design is inspired in part (or so I understand it) by Java Servelets. At it's core, WSGI is a specification for how web servers and python applications communicate; but more interesting (and far more necessary in the statically typed world of Java) it also defines specifically how various python applications are called by the web server. This means that other python applications, given that they abide by the specs, are free to call other WSGI applications themselves with impunity and expect them to work.

The way it's implemented, you need only define the __call__ method to receive 2 passed arguments and return an iterable in order to qualify as a WSGI application. These are incredibly weak requirements on applications, and make many middlewares truly plug and play. What's more, the effort was originally to define a standard that the existing plethora of Python frameworks could all use so that their component pieces would be interoperable with each other. WSGI is still pretty new, and opinionated frameworks like Django are probably not eager to ditch their middleware integration layers for pure WSGI interfaces anytime soon (although Django does work w/ WSGI, I think that's more of an interface between a web server and a Django application taken as a whole), but the proposition of using, say, Django's caching middleware, for any python web application written to conform to WSGI is really exciting.

This gives you flexibility in designing your own "framework" built of hand chosen component pieces. Pylons is essentially a framework built upon PythonPaste that facilitates you in choosing these WSGI middleware components, but I've found some of the areas (particularly the URI routing) to be a little less flexible than I'd like (and, sadly, the documentation is a far cry from Django's). Accepting the dogma of one framework or another does come at a practical advantage; you avoid writing the necessary glue between components. But as the glue itself is agonized over, standardized and simplified, it becomes just another component.

It also gives you another interesting flexibility: the ability to attach applications written completely differently (even in different frameworks) to different URIs at the same site, all of them using the same middleware. This blog works as a Django application; why change it? But my Gallery might be better implemented using other technologies (and I discuss this below); with everyone on board using WSGI, it'd be trivial to attach a different application to handle the '/gallery/' URI space but keep both applications using the same caching, gzip, and authentication middleware. This idea is extremely powerful, because it allows one to select the proper tool for the job and align with whatever tool chain most closely reflects the problem at hand.

What about flexibility at the genesis of the application? Web applications these days deal mostly with the storage and presentation of data. Certainly, the current crop of frameworks reinforce this idea; ditch Django's ORM or ActiveRecord and see what's left with respect to creation of a data driven website. This is where CouchDB, or what I perceive CouchDB to offer, enters the equation.

As a metaphor, lets look at programming languages and type binding as a method of describing and manipulating data. In a statically typed programming language, the structure of data is described explicitly and is enforced by the compiler. You go about defining what a widget is, and then create instances of widget. Methods that would manipulate widgets must receive a widget as their in put.

Where statically typed languages provide subtypes, super types, and other ways to make the definition of what qualifies as a widget more malleable, databases struggle at this. You describe data (in the form of tables, relations, etc) beforehand as before, with each field being a strict type and each table describing some strictly typed record. To alter these definitions, you have to define new tables to make additions to the previously defined record types, and modifying the type of their existing data is not possible.

If you want to act on all widgets, you must be cognizant of other widget like tables. Even if your new widget is exactly alike from the old one, grouping both is either manual or inane and always slow. So how do you do migration? You dump the database, add or massage the types of the new table columns you will be adding, and then re-import. Some frameworks provide tools around this process, but the necessity is fundamentally broken.

The document oriented approach CouchDB takes is much more like having a large, flat, "duck typed" table where you can store anything. You define views of your large data soup that pick out items based on specific characteristics of those items, not on their structure. Want all "things" published on some day? It isn't a problem; everything is a thing. A quick stab at structure is to add a type field that allows you to filter out "things" that match a type string. These things are guaranteed, upon delivery, only to have matched that type string and nothing more. This is a weak guarantee, but weak guarantees buy us flexibility.

In Python, often times functions are described as taking objects that allow certain actions on them; for instance, iterable. Requiring only that an object be iterable is a very weak requirement, far weaker than acting on "anything of type foo". In practice, many functions merely require that the objects they manipulate only contain certain methods or attributes, not necessarily that they satisfy some larger unused type structure. This is a trade off, to be sure, but it's a trade off towards both simplicity and flexibility.

As a concrete example of how this can be useful, lets take my ever languishing gallery application. The goal is to keep in the database my images as well as their EXIF tags such that I could easily perform a search like "Find me all images with this aperture" or "Find me all images taken with this camera." Because I have images taken from at least 3 different cameras (not to mention pictures my friends or family take that I might want to include), and camera makers all add their own types of tags in the "MakerNote" section, I can't have a single per-image "tags" table.

As it stands now, my proto-SQL database has 3 simplistic tables to handle this: a gallery_image consisting of id, title, description, etc; a gallery_image_tag, which is supposed to represent a single EXIF tag consisting of an id, title, desc, etc, and 'gallery_image_tags' which allows me to tie the two together so I can get "an image and all of it's exif tags" in one query. This is straightforward (albeit painfully unoptimized) using the Django ORM, but it's a horrible rigid design that sees me making potentially dozens of database updates for each uploaded image.

In CouchDB, I could simply designate my images as having a type field of "image", and then dump in the tags as key/value pairs. It is as trivial to create views of the type described above of this database as it is to create a view returning all images; while the 'all images' view would map documents based on their satisfaction of (type == image), the more complex views are just as simple (camera_model_name == ...).

Looking to the future, it is also far easier to modify the CouchDB database to allow for new features. Lets look at some potentially interesting features: an algorithm that gauges the color temperature of a photo to group "like" photographs together, such that you can view a "gallery" of dusk pictures, dark pictures, or black and white pictures algorithmically rather than by manual tags. Implementing facial recognition to determine whether or not a picture is a portrait. I could run these algorithms on my database images in batch mode and then simply update each document with their temperature score or their boolean portrait status without ever explicitly modifying any structure. As the temperature scores or portrait statuses are tabulated, they are added to each document and the "gallery" views incorporate them automatically.

Software developers have this kind of wish list view of the future, where writing a web gallery can quickly turn into pushing the forefronts of computer vision technology or spawn a perl to python compilation project. Sometimes these whims manifest themselves as something very interesting or inspiring, and wherever they aren't too critical they should be possible!

comments

leave a comment

name, email, and comment are required.

name:
email:
(not shown)
url:
comment:
links get auto-converted, you may also use markdown