Python serialization

Python has lots of built-in serialization methods. It has had a standard module called pickle for a very long time, and exposes its built-in trusted data marshaling routines in the marshal module. Since the release of Python 2.6 in October, 2008 it has also had a standard, well-tested json library.

Since json's simple conceptual mapping to hash tables, smaller overhead, faster parsing, and ease of use in dynamic languages have found it taking over XML on the web in recent years, lots of work has been done trying to make it fast. Simplejson, which was incorporated into the stdlib as the standard json module, has seen further development aiming at keeping up with standards and adding compiled speedups. cjson, which took a lot of heat a few years ago for being relatively lax with respect to accepting broken input, has been the benchmark for raw speed, but its bad reputation, perceived flakiness, and zombied development status make it a tough sell for production. Finally, ujson, a project from esn.me, is under active development and shows benchmarks comparable to cjson.

At work, we ingest, process, store, load, and display tons of feeds, and serialization is a major consumer of CPU across our architecture. We recently wrote a faster replacement for feedparser (which was by far our biggest CPU hog) called speedparser and released it as open source. I decided to take the test corpus I used when developing speedparser and run a quick somewhat unscientific benchmark of the various serialization options mentioned above. In addition to those, I've also included msgpack, which has an in-C pickle-api python module, and is more or less compatible with json's data types, and bson, the binary-encoded json format used by mongodb.

The Setup

The data is the feedparser output from speedparser's test corpus, 4096 json documents of varying sizes (including some empty documents), the likes of which might be encountered by any general serialization job. I used the built-in python json module to decode each file, and then created files containing data encoded by marshal, pickle (via cPickle), and msgpack as well.

The total size in bytes of the various formats were different. Although this has obvious performance implications, I am more concerned with how fast the various formats can serialize and de-serialize the same data, not necessarily the same amount of bytes.

The python version is 2.7.2 built with GCC 4.6.1 on a 64bit linux machine. The versions of the non-built-in packages are:

msgpack 0.1.10
ujson 1.9
cjson 1.0.5
simplejson 2.2.1
bson from pymongo 2.0.1 (with _cbson ext)

The test was run multiple times with very little variance. The results given are not a mean, but a sample of one of the runs.

Data size (in kilobytes):

json	msgpack	marshal	pickle	bson
350168 (342M)	309772 (303M)	330660 (323M)	369060 (361M)	324452 (317M)

I was actually quite surprised pickle was the largest, though it is a format that can represent considerably richer objects than json can.

I ran a simple test script which pre-loads every file for each type, does a decode pass, then pre-loads the decoded data from each file and does an encode pass. Encoding and decoding were measured separately so that any differences between the various serialization formats would be more visible. As you would imagine from the data set size, this takes a decent amount of memory.

Results:

library	decoding	encoding	total
cjson	2.97s	3.57s	6.54s
json	6.83s	3.33s	10.16s
simplejson	4.25s	2.22s	6.47s
ujson	2.75s	1.63s	4.20s
cPickle	3.86s	4.81s	8.67s
marshal	2.76s	1.30s	4.06s
msgpack	0.72s	2.75s	3.47s
bson	2.58s	2.18s	4.76s

Some conclusions from this. The fastest decoder by some margin is msgpack, but since its data set was the smallest and it is designed for fast decoding and fast wire transport, this wasn't too surprising. The fastest encoder by less of a margin is marshal, followed by ujson, which performed very well in both encoding and decoding. bson also has a very strong all-around display, decoding and encoding documents quite quickly, although the interface is a bit awkward. cjson decodes very quickly even after 4 years of neglect, but under-performs during encoding, being beaten even by the builtin json library.

Pickle, the go-to for python object serialization, doesn't come out looking that good here, but if you want to serialize data with more complex types, there isn't really another option. Adding processors onto any of these other formatters for serialization and deserialization is going to slow them down significantly. Marshal was a curiosity; its documentation should scare you away from it, but if you need internal-only (trusted-data) serialization in python, speed is a major concern, and you want to avoid leaving the stdlib, it's the fastest built-in for both encoding and decoding.

On this evidence alone, I'd say that if you need speed and require json, ujson looks like it's the king, provided that you are comfortable with its test suite and maturity. It also provides pretty stable performance on both the encode and decode side of things, and json is a pretty stable format that is human readable and has plenty of tooling support. If decoding is particularly important for your application (reading blobs from a db or cache, perhaps), msgpack is worth taking a look at. The msgpack format is is a binary one, which is a drawback for human readability, but it will also take up less memory (since it serializes about 11% smaller than json), transfer over the wire faster, and decode almost 3x as quickly as json.

The standard caveats for these types of benchmarks apply. What is shown is a simple breakdown of the encoding and decoding performance characteristics of each library. Important things are ignored which should factor in to your serialization decisions, like the durability of each format to errors, or the behavior of the decoders under special data conditions or malformed input.

Note Since the time of this writing, I've discovered that the memory performance of simplejson can be dramatically better than the other json libraries during decoding thanks to patches from Antoine Pitrou in 2010 which allow make it reuse string objects when keys are repeated within a json document. For many real-life situations, where json can decode a list of embedded objects which share structure, this can be massive: for a few-megabyte json feed from flickr, ujson et al used 110MB to decode, whereas simplejson used half that.

Nov 2 2011

jmoiron plays the blues