Gevent and Requests are both absolutely fantastic libraries, for different reasons: gevent is technically brilliant, and requests is brilliant from a design perspective. Despite this, the combination of the two sent me all the way down the rabbit hole and into my own little hell. Before I go on, let me distill the lesson of this episode in a convenient little quip:

If you are profiling something, start by profiling it.

For about 3 of the past 4 weeks, I've been working on a pretty major project to replace a large Twisted application we have with a small gevent application, and it's been a huge success. The resultant code is fully usable in synchronous contexts (like a REPL or most unit testing frameworks), was way faster in testing, and since it was able to leverage more libraries for its work, about 1/5th the size.

It's been a huge success, that is, until Thursday, when I went to put it into production.

This application handles ~250 jobs per second across a number of different machines. Each Twisted-based worker process seems to have a capacity for around 30 jobs per second, and operates at a load of ~20 at around 60% CPU utilization. The new gevent-based workers could only handle about 5-10 per second each, and pegged the CPU at full tilt boogie.

The logging I had enabled showed that we were resource starved; the new app was unable to pull jobs in off of the queue at a fast enough rate. This sent me on a long chase, separating out the job fetch portion to its own network daemon, before finally finding the root of that issue (using channel.get, which did 4000 gets per 100 messages received, rather than channel.consume, which used pre-fetching and made this almost instant) and pulling it back in process.

With the resource starvation problem nicked, I prepared to run my triumphant production test. 5-10 jobs/sec.

I added more logging, this time around every requests.get and around all potentially time intensive input processing (json via ujson, xml via lxml), and noticed that the worker stopped during the non-cooperative CPU bound steps. This was, of course, entirely expected; but the duration of these stops were 100x more than I'd experienced outside of this context.

Now convinced now that gevent had a major fault with running c-calling code, I attempted to produce a synthetic example that would confirm my findings in a way I could easily share and debug. I wrote a little program which read 50 xml feeds and ran speedparser on all of them; first, serially in a normal loop, and then "asynchronously" via a gevent pool. I ran my test, expecting this gevent bug to show itself, but the timings were nearly identical. "Maybe", I thought, "this only happens with network traffic."

So I wrote a gevent-based wsgi application which served these feeds from memory, loaded the feeds off of that, and they still ended up being more nearly identical. Exasperated, and with nowhere else to turn, I decided to write a cProfile mode for my app and see what kinds of functions we were calling.

The first thing I noticed was that, somewhere down the line, I had been using response.text.encode("utf-8"), and text in turn was using chardet to detect the character encoding. I recalled from my many hours of performance testing writing speedparser that chardet can be an absolute dog, so excitedly assuming I'd had the problem nicked again, I replaced that with un-encoded response.content and gave it a run. 2-3 jobs/sec.

Deflated, embarrassed, and depressed, I went back to my profiler. json and xml were still taking far too long... In my encoding speed tests, ujson was decoding over 100MB/sec, but my printed timing information showed it taking sometimes up to a second for even small (50 kB) requests. Furthermore, the separation between the log timestamps showed that nothing else was done during this time.

When I went back into the pstats output, I noticed that cProfile disagreed; very little time was spent in ujson. Since ujson is a bit of a black box to the profiler, I turned my attention to speedparser, which uses considerably more python code, and while examining the callees noticed something very peculiar indeed:

Function                called...
                        ncalls  tottime  cumtime
.../feed.py:48(_parse)  ->  14    0.001    0.323  .../feedlib.py:55(speedparse_safe)
                            14    0.005    0.024  .../feedlib.py:98(postprocess)
                            14    0.000   11.976  .../requests/models.py:759(content)

Why was my completely non-cooperative CPU-bound feed parser (which itself processes ~60MB/sec) spending 12 seconds in the requests library over 14 calls? Digging deeper, I discovered that content is actually a property which returns a lazily-loaded iterator, which itself reads chunk-wise from the raw socket. In 1-byte chunks. Since saving memory is not a requirement for me, I wrote my own Response which just reads the whole thing in at once, and once again went to the server to test it out. 65-70 jobs/sec.

Note: Since the publication of this blog entry, requests has changed the generator to load requests in 10k chunks. This was still a performance problem for me, and I wrote a small wrapper to load it in 500k chunks instead.

What I suspect was happening, but cannot confirm, is that the generator returned by response.content made it into a GIL-holding section of one of these C libraries, which then yielded (via the generator's use of read) back to gevent's hub, which then went loco for a bit until it chanced upon the right greenthread, read more, and repeated, til the content was totally loaded. This kept any code from executing during these CPU-bound critical sections, but had the undesired side effect of adding potentially thousands of random delays. This kept the log clean of any other goingson during that interval, which led me to mistakenly believe that this code was running unnaturally slowly. The profiler was right; that code actually spent very little time running; it spent most of its time waiting on read calls.

So now that this is all in the past, what could I have done to discover this in less than 50 hours of debugging and hair pulling? Despite it seeming like the proverbial howitzer when a simple pistol would do, I really should reach for the profiler earlier in the future. The profiling code and subsequent pstats exploration actually only took about 2 hours; and indeed, my only other major breakthrough (the amqp bug) was discovered using a python-based gevent profiler which was sadly slowed down my app a bit too much to be practical.

In addition, it's made me appreciate optimization and how it can really bite you unknowingly. The lazy chunked-read behavior of response.content is superior to reading the whole response all the time, provided the resource you're concerned about is memory. Laziness is generally considered a virtue (certainly, by perl and haskell hackers), as it enables you to occasionally skip work altogether. Ironically, it was the combination of request's vanity and gevent's cleverness that led to this incredible performance nightmare.

Apr 23 2012