I have been programming in Python for quite some time now, and I've been doing it professionally for over 2 years. Despite this, I am not nearly as proficient at the language as I could be, probably because I am using it professionally and have to devote time I could be using learning the language to solving problems.
I find myself struggling sometimes to figure out what the "pythonic" way of doing something is. Whenever I realize a new solution to a design problem or just a regular coding problem, I mull over whether or not it fits the language I am writing in first. When I do this in Python, when I use what the language gives me rather than trying to force it to provide the solution I originally thought of, the results are almost always clearer and faster.
The python daemon I wrote at work in the last few months deals a lot with system state. The Object Oriented Paradigm really shines here, because there is much to share via inheritance, and by extension much to gain. The states I deal with are usually just kept in lists, with objects built around the lists mostly to provide the necessary knowledge on how to create them and occasionally to provide convenient transformations or functionality. There are many places in my program where I want to filter some of these state objects based on an arbitrary parameter; say I want all network interfaces with a last-measured latency under 100ms, or a list of writable data partitions with over 100MB free.
The way I was taught to do this, quite frankly, was terrible. If you have some special list, you are taught to write all sorts of special crap for every transformation you want to allow on that list, and that is that. If you have an object that represents a collection of something, like BagOfFruit, under this school of thought, you'd create some methods like BagOfFruit.filterByColor, or even better, the Bag superclass will have implemented a 'functional' style filter that takes a comparison function.
The "right" answer, in Python anyway, is a lot simpler. You have at your fingertips one of the most delightfully malleable built in generic collection objects in the world of programming languages. Want to filter some items from a list? There's a bunch of easy, short, and agile ways to do that in python code: just grab that list from the object and go. There's 4 ways I can think of, off the top of my head, to filter items from a list in python, and most of them look and read better than adding methods everywhere:
The oldfashioned way:
mylist = [] for item in oldlist: if item.foo < threshold: mylist.append(item)
The functional way:
mylist = filter(lambda item: item.foo < threshold, oldlist)
The itertools way:
mylist = list(itertools.ifilter(lambda item: item.foo < threshold, oldlist))
And the new way:
mylist = [item for item in oldlist if item.foo < threshold]
The old way is the way you'd think to solve this problem if you were a programmer who did not know python. You think about what know you can do and what you need to return and you go about creating it. You need a list, so you make one, and then you add to it everything that meets your conditions. Even though this is a very manual way of doing this in python, it's still a very useful level of abstraction over C/C++: no manual iteration.
The functional way is so named because map, filter, reduce and it's ilk were created historically for programmers used to that paradigm. Functional programmers deal mostly in data transformations (that's what functions do, since there's no state: everything's a transform), and as such they already had "patterns" on how to deal with many of these "I have a list and I want to do something with it" problems. Unfortunately, if you don't know python, and you aren't from a functional programming background (which is probably true), you'd have to reach for the documentation on 'filter' to know what was really going on in this code.
itertools is a python module bundled with the distribution that provides the same functionality as a lot of standard python functions, but returns an iterator instead. Although it's not the case in this example, if you were not planning on going through every item in the original list, and if that list was very large, using ifilter to filter your original list might be a very large time savings!
Now, on to the "answer." The new way is the best for lots of reasons. It's short, uses the list literal syntax in the creation of mylist, contains only semantics about the creation of the list (no book keeping or comparison function creation), and is more flexible than the filter, since you can store permutations on 'item' in the resultant list trivially (you can do this by composing filter & map, but if you do it in the straightforward way, then have fun iterating over the whole list twice).
But, back to my original question, which way is fastest? I wrote up a quick little test of dubious scientific quality, and here were the results:
running oldway() 100000 times ... 2.91310501099
running functional() 100000 times ... 3.14215993881
running itertools way (iter_) 100000 times ... 3.90754389763
running newway() 100000 times ... 2.10518980026
Not only is the list comprehension cleaner semantically than the other 3, but it is a lot faster. What if you wanted to iterate over the filtered list and do some more complex operations on the filtered set? This is presumably where itertools would be the Right Way (tm), but it looks like it's almost twice as slow as the comprehension. Indeed, when I added code to the comprehension to save the list, iterate over the whole saved list (but do nothing), and then return the filtered list, it still ran 100000 in only 2.61785793304 seconds.
The real importance of this all isn't that list comprehensions are the fastest, it's that their semantic purity is not a performance tradeoff. They really are the most pythonic way to approach this particular problem, and they happen to be the fastest. I ran into a similar problem with some timestamp printing code in a small logging library I wrote:
tz_adjust = (time.gmtime()[3] - time.localtime()[3]) * 3600 def default(): t = time.time() - tz_adjust ms = ("%.2F" % t).split(".")[1] return time.strftime('%H:%M:%S', time.gmtime(t)) + '.' + ms
Since this function would format the timestamp for every logged message, you had better believe that I tested the crap out of it to make sure it was the fastest I could manager. I was dismayed that the time module didn't give me anything better to work with than time.time as far as getting microseconds; as you can see from the code, not only do I have to convert the value (which is seconds since the epoch w/ 6 decimal digit microseconds) in order to get the time, but I had to pre-calculate the timezone since time.time() apparently always returns it's value in GMT/UTC.
Later on, I found out (writing something else where I needed to deal with dates) that the datetime module has a function/object called datetime.datetime.now(), which returns an object that has the current local time and microseconds in easily accessible attributes! I re-wrote my function as follows:
def default2(): dt = datetime.datetime.now() return "%02d:%02d:%02d.%s" % (dt.hour, dt.minute, dt.second, str(dt.microsecond)[:2])
It ran faster, it was less code, no more fooling around with timezones. What more could I ask? Correctness would be nice. This code has a bug in the way it displays hundredths of a second. While thinking it over, I remembered something I had noticed while goofing with the old code: in python, even if you think a numeric task is going to be complicated, it's almost always faster to stick with integer operations than to convert to a string. I only wanted 2 digits (hundredths), and the conversion would be as easy as dt.microsecond/10000, so I rewrote default2() to use integer division. Here are the times for default, default2, and default3 over 1000000 calls:
- default: 14.675798892974854
- default2: 10.18687105178833
- default3: 9.7911970615386963
My code now basically looked like this return "%02d:%02d:%02d.%02d" % (dt.hour, dt.minute, dt.second, dt.microsecond/10000); far cleaner than the original, actually correct unlike the second one, and the net speed increase was around 33%. Unfortunately, in both of these examples, the multitude of possibilities obscured the "right" solution; hopefully, with Python3k moving forward, and my python skills moving with it, some of the standard library can be merged so that a good mental coverage of it will be easier.