skip to navigation
skip to content

Planet Python

Last update: November 21, 2009 09:42 AM

November 21, 2009


Steve Holden

Links for 2009-11-20 [del.icio.us]

November 21, 2009 08:00 AM


Heikki Toivonen

Turbogears2 on Dreamhost

It has been almost two years since I tested Turbogears 1 on Dreamhost. Back then it was quite difficult for me to get it running. But some additional personal experience and improvements in Turbogears2 have made it a breeze. I tested with Turbogears 2.0 although I upgraded to 2.1a2 at some point.

First you need to get virtualenv installed, which is pretty simple after you have downloaded and unpacked the source tarball: python2.5 virtualenv.py $HOME. (I wanted it installed in $HOME, but you could use alternative locations as well.) This will install setuptools, but somehow not virtualenv. Then you just do easy_install virtualenv. You will also need PasteDeploy so do: easy_install PasteDeploy.

Next steps might be different for installing a Turbogears2 egg/application, but I used these instructions to install the wiki-20 tutorial in development mode. (To install a properly packaged app you probably just need to do: easy_install app_tarball; paster make-config yourapp production.ini and follow the instructions from FastCGI onwards.)

After that you just follow tg2 automatic installation instructions.

Then use paster quickstart to create a new project template. cd to the created directory, and run python setup.py develop to download any missing dependencies and set things up for debugging and development.

Edit as instructed in the tutorial. Then python setup-app development.ini.

After that it is time to create the production ini: paster make-config Wiki20 production.ini.

Next step is getting this running with FastCGI. Create wiki20.fcgi in the webroot directory:

#!/home/your-username/path/to/tg2env/bin/python
 
from fcgi import WSGIServer # you could also use flup etc.
 
from paste.deploy import loadapp
real_app = loadapp('config:/home/your-username/path/to/production.ini')
 
def myapp(environ, start_response):
    environ['SCRIPT_NAME'] =  # get rid of the .fcgi in urls
    return real_app(environ, start_response)
 
server = WSGIServer(myapp)
server.run()

There are a couple of points of note here:

Next we’ll need a .htaccess file:

# Enable Dreamhost stats
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_URI} ^/(stats|failed_auth\.html).*$ [NC]
RewriteRule . - [L]
</IfModule> 
# FastCGI
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^wiki20\.fcgi/ - [L]
RewriteRule ^(.*)$ wiki20.fcgi/$1 [L]
</IfModule>

Now when you go to your site, first time it is going to take a while to load your app, but after that things will be snappy as long as the app stays in memory.

November 21, 2009 04:51 AM


Simon Wittber

Giant, Python Powered Robots.

These are the robots I've been working on for the last 12 months. They each weigh about 11 tonnes and have a 17 meter reach.

The control system is written in Python, with small sections of C which run in hard-real-time to guarantee safety. The robots work cooperatively, semi-autonomously, with drive-by-wire style assistance when under manual control.

Update: added a close-up of the business end. This claw weighs just over 1 tonne, and gets hurled around at up to 3.5 meters per second.

November 21, 2009 03:37 AM


Carl Trachte

Pycon 2010 pre-favorites - the Carl T. edition

Catherine Devlin posted the talks she'd like to attend at Pycon.  It was fun reading her picks - she has a great sense of humor, although I'm bummed that everyone now knows about the submarine robot talk, and I'll have to fight for room in the lecture hall to get a spot.

These are my picks (not in order - the truth is I could spend all day talking about a lot of these talks - they are all good - too short a life - too many good talks):

1) Ha, go to the depths of the sea to your Octopus' Garden with your submarine robot, Catherine.  I'm heading skyward with robots in space.

 2) Jython in the Military, is near and dear to me.  Besides, bossman is giving the talk - never hurts to show up and support the team.

3) While we're on the subject of Jython, Extending Java Applications with Jython by Wierzbicki is one that has potential to make me less ignorant.


4) My posts on this blog have all been about Python 3.x - Chun's talk, Python 3:  The Next Generation and Smith's talk about Python 3.x string formatting are two I could seriously benefit from.  When Anthony Baxter gave a tutorial on Python 3 at OSCON a couple years back, I just about died when he said he wasn't going to cover new string formatting.  He gave me this look like, "Wow, who are you, are you nuts?"  And I gave him this look like, "A total Python nobody, and . . . as a matter of fact, yeah."  Anyway, the string formatting one is pretty important to me.


5) Changes to Unittest and Intro to Unittest - yeah, four years after being exposed to testing, I'm still taking baby steps - sue me.


6) The Diversity Suite - one diversity talk and two technical ones - talks Diversity as a Dependency, Infrastructure Construction in Africa, Hackerlab.  
Anna R. is giving the diversity one - I attended her women in computing one a few years back and learned a bit, plus she was cool enough to let me dink around with her brand new minicomputer - a novelty at the time.
There is so little about Africa out there on the web relative to Europe and the Far East - I want to see what's going on.
Hackerlab (actually Think Globally, Hack Locally - Teaching Python in Your Community) is Leigh's talk.  Leigh is a smart security person.  In addition she's upbeat and happy.  I, by contrast, am morose and depressive.  I am hoping to get smarter and more upbeat through the process of osmosis.

7) The Python Language Suite - The Mighty Dictionary, Deconstruction of an Object, Decorators, The Command Line.  I'm not as strong on some of the basics as I'd like to be.  No shortage of places to learn and get up to speed.


8) Dealing with unsightly data in the real world.  Wow, the story of my life - "Singing my life with his words, Killing me softly with his talk . . ."


There's a zillion other talks I'd like to attend; actually, I probably will end up attending only half of these.  I'm not a web programmer - there's a ton of talks on that end of things.  Check it out.

November 21, 2009 12:50 AM


Mikeal Rogers

JSON Performance in Python

In part of my ongoing performance work in our CouchDB+Python application I’ve decided to sit down and profile JSON performance in the different open source libraries available for Python.

I ran this test profiling json (pure Python simplejson) available in Python stdlib, simplejson compiled with C speedups, cjson, and jsonlib2, with a large JSON document. The test decodes and encodes a large JSON object 100 times. It then runs that test 100 times in each library in succession in order to find the average encode/decode time for each library and minimize other environmental factors that may occur. These numbers were taken on my MacBook Air running Mac OS X 1.6.1 with the default Python 2.6.

The time represents in milliseconds how long it takes to encode/decode this JSON object 100 times.

JSONPerf

I honestly didn’t expect the stdlib json to be this far behind.

Among the other C based libraries there isn’t a clear winner. cjson is the best decoder but the slowest encoder, simplejson compiled with C speedups is the fastest encoder but the slowest decoder while jsonlib2 is somewhere in the middle for both cases.

Also, annoyingly, cjson doesn’t implement the same API as the other libraries (dump and load functions are named encode and decode) making it much more difficult for a library to include support for all available libraries. Now rather than just being able to add a user defined json module I’ll need to add support for user defined parsing and encoding functions to couchdb-pythonviews, couchquery, and couchdb-wsgi.

November 21, 2009 12:07 AM


Alex Gaynor

Things College Taught me that the "Real World" Didn't

A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.

Things I learned in college:

Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.

C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.

Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).


Things I learned in the real world:

Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).

Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.

APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.


So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.

November 21, 2009 12:03 AM

November 20, 2009


IronPython-URLs

IronPython 2.6 Release Candidate 3

IronPython 2.6 is the up-and-coming version of IronPython targeting compatibility with Python 2.6. As well as the new features in Python 2.6, IronPython 2.6 has several important new features specific to IronPython. These include:

IronPython 2.6 Release Candidate 3 has just been released. The hope is that this will be the last release candidate before the final release:
We’re pleased to announce the third and hopefully final release candidate of IronPython 2.6. Release Candidate 3 only includes Silverlight-related changes pertaining to some incompatibilities between 2.6 RC1 and RC2. Those who utilize IronPython for non-Silverlight scenarios will happily find virtually no churn from RC2. We strongly encourage everyone interested in Silverlight to test out this release ASAP because we plan on releasing IronPython 2.6 final in a week if no major new regressions are detected.


November 20, 2009 11:20 PM

Two Articles: IronPython 2.0 and WPF Error

Two more articles from Ibrahim Kivanc, the Turkish blogger who has written several articles on IronPython and Silverlight. Both of these articles are in English.

IronPython 2.0 version now runs on DLR (Dynamic Language Runtime). DLR is a platform on .NET which is host Dynamicly typed languages on it. Now Dynamic Languages Communicate eachother and C#,VB, COM Objects, .NET Libraries.

IronPython, with 2.0 version runs on DLR (Dynamic Language Runtime); it’s a platform like CLR architecture. It’s host for Dynamic Languages on .NET. With this architecture Dynamic Languages now faster then running on CLR and easily communicate with other .NET objects!
In my opinion IronPython Studio is not stable enough for production use. It does have the advantage of being integrated in Visual Studio so some people can't resist trying it out. (You can read my write-up of IronPython Studio at: IronPython Tools and IDEs.)

If you insist on trying out IronPython Studio there are various minor trials and tribulations for you to overcome; using the WPF designer is one of them. In this article Ibrahim explains how to jump over this particular hurdle and get the WPF designer working in IronPython Studio.


November 20, 2009 10:04 PM


Steve Holden

Starting 2010 With a Bang

Holden Web's first one-day workshop was, thanks to Jacob Kaplan Moss, a sell-out success. As a result, and partially due to some excellent feedback from the New York City Python Meetup group, we will be running the same workshop in New York on January 22, again with Jacob presenting. We are also offering a one-day IronPython workshop presented by Michael Foord on January 21.

Since the three-day Introduction to Python classes have been well-received in Virginia we are also offering that class in New York on January 18-20.

To try and make things easier for those attending and smooth out our administration we are using Eventbrite for the first time. I would really like to know how easy people find it to get information about our classes and to enroll for them. Anyone wanting specific information not mentioned in the course outlines is, of course, welcome to contact us for further details.

If you would like to take one of these classes simply follow the links above (or click here for a list of all our current offerings, then just go to the ones you are interested in) and click the Order Now button which should be clearly visible. Once you have entered the details click the Review Your Order buttin, and you have fifteen minutes to check that you have entered the correct information before you click the Pay Now button. It really couldn't be much easier, I hope.

We are also very interested to know what other event you would like us to run. This is the front end of a new venture for Holden Web, and your opinions and requirements (places you'd like to attend presentations as well as other topics) will help us to move in the right direction. So feel free to contact us with your suggestions, or make them in comments below. Thanks in advance for the feedback.

November 20, 2009 09:41 PM


Logilab

First Pylint Bug Day on Nov 25th, 2009 !

http://www.logilab.org/image/18785?vid=download

Since we don't stop being overloaded here at Logilab, and we've got some encouraging feedback after the "Pylint needs you" post, we decided to take some time to introduce more "community" in pylint.

And the easiest thing to do, rather sooner than later, is a irc/jabber synchronized bug day, which will be held on Wednesday november 25. We're based in France, so main developpers will be there between around 8am and 19pm UTC+1. If a few of you guys are around Paris at this time and wish to come at Logilab to sprint with us, contact us and we'll try to make this possible.

The focus for this bug killing day could be:

We will of course also try to kill a hella-lotta bugs, but the main idea is to help whoever wants to contribute to pylint... and plan for the next bug-killing day !

As we are in the process of moving to another place, we can't organize a sprint yet, but we should have some room available for the next time, so stay tuned :)

November 20, 2009 06:38 PM


Dave Beazley

Python Thread Deadlock Avoidance

One danger of writing programs based on threads is the potential for deadlock--a problem that almost invariably shows up if you happen to write thread code that tries to acquire more than one mutex lock at once. For example:

a_lock = threading.Lock()
b_lock = threading.Lock()

def foo():
    with a_lock:
         ...
         with b_lock:
              # Do something
              ...

t1 = threading.Thread(target=foo)
t1.start()

Code like that looks innocent enough until you realize that some other thread in the system also has a similar idea about locking--but acquires the locks in a slightly different order:

def bar():
    with b_lock:
         ...
         with a_lock:
              # Do something (maybe)
              ...

Sure, the code might be lucky enough work "most" of the time. However, you will suffer a thousand sorrows if both threads try to acquire those locks at about the same time and you have to figure out why your program is mysteriously nonresponsive.

Computer scientists love to spend time thinking about such problems--especially if it means they can make up some diabolical problem about philosophers that they can put on an operating systems exam. However, I'll spare you the details of that.

The problem of deadlock is not something that I would normally spend much time thinking about, but I recently saw some material talking about improved thread support in C++0x. For example, this article has some details. In particular, it seems that C++0x offers a new locking operation std::lock() that can acquire multiple mutex locks all at once while avoiding deadlock. For example:

std::unique_lock<std::mutex> lock_a(a.m,std::defer_lock);
std::unique_lock<std::mutex> lock_b(b.m,std::defer_lock);
std::lock(lock_a,lock_b);      // Lock both locks
...
... do something involving data protected by both locks
...

I don't actually know how C++0x implements its lock() operation, but I do know that one way to avoid deadlock is to put some kind of ordering on all of the locks in a program. If you then strictly enforce a policy that all locks have to be acquired in increasing order, you can avoid deadlock. Just as an example, if you had two locks A and B, you could assign a unique number to each lock such as A=1 and B=2. Then, in any part of the program that wanted to acquire both lock A and B, you just make a rule that A always has to be acquired first (because its number is lower). In such a scheme, the thread bar() shown earlier would simply be illegal. That lock() operation in C++ is almost certainly doing something similar to this--that is, it knows enough about the locks so that they can acquired without deadlock.

All of this got me thinking--I wonder how hard it would be to implement the lock() operation in Python? Not hard as it turns out. First step is to change the name--given that acquire() is the typical method used to acquire a lock, let's just call the operation acquire() to make it more clear. You can define acquire() as a context-manager and simply order locks according to their id() value like this:

class acquire(object):
    def __init__(self,*locks):
        self.locks = sorted(locks, key=lambda x: id(x))
    def __enter__(self):
        for lock in self.locks:
            lock.acquire()
    def __exit__(self,ty,val,tb):
        for lock in reversed(self.locks):
            lock.release()
        return False

Okay, that was easy enough to do, but does it work? Let's try it on the classic dining philosophers problem (look it up if you need a refresher):

import threading

# The philosopher thread
def philosopher(left, right):
    while True:
        with acquire(left,right):
             print threading.currentThread(), "eating"

# The chopsticks
NSTICKS = 5
chopsticks = [threading.Lock() 
              for n in xrange(NSTICKS)]

# Create all of the philosophers
phils = [threading.Thread(target=philosopher,
                          args=(chopsticks[n],chopsticks[(n+1) % NSTICKS]))
         for n in xrange(NSTICKS)]

# Run all of the philosophers
for p in phils:
    p.start()

If you try this code, you'll find that the philosophers run all day with no deadlock. Just as an experiment, you can try changing the philosopher() implementation to one that acquires the locks separately:

def philosopher(left, right):
    while True:
        with left:
             with right:
                 print threading.currentThread(), "eating"

Yep, almost instantaneously deadlock. So, as you can see, our acquire() operation seems to be working.

There's still one last aspect of this experiment that needs to be addressed. One potential problem with our acquire() operation is that it doesn't prevent a user from using it in a nested manner as before. For example, someone might write code like this:

with acquire(a_lock,b_lock):
     ...
     with acquire(c_lock, d_lock):
          ...

Catching such cases at the time of definition would be difficult (if not impossible). However, we could make the acquire() context manager keep a record of all previously acquired locks using a list placed in thread local storage. Here's a new implementation--and just for kicks, I'm going to switch it over to a context manager defined by a generator (mainly because I can and generators are cool):

import threading
from contextlib import contextmanager

local = threading.local()
@contextmanager
def acquire(*locks):
    locks = sorted(locks, key=lambda x: id(x))   
    acquired = getattr(local,"acquired",[])
    # Check to make sure we're not violating the order of locks already acquired   
    if acquired:
        if max(id(lock) for lock in acquired) >= id(locks[0]):
            raise RuntimeError("Lock Order Violation")
    acquired.extend(locks)
    local.acquired = acquired
    try:
        for lock in locks:
            lock.acquire()
        yield
    finally:
        for lock in reversed(locks):
            lock.release()
        del acquired[-len(locks):]

If you use this version, you'll find that the philosophers work just fine as before. However, now consider this slightly modified version with the nested acquires:

# The philosopher thread                                                                                             
def philosopher(left, right):
    while True:
        with acquire(left):
            with acquire(right):
                print threading.currentThread(), "eating"

Unlike the previous version that had nested with statements and deadlocked, this one runs. However, one of the philosophers crashes with a nasty traceback:

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 522, in __bootstrap_inner
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "hier4.py", line 53, in philosopher
    with acquire(right):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "hier4.py", line 35, in acquire
    raise RuntimeError("Lock Order Violation")
RuntimeError: Lock Order Violation

Very good. That's exactly what we wanted.

So, what's the moral of this story. First of all, I don't think you should use this as a license to go off and write a bunch of multithreaded code that relies on nested lock acquisitions. Sure, the context manager might catch some potential problems, but it won't change the fact that you'll still want to blow your head off after debugging some other horrible problem that comes up with your overly clever and/or complicated design.

I think the main take-away is an appreciation for Python's context-manager feature. There's so much more you can do with a context manager than simply closing a file or releasing an individual lock.

Disclaimer: I didn't do a hugely exhaustive internet search to see if anyone else had implemented anything similar to this in Python. If you know of some links to related work, tell me. I'll add them here.

November 20, 2009 03:04 PM


Ned Batchelder's blog

On business English

I've noticed in the office, people often refer to meetings with just an adjective phrase, letting "meeting" be implied: "I have to go, I have a 2:00". Why don't we carry this to its logical extreme? Let's just say things like,

I have to get ready for the weekly.

Lunch has to be quick, I have to get back for a boring.

I think I'm going to skip the pointless.

November 20, 2009 10:43 AM


Spyced

Automatic project structure inference

David MacIver has an interesting blog entry up about determining logical project structure via commit logs. I was very interested because one of Cassandra's oldest issues is creating categories for our JIRA instance. (I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you, in this case.)

The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results.

In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not exactly a revelation that Filter and FilterTest go together. I'd be tempted to play with it more but with only about two months and 250 commits in the apache repo there's not really all that much data there. (Cassandra's first two years were in an internal Facebook repository.) Working with data that exists as a side effect of natural activity is fascinating.

November 20, 2009 05:29 AM


Catherine Devlin

PyCon pre-favorites

When I look over the PyCon 2010 talk list, I'd like to be at about half of them (a physical impossibility, until I master self-multiplexing). Still, these are the ones that I'll move heaven and earth to be at. What about you - what are your favorites?

Extending Java Applications with Jython
I'm hopeful that this can really move Jython from my "stuff I think is cool" box to my "stuff I use every day" box.
IronPython Tooling
This is going to cover development environments and tools for debugging and profiling... pretty much a necessity in the .NET world. I also hope to use the video of this talk in the future in talking to the hordes of programmers around here who live and breathe Visual Studio.
Python in the Browser
Silverlight is way too cool to leave to the C# kids.
Think Globally, Hack Locally - Teaching Python in Your Community
As a local group-leader type geek, I'd love to start some of these Hack Nights.
Dude, Where's My Database?
There were so many proposals for descriptions of non-relational databases - but this one really stands out because it looks at the huge picture, classifying databases by their broad category and highlighting what makes each category beneficial for particular purposes.
Sprox: data driven web development
I confess - I've fallen behind the TurboGears world lately. Nobody's demanded a dynamic web app of me for a while, and TG has moved too fast for me to keep track of it. When last I was involved, Sprox was just emerging. I hope this talk will help me catch up.
Revisioned Databases for MultiUser Editing
Revisioned databases are an interesting concept, and seeing how one was actually developed should warm my datageek heart.
Easy command-line applications with cmd and cmd2
Interactive command-line interfaces were good enough for ZORK, and they're good enough for you! cmd and cmd2 make them crazy-easy. (I'll get in trouble if I don't go to this one, since I'm the speaker.)
Dealing with unsightly data in the real world
Gathering data from disparate, chaotic sources is a big part of pretty much everybody's life. I'm eager for any new insights.
An Underwater Python: Tortuga the Python Powered Robot
because, deep down inside, people everywhere are the same; we all want to be loved, and Python-powered robot submarines.

November 20, 2009 03:44 AM


Imaginary Landscape

Permission Based File Serving

One issue I've run into a couple times while working with Django is the need to serve files to users based on permissions. The first situation occurred with a store we were building that would allow for electronic versions of books to be sold. These books would typically be ...

November 20, 2009 12:38 AM

November 19, 2009


Alex Gaynor

Another Pair of Unladen Swallow Optimizations

Today a patch of mine was committed to Unladen Swallow. In the past weeks I've described some of the optimizations that have gone into Unladen Swallow, in specific I looked at removing the allocation of an argument tuple for C functions. One of the "on the horizon" things I mentioned was extending this to functions with a variable arity (that is the number of arguments they take can change). This has been implemented for functions that take a finite range of argument numbers (that is, they don't take *args, they just have a few arguments with defaults). This support was used to optimize a number of builtin functions (dict.get, list.pop, getattr for example).

However, there were still a number of functions that weren't updated for this support. I initially started porting any functions I saw, but it wasn't a totally mechanical translation so I decided to do a little profiling to better direct my efforts. I started by using the cProfile module to see what functions were called most frequently in Unladen Swallow's Django template benchmark. Imagine my surprise when I saw that unicode.encode was called over 300,000 times! A quick look at that function showed that it was a perfect contender for this optimization, it was currently designated as a METH_VARARGS, but in fact it's argument count was a finite range. After about of dozen lines of code, to change the argument parsing, I ran the benchmark again, comparing it a control version of Unladen Swallow, and it showed a consistent 3-6% speedup on the Django benchmark. Not bad for 30 minutes of work.

Another optimization I want to look at, which hasn't landed yet, is one of optimize various operations. Right now Unladen Swallow tracks various data about the types seen in the interpreter loop, however for various operators this data isn't actually used. What this patch does is check at JIT compilation time whether the operator site is monomorphic (that is there is only one pair of types ever seen there), and if it is, and it is one of a few pairings that we have optimizations for (int + int, list[int], float - float for example) then optimized code is emitted. This optimized code checks the types of both the arguments that they are the expected ones, if they are then the optimized code is executed, otherwise the VM bails back to the interpreter (various literature has shown that a single compiled optimized path is better than compiling both the fast and slow paths). For simple algorithm code this optimization can show huge improvements.

The PyPy project has recently blogged about the results of the results of some benchmarks from the Computer Language Shootout run on PyPy, Unladen Swallow, and CPython. In these benchmarks Unladen Swallow showed that for highly algorithmic code (read: mathy) it could use some work, hopefully patches like this can help improve the situation markedly. Once this patch lands I'm going to rerun these benchmarks to see how Unladen Swallow improves, I'm also going to add in some of the more macro benchmarks Unladen Swallow uses to see how it compares with PyPy in those. Either way, seeing the tremendous improvements PyPy and Unladen Swallow have over CPython gives me tremendous hope for the future.

November 19, 2009 11:51 PM


Menno's Musings

Setting PYTHON_EGG_CACHE when deploying Python apps using FastCGI

I recently sorted out an issue with the IMAPClient Trac instance that's been bugging me for a while.

The problem was that whenever the web server logs were rotated logrotate would restart Lighttpd. The web server restart would in turn restart the Trac (FastCGI) processes. Unfortunately, the Trac processes would fail to start with the following error.

pkg_resources.ExtractionError: Can't extract file(s) to egg cache

The following error occurred while trying to extract file(s) to the Python egg
cache:

  [Errno 13] Permission denied: '/root/.python-eggs'

The Python egg cache directory is currently set to:

  /root/.python-eggs

Bang, no IMAPClient web site (the rest of the site was ok). To band-aid the problem when it happened (and I noticed!) I issue a sudo /etc/init.d/lighttpd restart and everything would be fine again.

After some investigation I found that running /etc/init.d/lighttpd restart as root always triggered the problem where-as restarting using sudo always worked. My guess is that restarting when logged in as root was leaving $HOME at /root even after Lighttpd had dropped to its unprivileged user account. The unprivileged user isn't allowed to write to /root so Trac blows up. setuptools seems to use $HOME instead of looking up the actual home directory of the current user.

The fix for me was to set the PYTHON_EGG_CACHE environment variable for the FastCGI processes to somewhere they are allowed to write to. This is done with the bin-environment option if you're using Lighttpd like me.

I imagine similar problems can happen with any Python app deployed using FastCGI.

November 19, 2009 05:48 PM


BioPython News

Introducing (and expanding) the Biopython Cookbook


Hi all,

You may have noticed we’re trying out using the wiki for Biopython cookbook entries. It’s a new idea so at the moment there are only a few ‘recipes’ on offer. If you have some tricks you find yourself using time and again to solve a problem why not share them? Similarly, if you find yourself coming up against a problem you can’t seem to solve easily with Biopython’s tools send a message to one of the mailing lists proposing it as a cookbook example and someone just might solve it for you!

There are also several short examples in the main “Biopython Tutorial and Cookbook” (pdf version) which might be worth copying/moving to the wiki. What would you pick from here?

Feedback from Biopython newcomers would be especially valuable! :)

November 19, 2009 04:30 PM

Dropping Python 2.3 Support


As announced here, any last minute requests to postpone dropping support for Python 2.3 from the next release of Biopython must be posted to the main Biopython mailing list no later than Friday, May 8.

November 19, 2009 04:30 PM

Working with FASTQ files in Biopython when speed matters


Biopython 1.51 onward includes support for Sanger, Solexa and Illumina 1.3+ FASTQ files in Bio.SeqIO, which allows a lot of neat tricks very concisely. For example, the tutorial (PDF) has examples finding and removing primer or adaptor sequences.

However, because the Bio.SeqIO interface revolves around SeqRecord objects there is often a speed penalty. For example for FASTQ files, the quality string gets turned into a list of integers on parsing, and then re-encoded back to ASCII on writing.

The new Bio.SeqIO.convert(…) function in Biopython 1.52 onwards makes converting from FASTQ to FASTA, or between the FASTQ variants about five times faster. It can do this because it doesn’t bother with creating any objects – it just uses Python strings.

You can use the same approach in your own scripts. For example, suppose you have a Solexa FASTQ file where you want to trim all the reads, taking just the first 21 bases (say). Why might you want to do this? Well, in Solexa/Illumina there is a general decline in read quality along the sequence, so it can make sense to trim, and some algorithms like to have all the input reads the same length. Here is how I would write this using the standard Bio.SeqIO functions:

from Bio import SeqIO
records = (rec[:21] for rec in SeqIO.parse(open("untrimmed.fastq"), "fastq-solexa"))
handle = open("trimmed21.fastq", "w")
count = SeqIO.write(records, handle, "fastq-solexa")
handle.close()
print "Trimmed %i FASTQ records" % count

This works, and is very simple and general. The same template can be used on any file formats supported by Bio.SeqIO. However, it might be a bit slow for large next generation sequence files.

Instead, we can get a little more low level – and work directly with strings. This requires you to know more about the details of the FASTQ file format. Parsing FASTQ files is surprising complicated (with nasty things like line wrapping technically allowed), so we’ll still get Biopython to do that bit – but not bother with constructing SeqRecord objects and decoding the FASTQ quality strings. On the other hand, doing the FASTQ output explicitly isn’t actually too bad once you know how things work:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
trim = 21
handle = open("trimmed21.fastq", "w")
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")) :
    handle.write("@%s\n%s\n+\n%s\n" % (title, seq[:trim], qual[:trim]))
handle.close()

Again, the solution is a very short script – but this time it is much less flexible, and not nearly as clear what is going on. On the bright side, it is many times faster. Deciding on this trade-off is down to you, but I hope this blog post has highlighted the potential usefulness of the FastqGeneralIterator function in Bio.SeqIO.QualityIO, which you might otherwise have overlooked. To find out more, please read the built in documentation (also available online):

>>> from Bio.SeqIO.QualityIO import FastqGeneralIterator
>>> help(FastqGeneralIterator)
...

Please sign up to the Biopython mailing list if you want to discuss this topic further.

Peter

November 19, 2009 04:30 PM

Biopython 1.51 beta released


A beta release for Biopython 1.51 is now available for download and testing.

In the two months since Biopython 1.50 was released, we have introduced support for writing features in GenBank files using Bio.SeqIO, extended SeqIO’s support for the FASTQ format to include files created by Illumina 1.3+, and added a new set of application wrappers for alignment programs, and made numerous tweaks and bug fixes.

All the new features have been tested by the dev team but it’s possible there are cases that we haven’t been able to foresee and test, especially for the GenBank feature writer (as there as just so many possible odd fuzzy feature locations).

Note that as previously announced, Biopython no longer supports Python 2.3, and our deprecated parsing infrastructure (Martel and Bio.Mindy) has been removed.

Source distributions and Windows installers are available from the downloads page on the Biopython website (biopython.org).

We are interested in getting feedback on the beta release as a whole, but especially on the new features and the Biopython Tutorial and Cookbook (PDF).

So, gather your courage, download the release, try it out and let us know what works and what doesn’t through the mailing lists (or bugzilla).

November 19, 2009 04:30 PM

Clever tricks with NCBI Entrez EInfo (& Biopython)


Constructing complicated NCBI Entrez searches can be tricky, but it turns out one of the Entrez Programming Utilities called Entrez EInfo can help.

For example, suppose you want to search for mitochondrial genomes from a given taxa – either just in the Entrez web interface, for use in a script with ESearch (where you might also download them with ESearch (where you might also download them with ESearch (where you might also download them with ESearch (where you might also download them with EFetch).

I knew from past experience about using name[ORGN] in Entrez to search for an organism name – but how would you specify just mitochondria? I actually worked this out from the NCBI help and exploring the Entrez website’s advanced search – but it took a while.

There is an easier way to find out the search fields available in Entrez! Just recently I came across an interesting blog post from Neil Saunders (written a couple of weeks ago) showing how Entrez EInfo provides information about the search fields in XML format, and how you can use Ruby to process this.

Biopython can do this too of course – using Bio.Entrez this took just a few lines of Python:

>>> from Bio import Entrez
>>> data = Entrez.read(Entrez.einfo(db="genome"))
>>> for field in data["DbInfo"]["FieldList"] :
... print "%(Name)s, %(FullName)s, %(Description)s" % field
...
ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to each sequence
FILT, Filter, Limits the records
WORD, Text Word, Free text associated with record
TITL, Title, Words in definition line
KYWD, Keyword, Nonstandardized terms provided by submitter
AUTH, Author, Author(s) of publication
JOUR, Journal, Journal abbreviation of publication
VOL, Volume, Volume number of publication
ISS, Issue, Issue number of publication
PAGE, Page Number, Page number(s) of publication
ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy
ACCN, Accession, Accession number of sequence
PACC, Primary Accession, Does not include retired secondary accessions
GENE, Gene Name, Name of gene associated with sequence
PROT, Protein Name, Name of protein associated with sequence
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
PDAT, Publication Date, Date sequence added to GenBank
MDAT, Modification Date, Date of last update
SUBS, Substance Name, CAS chemical name or MEDLINE Substance Name
PROP, Properties, Classification by source qualifiers and molecule type
SQID, SeqID String, String identifier for sequence
GPRJ, Genome Project, Genome Project
SLEN, Sequence Length, Length of sequence
FKEY, Feature key, Feature annotated on sequence
RTYP, Replicon type, Replicon type
RNAM, Replicon name, Replicon name
ORGL, Organelle, Organelle

That gives us a list of all the fields we can currently search on in the Genome database (and you could use the same code for any of the other NCBI databases in Entrez – they probably all have different searchable fields). Very handy! The ones in bold are discussed below.

So for my particular search, using “ORGL” to filter on organelle looks sensible, and after a bit of trial and error on the website I ended up with mitochondrion[ORGL] as a useful filter (not mitochondrial, or mitochondria).

I already knew about using “ORGN” to filter on the organism, either by species name or with a suitably formatted NCBI taxon ID (which you can get by searching or browsing the Entrez taxonomy database), e.g. txid9443[ORGN] gives primates.

Putting these together, to get all the primate mitochondria in the Entrez genome database you could use:

txid9443[ORGN] AND mitochondrion[ORGL]

Note that you have to use “AND” in upper case.

I think we’ll have to add something along these lines to the Biopython Tutorial and Cookbook (PDF)… Update: That’s done now and will be included with our next release :)

Entrez rocks! (although their documentation could use a few more examples).

Peter

November 19, 2009 04:30 PM

Simpler, optimized format conversion with Biopython


As per Peter’s recent post we are using this space to show of a couple of the new features in Biopython 1.52 before it is released. In this post we’ll look at the new convert() function that both Bio.SeqIO and Bio.AlignIO will get in Biopython 1.52.

No one has ever complained that bioinformatics just doesn’t have enough file formats – you probably frequently find yourself converting sequence files to suit particular applications with Bio.SeqIO. At the moment this is usually a two step process, something like this:

>>> records = SeqIO.parse(in_handle "genbank")
>>> SeqIO.write(records, out_handle, "fasta")

As of Biopython 1.52, you’ll be able to achieve the same result in a single step:

>>> SeqIO.convert(in_handle, "genbank", out_handle, "fasta")

In fact, it’s even easier than that because the convert function will accept filename strings as well as file handles for both input and output.

Adding the convert function to Bio.SeqIO and Bio.AlignIO will make your scripts more readable and might even save you a couple of lines of code, but perhaps more importantly it allows for the conversion process to be optimized for the two formats being used. In most cases the same old parse and write functions are called internally, however some key conversions are much faster…

In the above example we are moving from a GenBank file, which probably includes multiple features for each sequence, to a FASTA file, which doesn’t include features. If we used the two step process above we’d be spending time reading each sequence’s features into memory just to skip them when they get passed to the write function. Bio.SeqIO.convert() knows that the sequences in the input file are destined to be written to a FASTA file so it can skip over the features and save a bit of time in doing the conversion.

Obviously, any optimization is most important when its used on very large files, like those produced in next generation sequencing projects. For example, when converting between each of the FASTQ file formats variants with the “SeqIO two step” a significant amount of time is taken creating SeqRecord objects for each record in the input file and decoding the quality scores into numbers. However, none of the attributes or methods of the SeqRecord object are required to do the conversion. For this reason Bio.SeqIO.convert() deals with each record as simple strings. This makes it about five times faster – on a par with converters written in C.

The Biopython 1.52 edition of the Biopython Tutorial & Cookbook (PDF) will cover these new convert functions in more detail.

November 19, 2009 04:29 PM

Indexing sequence files with Biopython


The forthcoming release of Biopython 1.52 will include a couple of nice improvements to the Bio.SeqIO module, and here we’re going to introduce the new index function. This will of course be covered in the Biopython Tutorial & Cookbook (PDF) once this code is released.

Suppose you have a large sequence file with many many individual sequences in it. This could be next generation sequence data for example, maybe a FASTQ, FASTA or QUAL file. Or, it might be a big annotation rich file, such as the whole of UniProt, or a chunk of GenBank.

The Bio.SeqIO.parse(…) function lets you iterate over all the records in a file, one by one. This allows you to process each sequence in turn, keeping only one in memory at a time. This approach is very valuable for dealing with big files.

However, sometimes you can’t just loop over the records in the order found in the file. You may require random access. In Python, the natural API here is a dictionary – for example looking up the records via their ID string. This is how the existing Bio.SeqIO.to_dict(…) function works. It is just a helper function to build an in memory dictionary of a collection of SeqRecord objects. However, because everything is kept in memory (RAM), this only works on small or medium sized files. :(

So, what should you do when you have a very large file, and it is no longer possible to load everything into memory at once? Well, you might consider using a BioSQL database – this is probably quite a sensible option for something like GenBank data. However, for next generation sequencing data this is probably overkill. This is where the new Bio.SeqIO.index(…) function comes in. It behaves like a Python dictionary, but doesn’t keep everything in memory at once!

What Bio.SeqIO.index(…) does is first quickly scan the file looking for the position of each new record, and records this information against the record ID string. This still requires a reasonable amount of memory, but works fine even for millions of entries in a file. Then, when you ask for a particular record, it jumps to the relevant part of the file, and then parses the data into a SeqRecord.

For example, consider the FASTQ file SRR014849.fastq (available compressed at the NCBI). The Bio.SeqIO.index(…) function takes a minimum of two arguments, the filename and the file format:


>>> from Bio import SeqIO
>>> data = SeqIO.index("SRR014849.fastq", "fastq")
>>> len(data)
94696
>>> data.keys()[:3]
['SRR014849.80961', 'SRR014849.80960', 'SRR014849.80963']
>>> print data["SRR014849.290838"].seq
TGAGAATTTTTATTTTCAAGGGTTGGAACCGAAGGGTTTGAATTCAAACCCTTTCGGTTCCAACCCGACAAGTCATCGATGTTG
>>> print data["SRR014849.290838"].letter_annotations["phred_quality"]
[20, 18, 24, 25, 33, 26, 37, 33, 22, 11, 1, 22, 37, 33, 22, 10, 25, 31, 26, 36, 32, 16, 33, 26, 31, 25, 33, 25, 31, 25, 22, 33, 26, 35, 30, 12, 35, 31, 17, 19, 30, 20, 33, 26, 27, 31, 27, 6, 36, 32, 15, 33, 29, 10, 27, 32, 24, 30, 25, 30, 20, 24, 13, 35, 31, 13, 27, 26, 28, 32, 24, 28, 28, 27, 22, 22, 26, 28, 26, 24, 27, 30, 22, 27]

In this case the FASTQ file is only 24MB (so the example is easy to try at home), but the same approach works just as well on a 1GB FASTQ file :)

This will be covered in more detail by the new edition of the Biopython Tutorial & Cookbook, and once you have Biopython 1.52 or later installed don’t forget about the built in help:

>>> from Bio import SeqIO
>>> help(SeqIO.index)
...

Biopython 1.52 will index most of the file formats that can already be parsed with the SeqIO library – but not any multiple alignment formats. There is a table on the Bio.SeqIO wiki page.

Enjoy!

Peter

P.S. Note that random access to all the records in a sequence file isn’t the only potential benefit of using Bio.SeqIO.index(…). Suppose you are only interested in a few entries in a large file. You could use a for loop with Bio.SeqIO.parse(…), but this will mean every single record gets parsed and turned into a SeqRecord. If instead you used Bio.SeqIO.index(…) then only the few records you care about get parsed and turned into SeqRecord objects. This can save a lot of time, especially for an annotation rich format like SwissProt or GenBank.

November 19, 2009 04:29 PM

Biopython 1.51 released


We are pleased to announce the release of Biopython 1.51.This new stable release enhances version 1.50 (released in April) by extending the functionality of existing modules, adding a set of application wrappers for popular alignment programs and fixing a number of minor bugs.

In particular, the SeqIO module can now write Genbank files that include features, and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using Solexa, Sanger and Ilumina variants using conventions agreed upon with the BioPerl and EMBOSS projects.

Biopython 1.51 is the first stable release to include the Align.Applications module which allows users to define command line wrappers for popular alignment programs including ClustalW, Muscle and T-Coffee.

Bio.Fasta and the application tools ApplicationResult and generic_run() have been marked as deprecated – Bio.Fasta has been superseded by SeqIO’s support for the Fasta format and we provide ducumentation for using the subprocess module from the Python Standard Library as a more flexible approach to calling applications.

As always the Tutorial and Cookbook has been updated to document all the changes.

Thank you to everyone who tested our 1.51 beta or submitted bugs since out last stable release and to all our contributors

Sources and Windows Installer are available from the downloads page.

November 19, 2009 04:29 PM