skip to navigation
skip to content

Planet Python

Last update: August 20, 2014 07:47 PM

August 20, 2014


Omaha Python Users Group

August 20 Meeting Details

Location - a conference room at Gordmans in Aksarben thanks to Aaron Keck.

Meeting starts at 7pm, Wednesday, 8/20/14

Call 402-651-5215 if you have last minute communications.

Parking and entry details:

The building is the northwest corner of 67th and Frances and the Gordmans entrance is next to the “g” sign, about midway along the building.  There’s parking directly out front, but it sometimes fills up in the evenings.  The garage around back is open to the public after 5:30 or 6 as well.

The building doors lock at 5, so Aaron will be standing by to badge people in starting around 6:45.  If you’re running late, or early, just shoot him an email and I can meet you.

Agenda:
- Interesting Python tips and tricks we have discovered recently
- Bring your questions/problems you need help solving
- Scheduling topics and discussions for the next few meetings.

August 20, 2014 01:40 PM


Leonardo Giordani

Python 3 OOP Part 3 - Delegation: composition and inheritance

Previous post

Python 3 OOP Part 2 - Classes and members

The Delegation Run

If classes are objects what is the difference between types and instances?

When I talk about "my cat" I am referring to a concrete instance of the "cat" concept, which is a subtype of "animal". So, despite being both objects, while types can be specialized, instances cannot.

Usually an object B is said to be a specialization of an object A when:

Those targets are very general and valid for any system and the key to achieve them with the maximum reuse of already existing components is delegation. Delegation means that an object shall perform only what it knows best, and leave the rest to other objects.

Delegation can be implemented with two different mechanisms: composition and inheritance. Sadly, very often only inheritance is listed among the pillars of OOP techniques, forgetting that it is an implementation of the more generic and fundamental mechanism of delegation; perhaps a better nomenclature for the two techniques could be explicit delegation (composition) and implicit delegation (inheritance).

Please note that, again, when talking about composition and inheritance we are talking about focusing on a behavioural or structural delegation. Another way to think about the difference between composition and inheritance is to consider if the object knows who can satisfy your request or if the object is the one that satisfy the request.

Please, please, please do not forget composition: in many cases, composition can lead to simpler systems, with benefits on maintainability and changeability.

Usually composition is said to be a very generic technique that needs no special syntax, while inheritance and its rules are strongly dependent on the language of choice. Actually, the strong dynamic nature of Python softens the boundary line between the two techniques.

Inheritance Now

In Python a class can be declared as an extension of one or more different classes, through the class inheritance mechanism. The child class (the one that inherits) has the same internal structure of the parent class (the one that is inherited), and for the case of multiple inheritance the language has very specific rules to manage possible conflicts or redefinitions among the parent classes. A very simple example of inheritance is

``` python class SecurityDoor(Door):

pass

```

where we declare a new class SecurityDoor that, at the moment, is a perfect copy of the Door class. Let us investigate what happens when we access attributes and methods. First we instance the class

``` python

sdoor = SecurityDoor(1, 'closed') ```

The first check we can do is that class attributes are still global and shared

``` python

SecurityDoor.colour is Door.colour True sdoor.colour is Door.colour True ```

This shows us that Python tries to resolve instance members not only looking into the class the instance comes from, but also investigating the parent classes. In this case sdoor.colour becomes SecurityDoor.colour, that in turn becomes Door.colour. SecurityDoor is a Door.

If we investigate the content of __dict__ we can catch a glimpse of the inheritance mechanism in action

``` python

sdoor.dict {'number': 1, 'status': 'closed'} sdoor.class.dict mappingproxy({'doc': None, 'module': 'main'}) Door.dict mappingproxy({'dict': ,

'colour': 'yellow',
'open': <function Door.open at 0xb687e224>,
'__init__': <function Door.__init__ at 0xb687e14c>,
'__doc__': None,
'close': <function Door.close at 0xb687e1dc>,
'knock': <classmethod object at 0xb67ff6ac>,
'__weakref__': <attribute '__weakref__' of 'Door' objects>,
'__module__': '__main__',
'paint': <classmethod object at 0xb67ff6ec>})

```

As you can see the content of __dict__ for SecurityDoor is very narrow compared to that of Door. The inheritance mechanism takes care of the missing elements by climbing up the classes tree. Where does Python get the parent classes? A class always contains a __bases__ tuple that lists them

``` python

SecurityDoor.bases (,) ```

So an example of what Python does to resolve a class method call through the inheritance tree is

``` python

sdoor.class.bases[0].dict['knock'].get(sdoor) > sdoor.knock > ```

Please note that this is just an example that does not consider multiple inheritance.

Let us try now to override some methods and attributes. In Python you can override (redefine) a parent class member simply by redefining it in the child class.

``` python class SecurityDoor(Door):

colour = 'gray'
locked = True

def open(self):
    if not self.locked:
        self.status = 'open'

```

As you can forecast, the overridden members now are present in the __dict__ of the SecurityDoor class

``` python

SecurityDoor.dict mappingproxy({'doc': None,

'__module__': '__main__',
'open': <function SecurityDoor.open at 0xb6fcf89c>,
'colour': 'gray',
'locked': True})

```

So when you override a member, the one you put in the child class is used instead of the one in the parent class simply because the former is found before the latter while climbing the class hierarchy. This also shows you that Python does not implicitly call the parent implementation when you override a method. So, overriding is a way to block implicit delegation.

If we want to call the parent implementation we have to do it explicitly. In the former example we could write

``` python class SecurityDoor(Door):

colour = 'gray'
locked = True

def open(self):
    if self.locked:
        return
    Door.open(self)

```

You can easily test that this implementation is working correctly.

``` python

sdoor = SecurityDoor(1, 'closed') sdoor.status 'closed' sdoor.open() sdoor.status 'closed' sdoor.locked = False sdoor.open() sdoor.status 'open' ```

This form of explicit parent delegation is heavily discouraged, however.

The first reason is because of the very high coupling that results from explicitly naming the parent class again when calling the method. Coupling, in the computer science lingo, means to link two parts of a system, so that changes in one of them directly affect the other one, and is usually avoided as much as possible. In this case if you decide to use a new parent class you have to manually propagate the change to every method that calls it. Moreover, since in Python the class hierarchy can be dynamically changed (i.e. at runtime), this form of explicit delegation could be not only annoying but also wrong.

The second reason is that in general you need to deal with multiple inheritance, where you do not know a priori which parent class implements the original form of the method you are overriding.

To solve these issues, Python supplies the super() built-in function, that climbs the class hierarchy and returns the correct class that shall be called. The syntax for calling super() is

``` python class SecurityDoor(Door):

colour = 'gray'
locked = True

def open(self):
    if self.locked:
        return
    super().open(self)

```

The output of super() is not exactly the Door class. It returns a super object which representation is <super: <class 'SecurityDoor'>, <SecurityDoor object>>. This object however acts like the parent class, so you can safely ignore its custom nature and use it just like you would do with the Door class in this case.

Enter the Composition

Composition means that an object knows another object, and explicitly delegates some tasks to it. While inheritance is implicit, composition is explicit: in Python, however, things are far more interesting than this =).

First of all let us implement classic composition, which simply makes an object part of the other as an attribute

``` python class SecurityDoor:

colour = 'gray'
locked = True

def __init__(self, number, status):
    self.door = Door(number, status)

def open(self):
    if self.locked:
        return
    self.door.open()

def close(self):
    self.door.close()

```

The primary goal of composition is to relax the coupling between objects. This little example shows that now SecurityDoor is an object and no more a Door, which means that the internal structure of Door is not copied. For this very simple example both Door and SecurityDoor are not big classes, but in a real system objects can very complex; this means that their allocation consumes a lot of memory and if a system contains thousands or millions of objects that could be an issue.

The composed SecurityDoor has to redefine the colour attribute since the concept of delegation applies only to methods and not to attributes, doesn't it?

Well, no. Python provides a very high degree of indirection for objects manipulation and attribute access is one of the most useful. As you already discovered, accessing attributes is ruled by a special method called __getattribute__() that is called whenever an attribute of the object is accessed. Overriding __getattribute__(), however, is overkill; it is a very complex method, and, being called on every attribute access, any change makes the whole thing slower.

The method we have to leverage to delegate attribute access is __getattr__(), which is a special method that is called whenever the requested attribute is not found in the object. So basically it is the right place to dispatch all attribute and method access our object cannot handle. The previous example becomes

``` python class SecurityDoor:

locked = True

def __init__(self, number, status):
    self.door = Door(number, status)

def open(self):
    if self.locked:
        return
    self.door.open()

def __getattr__(self, attr):
    return getattr(self.door, attr)

```

Using __getattr__() blends the separation line between inheritance and composition since after all the former is a form of automatic delegation of every member access.

``` python class ComposedDoor:

def __init__(self, number, status):
    self.door = Door(number, status)

def __getattr__(self, attr):
    return getattr(self.door, attr)

```

As this last example shows, delegating every member access through __getattr__() is very simple. Pay attention to getattr() which is different from __getattr__(). The former is a built-in that is equivalent to the dotted syntax, i.e. getattr(obj, 'someattr') is the same as obj.someattr, but you have to use it since the name of the attribute is contained in a string.

Composition provides a superior way to manage delegation since it can selectively delegate the access, even mask some attributes or methods, while inheritance cannot. In Python you also avoid the memory problems that might arise when you put many objects inside another; Python handles everything through its reference, i.e. through a pointer to the memory position of the thing, so the size of an attribute is constant and very limited.

Movie Trivia

Section titles come from the following movies: The Cannonball Run (1981), Apocalypse Now (1979), Enter the Dragon (1973).

Sources

You will find a lot of documentation in this Reddit post. Most of the information contained in this series come from those sources.

Feedback

Feel free to use the blog Google+ page to comment the post. The GitHub issues page is the best place to submit corrections.

August 20, 2014 01:00 PM


Python Anywhere

Slides for Giles Thomas' EuroPython talk now online

Our founder, Giles Thomas, gave a high-level introduction to our load-balancing system as a talk at this summer's EuroPython. There's a video up on PyVideo: An HTTP request's journey through a platform-as-a-service. And here are the slides [PDF].

August 20, 2014 12:21 PM


Leonardo Giordani

Python 3 OOP Part 2 - Classes and members

Previous post

Python 3 OOP Part 1 - Objects and types

Python Classes Strike Again

The Python implementation of classes has some peculiarities. The bare truth is that in Python the class of an object is an object itself. You can check this by issuing type() on the class

``` python

a = 1 type(a) type(int) ```

This shows that the int class is an object, an instance of the type class.

This concept is not so difficult to grasp as it can seem at first sight: in the real world we deal with concepts using them like things: for example we can talk about the concept of "door", telling people how a door looks like and how it works. In this case the concept of door is the topic of our discussion, so in our everyday experience the type of an object is an object itself. In Python this can be expressed by saying that everything is an object.

If the class of an object is itself an instance it is a concrete object and is stored somewhere in memory. Let us leverage the inspection capabilities of Python and its id() function to check the status of our objects. The id() built-in function returns the memory position of an object.

In the first post we defined this class

``` python class Door:

def __init__(self, number, status):
    self.number = number
    self.status = status

def open(self):
    self.status = 'open'

def close(self):
    self.status = 'closed'

```

First of all, let's create two instances of the Door class and check that the two objects are stored at different addresses

``` python

door1 = Door(1, 'closed') door2 = Door(1, 'closed') hex(id(door1)) '0xb67e148c' hex(id(door2)) '0xb67e144c' ```

This confirms that the two instances are separate and unrelated. Please note that your values are very likely to be different from the ones I got. Being memory addresses they change at every execution. The second instance was given the same attributes of the first instance to show that the two are different objects regardless of the value of the attributes.

However if we use id() on the class of the two instances we discover that the class is exactly the same

``` python

hex(id(door1.class)) '0xb685f56c' hex(id(door2.class)) '0xb685f56c' ```

Well this is very important. In Python, a class is not just the schema used to build an object. Rather, the class is a shared living object, which code is accessed at run time.

As we already tested, however, attributes are not stored in the class but in every instance, due to the fact that __init__() works on self when creating them. Classes, however, can be given attributes like any other object; with a terrific effort of imagination, let's call them class attributes.

As you can expect, class attributes are shared among the class instances just like their container

``` python class Door:

colour = 'brown'

def __init__(self, number, status):
    self.number = number
    self.status = status

def open(self):
    self.status = 'open'

def close(self):
    self.status = 'closed'

```

Pay attention: the colour attribute here is not created using self, so it is contained in the class and shared among instances

``` python

door1 = Door(1, 'closed') door2 = Door(2, 'closed') Door.colour 'brown' door1.colour 'brown' door2.colour 'brown' ```

Until here things are not different from the previous case. Let's see if changes of the shared value reflect on all instances

``` python

Door.colour = 'white' Door.colour 'white' door1.colour 'white' door2.colour 'white' hex(id(Door.colour)) '0xb67e1500' hex(id(door1.colour)) '0xb67e1500' hex(id(door2.colour)) '0xb67e1500' ```

Raiders of the Lost Attribute

Any Python object is automatically given a __dict__ attribute, which contains its list of attributes. Let's investigate what this dictionary contains for our example objects:

``` python

Door.dict mappingproxy({'open': ,

'colour': 'brown',
'__dict__': <attribute '__dict__' of 'Door' objects>,
'__weakref__': <attribute '__weakref__' of 'Door' objects>,
'__init__': <function Door.__init__ at 0xb7062854>,
'__module__': '__main__',
'__doc__': None,
'close': <function Door.close at 0xb686041c>})

door1.dict {'number': 1, 'status': 'closed'} ```

Leaving aside the difference between a dictionary and a mappingproxy object, you can see that the colour attribute is listed among the Door class attributes, while status and number are listed for the instance.

How comes that we can call door1.colour, if that attribute is not listed for that instance? This is a job performed by the magic __getattribute__() method; in Python the dotted syntax automatically invokes this method so when we write door1.colour, Python executes door1.__getattribute__('colour'). That method performs the attribute lookup action, i.e. finds the value of the attribute by looking in different places.

The standard implementation of __getattribute__() searches first the internal dictionary (__dict__) of an object, then the type of the object itself; in this case door1.__getattribute__('colour') executes first door1.__dict__['colour'] and then, since the latter raises a KeyError exception, door1.__class__.__dict__['colour']

``` python

door1.dict['colour'] Traceback (most recent call last): File "", line 1, in KeyError: 'colour' door1.class.dict['colour'] 'brown' ```

Indeed, if we compare the objects' equality through the is operator we can confirm that both door1.colour and Door.colour are exactly the same object

``` python

door1.colour is Door.colour True ```

When we try to assign a value to a class attribute directly on an instance, we just put in the __dict__ of the instance a value with that name, and this value masks the class attribute since it is found first by __getattribute__(). As you can see from the examples of the previous section, this is different from changing the value of the attribute on the class itself.

``` python

door1.colour = 'white' door1.dict['colour'] 'white' door1.class.dict['colour'] 'brown' Door.colour = 'red' door1.dict['colour'] 'white' door1.class.dict['colour'] 'red' ```

Revenge of the Methods

Let's play the same game with methods. First of all you can see that, just like class attributes, methods are listed only in the class __dict__. Chances are that they behave the same as attributes when we get them

``` python

door1.open is Door.open False ```

Whoops. Let us further investigate the matter

``` python

Door.dict['open'] Door.open door1.open __main__.Door object at 0xb67e162c>> ```

So, the class method is listed in the members dictionary as function. So far, so good. The same happens when taking it directly from the class; here Python 2 needed to introduce unbound methods, which are not present in Python 3. Taking it from the instance returns a bound method.

Well, a function is a procedure you named and defined with the def statement. When you refer to a function as part of a class in Python 3 you get a plain function, without any difference from a function defined outside a class.

When you get the function from an instance, however, it becomes a bound method. The name method simply means "a function inside an object", according to the usual OOP definitions, while bound signals that the method is linked to that instance. Why does Python bother with methods being bound or not? And how does Python transform a function into a bound method?

First of all, if you try to call a class function you get an error

``` python

Door.open() Traceback (most recent call last): File "", line 1, in TypeError: open() missing 1 required positional argument: 'self' ```

Yes. Indeed the function was defined to require an argument called 'self', and calling it without an argument raises an exception. This perhaps means that we can give it one instance of the class and make it work

``` python

Door.open(door1) door1.status 'open' ```

Python does not complain here, and the method works as expected. So Door.open(door1) is the same as door1.open(), and this is the difference between a plain function coming from a class an a bound method: the bound method automatically passes the instance as an argument to the function.

Again, under the hood, __getattribute__() is working to make everything work and when we call door1.open(), Python actually calls door1.__class__.open(door1). However, door1.__class__.open is a plain function, so there is something more that converts it into a bound method that Python can safely call.

When you access a member of an object, Python calls __getattribute__() to satisfy the request. This magic method, however, conforms to a procedure known as descriptor protocol. For the read access __getattribute__() checks if the object has a __get__() method and calls this latter. So the converstion of a function into a bound method happens through such a mechanism. Let us review it by means of an example.

``` python

door1.class.dict['open'] ```

This syntax retrieves the function defined in the class; the function knows nothing about objects, but it is an object (remember "everything is an object"). So we can look inside it with the dir() built-in function

``` python

dir(door1.class.dict['open']) ['annotations', 'call', 'class', 'closure', 'code', 'defaults', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'get', 'getattribute', 'globals', 'gt', 'hash', 'init', 'kwdefaults', 'le', 'lt', 'module', 'name', 'ne', 'new', 'qualname', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook'] door1.class.dict['open'].get <method-wrapper 'get' of function object at 0xb68604ac> ```

As you can see, a __get__ method is listed among the members of the function, and Python recognizes it as a method-wrapper. This method shall connect the open function to the door1 instance, so we can call it passing the instance alone

``` python

door1.class.dict['open'].get(door1) __main__.Door object at 0xb67e162c>> ```

and we get exactly what we were looking for. This complex syntax is what happens behind the scenes when we call a method of an instance.

When Methods met Classes

Using type() on functions defined inside classes reveals some other details on their internal representation

``` python

Door.open door1.open __main__.Door object at 0xb6f9834c>> type(Door.open) type(door1.open) ```

As you can see, Python tells the two apart recognizing the first as a function and the second as a method, where the second is a function bound to an instance.

What if we want to define a function that operates on the class instead of operating on the instance? As we may define class attributes, we may also define class methods in Python, through the classmethod decorator. Class methods are functions that are bound to the class and not to an instance.

``` python class Door:

colour = 'brown'

def __init__(self, number, status):
    self.number = number
    self.status = status

@classmethod
def knock(cls):
    print("Knock!")

def open(self):
    self.status = 'open'

def close(self):
    self.status = 'closed'

```

Such a definition makes the method callable on both the instance and the class

``` python

door1.knock() Knock! Door.knock() Knock! ```

and Python identifies both as (bound) methods

``` python

door1.class.dict['knock'] door1.knock > Door.knock > type(Door.knock) type(door1.knock) ```

As you can see the knock() function accepts one argument, which is called cls just to remember that it is not an instance but the class itself. This means that inside the function we can operate on the class, and the class is shared among instances.

``` python class Door:

colour = 'brown'

def __init__(self, number, status):
    self.number = number
    self.status = status

@classmethod
def knock(cls):
    print("Knock!")

@classmethod
def paint(cls, colour):
    cls.colour = colour

def open(self):
    self.status = 'open'

def close(self):
    self.status = 'closed'

```

The paint() classmethod now changes the class attribute colour which is shared among instances. Let's check how it works

``` python

door1 = Door(1, 'closed') door2 = Door(2, 'closed') Door.colour 'brown' door1.colour 'brown' door2.colour 'brown' Door.paint('white') Door.colour 'white' door1.colour 'white' door2.colour 'white' ```

The class method can be called on the class, but this affects both the class and the instances, since the colour attribute of instances is taken at runtime from the shared class.

``` python

door1.paint('yellow') Door.colour 'yellow' door1.colour 'yellow' door2.colour 'yellow' ```

Class methods can be called on instances too, however, and their effect is the same as before. The class method is bound to the class, so it works on this latter regardless of the actual object that calls it (class or instance).

Movie Trivia

Section titles come from the following movies: The Empire Strikes Back (1980), Raiders of the Lost Ark (1981), Revenge of the Nerds (1984), When Harry Met Sally (1989).

Sources

You will find a lot of documentation in this Reddit post. Most of the information contained in this series come from those sources.

Feedback

Feel free to use the blog Google+ page to comment the post. The GitHub issues page is the best place to submit corrections.

Next post

Python 3 OOP Part 3 - Delegation: composition and inheritance

August 20, 2014 12:00 PM


Machinalis

Making the case for Jython

Introduction

Jython is is an implementation of Python that runs on top of the Java Virtual Machine. Why is it different? Why should I care about it? This blogpost will try to give an answer to those questions by introducing a real life example.

Why Jython?

I had the privilege of working in Java for almost 15 years before I jumped to the Python bandwagon, so for me the value of Jython is pretty obvious. This might not be the case for you if you’ve never worked with either language, so let me tell you (and show you) what makes Jython awesome and useful.

According to the Jython site, these are the features that makes Jython standout over other JVM based languages:

  • Dynamic compilation to Java bytecodes - leads to highest possible performance without sacrificing interactivity.
  • Ability to extend existing Java classes in Jython - allows effective use of abstract classes.
  • Optional static compilation - allows creation of applets, servlets, beans, ...
  • Bean Properties - make use of Java packages much easier.
  • Python Language - combines remarkable power with very clear syntax. It also supports a full object-oriented programming model which makes it a natural fit for Java’s OO design.

I think the first, second and fifth bullets require special attention.

For some reason, a lot of people believe that the JVM is slow. This might have been true on the first years of the platform, but the JVM’s performance has increased a lot since then. A lot has been written on this subject but the following Wikipedia article summarizes the situation pretty well.

As mentioned above, it is possible to use Java classes in Jython. Although this statement is true, it fails to convey what I think is the most important aspect of Jython: there are A LOT of high-quality mature Java libraries out there. The possibility of mixing all this libraries with the flexibility and richness of Python is invaluable. Let me give you a taste of this power.

Until the introduction of the new Date and Time API of Java 8, the only way to handle time properly in Java was to use Joda-Time. Joda-Time is an incredible powerful and flexible library for handling date and time on Java (or any JVM language for that matter). Although there are similar libraries in Python, I still haven’t come across one that can give Joda-Time a run for its money. The following shows a Jython shell session using Joda-Time:

Jython 2.7b2 (default:a5bc0032cf79+, Apr 22 2014, 21:20:17)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_05
Type "help", "copyright", "credits" or "license" for more information.
>>> from org.joda.time import DateTime
>>> date_time = DateTime()
>>> date_time
2014-07-14T20:06:11.074-03:00
>>> date_time.getMonthOfYear()
7
>>> date_time.withYear(2000)
2000-07-14T20:06:11.074-03:00
>>> date_time.monthOfYear().getAsText()
u'July'
>>> date_time.monthOfYear().getAsShortText(Locale.FRENCH);
u'juil.'
>>> date_time.dayOfMonth().roundFloorCopy();
2014-07-14T00:00:00.000-03:00
>>> date_time.plus(Period.days(1))
2014-07-15T20:06:11.074-03:00
>>> date_time.plus(Duration(24L*60L*60L*1000L));
2014-07-15T20:06:11.074-03:00

This was just a quick example of the simplest features of Joda-Time. Although most of the features of Joda-Time are present in python-dateutil (with the exception of unusual chronologies), this is just an example. There are other popular Java libraries without a Python counter-part (I’ll show you one on the next section).

As I mentioned before, I switched to Python recently. There was a lot involved in that decision, but the language itself played a major role. The possibility of combining this fantastic language with the power of the JVM and all the Java libraries and tools readily available is an interesting proposition.

Let me show you a real life example that I think summarizes perfectly what Jython matters.

Redacting names on comments

Not too long ago, we had to redact names from comments coming from social media sites. Our first idea was to use NLTK’s NERTagger. This class depends on the Stanford Named Entity Recognizer (NER) which is a Java library. The integration is done invoking a java shell command and analyzing its output. Not only this is far from ideal, it might create some problems if your data isn’t just a piece of large text (which is our case).

This limitation is not caused by the NER API but by the way NLTK interacts with it. Wouldn’t it be nice if we could just write Python code that uses this API? Let’s do just that.

We cannot show you the data we had to work with, but I wrote an IPython Notebook to generate fake comments and save them on a CSV file so our script can work with them.

After the comments have been read, all we need to do is have the classifier tag the tokens, so we can redact the person names from the comments:

classifier = CRFClassifier.getClassifierNoExceptions(
    'stanford-ner-2014-01-04/classifiers/english.all.3class.distsim.crf.ser.gz'
)

for row in dict_reader:
    redacted_text = row['text']
    classify_result = classifier.classify(row['text']);

    for sentence in classify_result:
        for word in sentence:
            token = word.originalText()
            tag = word.get(AnswerAnnotation)

            if tag == 'PERSON':
                redacted_text = redacted_text.replace(token, '****')

    row['redacted_text'] = redacted_text

This is an excerpt from a Python script available on github to redact names from text coming from a CSV file. All we need to run it is a JRE, Jython 2.7 distribution and the Stanford NER jars. All we need to do is run the following from the command line:

java -Dpython.path=stanford-ner-2014-01-04/stanford-ner.jar -jar jython-standalone-2.7-b2.jar redact_name_entities.py comments_df.csv comments_df_redacted.csv

Although we cannot run the code directly from Python (cPython, that is), we didn’t need to write a single line of Java to get access to the full power of Stanford NER API.

Conclusions

I hope by now you have an idea of just how important Jython is. It has some limitations, like the inability of integrating modules written in C or that it is only compatible with Python 2.7, but I think its advantages far outweigh the shortcomings.

Although we haven’t had the chance to work with .NET, I think the same rationale can be applied to IronPython when it comes to interacting with Microsoft’s framework.

August 20, 2014 11:57 AM


Leonardo Giordani

Python 3 OOP Part 1 - Objects and types

About this series

Object-oriented programming (OOP) has been the leading programming paradigm for several decades now, starting from the initial attempts back in the 60s to some of the most important languages used nowadays. Being a set of programming concepts and design methodologies, OOP can never be said to be "correctly" or "fully" implemented by a language: indeed there are as many implementations as languages.

So one of the most interesting aspects of OOP languages is to understand how they implement those concepts. In this post I am going to try and start analyzing the OOP implementation of the Python language. Due to the richness of the topic, however, I consider this attempt just like a set of thoughts for Python beginners trying to find their way into this beautiful (and sometimes peculiar) language.

This series of posts wants to introduce the reader to the Python 3 implementation of Object Oriented Programming concepts. The content of this and the following posts will not be completely different from that of the previous "OOP Concepts in Python 2.x" series, however. The reason is that while some of the internal structures change a lot, the global philosophy doesn't, being Python 3 an evolution of Python 2 and not a new language.

So I chose to split the previous series and to adapt the content to Python 3 instead of posting a mere list of corrections. I find this way to be more useful for new readers, that otherwise sould be forced to read the previoous series.

Print

One of the most noticeable changes introduced by Python 3 is the transformation of the print keyword into the print() function. This is indeed a very small change, compared to other modifications made to the internal structures, but is the most visual-striking one, and will be the source of 80% of your syntax errors when you will start writing Python 3 code.

Remember that print is now a function so write print(a) and not print a.d

Back to the Object

Computer science deals with data and with procedures to manipulate that data. Everything, from the earliest Fortran programs to the latest mobile apps is about data and their manipulation.

So if data are the ingredients and procedures are the recipes, it seems (and can be) reasonable to keep them separate.

Let's do some procedural programming in Python

``` python

This is some data

data = (13, 63, 5, 378, 58, 40)

This is a procedure that computes the average

def avg(d):

return sum(d)/len(d)

print(avg(data)) ```

As you can see the code is quite good and general: the procedure (function) operates on a sequence of data, and it returns the average of the sequence items. So far, so good: computing the average of some numbers leaves the numbers untouched and creates new data.

The observation of the everyday world, however, shows that complex data mutate: an electrical device is on or off, a door is open or closed, the content of a bookshelf in your room changes as you buy new books.

You can still manage it keeping data and procedures separate, for example

``` python

These are two numbered doors, initially closed

door1 = [1, 'closed'] door2 = [2, 'closed']

This procedure opens a door

def open_door(door):

door[1] = 'open'

open_door(door1) print(door1) ```

I described a door as a structure containing a number and the status of the door (as you would do in languages like LISP, for example). The procedure knows how this structure is made and may alter it.

This also works like a charm. Some problems arise, however, when we start building specialized types of data. What happens, for example, when I introduce a "lockable door" data type, which can be opened only when it is not locked? Let's see

``` python

These are two standard doors, initially closed

door1 = [1, 'closed'] door2 = [2, 'closed']

This is a lockable door, initially closed and unlocked

ldoor1 = [1, 'closed', 'unlocked']

This procedure opens a standard door

def open_door(door):

door[1] = 'open'

This procedure opens a lockable door

def open_ldoor(door):

if door[2] == 'unlocked':
    door[1] = 'open'

open_door(door1) print(door1)

open_ldoor(ldoor1) print(ldoor1) ```

Everything still works, no surprises in this code. However, as you can see, I had to find a different name for the procedure that opens a locked door since its implementation differs from the procedure that opens a standard door. But, wait... I'm still opening a door, the action is the same, and it just changes the status of the door itself. So why shall I remember that a locked door shall be opened with open_ldoor() instead of open_door() if the verb is the same?

Chances are that this separation between data and procedures doesn't perfectly fit some situations. The key problem is that the "open" action is not actually using the door; rather it is changing its state. So, just like the volume control buttons of your phone, which are on your phone, the "open" procedure should stick to the "door" data.

This is exactly what leads to the concept of object: an object, in the OOP context, is a structure holding data and procedures operating on them.

What About Type?

When you talk about data you immediately need to introduce the concept of type. This concept may have two meanings that are worth being mentioned in computer science: the behavioural and the structural one.

The behavioural meaning represents the fact that you know what something is by describing how it acts. This is the foundation of the so-called "duck typing" (here "typing" means "to give a type" and not "to type on a keyboard"): if it types acts like a duck, it is a duck.

The structural meaning identifies the type of something by looking at its internal structure. So two things that act in the same way but are internally different are of different type.

Both points of view can be valid, and different languages may implement and emphasize one meaning of type or the other, and even both.

Class Games

Objects in Python may be built describing their structure through a class. A class is the programming representation of a generic object, such as "a book", "a car", "a door": when I talk about "a door" everyone can understand what I'm saying, without the need of referring to a specific door in the room.

In Python, the type of an object is represented by the class used to build the object: that is, in Python the word type has the same meaning of the word class.

For example, one of the built-in classes of Python is int, which represents an integer number

``` python

a = 6 print(a) 6 print(type(a)) print(a.class) ```

As you can see, the built-in function type() returns the content of the magic attribute __class__ (magic here means that its value is managed by Python itself offstage). The type of the variable a, or its class, is int. (This is a very inaccurate description of this rather complex topic, so remember that at the moment we are just scratching the surface).

Once you have a class you can instantiate it to get a concrete object (an instance) of that type, i.e. an object built according to the structure of that class. The Python syntax to instantiate a class is the same of a function call

``` python

b = int() type(b) ```

When you create an instance, you can pass some values, according to the class definition, to initialize it.

``` python

b = int() print(b) 0 c = int(7) print(c) 7 ```

In this example, the int class creates an integer with value 0 when called without arguments, otherwise it uses the given argument to initialize the newly created object.

Let us write a class that represents a door to match the procedural examples done in the first section

``` python class Door:

def __init__(self, number, status):
    self.number = number
    self.status = status

def open(self):
    self.status = 'open'

def close(self):
    self.status = 'closed'

```

The class keyword defines a new class named Door; everything indented under class is part of the class. The functions you write inside the object are called methods and don't differ at all from standard functions; the nomenclature changes only to highlight the fact that those functions now are part of an object.

Methods of a class must accept as first argument a special value called self (the name is a convention but please never break it).

The class can be given a special method called __init__() which is run when the class is instantiated, receiving the arguments passed when calling the class; the general name of such a method, in the OOP context, is constructor, even if the __init__() method is not the only part of this mechanism in Python.

The self.number and self.status variables are called attributes of the object. In Python, methods and attributes are both members of the object and are accessible with the dotted syntax; the difference between attributes and methods is that the latter can be called (in Python lingo you say that a method is a callable).

As you can see the __init__() method shall create and initialize the attributes since they are not declared elsewhere. This is very important in Python and is strictly linked with the way the language handles the type of variables. I will detail those concepts when dealing with polymorphism in a later post.

The class can be used to create a concrete object

``` python

door1 = Door(1, 'closed') type(door1) print(door1.number) 1 print(door1.status) closed ```

Now door1 is an instance of the Door class; type() returns the class as __main__.Door since the class was defined directly in the interactive shell, that is in the current main module.

To call a method of an object, that is to run one of its internal functions, you just access it as an attribute with the dotted syntax and call it like a standard function.

``` python

door1.open() print(door1.number) 1 print(door1.status) open ```

In this case, the open() method of the door1 instance has been called. No arguments have been passed to the open() method, but if you review the class declaration, you see that it was declared to accept an argument (self). When you call a method of an instance, Python automatically passes the instance itself to the method as the first argument.

You can create as many instances as needed and they are completely unrelated each other. That is, the changes you make on one instance do not reflect on another instance of the same class.

Recap

Objects are described by a class, which can generate one or more instances, unrelated each other. A class contains methods, which are functions, and they accept at least one argument called self, which is the actual instance on which the method has been called. A special method, __init__() deals with the initialization of the object, setting the initial value of the attributes.

Movie Trivia

Section titles come from the following movies: Back to the Future (1985) , What About Bob? (1991), Wargames (1983).

Sources

You will find a lot of documentation in this Reddit post. Most of the information contained in this series come from those sources.

Feedback

Feel free to use the blog Google+ page to comment the post. The GitHub issues page is the best place to submit corrections.

Next post

Python 3 OOP Part 2 - Classes and members

August 20, 2014 11:00 AM


Kushal Das

10 years and continuing

Ten years ago I started a Linux Users Group in Durgapur as I thought that is the only way to go forward. All most no one in the colleges had enough idea other than couple of users in each college. “Learn and teach others”, the motto was very much true from day one and it still holds the perfect place in the group.

The group started with help from a lot of people who were from different places, mostly the ilug-kolkata chapter. Sankarshan, Runa, Sayamindu, Indranil, Soumyadip they all helped in many different ways. Abhijit Majumder, who is currently working as Assistant Professor in IIT Mumbai, donated the money for the domain name in the first year.

After one year, I moved to Bangalore for my job and gave a talk in foss.in about that first year’s journey of the group. The focus of the group also changed from just being a user group to a like minded contributors group.

Then from 2008 I started the summer training program, the 7th edition is currently going on. This program actually helped to keep doing the rolling release of contributors from the group. People from different countries participated in the sessions, they became contributors to many upstream projects.

I have to admit that we are close to the Fedora Project and Python, as many of us work on and use these two projects everyday.

We managed to have couple of meetings before, 2006, 2007. We will be meeting again from 29th August to 2nd September in NIT Duragapur, most of the active members are coming down to Durgapur, day times we will be spending in few talks and workshops. From evening we will be busy in developer sprints.

Suchakra Sharma made the new logo and tshirt design for the event.

dgplug logo

The event page is up and the talk schedule is also up with help from Sanisoft. We are using their beautiful conference scheduler application for the same. Come and meet us in Durgapur.

August 20, 2014 06:40 AM

August 19, 2014


Lee Braiden

When Agile goes wrong

I just received this message from customer support for a product I pay a subscription for (Oh, why not: this is Todoist.com): Unfortunately we don’t have a roadmap for future features. To stay flexible and add features based on requests, we work on a few options, implement them and then decide what’s next so unfortunately […]

August 19, 2014 06:11 PM


Martijn Faassen

New HTTP 1.1 RFCs versus WSGI

Recently new HTTP 1.1 RFCs were published that obsolete the old HTTP 1.1 RFCs. They are extensively rewritten.

Unfortunately the WSGI PEP 3333 refers to something only made explicit in the old version of the RFCs, but which is much harder to find in the new versions of the RFCs. I thought I'd leave a report of my investigations here so that others who may run into this in the future can find it.

WSGI is a protocol that's like HTTP but isn't quite HTTP. In particular WSGI defines its own iterator-based way to send larger responses out in smaller parts. It therefore cannot deal with so-called "hop-by-hop" headers, which try to control this behavior on a HTTP level. The WSGI spec says a WSGI application must not generate such headers.

This is relevant when you're dealing with a WSGI-over-HTTP proxy. This is a special WSGI application that talks to an underlying HTTP server. It presents itself as a normal WSGI application.

The underlying HTTP server could very well be sending out stuff like such as Transfer-Encoding: chunked. The WSGI spec does not allow a WSGI application to send them out though, so a WSGI proxy must strip these headers out.

So what headers are to be stripped out? The WSGI spec refers to section 13.5.1 in now-obsolete RFC 2616.

This nicely lists hop-by-hop headers:

That RFC also says:

"All other headers defined by HTTP/1.1 are end-to-end headers."

and then confusingly:

"Other hop-by-hop headers MUST be listed in a Connection header, (section 14.10) to be introduced into HTTP/1.1 (or later)."

which one is it, HTTP 1.1? I guess that's one of the reasons this text got rewritten.

In the new rewritten version of HTTP 1.1, this list is gone. Instead it specifies for some headers (such as TE and Upgrade) that these should be added to the Connection field. A HTTP proxy can then strip out the headers listed in Connection, and then also strip out Connection itself.

Confusingly, while the new RFC 7230 refers to the concept of 'hop-by-hop' early on, and also say this in the change notes in A.2:

"Also, "hop-by-hop" header fields are required to appear in the Connection header field; just because they're defined as hop- by-hop in this specification doesn't exempt them."

it doesn't actually say any headers are hop-by-hop anywhere else. Instead it mandates some headers should be added to Connection.

But wait: Transfer-Encoding is not to be listed in the Connection header, as it's not hop-by-hop. At least, not anymore. I've seen it described as 'hopX-by-hopY', but not in the RFC. This is, I think, because a HTTP proxy could let these through without having to remove them. But not for a WSGI over HTTP proxy: it MUST remove Transfer-Encoding, as WSGI applications have no such concept.

I think the WSGI PEP should be updated in terms of the new HTTP RFC. It should make explicit that some headers such as Transfer-Encoding must not be specified by a WSGI app, and that no headers that must be listed in Connection can be specified by a WSGI app, or something like that.

Relevant mailing list thread:

http://lists.w3.org/Archives/Public/ietf-http-wg/2014JulSep/thread.html#msg1710

August 19, 2014 11:37 AM

August 18, 2014


Mike C. Fletcher

Seem to need hidden-markov-models for text extraction...

So in order to "seed" listener with text-as-it-would-be-spoken for coding, I've built up a tokenizer that will parse through a file and attempt to produce a model of what would have been said to dictate that text. The idea being that we want to generate a few hundred megabytes of sample statements that can then be used to generate a "python coding" or "javascript coding" language model. Thing is, this is actually a pretty grotty/nasty problem, particularly dealing with "run together" words, such as `asstring` or `mkdtemp`. You can either be very strict, and only allow specifically defined words (missing out on a lot of statements) or you can attempt to pull apart the words.

If you attempt to pull apart the words you run into dragons. Any big English dictionary has almost all short character sequences defined, so "shutil" can be broken up into "shu" and "til" rather than "sh" and "util"... and really there's nothing wrong with the "shu" "til" guess other than that it's not statistically likely.

Interestingly, this is a big part of voice dictation, guessing what sequence of actions generated a sequence of observed events.  When doing voice dictation you're looking at "what spoken words would have generated this sequence of phonemes" while in corpus preparation you're trying to guess "what set of target words and combining rules would have generated this sequence of characters".

I'm beginning to think that I should use the same basic strategy as I used for translating IPA to ARPABet, namely do a very strict pass to produce a statistical guess as to what words are common/likely, then use that to do the second-pass where we attempt to guess. So if we see the words OpenGL and Context dozens of times then when we do the next pass and see OpenGLContext we likely see it as OpenGL Context, not "Open" "GLC" "on" "text".

Or maybe I should just go with strictness in all things and explicitly ask the user to deal with all un-recognized words (or at least just do a guess and keep those words separate). I only see a few in any given project anyway. If you could just say "say that word as 'open g l context'" that is, spell out the words you would expect to say to get the result then you could re-run the extraction in strict mode and get exactly what you wanted.

August 18, 2014 08:43 PM


Nigel Babu

Arrrgh! Tracebacks and Exceptions

My colleague asked me to take a look at a logging issue on a server last week. He noticed that the error logs had way too little information about exceptions. In this particular instance, we had switched to Nginx + gunicorn instead of our usual Nginx + Apache + mod_wsgi (yeah, we’re weird). I took a quick look this morning and everything looked exactly like they should. I’ve read up more gunicorn docs today than I’ve ever done, I think.

Eventually, I asked my colleague Tryggvi for help. I needed a third person to tell me if I was making an obvious mistake. He asked me if I tried running gunicorn without supervisor, which I hadn’t. I tried that locally first, and it worked! I was all set to blame supervisor for my woes and tried it on production. Nope. No luck. As any good sysadmin would do, I checked if the versions matched and they did. CKAN itself has it’s dependencies frozen, this lead to more confusion in my brain. It didn’t make sense.

I started looking at the Exception in more detail, there was a note about email not working and the actual traceback. Well, since I didn’t actually have a mail server on my local machine, I commented those configs out, and now I just had the right Traceback. A few minutes later, it dawned on me. It’s a Pylons “feature”. The full traceback is printed to stdout if and only if there’s no email handling. Our default configs have an email configured and our servers have postfix installed on them and all the errors go to an email alias that’s way too noisy to be useful (Sentry. Soon). I went and commented out the relevant bits of configuration and voilà, it works!

Palm Face

Image source: Unknown, but provided by Tryggvi :)

August 18, 2014 03:45 PM


Katie Cunningham

What is a tech reader?

If you write a tech book, eventually, you’ll be asked to find tech readers. It may not even be for your book! I’ve been asked to find tech readers for other people’s books several times, especially if the book is geared towards beginners. I teach, so naturally, I know more than a few new coders.

But… what is a tech reader?

The job

Simply put, the tech reader is the person who reads the book with an eye to technical accuracy. Grammar, layout, spelling: None of these are your bag. You make sure that the explanations make sense. You run the code to make sure it works. You point out if the author has glossed over something major, or if they’re using something that hasn’t been used yet.

The level of experience you need varies. A book should always have some tech readers who are the intended audience (so, possibly, beginners), but you also need experts to read the book. A beginner will pick up when they get confused more easily than an expert, but an expert is more likely to point out when you’re incorrect, or leading someone down the wrong path.

For example, Doug Hellmann was one of my expert technical readers for Teach Yourself Python in 24 Hours. Back in the day, I had planned a chapter on pickles (because I had to have 24 chapters and was having trouble with what should go in the middle of the book). He was the one that suggested that I should probably just teach JSON instead.

My beginners chimed in when they felt I was going too fast, and were experts at noticing when I was using something I hadn’t explained yet (like showing a for loop while teaching lists, even though I wasn’t covering for loops until the next chapter). Stuff like that is difficult for an expert to pick up on because, well, for loops and such come naturally to us.

The pay

I’m going to be frank about this: The pay is not great. Some places will toss a bit of money at you (around a few hundred dollars) while others will offer you a copy of the book. It not only varies by publishing house but by individual book. Some books simply end up with more of a budget for readers. A book that needs both beginners and experts will have more money allotted to technical readers than one aimed just at experts.

So why do it?

If the pay sucks but it’s going to take you a while to do, why would you bother to be a tech reader?

The biggest one: If you want to be a writer, this is a great way to get your name on the list. You get to chat with the editors and other authors as well as show off your chops. There are other ways to get your foot in the door, but when it comes to breaking into the writing scene, I recommend knocking on all the doors you can find.

Sometimes, you just want to do good turn for someone in the community. I have tech-read books because I consider someone a friend, and I want their book to be as awesome as possible.

It’s also a wonderful learning opportunity, if you happen to be a novice. Not only do you have a book, but you have access to the author. Not clear on a point? Shoot off an email! During the tech review, my job was to basically sit around and wait for my readers to email me with questions.

Finally, it helps create a better book. We always need more tech books. I know, it seems like there’s already a ton of tech books out there. Unlike a novel, though, tech books have a very short lifespan. Within a few years, they’re out of date, and a few years after that, they’re often useless. We need a stream of new books and updated books to help spread ideas and bring new developers into the fold.

How do I become a tech reader?

If you’re already friends with an author, then I would suggest telling them that you would like to be a tech reader. Most of us keep a list on hand for when our editor inevitably asks us to gather some people (I know I do).

If you see a booth for a publisher at a conference, talk to the people staffing it. I assure you, these people are not interns. They’re usually editors, authors, and community managers, and if they don’t know of a project you can help on right now, they can pass your information on to someone who does.

Finally, if you don’t go to conferences and don’t know any active authors, try Twitter. Every major publishing company has a dozen contact emails you can try, but I’ve found the people manning the Twitter accounts to be the most responsive. Most will follow you back so you can have a private conversation.

Just… don’t do what I did and complain about the quality of tech books. It worked for me, but you should probably start off with politeness rather than being a grouchy cuss.

August 18, 2014 02:44 PM


Ian Ozsvald

Python Training courses: Data Science and High Performance Python coming in October

I’m pleased to say that via our ModelInsight we’ll be running two Python-focused training courses in October. The goal is to give you new strong research & development skills, they’re aimed at folks in companies but would suit folks in academia too. UPDATE training courses ready to buy (1 Day Data Science, 2 Day High Performance).

UPDATE we have a <5min anonymous survey which helps us learn your needs for Data Science training in London, please click through and answer the few questions so we know what training you need.

“Highly recommended – I attended in Aalborg in May “:… upcoming Python DataSci/HighPerf training courses”” @ThomasArildsen

These and future courses will be announced on our London Python Data Science Training mailing list, sign-up for occasional announces about our upcoming courses (no spam, just occasional updates, you can unsubscribe at any time).

Intro to Data science with Python (1 day) on Friday 24th October

Students: Basic to Intermediate Pythonistas (you can already write scripts and you have some basic matrix experience)

Goal: Solve a complete data science problem (building a working and deployable recommendation engine) by working through the entire process – using numpy and pandas, applying test driven development, visualising the problem, deploying a tiny web application that serves the results (great for when you’re back with your team!)

High Performance Python (2 day) on Thursday+Friday 30th+31st October

Students: Intermediate Pythonistas (you need higher performance for your Python code)

Goal: learn high performance techniques for performant computing, a mix of background theory and lots of hands-on pragmatic exercises

The High Performance course is built off of many years teaching and talking at conferences (including PyDataLondon 2013, PyCon 2013, EuroSciPy 2012) and in companies along with my High Performance Python book (O’Reilly). The data science course is built off of techniques we’ve used over the last few years to help clients solve data science problems. Both courses are very pragmatic, hands-on and will leave you with new skills that have been battle-tested by us (we use these approaches to quickly deliver correct and valuable data science solutions for our clients via ModelInsight). At PyCon 2012 my students rated me 4.64/5.0 for overall happiness with my High Performance teaching.

@ianozsvald [..] Best tutorial of the 4 I attended was yours. Thanks for your time and preparation!” @cgoering

We’d also like to know which other courses you’d like to learn, we can partner with trainers as needed to deliver new courses in London. We’re focused around Python, data science, high performance and pragmatic engineering. Drop me an email (via ModelInsight) and let me know if we can help.

Do please join our London Python Data Science Training mailing list to be kept informed about upcoming training courses.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

August 18, 2014 11:05 AM


Montreal Python User Group

August organisation Meeting

The summer is slowly ending and it's time for us to plan our next season. The Montreal-Python's team will then meet next Wednesday, August 27th to organise and talk about what we would like to do this fall.

If you have ideas, if you would like to give a hand, please come join us !

Where

The meeting will be held at the Ajah offices at 1124 Marie-Est suite 11 (https://goo.gl/maps/74aWY)

When

Wednesday Auguest 27 at 7:00 pm

Schedule and Plan

See you there and if you have any comments or question, please don't hesitate to write to us at: mtlpyteam@googlegroups.com

August 18, 2014 04:00 AM

August 17, 2014


Nigel Babu

OKFestival Fringe Events

The writeup of the OKFestival is very incomplete, because I haven’t mentioned the fringe events! I attended two fringe events and they both were very good.

First, I attended CKANCon right before OKFestival. It was informal and co-located with CSVConf. My best takeaway has been talking to people from the wider community around CKAN. I often feel blind-sided because we don’t have a good view of CKAN. I want to know how a user of a portal built on CKAN feels about the UX. After all, the actual users of open data portals are citizens who get data that they can do awesome things with. I had a good conversation with folks from DKAN about their work and I’ve been thinking about how we can make that better.

I finally met Max! (And I was disappointed he didn’t have a meatspace sticker :P

The other event I attended was Write the Docs. Ali and Florian came to Berlin to attend the event. It was total surprise running into them at the Mozilla Berlin office. The discussions at the event were spectacular. The talks by by Paul Adams and Jessica Rose were great and a huge learning experience. I missed parts of oncletom’s talk, but the bit I did catch sounded very different to my normal view of documentation.

We had a few discussions around localization and QA of docs which were pretty eye opening. At one of the sessions, Paul, Ali, Fabian and I discussed rules of documentation, which turned out pretty good! It was an exercise in patience narrowing them down!

I was nearly exhausted and unable to think clearly by the time Write the Docs started, but managed to face through it! Huge thanks to (among others ) Mikey and Kristof for organizing the event!

August 17, 2014 03:00 PM


Graham Dumpleton

Transparent object proxies in Python.

This is a quick rejoinder to a specific comment made in Armin Ronacher's recent blog post titled 'The Python I Would Like To See'. In that post Armin gives the following example of something that is possible with old style Python classes. >>> original = 42>>> class FooProxy:... def __getattr__(self, x):... return getattr(original, x)...>>> proxy = FooProxy()>>> proxy42>>> 1 + proxy43>>> proxy +

August 17, 2014 03:57 PM


Nick Coghlan

Why Python 4.0 won't be like Python 3.0

Newcomers to python-ideas occasionally make reference to the idea of "Python 4000" when proposing backwards incompatible changes that don't offer a clear migration path from currently legal Python 3 code. After all, we allowed that kind of change for Python 3.0, so why wouldn't we allow it for Python 4.0?

I've heard that question enough times now (including the more concerned phrasing "You made a big backwards compatibility break once, how do I know you won't do it again?"), that I figured I'd record my answer here, so I'd be able to refer people back to it in the future.

What are the current expectations for Python 4.0?

My current expectation is that Python 4.0 will merely be "the release that comes after Python 3.9". That's it. No profound changes to the language, no major backwards compatibility breaks - going from Python 3.9 to 4.0 should be as uneventful as going from Python 3.3 to 3.4 (or from 2.6 to 2.7). I even expect the stable Application Binary Interface (as first defined in PEP 384) to be preserved across the boundary.

At the current rate of language feature releases (roughly every 18 months), that means we would likely see Python 4.0 some time in 2023, rather than seeing Python 3.10.

So how will Python continue to evolve?

First and foremost, nothing has changed about the Python Enhancement Proposal process - backwards compatible changes are still proposed all the time, with new modules (like asyncio) and language features (like yield from) being added to enhance the capabilities available to Python applications. As time goes by, Python 3 will continue to pull further ahead of Python 2 in terms of the capabilities it offers by default, even if Python 2 users have access to equivalent capabilities through third party modules or backports from Python 3.

Competing interpreter implementations and extensions will also continue to explore different ways of enhancing Python, including PyPy's exploration of JIT-compiler generation and software transactional memory, and the scientific and data analysis community's exploration of array oriented programming that takes full advantage of the vectorisation capabilities offered by modern CPUs and GPUs. Integration with other virtual machine runtimes (like the JVM and CLR) is also expected to improve with time, especially as the inroads Python is making in the education sector are likely to make it ever more popular as an embedded scripting language in larger applications running in those environments.

For backwards incompatible changes, PEP 387 provides a reasonable overview of the approach that was used for years in the Python 2 series, and still applies today: if a feature is identified as being excessively problematic, then it may be deprecated and eventually removed.

However, a number of other changes have been made to the development and release process that make it less likely that such deprecations will be needed within the Python 3 series:

From (mostly) English to all written languages

It's also worth noting that Python 3 wasn't expected to be as disruptive as it turned out to be. Of all the backwards incompatible changes in Python 3, many of the serious barriers to migration can be laid at the feet of one little bullet point in PEP 3100:

PEP 3100 was the home for Python 3 changes that were considered sufficiently non-controversial that no separate PEP was considered necessary. The reason this particular change was considered non-controversial was because our experience with Python 2 had shown that the authors of web and GUI frameworks were right: dealing sensibly with Unicode as an application developer means ensuring all text data is converted from binary as close to the system boundary as possible, manipulated as text, and then converted back to binary for output purposes.

Unfortunately, Python 2 doesn't encourage developers to write programs that way - it blurs the boundaries between binary data and text extensively, and makes it difficult for developers to keep the two separate in their heads, let alone in their code. So web and GUI framework authors have to tell their Python 2 users "always use Unicode text. If you don't, you may suffer from obscure and hard to track down bugs when dealing with Unicode input".

Python 3 is different: it imposes a much greater separation between the "binary domain" and the "text domain", making it easier to write normal application code, while making it a bit harder to write code that works with system boundaries where the distinction between binary and text data can be substantially less clear. I've written in more detail elsewhere regarding what actually changed in the text model between Python 2 and Python 3.

This revolution in Python's Unicode support is taking place against a larger background migration of computational text manipulation from the English-only ASCII (officially defined in 1963), through the complexity of the "binary data + encoding declaration" model (including the C/POSIX locale and Windows code page systems introduced in the late 1980's) and the initial 16-bit only version of the Unicode standard (released in 1991) to the relatively comprehensive modern Unicode code point system (first defined in 1996, with new major updates released every few years).

Why mention this point? Because this switch to "Unicode by default" is the most disruptive of the backwards incompatible changes in Python 3 and unlike the others (which were more language specific), it is one small part of a much larger industry wide change in how text data is represented and manipulated. With the language specific issues cleared out by the Python 3 transition, a much higher barrier to entry for new language features compared to the early days of Python and no other industry wide migrations on the scale of switching from "binary data with an encoding" to Unicode for text modelling currently in progress, I can't see any kind of change coming up that would require a Python 3 style backwards compatibility break and parallel support period. Instead, I expect we'll be able to accommodate any future language evolution within the normal change management processes, and any proposal that can't be handled that way will just get rejected as imposing an unacceptably high cost on the community and the core development team.

August 17, 2014 05:30 AM

August 16, 2014


Grzegorz Śliwiński

mirakuru 0.2 released

Last Thursday, we released new minor version of mirakuru. Mirakuru is a helpful tools that lets add superpowers to Your tests, or maybe other scripts that need other processes to run.

Changes introduced to mirakuru 0.2 - are:

Read more… (1 min remaining to read)

August 16, 2014 07:21 PM


Richard Tew

MIPS32 support

My satellite receiver has the odd bug, but otherwise it is generally a pretty good piece of hardware and software.  However, the bugs and the inability to fix them, are something I wish I could do something about.  The firmware, is written in MIPS assembly language, which runs on an Ali 3602 chip.  Scanning through it, you can see interesting things like the license for Linux-NTFS, something which was reported to the GPL violations mailing list several years back (with no action taken).

Anyway, I can't afford the main interactive disassembler out there, IDA.  So I have my own, which has a token amount of features comparatively.  Peasauce.  Up until now, it only disassembled m68k machine code, but I've just finished adding basic MIPS support.  It's nowhere near perfect, but it's a start.  And it shows how much work goes into IDA.


The next architecture is likely to be ARM support, although it like MIPS gets more complicated.  There's ARM and ARM thumb instructions, and they are different sized and mix together to some extent.  There's also MIPS32 and MIPS16, and they are different sized and mix together to some extent.  But that's a problem for another occasion.  The work on this could be endless, if I had the time.

August 16, 2014 03:19 PM


Stefan Behnel

Faster Python calls in Cython 0.21

I spent some time during the last two weeks reducing the call overhead for Python functions and methods in Cython. It was already quite low compared to CPython before, about 30-40% faster, but profiling then made me stumble over the fact that method calls in CPython really just do one thing: they repack the argument tuple and prepend the 'self' object to it. However, that is done right after Cython has carefully packed up exactly that argument tuple in the first place, so by simply inlining what PyMethodObject does, we can avoid packing tuples twice.

Avoiding to create a PyMethodObject at all may also appear as an interesting goal, but doing that is totally not easy (it happens during attribute lookup) and it's also most likely not worth it as method objects are created from a freelist, which makes their instantiation very fast. Method objects also hold actual state that the caller must receive: the underlying function and the self object. So getting rid of them will severly complicate things without a major gain to expect.

Another obvious optimisation, however, is that Python code calls into C implemented functions quite often, and if those are implemented as specialised functions that take exactly one or no argument (METH_O/METH_NOARGS), then the tuple packing and unpacking can be avoided completely. Together with the method call optimisation, this means that Cython can now call very simple methods without creating an argument tuple, and less simple ones without redundantly creating a second argument tuple.

I implemented these optimisations and they immediately blew up the method call micro benchmarks in Python's benchmark suite from about 1/3 to 2-3 times faster than CPython 3.5 (pre). Those are only simple micro benchmarks, so any real world code will benefit substantially less overall. However, it turned out that a couple of benchmarks in the suite that are based on real production code ended up loosing 5-15% of their total runtime. That's quite remarkable, given that the code they call actually does something (much) more heavy weight than the call overhead itself. I'm still tuning it a bit, but so far am really happy with this result.

August 16, 2014 11:31 AM


Graeme Cross

Resources for creating the “What’s New” talks

The monthly Melbourne Python User Group meeting has a regular section covering “What’s new in Python”.

Javier asked me what resources I use to compile “What’s new” when I present, so here is the list of resources I use.

August 16, 2014 04:32 AM


Armin Ronacher

The Python I Would Like To See

It's no secret that I'm not a fan of Python 3 or where the language is currently going. This has led to a bunch of emails flying my way over the last few months about questions about what exactly I would prefer Python would do. So I figured I might share some of my thoughts publicly to maybe leave some food for thought for future language designers :)

Python is definitely a language that is not perfect. However I think what frustrates me about the language are largely problems that have to do with tiny details in the interpreter and less the language itself. These interpreter details however are becoming part of the language and this is why they are important.

I want to take you on a journey that starts with a small oddity in the interpreter (slots) and ends up with the biggest mistake in the language design. If the reception is good there will be more posts like this.

In general though these posts will be an exploration about design decisions in the interpreter and what consequences they have on both the interpreter and the resulting language. I believe this is more interesting from a general language design point of view than as a recommendation about how to go forward with Python.

Language vs Implementation

I added this particular paragraph after I wrote the initial version of this article because I think it has been largely missed that Python as a language and CPython as the interpreter are not nearly as separate as developers might believe. There is a language specification but in many cases it just codifies what the interpreter does or is even lacking.

In this particular case this obscure implementation detail of the interpreter changed or influenced the language design and also forced other Python implementations to adopt. For instance PyPy does not know anything about slots (I presume) but it still has to operate as if slots were part of the interpreter.

Slots

By far my biggest problem with the language is the stupid slot system. I do not mean the __slots__ but the internal type slots for special methods. These slots are a "feature" of the language which is largely missed because it is something you rarely need to be concerned with. That said, the fact that slots exist is in my opinion the biggest problem of the language.

So what's a slot? A slot is the side effect of how the interpreter is implemented internally. Every Python programmer knows about "dunder methods": things like __add__. These methods start with two underscores, the name of the special method, and two underscores again. As each developer knows, a + b is something like a.__add__(b).

Unfortunately that is a lie.

Python does not actually work that way. Python internally does actually not work that way at all (nowadays). Instead here is roughly how the interpreter works:

  1. When a type gets created the interpreter finds all descriptors on the class and will look for special methods like __add__.

  2. For each special method the interpreter finds it puts a reference to the descriptor into a predefined slot on the type object.

    For instance the special method __add__ corresponds to two internal slots: tp_as_number->nb_add and tp_as_sequence->sq_concat.

  3. When the interpreter wants to evaluate a + b it will invoke something like TYPE_OF(a)->tp_as_number->nb_add(a, b) (more complicated than that because __add__ actually has multiple slots).

So on the surface a + b does something like type(a).__add__(a, b) but even that is not correct as you can see from the slot handling. You can easily verify that yourself by implementing __getattribute__ on a metaclass and attempting to hook a custom __add__ in. You will notice that it's never invoked.

The slot system in my mind is absolutely ridiculous. It's an optimization that helps for some very specific types in the interpreter (like integers) but it actually makes no sense for other types.

To demonstrate this, consider this completely pointless type (x.py):

class A(object):
    def __add__(self, other):
        return 42

Since we have an __add__ method the interpreter will set this up in a slot. So how fast is it? When we do a + b we will use the slots, so here is what it times it as:

$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.256 usec per loop

If we do however a.__add__(b) we bypass the slot system. Instead the interpreter is looking in the instance dictionary (where it will not find anything) and then looks in the type's dictionary where it will find the method. Here is where that clocks in at:

$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.158 usec per loop

Can you believe it: the version without slots is actually faster. What magic is that? I'm not entirely sure what the reason for this is, but it has been like this for a long, long time. In fact, old style classes (which did not have slots) where much faster than new style classes for operators and had more features.

More features? Yes, because old style classes could do this (Python 2.7):

>>> original = 42
>>> class FooProxy:
...  def __getattr__(self, x):
...   return getattr(original, x)
...
>>> proxy = FooProxy()
>>> proxy
42
>>> 1 + proxy
43
>>> proxy + 1
43

Yes. We have less features today than we had in Python 2 for a more complex type system. Because the code above cannot be done with new style classes and more. It's actually worse than that if you consider how lightweight oldstyle classes were:

>>> import sys
>>> class OldStyleClass:
...  pass
...
>>> class NewStyleClass(object):
...  pass
...
>>> sys.getsizeof(OldStyleClass)
104
>>> sys.getsizeof(NewStyleClass)
904

Where do Slots Come From?

This raises the question why slots exist. As far as I can tell the slot system exists because of legacy more than anything else. When the Python interpreter was created initially, builtin types like strings and others were implemented as global and statically allocated structs which held all the special methods a type needs to have. This was before __add__ was a thing. If you check out a Python from 1990 you can see how objects were built back then.

This for instance is how integers looked:

static number_methods int_as_number = {
    intadd, /*tp_add*/
    intsub, /*tp_subtract*/
    intmul, /*tp_multiply*/
    intdiv, /*tp_divide*/
    intrem, /*tp_remainder*/
    intpow, /*tp_power*/
    intneg, /*tp_negate*/
    intpos, /*tp_plus*/
};

typeobject Inttype = {
    OB_HEAD_INIT(&Typetype)
    0,
    "int",
    sizeof(intobject),
    0,
    free,       /*tp_dealloc*/
    intprint,   /*tp_print*/
    0,          /*tp_getattr*/
    0,          /*tp_setattr*/
    intcompare, /*tp_compare*/
    intrepr,    /*tp_repr*/
    &int_as_number, /*tp_as_number*/
    0,          /*tp_as_sequence*/
    0,          /*tp_as_mapping*/
};

As you can see, even in the first version of Python that was ever released, tp_as_number was a thing. Unfortunately at one point the repo probably got corrupted for old revisions so in those very old releases of Python important things (such as the actual interpreter) are missing so we need to look at little bit into the future to see how these objects were implemented. By 1993 this is what the interpreter's add opcode callback looked like:

static object *
add(v, w)
    object *v, *w;
{
    if (v->ob_type->tp_as_sequence != NULL)
        return (*v->ob_type->tp_as_sequence->sq_concat)(v, w);
    else if (v->ob_type->tp_as_number != NULL) {
        object *x;
        if (coerce(&v, &w) != 0)
            return NULL;
        x = (*v->ob_type->tp_as_number->nb_add)(v, w);
        DECREF(v);
        DECREF(w);
        return x;
    }
    err_setstr(TypeError, "bad operand type(s) for +");
    return NULL;
}

So when were __add__ and others implemented? From what I can see they appear in 1.1. I actually managed to get a Python 1.1 to compile on OS X 10.9 with a bit of fiddling:

$ ./python -v
Python 1.1 (Aug 16 2014)
Copyright 1991-1994 Stichting Mathematisch Centrum, Amsterdam

Sure. It likes to crash and not everything works, but it gives you an idea of how Python was like back then. For instance there was a huge split between types implemented in C and Python:

$ ./python test.py
Traceback (innermost last):
  File "test.py", line 1, in ?
    print dir(1 + 1)
TypeError: dir() argument must have __dict__ attribute

As you can see, no introspection of builtin types such as integers. In fact, while __add__ was supported for custom classes, it was a whole feature of custom classes:

>>> (1).__add__(2)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
TypeError: attribute-less object

So this is the heritage we even today have in Python. The general layout of a Python type has not changed but it was patched on top for many, many years.

A Modern PyObject

So today many would argue the difference between a Python object implemented in the C interpreter and a Python object implemented in actual Python code is very minimal. In Python 2.7 the biggest difference seemed to be that the __repr__ that was provided by default reported class for types implemented in Python and type for types implemented in C. In fact this difference in the repr indicated if a type was statically allocated (type) or on dynamically on the heap (class). It did not make a practical difference and is entirely gone in Python 3. Special methods are replicated to slots and vice versa. For the most part, the difference between Python and C classes seems to have disappeared.

However they are still very different unfortunately. Let's have a look.

As every Python developer knows, Python classes as "open". You can look into them, see all the state they store, detach and reattach method on them even after the class declaration finished. This dynamic nature is not available for interpreter classes. Why is that?

There is no technical restriction in itself of why you could not attach another method to, say, the dict type. The reason the interpreter does not let you do that actually has very little to do with programmer sanity in the first place as the fact that builtin types are not on the heap. To understand the wide ranging consequences of this you need to understand how the Python language starts the interpreter.

The Damn Interpreter

In Python the intepreter startup is a very expensive process. Whenever you start the Python executable you invoke a huge machinery that does pretty much everything. Among other things it will bootstrap the internal types, it will setup the import machinery, it will import some required modules, work with the OS to handle signals and to accept the command line parameters, setup internal state etc. When it's finally done it will run your code and shut down. This is also something that Python is doing like this for 25 years now.

In pseudocode this is how this looks like:

/* called once */
bootstrap()

/* these three could be called in a loop if you prefer */
initialize()
rv = run_code()
finalize()

/* called once */
shutdown()

The problem with this, is that Python's interpreter has a huge amount of global state. In fact, you can only have one interpreter. A much better design would be to setup the interpreter and run something on it:

interpreter *iptr = make_interpreter();
interpreter_run_code(iptr):
finalize_interpreter(iptr);

This is in fact how many other dynamic languages work. For instance this is how lua implementations operate, how javascript engines work etc. The clear advantage is that you can have two interpreters. What a novel concept.

Who needs multiple interpreters? You would be surprised. Even Python needs them or at least thought they are useful. For instance those exist so that an application embedding Python can have things run independently (for instance think web applications implemented in mod_python. They want to run in isolation). So in Python there are sub interpreters. They work within the interpreter but because there is so much global state. The biggest piece of global state is also the most controversial one: the global interpreter lock. Python already decided on this one interpreter concept so there is lots of data shared between subinterpreters. As those are shared there needs to be a lock around all of them, so that lock is on the actual interpreter. What data is shared?

If you look at the code I pasted above you can see these huge structs sitting around. These structs are actually sitting around as global variables. In fact the interpreter exposes those type structs directly to the Python code. This is enabled by the OB_HEAD_INIT(&Typetype) macro which gives this struct the necessary header so that the interpreter can work with it. For instance in there is the refcount of the type.

Now you can see where this is going. These objects are shared between sub interpreters. So imagine you could modify this object in your Python code. Two completely independent pieces of Python code that have nothing to do with each other could change each other's state. Imagine this was in JavaScript and the Facebook tab would be able to change the implementation of the builtin array type and the Google tab would immediately see the effects of this.

This design decision from 1990 or so still has ripples that can be felt today.

On the bright side, the immutability of builtin types has generally been accepted as a good feature by the community. The problems of mutable builtin types has been demonstrated by other programming languages and it's not something we missed much.

There is more though.

What's a VTable?

So Python types coming from C are largely immutable. What else is different though? The other big difference also has to do with the open nature of classes in Python. Classes implemented in Python have their methods as "virtual". While there is no "real" C++ style vtable, all methods are stored on the class dictionary and there is a lookup algorithm, it boils down to pretty much the same. The consequences are quite clear. When you subclass something and you override a method, there is a good chance another method will be indirectly modified in the process because it's calling into it.

A good example are collections. Lots of collections have convenience methods. As an example a dictionary in Python has two methods to retrieve an object from it: __getitem__() and get(). When you implement a class in Python you will usually implement one through the other by doing something like return self.__getitem__(key) in get(key).

For types implemented by the interpreter that is different. The reason is again the difference between slots and the dictionary. Say you want to implement a dictionary in the interpreter. Your goal is to reuse code still, so you want to call __getitem__ from get. How do you go about this?

A Python method in C is just a C function with a specific signature. That is the first problem. That function's first purpose is to handle the Python level parameters and convert them into something you can use on the C layer. At the very least you need to pull the individual arguments from a Python tuple or dict (args and kwargs) into local variables. So a common pattern is that dict__getitem__ internally does just the argument parsing and then calls into something like dict_do_getitem with the actual parameters. You can see where this is going. dict__getitem__ and dict_get both would call into dict_get which is an internal static function. You cannot override that.

There really is no good way around this. The reason for this is related to the slot system. There is no good way from the interpreter internally issue a call through the vtable without going crazy. The reason for this is related to the global interpreter lock. When you are a dictionary your API contract to the outside world is that your operations are atomic. That contract completely goes out of the window when your internal call goes through a vtable. Why? Because that call might now go through Python code which needs to manage the global interpreter lock itself or you will run into massive problems.

Imagine the pain of a dictionary subclass overriding an internal dict_get which would kick off a lazy import. You throw all your guarantees out of the window. Then again, maybe we should have done that a long time ago.

For Future Reference

In recent years there is a clear trend of making Python more complex as a language. I would like to see the inverse of that trend.

I would like to see an internal interpreter design could be based on interpreters that work independent of each other, with local base types and more, similar to how JavaScript works. This would immediately open up the door again for embedding and concurrency based on message passing. CPUs won't get any faster :)

Instead of having slots and dictionaries as a vtable thing, let's experiment with just dictionaries. Objective-C as a language is entirely based on messages and it has made huge advances in making their calls fast. Their calls are from what I can see much faster than Python's calls in the best case. Strings are interned anyways in Python, making comparisons very fast. I bet you it's not slower and even if it was a tiny bit slower, it's a much simpler system that would be easier to optimize.

You should have a look through the Python codebase how much extra logic is required to handle the slot system. It's pretty incredible.

I am very much convinced the slot system was a bad idea and should have been ripped out a long ago. The removal might even have benefited PyPy because I'm pretty sure they need to go out of the way to restrict their interpreter to work like the CPython one to achieve compatibility.

August 16, 2014 12:00 AM


Anatoly Techtonik

ANN: sha1 1.0 - command line tool to calculate file hash

Windows doesn't have a native tool to calculate SHA-1, so I've made one that can be easily installed and used from command line on any operating system thanks to Python:
python -m pip install sha1
python -m sha1 <filename>

August 16, 2014 12:13 AM

August 15, 2014


Muharem Hrnjadovic

Software Engineer in Test (local or remote)

Monetas is looking for experienced software engineers to join the Q/A team. Responsibilities:

Desired Skills and Experience

If successful, you will be testing the software in the Open Transactions [1] and voting pools [2,3] eco-system.

NB: we are a distributed organisation and applications by remote candidates are welcome. We do not sponsor relocations to Switzerland at this time however.

[1] http://opentransactions.org/
[2] http://opentransactions.org/wiki/index.php?title=Category:Voting_Pools
[3] http://www.cryptocoinsnews.com/news/open-transactions-multisig-voting-pools/2014/05/23

To apply please send your CV to: careers AT monetas DOT net.


August 15, 2014 02:07 PM


V.S. Babu

How to beat workday blues?

Let us face it - all of us feel like having achieved or done very little after spending a long day away from family. Then you look back and find that you could've spent some of that time with family at least!

I've been observing my work habits a lot and I think I have found out something that works for me.

I am summarizing these as a NOT-TODO list of 3 items. I am a software engineer by profession and by passion.

Has this worked for me? Absolutely much better than when I was not following these rules.

August 15, 2014 11:44 AM