Paul Boddie's Free Software-related blog
Paul's activities and perspectives around Free Software
Investigating CPython’s Optimisation Trickery for Lichen
For those of us old enough to remember how the Python programming language was twenty or so years ago, nostalgic for a simpler and kinder time, looking to escape the hard reality of today’s feature enhancement arguments, controversies, general bitterness and recriminations, it can be informative to consider what was appealing about Python all those years ago. Some of us even take this slightly further and attempt to formulate our own take on the language, casting aside things that do not appeal or that seem superfluous, needlessly confusing, or redundant.
My own Python variant, called Lichen, strips away quite a few things from today’s Python but probably wouldn’t seem so different to twentieth century Python. Since my primary objective with Lichen is to facilitate static analysis so that observations can be made about program behaviour before running the program, certain needlessly-dynamic features have been eliminated. Usually, when such statements about feature elimination are made, people seize upon them to claim that the resulting language is statically typed, but this is deliberately not the case here. Indeed, “duck typing” is still as viable as ever in Lichen.
Ancient Dynamism
An example of needless dynamism in Python can arguably be found with the __getattr__ and __setattr__ methods, introduced as far back as Python 1.1. They allow accesses to attributes via instances to be intercepted and values supposedly provided by these attributes to be computed on demand. In effect, these methods support virtual or dynamic attributes that are not really present on an object. Here’s an extract from one of the Python 1.2 demonstration programs (Demo/pdist/client.py):
def __getattr__(self, name): if name in self._methods: method = _stub(self, name) setattr(self, name, method) # XXX circular reference return method raise AttributeError, name
In this code, if an instance of the Client class (from which this method is taken) is used to access an attribute called hello, then this method will see if the string “hello” is found in the instance’s _methods attribute, and if so it produces a special object that is then returned as the value for the hello attribute. Otherwise, it raises an exception to indicate that this virtual attribute is not recognised. (Here, the setattr call stores the special object as a genuine attribute in order to save this method from being called again for the same attribute.)
Admittedly, this is quite neat, and it quickly becomes tempting to use such facilities everywhere – this is very much the story of Python and its development – but such things make reasoning about programs more difficult. We cannot know what attributes the instances of this Client class may have without running the program. Indeed, to find out in this case, running the program is literally unavoidable since the _methods attribute is actually populated using the result of a message received over the network!
But even in simpler cases, it can readily be intuitively understood that finding out the supported attributes of instances whose class offers such a method might involve a complicated exercise looking at practically all the code in a program. Despite all the hard work, this exercise will nevertheless produce unreliable or imprecise results. It says something about the fragility of such facilities that properties were later added to Python 2.2 to offer a more declarative alternative.
(It also says something about Python 3 that the cornucopia of special mechanisms for dynamically exposing attributes are apparently still present, despite much having been said about Python 3 remedying such Python 1 and 2 design artefacts.)
Hidden Motives
With static analysis, we might expect to be able to deduce which attributes are provided by class instances without running a program, this potentially allowing us to determine the structure of program objects and to detect errors around their use. But another objective with Lichen is to see how constraints on the language may be used to optimise the performance of programs. I will not deny that performance has always been an interest of mine with respect to Python and its implementations, and I imagine that many compiler and virtual machine implementers have been motivated by such concerns throughout the years.
The deductions made during static analysis can potentially allow us to generate executable programs that perform the same work more efficiently. For example, if it is known that a collection of method calls on an object identify that object as being of a certain type, we can then employ more efficient ways of calling those methods. So, for the following code…
while number: digits.append(_hexdigits[number % base]) number = number / base
…if we can assert that digits is a list, then considering that we might normally generate code for the append method call as something like this…
__load_via_class(digits, append)(...)
…where the __load_via_class operation has to go and find the append method via the class of digits (or, in some cases, even look for the append attribute on the object first), we might instead be able to generate code like this…
__fn_list_append(digits, ...)
…where __fn_list_append is a genuine C function and the digits instance is passed directly to it, together with the elided arguments. When we can get this kind of thing to happen, it can obviously be very satisfying. Various Python implementations and tools also attempt to make method calls efficient in their own ways, some possibly relying on run-time caches that short-circuit the exercise of finding the method.
Magic Numbers
It can be informative to compare the performance of code generated by the Lichen toolchain and the performance of the same program running in the CPython virtual machine, Python and Lichen being broadly compatible (but not identical). As I noted in my summary of 2017, the performance of generated programs was rather disheartening to see at first. I needed to employ profiling to discover where the time was being spent in my generated code that seemed not to be a comparable burden on CPython.
The practicalities of profiling are definitely beyond the scope of this article, but what I did notice was just how much time was being spent allocating space in memory for integers used by programs. I recalled that Python does some special things with integers itself, and so I set about looking for the details of its memory allocation strategies.
It turns out that integers are allocated in a simplified fashion for performance reasons, instead of using the more general allocator that is compatible with garbage collection. And not just that: a range of “small” integers is also allocated in advance when programs run, so that no time is wasted repeatedly allocating objects for numbers that would likely see common use. The details of this can be found in the Objects/intobject.c file in CPython 1.x and 2.x source distributions. Even CPython 1.0 employs this technique.
At first, I thought I had discovered the remedy for my performance problems, but replicating similar allocation arrangements in my run-time code demonstrated that such a happy outcome was not to be so easily achieved. As I looked around for what other special treatment CPython does, I took a closer look at the bytecode interpreter (found in Python/ceval.c), which is the mechanism that takes the compiled form of Python programs (the bytecode) and evaluates the instructions encoded in this form.
My test programs involved simple loops like this:
i = 0 while i < 100000: f(i) i += 1
And I had a suspicion that apart from allocating new integers, the operations involved in incrementing them were more costly than they were in CPython. Now, in Python versions from 1.1 onwards, the special operator methods are supported for things like the addition, subtraction, multiplication and division operators. This could conceivably lead to integer addition being supported by the following logic in one of the simpler cases:
# c = a + b c = a.__add__(b)
But from Python 1.5 onwards, some interesting things appear in the CPython source code:
case BINARY_ADD: w = POP(); v = POP(); if (PyInt_Check(v) && PyInt_Check(w)) { /* INLINE: int + int */ register long a, b, i; a = PyInt_AS_LONG(v); b = PyInt_AS_LONG(w); i = a + b;
Here, when handling the bytecode for the BINARY_ADD instruction, which is generated when the addition operator (plus, “+”) is used in programs, there is a quick test for two integer operands. If the conditions of this test are fulfilled, the result is computed directly (with some additional tests being performed for overflows not shown above). So, CPython was special-casing integers in two ways: with allocation tricks, and with “fast paths” in the interpreter for cases involving integers.
The Tag Team
My response to this was similarly twofold: find an efficient way of allocating integers, and introduce faster ways of handling integers when they are presented to operators. One option that the CPython implementers actually acknowledge in their source code is that of employing a different representation for integers entirely. CPython may have too much legacy baggage to make this transition, and Python 3 certainly didn’t help the implementers to make the break, it would seem, but I have a bit more flexibility.
The option in question is the so-called tagged pointer approach where instead of having a dedicated object for each integer, with a pointer being used to reference that object, the integers themselves are represented by a value that would normally act as a pointer. But this value is not actually a valid pointer at all since it has its lowest bit set, which violates a restriction that is imposed by some processor architectures, but it can be a self-imposed restriction on other systems as well, merely ruling out the positioning of objects at odd-numbered addresses.
So, we might have the following example representations on a 32-bit architecture:
hex value 31..............................0 (bits) 0x12345678 == 0b00010010001101000101011001111000 => pointer 0x12345679 == 0b00010010001101000101011001111001 => integer
Clearing bit 0 and shifting the other bits one position to the right yields the actual integer value, which in the above case will be 152709948. It is conceivable that in future I might sacrifice another bit for encoding other non-pointer values, especially since various 32-bit architectures require word-aligned addresses, where words are positioned on boundaries that are multiples of four bytes, meaning that the lowest two bits would have to be zero for a pointer to be valid.
Albeit with some additional cost incurred when handling pointers, we can with such an approach distinguish integers from other types rapidly and conveniently, which makes the second part of our strategy more efficient as well. Here, we need to identify and handle integers for the arithmetic operators, but unlike CPython, where this happens to be done in an interpreter loop, we have no such loop. Instead we generate code for such operators that simply invokes some existing functions (written in the Lichen language and compiled to C code, another goal being to write as much of the language system in Lichen itself, not C).
It would be rather wasteful to generate tests for integers in addition to these operator function calls every time such a call is made, but the tests can certainly reside within those functions instead. So, here is what we might do for the addition operator:
def add(a, b): if is_int(a) and is_int(b): return int_add(a, b) return binary_op(a, b, lambda a: a.__add__, lambda b: b.__radd__)
This code leaves me with a bit of explaining to do! Last things first: the final statement is the general case of dispatching to the operands and calling an appropriate operator method, with the binary_op function performing the logic in conjunction with the operands and some lambda functions that defer access to the special methods until they are really needed. It is probably best just to trust me that this does the job!
Before the generic operator method dispatch, however, is the test of the operands to see if they are both integers, and this should be vaguely familiar from the CPython source code. A special function is then called to add them efficiently. Note that we couldn’t use the addition (plus, “+”) operator because this code is meant to be handling that, and it would most likely send us on an infinitely recursive loop that never gets round to performing the addition! (I don’t treat the operator as a special case in this code, either. This code is compiled exactly like any other code written in the language.)
The is_int function is what I call “native”, meaning that it is implemented using low-level operations, in this case ones that test the representation of the argument to see if it has its lowest bit set, returning a true value if so. Meanwhile, int_add is largely equivalent to the addition operation seen in the CPython source code above, just with different details involved.
Progress and Reflections
Such adjustments made quite a difference to the performance of my generated code. They do also make some sense, too. Integers are used a lot in programs, being used not only for general arithmetic, but also for counters, index values for things like lists, tuples, strings and other collections, plus a range of other mundane things whose performance can be overlooked until it proves to be suboptimal. Python has something of a reputation for having slow implementations, but CPython’s trickery here optimises in favour of fast results where they can be obtained, falling back on the slower, general mechanisms when these are required.
But I discovered that this is not the only optimisation trickery CPython does, as another program with interesting representation choices and some wildly varying running times was to demonstrate. More on that in the next article on this topic!