Paul Boddie's Free Software-related blog

Paul's activities and perspectives around Free Software

Tuple Performance Optimisations in CPython and Lichen

One of the nice things about the Python programming language is the selection of built-in data types it offers for common activities. Things like lists (collections of values) and dictionaries (key-value mappings) are very convenient and do not need much further explanation, but there is also the concept of the tuple, which is also a collection of values, like a list, but whose size and values are fixed for its entire lifespan, unlike a list. Here is what a very simple tuple looks like:

(123, "abc")

Normally, the need for a data type like this becomes apparent when programming and needing to return multiple values from a function. In languages that do not support such a convenient way of bundling things together, some extra thought is usually required to send data back to the caller, and in languages like C the technique of mutating function arguments and thus communicating such data via a function’s parameters is often used instead.

Lichen, being a Python-like language, also supports tuples. The parsing of source code, employing various existing Python libraries, involves the identification of tuple literals: occurrences of tuples written directly in the code, as seen above. For such values to have any meaning, they must be supported by a particular program representation of the tuple plus the routines that provide each tuple with their familiar characteristics. In other words, we need to provide a way of translating these values into code that makes them tangible things within a running program.

Here, Lichen differs from various Python implementations somewhat. CPython, for instance, defines practically all of the nature of its tuples in C language code (the “C” in CPython), with the pertinent file in CPython versions from 1.0 all the way to 2.7 being found as Objects/tupleobject.c within the source distribution. Meanwhile, Jython defines its tuples in Java language code, with the pertinent file being found as src/org/python/core/PyTuple.java (without the outermost src directory in very early versions). Lichen, on the other hand, implements the general form of tuples in the Lichen language itself.

Tuples All The Way Down

This seems almost nonsensical! How can Lichen’s tuples be implemented in the language itself? What trickery is involved to pull off such an illusion? Well, it might be worth clarifying what kind of code is involved and which parts of the tuple functionality are really provided by the language framework generally. The Lichen code for tuples is found in lib/__builtins__/tuple.py and has the following outline:

class tuple(sequence, hashable):
    "Implementation of tuple."

    def __init__(self, args=None):
        "Initialise the tuple."

    def __hash__(self):
        "Return a hashable value for the tuple."

    def __add__(self, other):
        "Add this tuple to 'other'."

    def __str__(self):
        "Return a string representation."

    def __bool__(self):
        "Tuples are true if non-empty."

    def __iter__(self):
        "Return an iterator."

    def __get_single_item__(self, index):
        "Return the item at the normalised (positive) 'index'."

Here, the actual code within each method has been omitted, but the outline itself defines the general structure of the data type, described by a class, representing the behaviour of each tuple. As in Python, a collection of special methods are provided to support standard operations: __hash__ supports the hash built-in function and is used when using tuples as dictionary keys; __bool__ permits the truth value testing of tuples so that they may be considered as “true” or not; and so on.

Since this definition of classes (data types) is something that needs to be supported generally, it makes sense to use the mechanisms already in place to allow us to define the tuple class in this way. Particularly notable here is the way that the tuple class inherits from other “base classes” (sequence and hashable). Indeed, why should the tuple class be different from any other class? It still needs to behave like any other class with regard to supporting things like methods, and in Lichen its values (or instances) are fundamentally just like instances of other classes.

It would, of course, be possible for me to define the tuple class in C (it being the language to which Lichen programs are compiled), but another benefit of just using the normal process of parsing and compiling the code written in the Lichen language is that it saves me the bother of having to work with such a lower-level representation and the accompanying need to update it carefully when changing its functionality. The functionality itself, being adequately expressed as Lichen code, would need to be hand-compiled to C: a tedious exercise indeed.

One can turn such questions around and ask why tuples are special things in various Python implementations. A fairly reasonable response is that CPython, at least, has evolved its implementation of types and objects over the years, starting out as a “scripting language” offering access to convenient data structures implemented in C and a type system built using those data structures. It was not until Python 2.2 that “type/class unification” became addressed, meaning that the built-in types implemented at the lowest levels – tuples amongst them – could then be treated more like “user-defined classes”, these classes being implemented in Python code.

Although the outline of a tuple class can be defined in the Lichen language itself, and although operations defining tuple behaviour are provided as Lichen code, this does not mean that everything can be implemented at this level. For example, programs written in Lichen do not manage the memory their objects use but instead delegate this task to “native” code. Moreover, some of the memory being managed may have representations that only sensibly exist at a lower level. We can start to investigate this by considering the method returning the size or length of a tuple, invoked when the len built-in function is called on a tuple:

    def __len__(self):
        "Return the length of the tuple."
        return list_len(self.__data__)

Here, the method delegates practically everything to another function, presenting the __data__ attribute of the instance involved in the method call (self). This other function actually isn’t implemented in Lichen: it is a native function that knows about memory and some low-level structures that support the tuple and list abstractions. It looks like this:

__attr __fn_native_list_list_len(__attr self, __attr _data)
{
    unsigned int size = _data.seqvalue->size;
    return __new_int(size);
}

And what it does is to treat the __data__ attribute as a special sequence structure, obtaining its size and passing that value back as an integer usable within Lichen code. The sequence structure is defined as part of the support library for compiled Lichen programs, along with routines to allocate such structures and to populate them. Other kinds of values are also represented at the native level, such as integers and character strings.

To an extent, such native representations are not so different from the special data types implemented in C within CPython and in Java within Jython. However, the Lichen implementation seeks to minimise the amount of native code dedicated to providing abstractions. Where functionality supporting a basic abstraction such as a tuple does not need to interact directly with native representations or perform “machine-level” operations, it is coded in Lichen, and this code can remain happily oblivious to the nature of the data passing through it.

There are interesting intellectual challenges involved here. One might wonder how minimal the layer of native code might need to be, for instance. With a suitable regime in place for translating Lichen code into native operations, might it be possible to do memory management, low-level arithmetic, character string operations, system calls, and more, all in the same language, not writing any (or hardly writing any) native code by hand? It is an intriguing question but also a distraction, and that leads me back towards the main topic of the article!

The Benchmarking Game

Quite a few years ago now, there was a project being run to benchmark different programming languages in order to compare their performance. It still exists, it would seem. But in the early days of this initiative, the programs were fairly simple translations between languages and the results relatively easy to digest. Later on, however, there seemed to be a choice of results depending on the hardware used to create them, and the programs became more complicated, perhaps as people saw their favourite language falling down the result tables and felt that they needed to employ a few tricks to boost their language’s standing.

I have been interested in Python implementations and their performance for a long time, and one of the programs that I have used from time to time has been the “binary trees” benchmark. You can find a more complicated version of this on the Python Interpreters Benchmarks site as well as on the original project’s site. It would appear that on both these sites, different versions are being run even for the same language implementation, presumably to showcase optimisations.

I prefer to keep things simple, however. As the Wikipedia page notes, the “binary trees” benchmark is presumably a test of memory allocation performance. What I discovered when compiling a modified version of this program, one that I had originally obtained without the adornments of multiprocessing module and generator usage, was perhaps more interesting in its own right. The first thing I found was that my generated C program was actually slower than the original program run using CPython: it took perhaps 140% of the CPython running time (48 seconds versus 34 seconds).

My previous article described various realisations that I had around integer performance optimisations in CPython. But when I first tried to investigate this issue, I was at a loss to explain it. It could be said that I had spent so much effort getting the toolchain and supporting library code into some kind of working order that I had little energy left for optimisation investigations, even though I had realised one of my main objectives and now had the basis for such investigations available to me. Perhaps a quick look at the “binary trees” code is in order, so here is an extract:

def make_tree(item, depth):
    if depth > 0:
        item2 = 2 * item
        depth -= 1
        return (item, make_tree(item2 - 1, depth), make_tree(item2, depth))
    else:
        return (item, None, None)

So, here we have some tuples in action, and in the above function, recursion takes place – the function calls itself – to make the tree, hence the function name. Consequently, we have a lot of tuples being created and can now understand what the Wikipedia page was claiming about the program. The result of this function gets presented to another function which unpacks the return value, inspects it, and then calls itself recursively, too:

def check_tree(tree):
    (item, left, right) = tree
    if left is not None:
        return item + check_tree(left) - check_tree(right)
    else:
        return item

I did wonder about all these tuples, and in the struggle to get the language system into a working state, I had cobbled together a working tuple representation, in which I didn’t really have too much confidence. But I wondered about what the program would look like in the other languages involved in the benchmarking exercise and whether tuples (or some equivalent) were also present in whichever original version that had been written for the exercise, possibly in a language like Java or C. Sure enough, the Java versions (simple version) employ class instances and not things like arrays or other anonymous data structures comparable to tuples.

So I decided to change the program to also use classes and to give these tree nodes a more meaningful form:

class Node:
    def __init__(self, item, left, right):
        self.item = item
        self.left = left
        self.right = right

def make_tree(item, depth):
    if depth > 0:
        item2 = 2 * item
        depth -= 1
        return Node(item, make_tree(item2 - 1, depth), make_tree(item2, depth))
    else:
        return Node(item, None, None)

In fact, this is a somewhat rudimentary attempt at introducing object orientation since we might also make the function a method. Meanwhile, in the function handling the return value of the above function, the tuple unpacking was changed to instead access the attributes of the returned Node instances seen above.

def check_tree(tree):
    if tree.left is not None:
        return tree.item + check_tree(tree.left) - check_tree(tree.right)
    else:
        return tree.item

Now, I expected this to be slower in CPython purely because there is more work being done, and instance creation is probably more costly than tuple creation, but I didn’t expect it to be four times slower (at around 2 minutes 15 seconds), which it was! And curiously, running the same program compiled by Lichen was actually quicker (22 seconds), which is about 65% of the original version’s running time in CPython, half the running time of the original version compiled by Lichen, and nearly an sixth of the revised version’s running time in CPython.

One may well wonder why CPython is so much slower when dealing with instances instead of tuples, and this may have been a motivation for using tuples in the benchmarking exercise, but what was more interesting to me at this point was how the code generated by the Lichen toolchain was managing to be faster for instances, especially since tuples are really just another kind of object in the Lichen implementation. So why were tuples slower, and could there be a way of transferring some of the performance of general objects to tuples?

Unpacking More Performance

The “binary trees” benchmark is supposed to give memory allocation a workout, but after the integer performance investigation, I wasn’t about to fall for the trick of blaming the allocator (provided by the Boehm-Demers-Weiser garbage collection library) whose performance is nothing I would complain about. Instead, I considered how CPython might be optimising tuple operations and paid another visit to the interpreter source code (found in Python/ceval.c in the sources for all the different releases of Python 1 and 2) and searched for tuple-related operations.

My experiments with Python over the years have occasionally touched upon the bytecode employed by CPython to represent compiled programs, each bytecode instruction being evaluated by the CPython interpreter. I already knew that some program operations were supported by specific bytecodes, and sure enough, it wasn’t long before I encountered a tuple-specific example: the UNPACK_SEQUENCE instruction (and its predecessors in Python 1.5 and earlier, UNPACK_TUPLE and UNPACK_LIST). This instruction is generated when source code like the following is used:

(item, left, right) = tree

The above would translate to something like this:

              0 LOAD_FAST                0 (tree)
              3 UNPACK_SEQUENCE          3
              6 STORE_FAST               1 (item)
              9 STORE_FAST               2 (left)
             12 STORE_FAST               3 (right)

In CPython, you can investigate the generated bytecode instructions using the dis module, putting the code of interest in a function, and running the dis.dis function on the function object, which is how I generated the above output. Here, UNPACK_SEQUENCE makes an appearance, accessing the items in the tree sequence one by one, pushing them onto the evaluation stack, CPython’s interpreter being a stack-based virtual machine. And sure enough, the interpreter capitalises on the assumption that the operand of this instruction will most likely be a tuple, testing it and then using tuple-specific operations to get at the tuple’s items.

Meanwhile, the translation of the same source code by the Lichen toolchain was rather less optimal. In the translation code, the unpacking operation from the input program is rewritten as a sequence of assignments, and something like the following was being generated:

item = tree[0]
left = tree[1]
right = tree[2]

This in turn gets processed, rewriting the subscript operations (indicated by the bracketing) to the following:

item = tree.__getitem__(0)
left = tree.__getitem__(1)
right = tree.__getitem__(2)

This in turn was being translated to C for the output program. But this is not particularly efficient: it uses a generic mechanism to access each item in the tree tuple, since it is possible that the only thing we may generally assert about tree is that it may provide the __getitem__ special method. The resulting code has to perform extra work to eventually arrive at the code that actually extracts an item from the tuple, and it will be doing this over and over again.

So, the first thing to try was to see if there was any potential for a speed-up by optimising this unpacking operation. I changed the generated C code emitted for the operations above to use the native tuple-accessing functions instead and re-ran the program. This was promising: the running time decreased from 48 seconds to 23 seconds; I had struck gold! But it was all very well demonstrating the potential. What now needed to be done was to find a general way of introducing something similarly effective that would work automatically for all programs.

Of course, I could change the initial form of the unpacking operations to use the __getitem__ method directly, but this is what was being produced anyway, so there would be no change whatsoever in the resulting program. However, I had introduced a Lichen-specific special method, used within the standard library, that accesses individual items in a given sequence instance. (It should be noted that in Python and Lichen, the __getitem__ method can accept a slice object and thus return a collection of values, not just one.) Here is what the rewritten form of the unpacking would now look like:

item = tree.__get_single_item__(0)
left = tree.__get_single_item__(1)
right = tree.__get_single_item__(2)

Compiling the program and running it gave a time of 34 seconds. We were now at parity with CPython. Ostensibly, the overhead in handling different kinds of item index (integers or slice objects) was responsible for 30% of the original version’s running time. What other overhead might there be, given that 34 seconds is still rather longer than 23 seconds? What does this other special method do that my quick hack does not?

It is always worth considering what the compiler is able to know about the program in these cases. Looking at the __get_single_item__ method for a tuple reveals something of interest:

    def __get_single_item__(self, index):
        "Return the item at the normalised (positive) 'index'."
        self._check_index(index)
        return list_element(self.__data__, index)

In the above, the index used to obtain an item is checked to see if it is valid for the tuple. Then, the list_element native function (also used on tuples) obtains the item from the low-level data structure holding all the items. But is there a need to check each index? Although we do need to make sure that accesses to do not try and read “off the end” of the collection of items, accessing items that do not exist, we do not actually need to “normalise” the index.

Such normalisation is the process of interpreting negative index values as referring to items relative to the end of the collection, with -1 referring to the last item, -2 to the next last item, and so on, all the way back to -n referring to the first item (with n being the number of items in the collection). However, the code being generated does not use negative index values, and if we introduce a test to make sure that the tuple is large enough, then we should be able to get away with operations that use the provided index values directly. So I resolved to introduce another special method for this purpose, now rewriting the code as follows:

__builtins__.sequence._test_length(tree, 3)
item = tree.__get_single_item_unchecked__(0)
left = tree.__get_single_item_unchecked__(1)
right = tree.__get_single_item_unchecked__(2)

The _test_length function will raise an exception if the length is inappropriate. Meanwhile, the newly-introduced special method is implemented in a base class of both tuples and lists, and it merely employs a call to list_element for the provided index. Compiling the code with these operations now being generated and running the result yielded a running time of 27 seconds. Some general changes to the code generation, not specific to tuples, brought this down to 24 seconds (and the original version down to 44 seconds, with the object-based version coming down to 16 seconds).

So, the progression in performance looks like this:

Program Version (Lichen Strategy) Lichen CPython
Objects 135 seconds
Tuples (__getitem__) 48 seconds

44 seconds
Tuples (__get_single_item__) 34 seconds  34 seconds
Tuples (__get_single_item_unchecked__) 27 seconds

24 seconds
Objects 22 seconds

16 seconds

Here, the added effect of these other optimisations is also shown where measured.

Conclusions

As we saw with the handling of integers in CPython, optimisations also exist to tune tuple performance in that implementation of Python, and these also exist in other implementations such as Jython (see the unpackSequence method in the org.python.core.Py class, found in the org/python/core/Py.java file). Taking advantage of guarantees about accesses to tuples that are written explicitly into the program source, the generated code can avoid incurring unnecessary overhead, thus considerably speeding up the running time of programs employing tuple unpacking operations.

One might still be wondering why the object-based version of the program is faster than the tuple-based version for Lichen. This is most likely due to the ability of the compiler to make the attribute accesses on the tree object efficient based on deductions it has performed. Fewer low-level operations are performed to achieve the same result, and time is saved as a consequence. One might also wonder why the object-based version is slower when run by CPython. That would probably be due to the flexible but costly way objects are represented and accessed in that language implementation, and this was indeed one of my motivations for exploring other language design and implementation approaches with Lichen.