Paul Boddie's Free Software-related blog


Archive for July, 2018

Tuple Performance Optimisations in CPython and Lichen

Sunday, July 22nd, 2018

One of the nice things about the Python programming language is the selection of built-in data types it offers for common activities. Things like lists (collections of values) and dictionaries (key-value mappings) are very convenient and do not need much further explanation, but there is also the concept of the tuple, which is also a collection of values, like a list, but whose size and values are fixed for its entire lifespan, unlike a list. Here is what a very simple tuple looks like:

(123, "abc")

Normally, the need for a data type like this becomes apparent when programming and needing to return multiple values from a function. In languages that do not support such a convenient way of bundling things together, some extra thought is usually required to send data back to the caller, and in languages like C the technique of mutating function arguments and thus communicating such data via a function’s parameters is often used instead.

Lichen, being a Python-like language, also supports tuples. The parsing of source code, employing various existing Python libraries, involves the identification of tuple literals: occurrences of tuples written directly in the code, as seen above. For such values to have any meaning, they must be supported by a particular program representation of the tuple plus the routines that provide each tuple with their familiar characteristics. In other words, we need to provide a way of translating these values into code that makes them tangible things within a running program.

Here, Lichen differs from various Python implementations somewhat. CPython, for instance, defines practically all of the nature of its tuples in C language code (the “C” in CPython), with the pertinent file in CPython versions from 1.0 all the way to 2.7 being found as Objects/tupleobject.c within the source distribution. Meanwhile, Jython defines its tuples in Java language code, with the pertinent file being found as src/org/python/core/PyTuple.java (without the outermost src directory in very early versions). Lichen, on the other hand, implements the general form of tuples in the Lichen language itself.

Tuples All The Way Down

This seems almost nonsensical! How can Lichen’s tuples be implemented in the language itself? What trickery is involved to pull off such an illusion? Well, it might be worth clarifying what kind of code is involved and which parts of the tuple functionality are really provided by the language framework generally. The Lichen code for tuples is found in lib/__builtins__/tuple.py and has the following outline:

class tuple(sequence, hashable):
    "Implementation of tuple."

    def __init__(self, args=None):
        "Initialise the tuple."

    def __hash__(self):
        "Return a hashable value for the tuple."

    def __add__(self, other):
        "Add this tuple to 'other'."

    def __str__(self):
        "Return a string representation."

    def __bool__(self):
        "Tuples are true if non-empty."

    def __iter__(self):
        "Return an iterator."

    def __get_single_item__(self, index):
        "Return the item at the normalised (positive) 'index'."

Here, the actual code within each method has been omitted, but the outline itself defines the general structure of the data type, described by a class, representing the behaviour of each tuple. As in Python, a collection of special methods are provided to support standard operations: __hash__ supports the hash built-in function and is used when using tuples as dictionary keys; __bool__ permits the truth value testing of tuples so that they may be considered as “true” or not; and so on.

Since this definition of classes (data types) is something that needs to be supported generally, it makes sense to use the mechanisms already in place to allow us to define the tuple class in this way. Particularly notable here is the way that the tuple class inherits from other “base classes” (sequence and hashable). Indeed, why should the tuple class be different from any other class? It still needs to behave like any other class with regard to supporting things like methods, and in Lichen its values (or instances) are fundamentally just like instances of other classes.

It would, of course, be possible for me to define the tuple class in C (it being the language to which Lichen programs are compiled), but another benefit of just using the normal process of parsing and compiling the code written in the Lichen language is that it saves me the bother of having to work with such a lower-level representation and the accompanying need to update it carefully when changing its functionality. The functionality itself, being adequately expressed as Lichen code, would need to be hand-compiled to C: a tedious exercise indeed.

One can turn such questions around and ask why tuples are special things in various Python implementations. A fairly reasonable response is that CPython, at least, has evolved its implementation of types and objects over the years, starting out as a “scripting language” offering access to convenient data structures implemented in C and a type system built using those data structures. It was not until Python 2.2 that “type/class unification” became addressed, meaning that the built-in types implemented at the lowest levels – tuples amongst them – could then be treated more like “user-defined classes”, these classes being implemented in Python code.

Although the outline of a tuple class can be defined in the Lichen language itself, and although operations defining tuple behaviour are provided as Lichen code, this does not mean that everything can be implemented at this level. For example, programs written in Lichen do not manage the memory their objects use but instead delegate this task to “native” code. Moreover, some of the memory being managed may have representations that only sensibly exist at a lower level. We can start to investigate this by considering the method returning the size or length of a tuple, invoked when the len built-in function is called on a tuple:

    def __len__(self):
        "Return the length of the tuple."
        return list_len(self.__data__)

Here, the method delegates practically everything to another function, presenting the __data__ attribute of the instance involved in the method call (self). This other function actually isn’t implemented in Lichen: it is a native function that knows about memory and some low-level structures that support the tuple and list abstractions. It looks like this:

__attr __fn_native_list_list_len(__attr self, __attr _data)
{
    unsigned int size = _data.seqvalue->size;
    return __new_int(size);
}

And what it does is to treat the __data__ attribute as a special sequence structure, obtaining its size and passing that value back as an integer usable within Lichen code. The sequence structure is defined as part of the support library for compiled Lichen programs, along with routines to allocate such structures and to populate them. Other kinds of values are also represented at the native level, such as integers and character strings.

To an extent, such native representations are not so different from the special data types implemented in C within CPython and in Java within Jython. However, the Lichen implementation seeks to minimise the amount of native code dedicated to providing abstractions. Where functionality supporting a basic abstraction such as a tuple does not need to interact directly with native representations or perform “machine-level” operations, it is coded in Lichen, and this code can remain happily oblivious to the nature of the data passing through it.

There are interesting intellectual challenges involved here. One might wonder how minimal the layer of native code might need to be, for instance. With a suitable regime in place for translating Lichen code into native operations, might it be possible to do memory management, low-level arithmetic, character string operations, system calls, and more, all in the same language, not writing any (or hardly writing any) native code by hand? It is an intriguing question but also a distraction, and that leads me back towards the main topic of the article!

The Benchmarking Game

Quite a few years ago now, there was a project being run to benchmark different programming languages in order to compare their performance. It still exists, it would seem. But in the early days of this initiative, the programs were fairly simple translations between languages and the results relatively easy to digest. Later on, however, there seemed to be a choice of results depending on the hardware used to create them, and the programs became more complicated, perhaps as people saw their favourite language falling down the result tables and felt that they needed to employ a few tricks to boost their language’s standing.

I have been interested in Python implementations and their performance for a long time, and one of the programs that I have used from time to time has been the “binary trees” benchmark. You can find a more complicated version of this on the Python Interpreters Benchmarks site as well as on the original project’s site. It would appear that on both these sites, different versions are being run even for the same language implementation, presumably to showcase optimisations.

I prefer to keep things simple, however. As the Wikipedia page notes, the “binary trees” benchmark is presumably a test of memory allocation performance. What I discovered when compiling a modified version of this program, one that I had originally obtained without the adornments of multiprocessing module and generator usage, was perhaps more interesting in its own right. The first thing I found was that my generated C program was actually slower than the original program run using CPython: it took perhaps 140% of the CPython running time (48 seconds versus 34 seconds).

My previous article described various realisations that I had around integer performance optimisations in CPython. But when I first tried to investigate this issue, I was at a loss to explain it. It could be said that I had spent so much effort getting the toolchain and supporting library code into some kind of working order that I had little energy left for optimisation investigations, even though I had realised one of my main objectives and now had the basis for such investigations available to me. Perhaps a quick look at the “binary trees” code is in order, so here is an extract:

def make_tree(item, depth):
    if depth > 0:
        item2 = 2 * item
        depth -= 1
        return (item, make_tree(item2 - 1, depth), make_tree(item2, depth))
    else:
        return (item, None, None)

So, here we have some tuples in action, and in the above function, recursion takes place – the function calls itself – to make the tree, hence the function name. Consequently, we have a lot of tuples being created and can now understand what the Wikipedia page was claiming about the program. The result of this function gets presented to another function which unpacks the return value, inspects it, and then calls itself recursively, too:

def check_tree(tree):
    (item, left, right) = tree
    if left is not None:
        return item + check_tree(left) - check_tree(right)
    else:
        return item

I did wonder about all these tuples, and in the struggle to get the language system into a working state, I had cobbled together a working tuple representation, in which I didn’t really have too much confidence. But I wondered about what the program would look like in the other languages involved in the benchmarking exercise and whether tuples (or some equivalent) were also present in whichever original version that had been written for the exercise, possibly in a language like Java or C. Sure enough, the Java versions (simple version) employ class instances and not things like arrays or other anonymous data structures comparable to tuples.

So I decided to change the program to also use classes and to give these tree nodes a more meaningful form:

class Node:
    def __init__(self, item, left, right):
        self.item = item
        self.left = left
        self.right = right

def make_tree(item, depth):
    if depth > 0:
        item2 = 2 * item
        depth -= 1
        return Node(item, make_tree(item2 - 1, depth), make_tree(item2, depth))
    else:
        return Node(item, None, None)

In fact, this is a somewhat rudimentary attempt at introducing object orientation since we might also make the function a method. Meanwhile, in the function handling the return value of the above function, the tuple unpacking was changed to instead access the attributes of the returned Node instances seen above.

def check_tree(tree):
    if tree.left is not None:
        return tree.item + check_tree(tree.left) - check_tree(tree.right)
    else:
        return tree.item

Now, I expected this to be slower in CPython purely because there is more work being done, and instance creation is probably more costly than tuple creation, but I didn’t expect it to be four times slower (at around 2 minutes 15 seconds), which it was! And curiously, running the same program compiled by Lichen was actually quicker (22 seconds), which is about 65% of the original version’s running time in CPython, half the running time of the original version compiled by Lichen, and nearly an sixth of the revised version’s running time in CPython.

One may well wonder why CPython is so much slower when dealing with instances instead of tuples, and this may have been a motivation for using tuples in the benchmarking exercise, but what was more interesting to me at this point was how the code generated by the Lichen toolchain was managing to be faster for instances, especially since tuples are really just another kind of object in the Lichen implementation. So why were tuples slower, and could there be a way of transferring some of the performance of general objects to tuples?

Unpacking More Performance

The “binary trees” benchmark is supposed to give memory allocation a workout, but after the integer performance investigation, I wasn’t about to fall for the trick of blaming the allocator (provided by the Boehm-Demers-Weiser garbage collection library) whose performance is nothing I would complain about. Instead, I considered how CPython might be optimising tuple operations and paid another visit to the interpreter source code (found in Python/ceval.c in the sources for all the different releases of Python 1 and 2) and searched for tuple-related operations.

My experiments with Python over the years have occasionally touched upon the bytecode employed by CPython to represent compiled programs, each bytecode instruction being evaluated by the CPython interpreter. I already knew that some program operations were supported by specific bytecodes, and sure enough, it wasn’t long before I encountered a tuple-specific example: the UNPACK_SEQUENCE instruction (and its predecessors in Python 1.5 and earlier, UNPACK_TUPLE and UNPACK_LIST). This instruction is generated when source code like the following is used:

(item, left, right) = tree

The above would translate to something like this:

              0 LOAD_FAST                0 (tree)
              3 UNPACK_SEQUENCE          3
              6 STORE_FAST               1 (item)
              9 STORE_FAST               2 (left)
             12 STORE_FAST               3 (right)

In CPython, you can investigate the generated bytecode instructions using the dis module, putting the code of interest in a function, and running the dis.dis function on the function object, which is how I generated the above output. Here, UNPACK_SEQUENCE makes an appearance, accessing the items in the tree sequence one by one, pushing them onto the evaluation stack, CPython’s interpreter being a stack-based virtual machine. And sure enough, the interpreter capitalises on the assumption that the operand of this instruction will most likely be a tuple, testing it and then using tuple-specific operations to get at the tuple’s items.

Meanwhile, the translation of the same source code by the Lichen toolchain was rather less optimal. In the translation code, the unpacking operation from the input program is rewritten as a sequence of assignments, and something like the following was being generated:

item = tree[0]
left = tree[1]
right = tree[2]

This in turn gets processed, rewriting the subscript operations (indicated by the bracketing) to the following:

item = tree.__getitem__(0)
left = tree.__getitem__(1)
right = tree.__getitem__(2)

This in turn was being translated to C for the output program. But this is not particularly efficient: it uses a generic mechanism to access each item in the tree tuple, since it is possible that the only thing we may generally assert about tree is that it may provide the __getitem__ special method. The resulting code has to perform extra work to eventually arrive at the code that actually extracts an item from the tuple, and it will be doing this over and over again.

So, the first thing to try was to see if there was any potential for a speed-up by optimising this unpacking operation. I changed the generated C code emitted for the operations above to use the native tuple-accessing functions instead and re-ran the program. This was promising: the running time decreased from 48 seconds to 23 seconds; I had struck gold! But it was all very well demonstrating the potential. What now needed to be done was to find a general way of introducing something similarly effective that would work automatically for all programs.

Of course, I could change the initial form of the unpacking operations to use the __getitem__ method directly, but this is what was being produced anyway, so there would be no change whatsoever in the resulting program. However, I had introduced a Lichen-specific special method, used within the standard library, that accesses individual items in a given sequence instance. (It should be noted that in Python and Lichen, the __getitem__ method can accept a slice object and thus return a collection of values, not just one.) Here is what the rewritten form of the unpacking would now look like:

item = tree.__get_single_item__(0)
left = tree.__get_single_item__(1)
right = tree.__get_single_item__(2)

Compiling the program and running it gave a time of 34 seconds. We were now at parity with CPython. Ostensibly, the overhead in handling different kinds of item index (integers or slice objects) was responsible for 30% of the original version’s running time. What other overhead might there be, given that 34 seconds is still rather longer than 23 seconds? What does this other special method do that my quick hack does not?

It is always worth considering what the compiler is able to know about the program in these cases. Looking at the __get_single_item__ method for a tuple reveals something of interest:

    def __get_single_item__(self, index):
        "Return the item at the normalised (positive) 'index'."
        self._check_index(index)
        return list_element(self.__data__, index)

In the above, the index used to obtain an item is checked to see if it is valid for the tuple. Then, the list_element native function (also used on tuples) obtains the item from the low-level data structure holding all the items. But is there a need to check each index? Although we do need to make sure that accesses to do not try and read “off the end” of the collection of items, accessing items that do not exist, we do not actually need to “normalise” the index.

Such normalisation is the process of interpreting negative index values as referring to items relative to the end of the collection, with -1 referring to the last item, -2 to the next last item, and so on, all the way back to -n referring to the first item (with n being the number of items in the collection). However, the code being generated does not use negative index values, and if we introduce a test to make sure that the tuple is large enough, then we should be able to get away with operations that use the provided index values directly. So I resolved to introduce another special method for this purpose, now rewriting the code as follows:

__builtins__.sequence._test_length(tree, 3)
item = tree.__get_single_item_unchecked__(0)
left = tree.__get_single_item_unchecked__(1)
right = tree.__get_single_item_unchecked__(2)

The _test_length function will raise an exception if the length is inappropriate. Meanwhile, the newly-introduced special method is implemented in a base class of both tuples and lists, and it merely employs a call to list_element for the provided index. Compiling the code with these operations now being generated and running the result yielded a running time of 27 seconds. Some general changes to the code generation, not specific to tuples, brought this down to 24 seconds (and the original version down to 44 seconds, with the object-based version coming down to 16 seconds).

So, the progression in performance looks like this:

Program Version (Lichen Strategy) Lichen CPython
Objects 135 seconds
Tuples (__getitem__) 48 seconds

44 seconds
Tuples (__get_single_item__) 34 seconds  34 seconds
Tuples (__get_single_item_unchecked__) 27 seconds

24 seconds
Objects 22 seconds

16 seconds

Here, the added effect of these other optimisations is also shown where measured.

Conclusions

As we saw with the handling of integers in CPython, optimisations also exist to tune tuple performance in that implementation of Python, and these also exist in other implementations such as Jython (see the unpackSequence method in the org.python.core.Py class, found in the org/python/core/Py.java file). Taking advantage of guarantees about accesses to tuples that are written explicitly into the program source, the generated code can avoid incurring unnecessary overhead, thus considerably speeding up the running time of programs employing tuple unpacking operations.

One might still be wondering why the object-based version of the program is faster than the tuple-based version for Lichen. This is most likely due to the ability of the compiler to make the attribute accesses on the tree object efficient based on deductions it has performed. Fewer low-level operations are performed to achieve the same result, and time is saved as a consequence. One might also wonder why the object-based version is slower when run by CPython. That would probably be due to the flexible but costly way objects are represented and accessed in that language implementation, and this was indeed one of my motivations for exploring other language design and implementation approaches with Lichen.

Shared-Mode Executables in L4Re for MIPS-Based Devices

Sunday, July 8th, 2018

I have been meaning to write about my device driver experiments with L4Re, following on from my porting exercises, but that exercise took me along various routes and I haven’t yet got back to documenting all of them. Meanwhile, one thing that did start to bother me was how much space the software was taking up when compiled, linked and ready to deploy.

Since each of my device drivers is a separate program, and since each one may be linked to various libraries, they each started to contribute substantially to the size of the resulting file – the payload – needing to be transferred to the device. At one point, I had to resize the boot partition on the memory card used by the Letux 400 notebook computer to make the payload fit in the available space.

The work done to port L4Re to the MIPS Creator CI20 had already laid the foundations for functioning payloads, and once the final touches were put in place to support the peculiarities of the Ingenic JZ4780 system-on-a-chip, it was possible to run both the conventional “hello” example which is statically linked to its libraries, as well as a “shared-hello” example which is dynamically linked to its libraries. The latter configuration of the program results in a smaller executable program and thus a smaller payload.

So it seemed clear that I might be able to run my own programs on the Letux 400 or Ben NanoNote with similar reductions in payload size. Unfortunately, nothing ever seems to be as straightforward as it ought to be.

Exceptional Obstructions and Observations

Initially, I set about trying one of my own graphical examples with the MODE variable set to “shared” in its Makefile. This, upon powering up, merely indicated that it had not managed to start up properly. Instead of a blank screen, the viewports set up by the graphical multiplexer, Mag, were still active and showing their usual blankness. But these regions did not then change in any way when I pressed keys on the keyboard (which is functionality that I will hopefully get round to describing in another article).

I sought some general advice from the l4-hackers mailing list, but quickly realised that to make any real progress, I would need a decent way of accessing the debugging output produced by the dynamic linker. This took me on a diversion that led to my debugging capabilities being strengthened with the availability of a textual output console on the screen of my devices. I still don’t like the idea of performing hardware modifications to get access to the serial console, so this is a useful and welcome alternative.

Having switched out the “hello” program with the “shared-hello” program in the system configuration and module list demonstrating the framebuffer terminal, I deployed the payload and powered up, but I did not get the satisfying output of the program operating normally. Instead, the framebuffer terminal appeared and rewarded me with the following message:

L4Re: rom/ex_hello_shared: Unhandled exception: PC=0x800000 PFA=8d7a LdrFlgs=0

This isn’t really the kind of thing you want to see. Having not had to debug L4Re or Fiasco.OC in any serious fashion for a couple of months, I was out of practice in considering the next step, but fortunately some encouragement arrived in a private e-mail from Jean Wolter. This brought the suggestion of triggering the kernel debugger, but since this requires serial console access, it wasn’t a viable approach. But another idea that I could use involved writing out a bit more information in the routine that was producing this output.

The message in question originates in the pkg/l4re-core/l4re_kernel/server/src/region.cc file, within the Region_map::op_exception method. The details it produces are rather minimal and generic: the program counter (PC) tells us where the exception occurred; the loader flags (LdrFlags) presumably tell us about the activity of the library loader; the mysterious “PFA” is supposedly the page fault address but it actually seemed to be the stack pointer address on these MIPS-based systems.

On their own, these details are not particularly informative, but I suppose that more useful information could quickly become fairly specific to a particular architecture. Jean suggested looking at the structure describing the exception state, l4_exc_regs_t (defined with MIPS-specific members in pkg/l4re-core/l4sys/include/ARCH-mips/utcb.h), to see what else I might dig up. This I did, generating the following:

pc=0x800000
gp=0x82dd30
sp=0x8d7a
ra=0x802f6c
cause=0x1000002c

A few things interested me, thus motivating my choice of registers to dump. The global pointer (gp) register tells us about symbols in the problematic code, and I felt that having once made changes to the L4Re sources – way back in the era of getting the CI20 to run GCC-generated code – so that another register (t9) would be initialised correctly, this so that the gp register would be set up correctly within programs, it was entirely possible that I had rather too enthusiastically changed something that was now causing a problem.

The stack pointer (sp) is useful to check, just to see if it located in a sensible region of memory, and here I discovered that this seemed to be the same as the “PFA” number. Oddly, the “PFA” seems to occupy the same place in the exception structure as any “bad virtual address” featuring in an address exception, and so I started to suspect that maybe the stack pointer was referencing the wrong part of memory. But this was partially ruled out by examining the value of the stack pointer in the “hello” example, which appeared to reference broadly the same part of memory. And, of course, the “hello” example works just fine.

In fact, the cause register indicated another kind of exception entirely, and it was one I was not really expecting: a “coprocessor unusable” exception indicating that coprocessor 1, typically a floating point arithmetic unit, was being illegally requested by an instruction. Here is how I interpreted that register’s value:

hex value   binary value
1000002c == 00010000000000000000000000101100
              --                     -----
              CE                     ExcCode

=> CE == 1; ExcCode == 11 (coprocessor unusable)
=> coprocessor 1 unusable

Now, as I may have mentioned before, the hardware involved in this exercise does not support floating point instructions itself, and this is why I have configured compilers to use “soft-float” (software-based floating point arithmetic) support. It meant that I had to find places that might have wanted to use floating point instructions and eliminate those instructions somehow. Fortunately, only code generated by the compiler was likely to contain such instructions. But now I wondered if there weren’t some instructions of this nature lurking in places I hadn’t checked.

I had also thought to check the return address (ra) register. This tells us where the processor will jump to when it has finished executing the current routine, and since this is usually a matter of “returning” somewhere, it tells us something about the code that was being executed before the problematic routine was called. I figured that the work being done before the exception was probably going to be more important than the exception itself.

Floating Point Magic

Another debugging suggestion that now became unavoidable was to inspect the erroneous instruction. I noted above that this instruction was causing the processor to signal an illegal attempt to use an unusable – actually completely unavailable – coprocessor. Writing a numeric representation of the instruction to the display provided me with the following hexadecimal (base 16) value:

464c457f

This can be interpreted as follows in binary, with groups of bits defined for interpretation according to the MIPS instruction set architecture, and with tentative interpretations of these groups provided beneath:

010001 10010  01100 01000 10101 111111
COP1   rs/fmt rt/ft rd/fs       C.ABS.NGT

The first group of bits is the opcode field which is interpreted as a coprocessor 1 (COP1) opcode. Should we then wish to consider what the other groups mean, we might then examine the final group which could indicate a comparison instruction. However, this becomes rather hypothetical since the processor will most likely interpret the opcode field and then decide that it cannot handle the instruction.

So, I started to look for places where the instruction might have been written, but no obvious locations were forthcoming. One peculiar aspect of all this is that the location of the instruction is at a rather “clean” location – 0x800000 – and some investigations indicated that this is where the library containing the problematic code gets loaded. I actually don’t remember precisely how I figured this out, but I think it was as follows.

I had looked at linker scripts that might give some details of the location of program objects, and one of them (pkg/l4re-core/ldscripts/ARCH-mips/main_dyn.ld) seemed to be related. It gave an address for the code of 0x400000. This made me think that some misconfiguration or erroneous operation was putting the observed code somewhere it shouldn’t be. But changing this address in the linker script just gave another exception at 0x400000, meaning that I had disrupted something that was intentional and probably working fine.

Meanwhile, emitting the t9 register’s value from the exception state yielded 0x800000, indicating that the calling routine had most likely jumped straight to that address, not to another address with execution having then proceeded normally until reaching the exception location. I decided to look at the instructions around the return address, these most likely being the ones that had set up the call to the exception location. Writing these locations out gave me some idea about the instructions involved. Below, I provide the stored values and their interpretations as machine instructions:

8f998250 # lw $t9, -32176($gp)
24a55fa8 # addiu $a1, $a1, 0x5fa8
0320f809 # jalr $t9
24844ee4 # addiu $a0, $a0, 0x4ee4
8fbc0010 # lw $gp, 16($sp)

One objective of doing this, apart from confirming that a jump instruction (jalr) was involved, with the t9 register being employed as is the convention with MIPS code, was to use the fragment to identify the library that was causing the error. A brute-force approach was employed here, generating “object dumps” from the library files and writing them out as files in a new directory:

mkdir tmpdir
for FILENAME in mybuild/lib/mips_32/l4f/* ; do
    mipsel-linux-gnu-objdump -d "$FILENAME" > tmpdir/`basename "$FILENAME"`
done

The textual dump files were then searched for the instruction values using grep, narrowing down the places where these instructions were found in consecutive locations. This yielded the following code, found in the libld-l4.so library:

    2f5c:       8f998250        lw      t9,-32176(gp)
    2f60:       24a55fa8        addiu   a1,a1,24488
    2f64:       0320f809        jalr    t9
    2f68:       24844ee4        addiu   a0,a0,20196
    2f6c:       8fbc0010        lw      gp,16(sp)

The integer operands for the addiu instructions are the same, of course, just being shown as decimal rather than hexadecimal values. Now, we previously saw that the return address (ra) register had the value 0x802f6c. When a MIPS processor executes a jump instruction, it will also fetch the following instruction and execute it, this being a consequence of the way the processor architecture is designed.

So, the instruction after the jump, residing in what is known as the “branch delay slot” is not the instruction that will be visited upon returning from the called routine. Instead, it is the instruction after that. Here, we see that the return address from the jump at location 0x2f64 would be two locations later at 0x2f6c. This provides a kind of realisation that the program object – the libld-l4.so library – is positioned in memory at 0x800000: 0x2f6c added to 0x800000 gives the value of ra, 0x802f6c.

And this means that the location of the problematic instruction – the cause of our exception – is the first location within this object. Anyone with any experience of this kind of software will have realised by now that this doesn’t sound like a healthy situation: the first location within a library is not actually going to be code because these kinds of objects are packaged up in a way that permits their manipulation by other programs.

So what is the first location of a library used for? Since such objects employ the Executable and Linkable Format (ELF), we can take a look at some documentation. And we see that the first location is used to identify the kind of object, employing a “magic number” for the purpose. And that magic number would be…

464c457f

In the little-endian arrangement employed by this processor, the stored bytes are as follows:

7f
45 ('E')
4c ('L')
46 ('F')

The value was not a floating point instruction at all, but the magic number at the start of the library object! It was something of a coincidence that such a value would be interpreted as a floating point instruction, an accidentally convenient way of signalling something going badly wrong.

Missing Entries

The investigation now started to focus on how the code trying to jump to the start of the library had managed to get this incorrect address and what it was trying to do by jumping to it. I started to wonder if the global pointer (gp), whose job it is to reference the list of locations of program routines and other global data, might have been miscalculated such that attempts to load the addresses of routines would then be failing with data being fetched from the wrong places.

But looking around at code fragments where the gp register was being calculated, they seemed to look set to calculate the correct values based on assumptions about other registers. For example, from the object dump for libld-l4.so:

00002780 <_ftext>:
    2780:       3c1c0003        lui     gp,0x3
    2784:       279cb5b0        addiu   gp,gp,-19024
    2788:       0399e021        addu    gp,gp,t9

Assuming that the processor has t9 set to 0x2780 and then jumps to the value of t9, as is the convention, the following calculation is then performed:

gp = 0x30000 (since lui loads the "upper" half-word)
gp = gp - 19024 = 0x30000 - 19024 = 0x2b5b0
gp = gp + t9 = 0x2b5b0 + 0x2780 = 0x2dd30

Using the nm tool, which tells us about symbols in program objects, it was possible to check this value:

mipsel-linux-gnu-nm -n mybuild/lib/mips_32/l4f/libld-l4.so

This shows the following at the end of the output:

0002dd30 d _gp

Also appearing somewhat earlier in the output is this, telling us where the table of symbols starts (as well as the next thing in the file):

00025d40 a _GLOBAL_OFFSET_TABLE_
00025f90 g __dso_handle

Some digging around in the L4Re source code gave a kind of confirmation that the difference between _gp and _GLOBAL_OFFSET_TABLE_ was to be expected. Here is what I found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/mips/elfinterp.c file:

#define OFFSET_GP_GOT 0x7ff0

If gp, when recalculated in other places, ended up getting the same value, there didn’t seem to be anything wrong with it. Some quick inspections of neighbouring calculations indicated that this wasn’t likely to be the problem. But what about the values used in conjunction with gp? Might they be having an effect? In the case of the erroneous jump, the following calculation is involved:

lw t9,-32176(gp) => load word into t9 from the location at gp - 32176
                 => ...               from 0x2dd30 - 32176
                 => ...               from 0x25f80

The calculated address, 0x25f80, is after the start of _GLOBAL_OFFSET_TABLE_ providing entries for program routines and other things, which is a good sign, but what is perhaps more troubling is how far after the start of the table such a value is. In the above output, another symbol (__dso_handle) indicates something that is located at the end of the table. Now, although its address is still greater than the one computed above, meaning that the computation does not cause us to stray off the end of the table, the computed address is suspiciously close to the end.

There was nothing else to do than to have a look at the table contents itself, and here it was rather useful to have a way of displaying a number of values on the screen. At this point, we have to note that the addresses in use in the running system are adjusted according to the start of the loaded object, so that the table is positioned at 0x25d40 in the object dump, but in the running system we would see 0x800000 + 0x25d40 and thus 0x825d40 instead.

What I saw was that the table contained entries that varied in the expected way right up until 0x825f60 (corresponding to 0x25f60 in the object dump) being only 0x30 (or 48 bytes, or 12 entries) before the end of the table, but then all remaining entries starting at 0x825f64 (corresponding to 0x25f64) yielded a value of 0x800000, apart from 0x825f90 (corresponding to 0x25f90, right at the end of the table) which yielded itself.

Since the calculated address above (0x25f80, adjusted to 0x825f80 in the running system) lies in this final region, we now know the origin of this annoying 0x800000: it comes from entries at the end of the table that do not seem to hold meaningful values. Indeed, the object dump for the library seemed to skip over this region of the table entirely, presumably because it was left uninitialised. And using the readelf tool with the –relocs option to show “relocations”, which applies to this table, it appeared that the last entries rather confirmed my observations:

00025d34  00000003 R_MIPS_REL32
00025f90  00000003 R_MIPS_REL32

Clearly, something is missing from this table. But since something has to adjust the contents of the table to add the “base address”, 0x800000, to the entries in order to provide valid addresses within the running program, what started to intrigue me was whether the code that performed this adjustment had any idea about these missing entries, and how this code might be related to the code causing the exception situation.

Routines and Responsibilities

While considering the nature of the code causing the exception, I had been using the objdump utility with the -d (disassemble) and -D (disassemble all) options. These provide details of program sections, code routines and the machine instructions themselves. But Jean pointed out that if I really wanted to find out which part of the source code was responsible for producing certain regions of the program, I might use a combination of options: -d, -l (line numbers) and -S (source code). This was almost a revelation!

However, the code responsible for the jump to the start of the library resisted such measures. A large region of code appeared to have no corresponding source, suggesting that it might be generated. Here is how it starts:

_ftext():
    2dac:       00000000        nop
    2db0:       3c1c0003        lui     gp,0x3
    2db4:       279caf80        addiu   gp,gp,-20608
    2db8:       0399e021        addu    gp,gp,t9
    2dbc:       8f84801c        lw      a0,-32740(gp)
    2dc0:       8f828018        lw      v0,-32744(gp)

There is no function defined in the source code with the name _ftext. However, _ftext is defined in the linker script (in pkg/l4re-core/ldscripts/ARCH-mips/main_rel.ld) as follows:

  .text           :
  {
    _ftext = . ;
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em.  */
    *(.gnu.warning)
    *(.mips16.fn.*) *(.mips16.call.*)
  }

If you haven’t encountered linker scripts before, then you probably don’t want to spend too much time looking at this, linker scripts being frustratingly terse and arcane, but the essence of the above is that a bunch of code is stuffed into the .text section, with _ftext being assigned the address of the start of all this code. Now, _ftext in the linker script corresponds to a particular label in the object dump (which we saw earlier was positioned at 0x2780) whereas the _ftext function in the code occurs later (at 0x2dac, above). After the label but before the function is code whose source is found by objdump.

So I took the approach of removing things from the linker script, ultimately removing everything from the .text section apart from the assignment to _ftext. This removed the annotated regions of the code and left me with only the _ftext function. It really did appear that this was something the compiler might be responsible for. But where would I find the code responsible?

One hint that was present in the _ftext function code was the use of another identified function, __cxa_finalize. Searching the GCC sources for code that might use it led me to the libgcc sources and to code that invokes destructor functions upon program exit. This wasn’t really what I was looking for, but the file containing it (libgcc/crtstuff.c) would prove informative.

Back to the Table

Jean had indicated that there might be a difference in output between compilers, and that certain symbols might be produced by some but not by others. I investigated further by using the readelf tool with the -a option to show almost everything about the library file. Here, the focus was on the global offset table (GOT) and information about the entries. In particular, I wanted to know more about the entry providing the erroneous 0x800000 value, located at (gp – 32176). In my output I saw the following interesting thing:

 Global entries:
   Address     Access  Initial Sym.Val. Type    Ndx Name
  00025f80 -32176(gp) 00000000 00000000 FUNC    UND __register_frame_info

This seems to tell us what the program expects to find at the location in question, and it indicates that the named symbol is presumably undefined. There were some other undefined symbols, too:

_ITM_deregisterTMCloneTable
_ITM_registerTMCloneTable
__deregister_frame_info

Meanwhile, Jean was seeing symbols with other names:

__register_frame_info_base
__deregister_frame_info_base

During my perusal of the libgcc sources, I had noticed some of these symbols being tested to see if they were non-zero. For example:

  if (__register_frame_info)
    __register_frame_info (__EH_FRAME_BEGIN__, &object);

These fragments of code appear to be located in functions related to program initialisation. And it is also interesting to note that back in the library code, after the offending table entry has been accessed, there are tests against zero:

    2f34:       3c1c0003        lui     gp,0x3
    2f38:       279cadfc        addiu   gp,gp,-20996
    2f3c:       0399e021        addu    gp,gp,t9
    2f40:       27bdffe0        addiu   sp,sp,-32
    2f44:       8f828250        lw      v0,-32176(gp)
    2f48:       afbc0010        sw      gp,16(sp)
    2f4c:       afbf001c        sw      ra,28(sp)
    2f50:       10400007        beqz    v0,2f70 <_ftext+0x7f0>

Here, gp gets set up correctly, v0 is set to the value of the table entry, which we now believe refers to __register_frame_info, and the beqz instruction tests this value against zero, skipping ahead if it is zero. Does that not sound a bit like the code shown above? One might think that the libgcc code might handle an uninitialised table entry, and maybe it is intended to do so, but the table entry ends up getting adjusted to 0x800000, presumably as part of the library loading process.

I think that the most relevant function here for the adjustment of these entries is _dl_perform_mips_global_got_relocations which can be found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/ldso.c file as part of the L4Re C library code. It may well have changed the entry from zero to this erroneous non-zero value, merely because the entry lies within the table and is assumed to be valid.

So, as a consequence, the libgcc code acts as if it has a genuine __register_frame_info function to call, and doing so causes the jump to the start of the library object and the exception. Maybe the code is supposed to be designed to handle missing symbols, those symbols potentially being deliberately omitted, but it doesn’t function correctly under these particular circumstances.

Symbol Restoration

However, despite identifying this unfortunate interaction between C library and libgcc, the matter of a remedy remained unaddressed. What was I to do about these missing symbols? Were they supposed to be there? Was there a way to tell libgcc not to expect them to be there at all?

In attempting to learn a bit more about the linking process, I had probably been through the different L4Re packages several times, but Jean then pointed me to a file I had seen before, perhaps before I had needed to think about these symbols at all. It contained “empty” definitions for some of the symbols but not for others. Maybe the workaround or even the solution was to just add more definitions corresponding to the symbols the program was expecting? Jean thought so.

So, I added a few things to the file (pkg/l4re-core/ldso/ldso/fixup.c):

void __deregister_frame_info(void);
void __register_frame_info(void);
void _ITM_deregisterTMCloneTable(void);
void _ITM_registerTMCloneTable(void);

void __deregister_frame_info(void) {}
void __register_frame_info(void) {}
void _ITM_deregisterTMCloneTable(void) {}
void _ITM_registerTMCloneTable(void) {}

I wasn’t confident that this would fix the problem. After all the investigation, adding a few lines of trivial code to one file seemed like too easy a way to fix what seemed like a serious problem. But I checked the object dump of the library, and suddenly things looked a bit more reasonable. Instead of referencing an uninitialised table entry, objdump was able to identify the jump target as __register_frame_info:

    2e14:       8f828040        lw      v0,-32704(gp)
    2e18:       afbc0010        sw      gp,16(sp)
    2e1c:       afbf001c        sw      ra,28(sp)
    2e20:       10400007        beqz    v0,2e40 <_ftext+0x7f0>
    2e24:       8f85801c        lw      a1,-32740(gp)
    2e28:       8f84803c        lw      a0,-32708(gp)
    2e2c:       8f998040        lw      t9,-32704(gp)
    2e30:       24a55fa8        addiu   a1,a1,24488
    2e34:       04111c39        bal     9f1c <__register_frame_info>

Of course, the code had been reorganised and so things were no longer in quite the same places, but in the above, (gp – 32704) is actually a reference to __register_frame_info, and although this value gets tested against zero as before, we can see that enough is now known about the previously-missing symbol that a branch directly to the location of the function has been incorporated, rather than a jump to the address stored in the table.

And sure enough, powering up the Letux 400 produced the framebuffer terminal showing the expected output:

Hi World! (shared)

It had been a long journey for such a modest reward, but thanks to Jean’s help and a bit of perseverance, I got there in the end.

Investigating CPython’s Optimisation Trickery for Lichen

Saturday, July 7th, 2018

For those of us old enough to remember how the Python programming language was twenty or so years ago, nostalgic for a simpler and kinder time, looking to escape the hard reality of today’s feature enhancement arguments, controversies, general bitterness and recriminations, it can be informative to consider what was appealing about Python all those years ago. Some of us even take this slightly further and attempt to formulate our own take on the language, casting aside things that do not appeal or that seem superfluous, needlessly confusing, or redundant.

My own Python variant, called Lichen, strips away quite a few things from today’s Python but probably wouldn’t seem so different to twentieth century Python. Since my primary objective with Lichen is to facilitate static analysis so that observations can be made about program behaviour before running the program, certain needlessly-dynamic features have been eliminated. Usually, when such statements about feature elimination are made, people seize upon them to claim that the resulting language is statically typed, but this is deliberately not the case here. Indeed, “duck typing” is still as viable as ever in Lichen.

Ancient Dynamism

An example of needless dynamism in Python can arguably be found with the __getattr__ and __setattr__ methods, introduced as far back as Python 1.1. They allow accesses to attributes via instances to be intercepted and values supposedly provided by these attributes to be computed on demand. In effect, these methods support virtual or dynamic attributes that are not really present on an object. Here’s an extract from one of the Python 1.2 demonstration programs (Demo/pdist/client.py):

        def __getattr__(self, name):
                if name in self._methods:
                        method = _stub(self, name)
                        setattr(self, name, method) # XXX circular reference
                        return method
                raise AttributeError, name

In this code, if an instance of the Client class (from which this method is taken) is used to access an attribute called hello, then this method will see if the string “hello” is found in the instance’s _methods attribute, and if so it produces a special object that is then returned as the value for the hello attribute. Otherwise, it raises an exception to indicate that this virtual attribute is not recognised. (Here, the setattr call stores the special object as a genuine attribute in order to save this method from being called again for the same attribute.)

Admittedly, this is quite neat, and it quickly becomes tempting to use such facilities everywhere – this is very much the story of Python and its development – but such things make reasoning about programs more difficult. We cannot know what attributes the instances of this Client class may have without running the program. Indeed, to find out in this case, running the program is literally unavoidable since the _methods attribute is actually populated using the result of a message received over the network!

But even in simpler cases, it can readily be intuitively understood that finding out the supported attributes of instances whose class offers such a method might involve a complicated exercise looking at practically all the code in a program. Despite all the hard work, this exercise will nevertheless produce unreliable or imprecise results. It says something about the fragility of such facilities that properties were later added to Python 2.2 to offer a more declarative alternative.

(It also says something about Python 3 that the cornucopia of special mechanisms for dynamically exposing attributes are apparently still present, despite much having been said about Python 3 remedying such Python 1 and 2 design artefacts.)

Hidden Motives

With static analysis, we might expect to be able to deduce which attributes are provided by class instances without running a program, this potentially allowing us to determine the structure of program objects and to detect errors around their use. But another objective with Lichen is to see how constraints on the language may be used to optimise the performance of programs. I will not deny that performance has always been an interest of mine with respect to Python and its implementations, and I imagine that many compiler and virtual machine implementers have been motivated by such concerns throughout the years.

The deductions made during static analysis can potentially allow us to generate executable programs that perform the same work more efficiently. For example, if it is known that a collection of method calls on an object identify that object as being of a certain type, we can then employ more efficient ways of calling those methods. So, for the following code…

        while number:
            digits.append(_hexdigits[number % base])
            number = number / base

…if we can assert that digits is a list, then considering that we might normally generate code for the append method call as something like this…

__load_via_class(digits, append)(...)

…where the __load_via_class operation has to go and find the append method via the class of digits (or, in some cases, even look for the append attribute on the object first), we might instead be able to generate code like this…

__fn_list_append(digits, ...)

…where __fn_list_append is a genuine C function and the digits instance is passed directly to it, together with the elided arguments. When we can get this kind of thing to happen, it can obviously be very satisfying. Various Python implementations and tools also attempt to make method calls efficient in their own ways, some possibly relying on run-time caches that short-circuit the exercise of finding the method.

Magic Numbers

It can be informative to compare the performance of code generated by the Lichen toolchain and the performance of the same program running in the CPython virtual machine, Python and Lichen being broadly compatible (but not identical). As I noted in my summary of 2017, the performance of generated programs was rather disheartening to see at first. I needed to employ profiling to discover where the time was being spent in my generated code that seemed not to be a comparable burden on CPython.

The practicalities of profiling are definitely beyond the scope of this article, but what I did notice was just how much time was being spent allocating space in memory for integers used by programs. I recalled that Python does some special things with integers itself, and so I set about looking for the details of its memory allocation strategies.

It turns out that integers are allocated in a simplified fashion for performance reasons, instead of using the more general allocator that is compatible with garbage collection. And not just that: a range of “small” integers is also allocated in advance when programs run, so that no time is wasted repeatedly allocating objects for numbers that would likely see common use. The details of this can be found in the Objects/intobject.c file in CPython 1.x and 2.x source distributions. Even CPython 1.0 employs this technique.

At first, I thought I had discovered the remedy for my performance problems, but replicating similar allocation arrangements in my run-time code demonstrated that such a happy outcome was not to be so easily achieved. As I looked around for what other special treatment CPython does, I took a closer look at the bytecode interpreter (found in Python/ceval.c), which is the mechanism that takes the compiled form of Python programs (the bytecode) and evaluates the instructions encoded in this form.

My test programs involved simple loops like this:

i = 0
while i < 100000:
    f(i)
    i += 1

And I had a suspicion that apart from allocating new integers, the operations involved in incrementing them were more costly than they were in CPython. Now, in Python versions from 1.1 onwards, the special operator methods are supported for things like the addition, subtraction, multiplication and division operators. This could conceivably lead to integer addition being supported by the following logic in one of the simpler cases:

# c = a + b
c = a.__add__(b)

But from Python 1.5 onwards, some interesting things appear in the CPython source code:

                case BINARY_ADD:
                        w = POP();
                        v = POP();
                        if (PyInt_Check(v) && PyInt_Check(w)) {
                                /* INLINE: int + int */
                                register long a, b, i;
                                a = PyInt_AS_LONG(v);
                                b = PyInt_AS_LONG(w);
                                i = a + b;

Here, when handling the bytecode for the BINARY_ADD instruction, which is generated when the addition operator (plus, “+”) is used in programs, there is a quick test for two integer operands. If the conditions of this test are fulfilled, the result is computed directly (with some additional tests being performed for overflows not shown above). So, CPython was special-casing integers in two ways: with allocation tricks, and with “fast paths” in the interpreter for cases involving integers.

The Tag Team

My response to this was similarly twofold: find an efficient way of allocating integers, and introduce faster ways of handling integers when they are presented to operators. One option that the CPython implementers actually acknowledge in their source code is that of employing a different representation for integers entirely. CPython may have too much legacy baggage to make this transition, and Python 3 certainly didn’t help the implementers to make the break, it would seem, but I have a bit more flexibility.

The option in question is the so-called tagged pointer approach where instead of having a dedicated object for each integer, with a pointer being used to reference that object, the integers themselves are represented by a value that would normally act as a pointer. But this value is not actually a valid pointer at all since it has its lowest bit set, which violates a restriction that is imposed by some processor architectures, but it can be a self-imposed restriction on other systems as well, merely ruling out the positioning of objects at odd-numbered addresses.

So, we might have the following example representations on a 32-bit architecture:

hex value      31..............................0 (bits)
0x12345678 == 0b00010010001101000101011001111000 => pointer
0x12345679 == 0b00010010001101000101011001111001 => integer

Clearing bit 0 and shifting the other bits one position to the right yields the actual integer value, which in the above case will be 152709948. It is conceivable that in future I might sacrifice another bit for encoding other non-pointer values, especially since various 32-bit architectures require word-aligned addresses, where words are positioned on boundaries that are multiples of four bytes, meaning that the lowest two bits would have to be zero for a pointer to be valid.

Albeit with some additional cost incurred when handling pointers, we can with such an approach distinguish integers from other types rapidly and conveniently, which makes the second part of our strategy more efficient as well. Here, we need to identify and handle integers for the arithmetic operators, but unlike CPython, where this happens to be done in an interpreter loop, we have no such loop. Instead we generate code for such operators that simply invokes some existing functions (written in the Lichen language and compiled to C code, another goal being to write as much of the language system in Lichen itself, not C).

It would be rather wasteful to generate tests for integers in addition to these operator function calls every time such a call is made, but the tests can certainly reside within those functions instead. So, here is what we might do for the addition operator:

def add(a, b):
    if is_int(a) and is_int(b):
        return int_add(a, b)
    return binary_op(a, b, lambda a: a.__add__, lambda b: b.__radd__)

This code leaves me with a bit of explaining to do! Last things first: the final statement is the general case of dispatching to the operands and calling an appropriate operator method, with the binary_op function performing the logic in conjunction with the operands and some lambda functions that defer access to the special methods until they are really needed. It is probably best just to trust me that this does the job!

Before the generic operator method dispatch, however, is the test of the operands to see if they are both integers, and this should be vaguely familiar from the CPython source code. A special function is then called to add them efficiently. Note that we couldn’t use the addition (plus, “+”) operator because this code is meant to be handling that, and it would most likely send us on an infinitely recursive loop that never gets round to performing the addition! (I don’t treat the operator as a special case in this code, either. This code is compiled exactly like any other code written in the language.)

The is_int function is what I call “native”, meaning that it is implemented using low-level operations, in this case ones that test the representation of the argument to see if it has its lowest bit set, returning a true value if so. Meanwhile, int_add is largely equivalent to the addition operation seen in the CPython source code above, just with different details involved.

Progress and Reflections

Such adjustments made quite a difference to the performance of my generated code. They do also make some sense, too. Integers are used a lot in programs, being used not only for general arithmetic, but also for counters, index values for things like lists, tuples, strings and other collections, plus a range of other mundane things whose performance can be overlooked until it proves to be suboptimal. Python has something of a reputation for having slow implementations, but CPython’s trickery here optimises in favour of fast results where they can be obtained, falling back on the slower, general mechanisms when these are required.

But I discovered that this is not the only optimisation trickery CPython does, as another program with interesting representation choices and some wildly varying running times was to demonstrate. More on that in the next article on this topic!