Paul Boddie's Free Software-related blog


Archive for the ‘Free Software’ Category

An Absence of Strategy?

Wednesday, January 16th, 2019

I keep starting articles but not finishing them. However, after responding to some correspondence recently, where I got into a minor rant about a particular topic, I thought about starting this article and more or less airing the rant for a wider audience. I don’t intend to be negative here, so even if this sounds like me having a moan about how things are, I really do want to see positive and constructive things happen to remedy what I see as deficiencies in the way people go about promoting and supporting Free Software.

The original topic of the correspondence was my brother’s article about submitting “apps” to F-Droid, the Free Software application repository for Android, which somehow got misattributed to me in the FSFE newsletter. As anyone who knows both of us can imagine, it is not particularly unusual that people mix us up, but it does still surprise me how people can be fluid about other people’s names and assume that two people with the same family name are the same person.

Eventually, the correction was made, for which I am grateful, and it must be said that I do also appreciate the effort that goes into writing the newsletter. Having previously had the task of doing some of the Fellowship interviews, I know that such things require more work than people might think, largely go either unnoticed or unremarked, and as a participant in the process it can be easy to wonder afterwards if it was worth the bother. I do actually follow the FSFE Planet and the discussion mailing list, so I’d like to think that I keep up with what other people do, but the newsletter must have some value to those who don’t want to follow a range of channels.

A Rant about Free Software on Mobile

Well, it wasn’t as much of a rant as it was a moan about how there doesn’t seem to be a coherent strategy about Free Software on mobile devices. The FSFE has had some kind of campaign about Android for quite some time. What it amounts to is promoting Free Software applications and Free Software distributions on phones.

This probably isn’t significantly different from the activities promoted by the FSF, whose Defective By Design campaign features a gift guide promoting phones that run Free Software. The FSF also promotes and funds the Replicant project, more of which below.

For all I know, the situation about getting Free Software applications onto a phone probably isn’t all that dire, assuming that Google and phone vendors don’t try and prevent users from installing software that isn’t delivered via Google Play or other officially-sanctioned channels. Or as the Android documentation puts it:

Of course, this is rather reminiscent of the “bad old days” where some people could copy things on and off their phone using Bluetooth (or for those with particularly long memories, infrared communication) whereas others had those features disabled by their provider. So, while some people get to enjoy the benefits of Free Software, others are denied them: another case of divide-and-rule in action, I suppose.

But it is the situation about Free Software distributions, more specifically having a Free Software operating system with Free Software device drivers and Free Software firmware, that is most worrying. The FSFE campaign points people to the two enduring initiatives for putting a different kind of Android distribution on phones: Replicant and LineageOS (previously CyanogenMod).

While LineageOS seems to try and support as many devices as possible, it relies on non-free software to support device hardware. Meanwhile, Replicant employs only Free Software and is therefore limited in which devices it may support and to what extent those device’s features will function.

Although I can’t really muster much enthusiasm for Android and its derivatives, I don’t think there is anything wrong with trying to provide a completely Free Software distribution of that software. Certainly, there will always be challenges with the evolution of the upstream code, being steered this way and that by its corporate parent for maximum corporate benefit, but this isn’t really much different to clinging on to the pace of change (and churn) in an openly-developed project like the Linux kernel.

But ultimately, these initiatives will always be reacting to what other people, specifically those working for large companies, have decided to do. It will always be about chasing the latest release of the upstream software and making it acceptable for a Free Software audience. And it will always be about seeing whether the latest devices can be supported or not and then trying to figure out how. And this is where most people start to wonder why things always have to be like this, spurring my rant.

For the Long Term

To be fair to the FSFE’s Android-related campaign, the advice given does give people some concrete activities to consider: it isn’t simply the “go out and write code” battle cry that sometimes drifts through the air after an acrimonious episode where nobody can agree with each other. Helping F-Droid get more applications published, writing more Free Software applications, helping the operating system projects with their efforts: these are all worthwhile things for people to do.

But we only need look at the FSF’s ethical gift guide to see the severe challenges being faced over the longer term. For yet another year, the only offerings are older, refurbished Samsung smartphones, the most recent being the Galaxy S3 and Galaxy Note 2 from 2012. Now, there is nothing wrong with older hardware or refurbished devices. After all, I have written about older and much less powerful hardware than that which I believe still should have a place in modern life.

Nor should people regard the price of such refurbished units as particularly expensive. Of course you can buy a new phone with better specifications for the same or even less, but that new phone won’t be running only Free Software. Yes, there is always the Fairphone whose creators’ grip on Free Software, software longevity and other matters that weren’t confronted in the beginning is supposedly now rather better, although the hardware drivers remain non-free, so it isn’t really comparable, either.

Here, it is illustrative to consider community-originating efforts to develop smartphones, particularly since there is a perception that such efforts eventually end up pitching “expensive” and “underpowered” devices that “aren’t competitive”. There are obviously a collection of such initiatives ongoing at any given point, but ignoring random crowdfunding campaigns and corporate publicity stunts, we might usefully focus on some more familiar projects: the GTA04 successor to the Openmoko FreeRunner and the Neo900 successor to the Nokia N900.

Both of these projects are in a not-easily-explained state. The GTA04 device was made in a number of incrementally refined versions and sold primarily to people who already had a FreeRunner into which they could install the new hardware. However, difficulties with the last hardware revision meant that it was no longer viable as a product, with the cost of overcoming production problems being rather likely to exceed any money otherwise made on the units.

Meanwhile, the Neo900 project is effectively stalled having experienced several setbacks, notably the freezing of funds by PayPal for no really good reason, and difficulties in finding and retaining qualified people to do things like board layout. Although there are aspirations to get to completion, perhaps with some modification of the original goals, the path to completion remains obscure and uncertain. It is certainly hard to sustain my initial enthusiasm for the project, even if I do sympathise deeply with the struggles and difficulties of those trying to deliver something that they want to see succeed perhaps more than anyone else.

The future is not necessarily entirely bleak for these projects, though. Experience from the GTA04 effort may have been beneficial to the development of the Pyra handheld computer, whose own genesis has been troubled at times and yet forthrightly communicated in an honourably transparent fashion by its initiator, and the CPU board for that device may end up as the basis for a new product known as the GTA15. Given the common architectural heritage of the GTA04, N900 and Neo900, it would not be completely inconceivable that if some kind of way forward could be found for the Neo900, GTA15 might be some kind of contributing element to that.

What these projects should illustrate, however, is that the foundations of a Free Software mobile device are difficult to prepare, subject to sudden and potentially catastrophic setbacks or outright failure, and they require persistence and plenty of resources, not least of the financial kind but also the dedication of people with the right competence and experience. Sadly, these projects never get the attention, the recognition, or the generosity they deserve.

Seeing the Bigger Picture

If we care about being able to support all the different elements of our phones with Free Software, instead of crossing out items in the specification list, sacrificing functionality because nobody knows how it works without signing a non-disclosure agreement and then only being allowed to release a binary blob for loading into specific Linux kernel versions, then we need to be there at the start, when the phone gets designed, and play a role in making sure that everything can be supported by Free Software. Spending time and effort on figuring out someone else’s hardware is not time and effort well spent.

Indeed, from the moment a proprietary product gets into the hands of developers, the clock is ticking. Already, given the pace of product development and shipment, the product is heading for obsolescence and its manufacturer will be tooling up to make its successor. Even if downstream developers work quickly and get as much of the hardware supported as they can, there will be only be a certain period before the product becomes difficult to obtain. And then the process of catch-up starts all over again with the next product.

Of course, product variations always used to happen with desktop and laptop computers. One day you’d get a laptop with one chipset and the next day the “same” laptop would contain something else. The only thing that eased the pain involved was broad hardware support for these kinds of devices, and even then there would be chipsets notorious for their lack of support in things like the Linux kernel.

Such pitfalls cultivated demands for products that could run Free Software and be supplied with such software instead of the usual proprietary products bundled as a consequence of Microsoft’s anticompetitive and coercive business practices. It was no longer enough to accept that we might buy a computer with bundled software and “install over it”, that this might emphasise our Free Software credentials. Credible advocates of Free Software have sought to identify vendors offering systems that are either already supplied with Free Software or that come without any preinstalled software at all, in both cases being fully capable of supporting a Free Software distribution.

But we now find ourselves in the absurd situation with mobile devices where remedial measures comparable to “install over it” are almost the only things people can suggest, that there really aren’t any mobile device vendors who can offer a bare, supportable device or one that is, say, already running Replicant and offering access to all of the device’s features. And although refurbished devices are sold that may run Replicant well enough, we lack another essential guarantee that may not have been so important in the past, one that community-originating hardware projects might be able to help with.

In being involved with the design of these devices, we can seek to dictate how long they remain viable. Instead of having a large corporation decide that now is the time for you to buy their next device and that the product you bought and liked is now deliberately unavailable, we can seek to keep making devices as long as they have a role and have people wanting to keep using them.

If something runs a Free Software distribution well enough, and if that device can still be made, it becomes a safe choice and something we can recommend to others. At last, we get some kind of certainty in a world whose stream of continual change is often fad-driven, exploitative and needless.

The Strategic Vacuum

So it seems obvious to me that if people want Free Software on phones, they need to cultivate the efforts to make that Free Software viable, which means cultivating sustainable hardware design and actually promoting and supporting the projects pursuing it. Otherwise, it is like trying to plant an orchard without paying attention to the soil, cursing that the trees will not grow whilst being oblivious to the fact that the ground is concrete.

And this goes beyond this particular domain. Free Software advocacy is all well and good, but there needs to be practical action that goes beyond pitching in and nudging things towards success. It is wonderful that collective effort and collaboration can take small endeavours and grow them into solutions that are useful for many, but it can be too much to expect everything to just coalesce as if by magic, that people and projects will just discover each other, work together, strengthen each other and multiply the benefits.

There needs to be a strategy, for people to identify real-world problems that are not being addressed by Free Software, to identify the projects that might be applied to those problems, and to propose actual solutions. In the absence of Free Software, proprietary and exploitative solutions are able to stake out their territory, entrench themselves and to thrive.

Here, I struggle to see which Free Software organisations have both the breadth of scope and the depth of focus to make a difference. Developer-driven organisations like Debian and KDE have a lot going on, and they deliver non-trivial software systems, and yet sustaining something like a viable mobile platform (encompassing hardware and software) is seemingly beyond their reach. Neither have a coherent answer to other significant challenges of our time, such as the dominance of proprietary (and destructive) communications platforms.

Meanwhile, advocacy-driven organisations like the FSFE merely give advice to people about how they might help Free Software. The FSF arguably combines this with actual development through the GNU project, and there are those lists of urgent activities, but one cannot help but have the impression that the urgency is reactive and not the product of strategic vision (beyond the goal of the proliferation of Free Software, of course). I would like to think that the Software Freedom Conservancy might combine things more effectively, but it perhaps remains more of an umbrella organisation with a similar emphasis on broad Free Software adoption.

I would like to think that the step of getting involved in Free Software is but the first step towards fairer, kinder, more transparent, more productive and more sustainable societies. Traces of such visions can be seen in the communications of various organisations and yet they largely hold back from suggesting what the next steps might be. And yet, I think, we will in future really need to take those steps in order to protect ourselves, our societies, and the things we care about.

In general, we notice the absence of strategy; in specific cases, we notice it keenly. Which organisations are willing and able to fill this strategic void coherently and decisively, to offer concrete solutions as opposed to suggestions, to have something definite to work towards, and to direct effort and resources into actually realising such goals? Surely, in this age of hoping for the best, those organisations will be the ones that deserve our support the most.

Another Look at VGA Signal Generation with a PIC32 Microcontroller

Thursday, November 8th, 2018

Maybe some people like to see others attempting unusual challenges, things that wouldn’t normally be seen as productive, sensible or a good way to spend time and energy, things that shouldn’t be possible or viable but were nevertheless made to work somehow. And maybe they like the idea of indulging their own curiosity in such things, knowing that for potential future experiments of their own there is a route already mapped out to some kind of success. That might explain why my project to produce a VGA-compatible analogue video signal from a PIC32 microcontroller seems to attract more feedback than some of my other, arguably more useful or deserving, projects.

Nevertheless, I was recently contacted by different people inquiring about my earlier experiments. One was admittedly only interested in using Free Software tools to port his own software to the MIPS-based PIC32 products, and I tried to give some advice about navigating the documentation and to describe some of the issues I had encountered. Another was more concerned with the actual signal generation aspect of the earlier project and the usability of the end result. Previously, I had also had a conversation with someone looking to use such techniques for his project, and although he ended up choosing a different approach, the cumulative effect of my discussions with him and these more recent correspondents persuaded me to take another look at my earlier work and to see if I couldn’t answer some of the questions I had received more definitively.

Picking Over the Pieces

I was already rather aware of some of the demonstrated and potential limitations of my earlier work, these being concerned with generating a decent picture, and although I had attempted to improve the work on previous occasions, I think I just ran out of energy and patience to properly investigate other techniques. The following things were bothersome or a source of concern:

  • The unevenly sized pixels
  • The process of experimentation with the existing code
  • Whether the microcontroller could really do other things while displaying the picture

Although one of my correspondents was very complimentary about the form of my assembly language code, I rather felt that it was holding me back, making me focus on details that should be abstracted away. It should be said that MIPS assembly language is fairly pleasant to write, at least in comparison to certain other architectures.

(I was brought up on 6502 assembly language, where there is an “accumulator” that is the only thing even approaching a general-purpose register in function, and where instructions need to combine this accumulator with other, more limited, registers to do things like accessing “zero page”: an area of memory that supports certain kinds of operations by providing the contents of locations as inputs. Everything needs to be meticulously planned, and despite odd claims that “zero page” is really one big register file and that 6502 is therefore “RISC-like”, the existence of virtual machines such as SWEET16 say rather a lot about how RISC-like the 6502 actually is. Later, I learned ARM assembly language and found it rather liberating with its general-purpose registers and uncomplicated, rather easier to use, memory access instructions. Certain things are even simpler in MIPS assembly language, whereas other conveniences from ARM are absent and demand a bit more effort from the programmer.)

Anyway, I had previously introduced functionality in C in my earlier work, mostly because I didn’t want the challenge of writing graphics routines in assembly language. So with the need to more easily experiment with different peripheral configurations, I decided to migrate the remaining functionality to C, leaving only the lowest-level routines concerned with booting and exception/interrupt handling in assembly language. This effort took me to some frustrating places, making me deal with things like linker scripts and the kind of memory initialisation that one’s compiler usually does for you but which is absent when targeting a “bare metal” environment. I shall spare you those details in this article.

I therefore spent a certain amount of effort in developing some C library functionality for dealing with the hardware. It could be said that I might have used existing libraries instead, but ignoring Microchip’s libraries that will either be proprietary or the subject of numerous complaints by those unwilling to leave that company’s “ecosystem”, I rather felt that the exercise in library design would be useful in getting reacquainted and providing me with something I would actually want to use. An important goal was minimalism, whereas my impression of libraries such as those provided by the Pinguino effort are that they try and bridge the different PIC hardware platforms and consequently accumulate features and details that do not really interest me.

The Wide Pixel Problem

One thing that had bothered me when demonstrating a VGA signal was that the displayed images featured “wide” pixels. These are not always noticeable: one of my correspondents told me that he couldn’t see them in one of my example pictures, but they are almost certainly present because they are a feature of the mechanism used to generate the signal. Here is a crop from the example in question:

Picture detail from VGAPIC32 output

Picture detail from VGAPIC32 output

And here is the same crop with the wide pixels highlighted:

Picture detail with wide pixels highlighted

Picture detail with wide pixels highlighted

I have left the identification of all wide pixel columns to the reader! Nevertheless, it can be stated that these pixels occur in every fourth column and are especially noticeable with things like text, where at such low resolutions, the doubling of pixel widths becomes rather obvious and annoying.

Quite why this increase in pixel width was occurring became a matter I wanted to investigate. As you may recall, the technique I used to output pixels involved getting the direct memory access (DMA) controller in the PIC32 chip to “copy” the contents of memory to a hardware register corresponding to output pins. The signals from these pins were sent along the cable to the monitor. And the DMA controller was transferring data as fast as it could and thus producing pixel colours as fast as it could.

Pixel Output Using DMA Transfer

An overview of the architecture for pixel output using DMA transfer

One of my correspondents looked into the matter and confirmed that we were not imagining this problem, even employing an oscilloscope to check what was happening with the signals from the output pins. The DMA controller would, after starting each fourth pixel, somehow not be able to produce the next pixel in a timely fashion, leaving the current pixel colour unchanged as the monitor traced the picture across the screen. This would cause these pixels to “stretch” until the first pixel from the next group could be emitted.

Initially, I had thought that interrupts were occurring and the CPU, in responding to interrupt conditions and needing to read instructions, was gaining priority over the DMA controller and forcing pixel transfers to wait. Although I had specified a “cell size” of 160 bytes, corresponding to 160 pixels, I was aware that the architecture of the system would be dividing data up into four-byte “words”, and it would be natural at certain points for accesses to memory to be broken up and scheduled in terms of such units. I had therefore wanted to accommodate both the CPU and DMA using an approach where the DMA would not try and transfer everything at once, but without the energy to get this to work, I had left the matter to rest.

A Steady Rhythm

The documentation for these microcontrollers distinguishes between block and cell transfers when describing DMA. Previously, I had noted that these terms could be better described, and I think there are people who are under the impression that cells must be very limited in size and that you need to drive repeated cell transfers using various interrupt conditions to transfer larger amounts. We have seen that this is not the case: a single, large cell transfer is entirely possible, even though the characteristics of the transfer have been less than desirable. (Nevertheless, the documentation focuses on things like copying from one UART peripheral to another, arguably failing to explore the range of possible applications for DMA and to thereby elucidate the mechanisms involved.)

However, given the wide pixel effect, it becomes more interesting to introduce a steady rhythm by using smaller cell sizes and having an external event coordinate each cell’s transfer. With a single, large transfer, only one initiation event needs to occur: that produced by the timer whose period corresponds to that of a single horizontal “scanline”. The DMA channel producing pixels then runs to completion and triggers another channel to turn off the pixel output. In this scheme, the initiating condition for the cell transfer is the timer.

VGA Display Line Structure

The structure of each visible display line in the VGA signal

When using multiple cells to transfer the pixel data, however, it is no longer possible to use the timer in this way. Doing so would cause the initiation of the first cell, but then subsequent cells would only be transferred on subsequent timer events. And since these events only occur once per scanline, this would see a single line’s pixel data being transferred over many scanlines instead (or, since the DMA channel would be updated regularly, we would see only the first pixel on each line being emitted, stretched across the entire line). Since the DMA mechanism apparently does not permit one kind of interrupt condition to enable a channel and another to initiate each cell transfer, we must be slightly more creative.

Fortunately, the solution is to chain two channels, just as we already do with the pixel-producing channel and the one that resets the output. A channel is dedicated to waiting for the line timer event, and it transfers a single black pixel to the screen before handing over to the pixel-producing channel. This channel, now enabled, has its cell transfers regulated by another interrupt condition and proceeds as fast as such a condition may occur. Finally, the reset channel takes over and turns off the output as before.

Pixel Output Using Timed DMA Transfers

An overview of the architecture for pixel output using timed DMA transfers

The nature of the cell transfer interrupt can take various forms, but it is arguably most intuitive to use another timer for this purpose. We may set the limit of such a timer to 1, indicating that it may “wrap around” and thus produce an event almost continuously. And by configuring it to run as quickly as possible, at the frequency of the peripheral clock, it may drive cell transfers at a rate that is quick enough to hopefully produce enough pixels whilst also allowing other activities to occur between each transfer.

VGA Pixel Output (Using Transfer Timer)

Using a timer to initiate pixel transfers for the VGA signal

One thing is worth mentioning here just to be explicit about the mechanisms involved. When configuring interrupts that are used for DMA transfers, it is the actual condition that matters, not the interrupt nor the delivery of the interrupt request to the CPU. So, when using timer events for transfers, it appears to be sufficient to merely configure the timer; it will produce the interrupt condition upon its counter “wrapping around” regardless of whether the interrupt itself is enabled.

With a cell size of a single byte, and with a peripheral clock running at half the speed of the system clock, this approach is sufficient all by itself to yield pixels with consistent widths, with the only significant disadvantage being how few of them can be produced per line: I could only manage in the neighbourhood of 80 pixels! Making the peripheral clock run as fast as the system clock doesn’t help in theory: we actually want the CPU running faster than the transfer rate just to have a chance of doing other things. Nor does it help in practice: picture stability rather suffers.

A picture of the display output from timed DMA transfers

A picture of the display output from timed DMA transfers

Using larger cell sizes, we encounter the wide pixel problem, meaning that the end of a four-byte group is encountered and the transfer hangs on for longer than it should. However, larger cell sizes also introduce byte transfers at a different rate from cell transfers (at the system clock rate) and therefore risk making the last pixel produced by a cell longer than the others, anyway.

Uncovering DMA Transfers

I rather suspect that interruptions are not really responsible for the wide pixels at all, and that it is the DMA controller that causes them. Some discussion with another correspondent explored how the controller might be operating, with transfers perhaps looking something like this:

DMA read from memory
DMA write to display (byte #1)
DMA write to display (byte #2)
DMA write to display (byte #3)
DMA write to display (byte #4)
DMA read from memory
...

This would, by itself, cause a transfer pattern like this:

R____R____R____R____R____R ...
_WWWW_WWWW_WWWW_WWWW_WWWW_ ...

And thus pixel output as follows:

41234412344123441234412344 ...
=***==***==***==***==***== ... (narrow pixels as * and wide pixel components as =)

Even without any extra operations or interruptions, we would get a gap between the write operations that would cause a wider pixel. This would only get worse if the DMA controller had to update the address of the pixel data after every four-byte read and write, not being able to do so concurrently with those operations. And if the CPU were really able to interrupt longer transfers, even to obtain a single instruction to execute, it might then compete with the DMA controller in accessing memory, making the read operations even later every time.

Assuming, then, that wide pixels are the fault of the way the DMA controller works, we might consider how we might want it to work instead:

                     | ...
DMA read from memory | DMA write to display (byte #4)
                 \-> | DMA write to display (byte #1)
                     | DMA write to display (byte #2)
                     | DMA write to display (byte #3)
DMA read from memory | DMA write to display (byte #4)
                 \-> | ...

If only pixel data could be read from memory and written to the output register (and thus the display) concurrently, we might have a continuous stream of evenly-sized pixels. Such things do not seem possible with the microcontroller I happen to be using. Either concurrent reading from memory and writing to a peripheral is not possible or the DMA controller is not able to take advantage of this concurrency. But these observations did give me another idea.

Dual Channel Transfers

If the DMA controller cannot get a single channel to read ahead and get the next four bytes, could it be persuaded to do so using two channels? The intention would be something like the following:

Channel #1:                     Channel #2:
                                ...
DMA read from memory            DMA write to display (byte #4)
DMA write to display (byte #1)
DMA write to display (byte #2)
DMA write to display (byte #3)
DMA write to display (byte #4)  DMA read from memory
                                DMA write to display (byte #1)
...                             ...

This is really nothing different from the above, functionally, but the required operations end up being assigned to different channels explicitly. We would then want these channels to cooperate, interleaving their data so that the result is the combined sequence of pixels for a line:

Channel #1: 1234    1234     ...
Channel #2:     5678    5678 ...
  Combined: 1234567812345678 ...

It would seem that channels might even cooperate within cell transfers, meaning that we can apparently schedule two long transfer cells and have the DMA controller switch between the two channels after every four bytes. Here, I wrote a short test program employing text strings and the UART peripheral to see if the microcontroller would “zip up” the strings, with the following being used for single-byte cells:

Channel #1: "Adoc gi,hlo\r"
Channel #2: "n neaan el!\n"
  Combined: "And once again, hello\r\n"

Then, seeing that it did, I decided that this had a chance of also working with pixel data. Here, every other pixel on a line needs to be presented to each channel, with the first channel being responsible for the pixels in odd-numbered positions, and the second channel being responsible for the even-numbered pixels. Since the DMA controller is unable to step through the data at address increments other than one (which may be a feature of other DMA implementations), this causes us to rearrange each line of pixel data as follows:

 Displayed pixels: 123456......7890
Rearranged pixels: 135...79246...80
                   *       *

Here, the asterisks mark the start of each channel’s data, with each channel only needing to transfer half the usual amount.

Pixel Output Using Timed Dual-Channel Transfers

The architecture involved in employing two pixel data channels with timed transfers

The documentation does, in fact, mention that where multiple channels are active with the same priority, each one is given control in turn with the controller cycling through them repeatedly. The matter of which order they are chosen, which is important for us, seems to be dependent on various factors, only some of which I can claim to understand. For instance, I suspect that if the second channel refers to data that appears before the first channel’s data in memory, it gets scheduled first when both channels are activated. Although this is not a significant concern when just trying to produce a stable picture, it does limit more advanced operations such as horizontal scrolling.

A picture of the display output from timed, dual-channel DMA transfers

A picture of the display output from timed, dual-channel DMA transfers

As you can see, trying this technique out with timed transfers actually made a difference. Instead of only managing something approaching 80 pixels across the screen, more than 90 can be accommodated. Meanwhile, experiments with transfers going as fast as possible seemed to make no real difference, and the fourth pixel in each group was still wider than the others. Still, making the timed transfer mode more usable is a small victory worth having, I suppose.

Parallel Mode Revisited

At the start of my interest in this project, I had it in my mind that I would couple DMA transfers with the parallel mode (or Parallel Master Port) functionality in order to generate a VGA signal. Certain aspects of this, particularly gaps between pixels, made me less than enthusiastic about the approach. However, in considering what might be done to the output signal in other situations, I had contemplated the use of a flip-flop to hold output stable according to a regular tempo, rather like what I managed to achieve, almost inadvertently, when introducing a transfer timer. Until recently, I had failed to apply this idea to where it made most sense: in regulating the parallel mode signal.

Since parallel mode is really intended for driving memory devices and display controllers, various control signals are exposed via pins that can tell these external devices that data is available for their consumption. For our purposes, a flip-flop is just like a memory device: it retains the input values sampled by its input pins, and then exposes these values on its output pins when the inputs are “clocked” into memory using a “clock pulse” signal. The parallel mode peripheral in the microcontroller offers various different signals for such clock and selection pulse purposes.

VGA Output Circuit (Parallel Mode)

The parallel mode circuit showing connections relevant to VGA output (generic connections are not shown)

Employing the PMWR (parallel mode write) signal as the clock pulse, directing the display signals to the flip-flop’s inputs, and routing the flip-flop’s outputs to the VGA circuit solved the pixel gap problem at a stroke. Unfortunately, it merely reminded us that the wide pixel problem also affects parallel mode output, too. Although the clock pulse is able to tell an external component about the availability of a new pixel value, it is up to the external component to regulate the appearance of each pixel. A memory device does not care about the timing of incoming data as long as it knows when such data has arrived, and so such regulation is beyond the capabilities of a flip-flop.

It was observed, however, that since each group of pixels is generated at a regular frequency, the PMWR signalling frequency might be reduced by being scaled by a constant factor. This might allow some pixel data to linger slightly longer in the flip-flop and be slightly stretched. By the time the fourth pixel in a group arrives, the time allocated to that pixel would be the same as those preceding it, thus producing consistently-sized pixels. I imagine that a factor of 8/9 might do the trick, but I haven’t considered what modification to the circuit might be needed or whether it would all be too complicated.

Recognising the Obvious

When people normally experiment with video signals from microcontrollers, one tends to see people writing code to run as efficiently as is absolutely possible – using assembly language if necessary – to generate the video signal. It may only be those of us using microcontrollers with DMA peripherals who want to try and get the DMA hardware to do the heavy lifting. Indeed, those of us with exposure to display peripherals provided by system-on-a-chip solutions feel almost obliged to do things this way.

But recent discussions with one of my correspondents made me reconsider whether an adequate solution might be achieved by just getting the CPU to do the work of transferring pixel data to the display. Previously, another correspondent had indicated that it this was potentially tricky, and that getting the timings right was more difficult than letting the hardware synchronise the different mechanisms – timer and DMA – all by itself. By involving the CPU and making it run code, the display line timer would need to generate an interrupt that would be handled, causing the CPU to start running a loop to copy data from the framebuffer to the output port.

Pixel Output Using CPU-Driven Transfers

An overview of the architecture with the CPU driving transfers of pixel data

This approach puts us at the mercy of how the CPU handles and dispatches interrupts. Being somewhat conservative about the mechanisms more generally available on various MIPS-based products, I tend to choose a single interrupt vector and then test for the different conditions. Since we need as little variation as possible in the latency between a timer event occurring and the pixel data being generated, I test for that particular event before even considering anything else. Then, a loop of the following form is performed:

    for (current = line_data; current < end; current++)
        *output_port = *current;

Here, the line data is copied byte by byte to the output port. Some adornments are necessary to persuade the compiler to generate code that writes the data efficiently and in order, but there is nothing particularly exotic required and GCC does a decent job of doing what we want. After the loop, a black/reset pixel is generated to set the appropriate output level.

One concern that one might have about using the CPU for such long transfers in an interrupt handler is that it ties up the CPU, preventing it from doing other things, and it also prevents other interrupt requests from being serviced. In a system performing a limited range of activities, this can be acceptable: there may be little else going on apart from updating the display and running programs that access the display; even when other activities are incorporated, they may accommodate being relegated to a secondary status, or they may instead take priority over the display in a way that may produce picture distortion but only very occasionally.

Many of these considerations applied to systems of an earlier era. Thinking back to computers like the Acorn Electron – a 6502-based system that formed the basis of my first sustained experiences with computing – it employs a display controller that demands access to the computer’s RAM for a certain amount of the time dedicated to each video frame. The CPU is often slowed down or even paused during periods of this display controller’s activity, making programs slower than they otherwise would be, and making some kinds of input and output slightly less reliable under certain circumstances. Nevertheless, with certain kinds of additional hardware, the possibility is present for such hardware to interrupt the CPU and to override the display controller that would then produce “snow” or noise on the screen as a consquence of this high-priority interruption.

Such issues cause us to consider the role of the DMA controller in our modern experiment. We might well worry about loading the CPU with lots of work, preventing it from doing other things, but what if the DMA controller dominates the system in such a way that it effectively prevents the CPU from doing anything productive anyway? This would be rather similar to what happens with the Electron and its display controller.

So, evaluating a CPU-driven solution seems to be worthwhile just to see whether it produces an acceptable picture and whether it causes unacceptable performance degradation. My recent correspondence also brought up the assertion that the RAM and flash memory provided by PIC32 microcontrollers can be accessed concurrently. This would actually mitigate contention between DMA and any programs running from flash memory, at least until the point that accesses to RAM needed to be performed by those programs, meaning that we might expect some loss of program performance by shifting the transfer burden to the CPU.

(Again, I am reminded of the Electron whose ROM could be accessed at full speed but whose RAM could only be accessed at half speed by the CPU but at full speed by the display controller. This might have been exploited by software running from ROM, or by a special kind of RAM installed and made available at the right place in memory, but the 6502 favours those zero-page instructions mentioned earlier, forcing RAM access and thus contention with the display controller. There were upgrades to mitigate this by providing some dedicated memory for zero page, but all of this is really another story for another time.)

Ultimately, having accepted that the compiler would produce good-enough code and that I didn’t need to try more exotic things with assembly language, I managed to produce a stable picture.

A picture of the display output from CPU-driven pixel data transfers

A picture of the display output from CPU-driven pixel data transfers

Maybe I should have taken this obvious path from the very beginning. However, the presence of DMA support would have eventually caused me to investigate its viability for this application, anyway. And it should be said that the performance differences between the CPU-based approach and the DMA-based approaches might be significant enough to argue in favour of the use of DMA for some purposes.

Observations and Conclusions

What started out as a quick review of my earlier work turned out to be a more thorough study of different techniques and approaches. I managed to get timed transfers functioning, revisited parallel mode and made it work in a fairly acceptable way, and I discovered some optimisations that help to make certain combinations of techniques more usable. But what ultimately matters is which approaches can actually be used to produce a picture on a screen while programs are being run at the same time.

To give the CPU more to do, I decided to implement some graphical operations, mostly copying data to a framebuffer for its eventual transfer as pixels to the display. The idea was to engage the CPU in actual work whilst also exercising access to RAM. If significant contention between the CPU and DMA controller were to occur, the effects would presumably be visible on the screen, potentially making the chosen configuration unusable.

Although some approaches seem promising on paper, and they may even produce promising results when the CPU is doing little more than looping and decrementing a register to introduce a delay, these approaches may produce less than promising results under load. The picture may start to ripple and stretch, and under “real world” load, the picture may seem noisy and like a badly-tuned television (for those who remember the old days of analogue broadcast signals).

Two approaches seem to remain robust, however: the use of timed DMA transfers, and the use of the CPU to do all the transfer work. The former is limited in terms of resolution and introduces complexity around the construction of images in the framebuffer, at least if optimised as much as possible, but it seems to allow more work to occur alongside the update of the display, and the reduction in resolution also frees up RAM for other purposes for those applications that need it. Meanwhile, the latter provides the resolution we originally sought and offers a straightforward framebuffer arrangement, but it demands more effort from the CPU, slowing things down to the extent that animation practically demands double buffering and thus the allocation of even more RAM for display purposes.

But both of these seemingly viable approaches produce consistent pixel widths, which is something of a happy outcome given all the effort to try and fix that particular problem. One can envisage accommodating them both within a system given that various fundamental system properties (how fast the system and peripheral clocks are running, for example) are shared between the two approaches. Again, this is reminiscent of microcomputers where a range of display modes allowed developers and users to choose the trade-off appropriate for them.

A demonstration of text plotting at a resolution of 160x128

A demonstration of text plotting at a resolution of 160x128

Having investigated techniques like hardware scrolling and sprite plotting, it is tempting to keep developing software to demonstrate the techniques described in this article. I am even tempted to design a printed circuit board to tidy up my rather cumbersome breadboard arrangement. And perhaps the library code I have written can be used as the basis for other projects.

It is remarkable that a home-made microcontroller-based solution can be versatile enough to demonstrate aspects of simple computer systems, possibly even making it relevant for those wishing to teach or to learn about such things, particularly since all the components can be connected together relatively easily, with only some magic happening in the microcontroller itself. And with such potential, maybe this seemingly pointless project might have some meaning and value after all!

Update

Although I can’t embed video files of any size here, I have made a “standard definition” video available to demonstrate scrolling and sprites. I hope it is entertaining and also somewhat illustrative of the kind of thing these techniques can achieve.

The Elephant in the Room

Sunday, September 9th, 2018

I recently had my attention drawn towards a blog article about the trials of Free Software development by senior Python core developer, Brett Cannon. Now, I agree with the article’s emphasis on being nice to other people, and I sympathise with those who feel that their community-related activities are wearing them down. However, I would like to point out some aspects of his article that fall rather short of my own expectations about what Free Software, or “open source” as he calls it, should be about.

I should perhaps back up a little and mention where this article was found, which was via the “Planet Python” blog aggregator site. I do not read Planet Python, either in my browser or using a feed reader, any more. Those who would create some kind of buzz or energy around Python have somehow managed to cultivate a channel where it seems that almost every post is promoting something. I might quickly and crudely categorise the posts as follows:

  • “Look at our wonderful integrated development environment which is nice to Python (that is written in Java)! (But wouldn’t you rather use the Java-related language we are heavily promoting instead?)”
  • Stub content featuring someone’s consulting/training/publishing business.
  • Random “beginner” articles either parading the zealotry of the new convert or, of course, promoting someone’s consulting/training/publishing business.

Maybe such themes are merely a reflection of attitudes and preoccupations held amongst an influential section of the Python community, and perhaps there is something to connect those attitudes with the topics discussed below. I do recall other articles exhorting Python enthusiasts to get their name out there by doing work on “open source”, with the aim of getting some company’s attention by improving the software that company has thrown over the wall, and “Python at <insert company name>” blogging is, after all, a common Planet Python theme.

Traces of the Pachyderm

But returning to the article in question, if you read it with a Free Software perspective – that is to say that you consciously refer to “Free Software”, knowing that “open source” was coined by people who, for various reasons, wanted another term to use – then certain things seem to stand out. Most obviously, the article never seems to mention software freedom: it is all about “having fun”, attracting contributors to your projects, giving and receiving “kindnesses”, and participating in “this grand social experiment we call open source”. It is almost as if the Free Software movement and the impetus for its foundation never took place, or if it does have a place in someone’s alternative version of history, then in that false view of reality Richard Stallman was only motivated to start the GNU project because maybe he wanted to “have fun hacking a printer”.

Such omissions are less surprising if you have familiarity with attitudes amongst certain people in various Free Software communities – those typically identifying as “open source”, of course – who bear various grudges against the FSF and Richard Stallman. In the Python core development community, those grudges are sometimes related to some advice given about GPL-compatible licensing back when CPython was changing custodian and there had been concerns, apparently expressed by the entity being abandoned by the core developers, that the original “CWI licence” was not substantial enough. We might wonder whether grudges might be better directed towards those who have left CPython with its current, rather incoherent, licensing paper trail.

A Different Kind of Free

Now, as those of us familiar with the notion of Free Software should know, it is “a matter of freedom, not price”. You can very well sell Free Software, and nobody is actually obliged to distribute their Free Software works at no cost. In fact, the advice from those who formulated the very definition of Free Software is this:

Distributing free software is an opportunity to raise funds for development. Don’t waste it!

Of course, there are obligations about providing the source code for software already distributed in executable form and limitations about the fees or charges to be imposed on recipients, but these do not compel no-cost sharing, publication or distribution. Meanwhile, the Open Source Definition, for those who need an “open source” form of guidance, states the following:

The license shall not require a royalty or other fee for such sale.

This appearing, rather amusingly, in a section entitled “Free Redistribution” where “Free” apparently has the same meaning as the “Free” in Free Software: the label that the “open source” crowd were so vehemently opposed to. It is also interesting that the “Source Code” section of the definition also stipulates similar obligations to those upheld by copyleft licences.

So, in the blog article in question, it is rather interesting to see the following appear:

While open source, by definition, is monetarily free, that does not mean that the production of it is free.

Certainly, the latter part of the sentence is true, and we will return to that in a moment, but the former part is demonstrably false: the Open Source Definition states no such thing. In fact, it states that “open source” is not obliged to cost anything, which as the practitioners of logic amongst us will note is absolutely not the same thing as obliging it to always be cost-free.

Bearing the Cost

Much of the article talks about the cost of developing Free Software from the perspective of those putting in the hours to write, test, maintain and support the code. These “production” costs are acknowledged while the producers are somehow shackled to an economic model – one that is apparently misinformed, as noted above – that demands that the cost of all this work be zero to those wanting to acquire it.

So, how exactly are the production costs going to be met? One of the most useful instruments for doing so has apparently been discarded, and I imagine that a similarly misguided attitude lingers with regard to supporting Free Software produced under such misconceptions. Indeed, much of the article focuses on doing “free work”, that of responding to requests, dealing with feedback, shepherding contributions, and the resulting experience of being “stressed by strangers”.

Normally, when one hears of something of this nature taking place, when the means to live decently and to control one’s own life is being taken away from people, there is a word that springs to mind: exploitation. From what we know about certain perspectives about Free Software and “open source”, it is hardly a surprise that the word “exploitation” does not appear in the article because such words are seen by some as “political”, where “political” takes on the meaning of “something raising awkward ethical questions” that if acknowledged and addressed appropriately would actually result in people not being exploited.

But there is an ideological condition which prevents people from being “political”. According to those with such a condition, we are not supposed to embarrass those who could help us deal with the problems that trouble us because that might be “impolite”, and it might also be questioning just how they made their money, how badly they may have treated people on their way to the top, and whether personal merit had less to do with their current status than good fortune and other things that undermine the myth of their own success. We are supposed to conflate money and power with merit or to play along convincingly enough at least as long as the wallets of such potential benefactors are open.

So there is this evasion of the “political” and a pandering to those who might offer validation and maybe even some rewards for all the efforts that are being undertaken as long as their place is not challenged. And what that leaves us with is a catalogue of commiseration, where one can do no more than appeal to those in the same or similar circumstances to be nicer to each other – not a bad idea, it must be said – but where the divisive and exploitative forces at work will result in more conflict over time as people struggle even harder to keep going.

When the author writes this…

Remember, open source is done by people giving something away for free because they choose to; you could say you’re dealing with a bunch of digital hippies.

…we should also remember that “open source” is also done by people who will gladly take those things and still demand more things for free, these being people who will turn a nice profit for themselves while treating others so abominably.

Selective Sustainability

According to the article “the overall goal of open source is to attract and retain people to help maintain an open source project while enjoying the experience”. I cannot speak for those who advocate “open source”, but this stated goal is effectively orthogonal to the aim of Free Software, which is to empower users by presenting them with the means to take control of the software they use. By neglecting software freedom, the article contemplates matters of sustainability whilst ignoring crucial factors that provide the foundations of sustainability.

For instance, there seems to be some kind of distinction being drawn between Free Software projects that people are paid to work on (“corporate open source”) and those done in their own time (“community open source”). This may be a reflection of attitudes within companies: that there are some things that they may use but which, beyond “donations” and letting people spend a portion of their work time on it, they will never pay for. Maybe such software does not align entirely with the corporate goals and is therefore “something someone else can pay for”, like hospitals, schools, public services, infrastructure, and all the other things that companies of a certain size often seem to be unwilling to fund as they reduce their exposure to taxation.

Free Software, then, becomes almost like the subject of charity. Maybe the initiator of a project will get recognised and hired for complementing and enhancing a company’s proprietary product, just like the individual described in the article’s introduction whose project was picked up by the author’s employer. I find it interesting that the author notes how important people are to the sustainability of a project but then acknowledges that the project illustrating his employer’s engagement with “open source” could do just fine without other people getting involved. Nothing is said about why that might be the case.

So, with misapprehensions about whether anyone can ask for money for their Free Software work, plus cultural factors that encourage permissive licensing, “building a following” and doing things “for exposure”, and with Free Software being seen as something needing “donations”, an unsustainable market is cultivated. Those who wish to find some way of funding their activities must compete with people being misled into working for free. And it goes beyond whether people can afford the time: time is money, as they say, and the result may well be that people who have relatively little end up subsidising “gifts” for people who are substantially better off.

One may well be reminded of other exploitative trends in society where the less well-off have to sacrifice more and work harder for the causes of “productivity” and “the economy”, with the real beneficiaries being the more well-off looking to maximise their own gains and optimise their own life-enriching experiences. Such trends are generally not regarded as sustainable in any way. Ultimately, something has to give, as history may so readily remind us.

Below the Surface

It is certainly important to make sure people keep wanting to do an activity, whether that is Free Software development or anything else, but having enough people who “enjoy doing open source” is far from sufficient to genuinely sustain a Free Software project. It is certainly worthwhile investigating the issues that determine whether people derive enjoyment from such work or not, along with the issues that cause stress, dissatisfaction, disillusionment and worse.

But what good is it if no-one deals with these issues? When taking “a full month off annually from volunteering” is seen as the necessary preventative medicine to avoid “potential burnout”, and when there is even such a notion as “open source detox”, does it not indicate that the symptoms may be seeing some relief but the cause remains untreated? The author of the article seems to think that the main source of misery is the way people treat each other:

It all comes down to how people treat each other in open source.

That in itself is something of a superficial diagnosis given that some people may not experience random angry people criticising their work at all, and yet they may be dissatisfied with their situation nevertheless. Others may experience bad interactions, but these might be the result of factors that are not so readily acknowledged. I do not condone behaviour that might be offensive, let alone abusive, but when people react strongly in their interactions with others, they may be doing so as the consequence of what they perceive as ill-treatment or even a form of oppression or betrayal.

There is much talk of kindness, and I cannot exactly disagree with the recommendation that people be kind to each other. But I also have the feeling that another story is not being told, one of how people with a level of power and influence choose to discharge their responsibilities. And in the context of Python, the matter of Python 3 is never far away. People may have received the “gift” of Python, but they have invested in it, too. In a way, this goes beyond any reciprocation of a mere gift because this investment is also a form of submission to the governance of the technology, as well as a form of validation of it that persuades others of its viability and binds those others to its governance, too.

What then must someone with a substantial investment in that technology think when presented with something like the “Python 2.7 Countdown” clock? Is it a helpful tool for technological planning or a way of celebrating and trivialising disruption to widespread investment in, and commitment to, a mature technology? What about the “Python 3 Statement” with projects being encouraged to pledge to drop support for Python 2 and to deliberately not maintain any such support beyond the official “end of life” date? Is it an encouraging statement of enthusiasm or another way of passive-aggressively shaming those who would continue to use and support Python 2?

I accept that it would be unfair to demand that the Python core developers be made to continue to support Python 2. But I also think it is unfair to see destructive measures being taken to put Python 2 “beyond use”, the now-familiar campaigns of inaccurate or incorrect information to supposedly stir people into action to port their software to Python 3, the denial of the name “Python” to anyone who might step up and continue to support Python 2, the atmosphere of hostility to those who might take on that role. And, well, excuse me if I cannot really take the following statement seriously based on the strategic choices of the Python core developers:

And then there’s the fact that your change may have just made literally tons of physical books in bookstores around the world obsolete; something else I have to consider.

It is intriguing that there is an insistence that people not expect anything when they do something for the benefit of another, that “kindnesses” are more appropriate than “favours”:

I switched to using kindnesses because being kind in the cultures I’m familiar with has no expectation of something in return.

Aside from the fact that it becomes pretty demotivating to send fixes to projects and expect nothing to ever happen to them, to take an example of the article’s author, which after a while amounts to a lot of wasted time and effort, I cannot help but observe that returning the favour was precisely what the Python core developers expected when promoting Python 3. From there, one cannot help but observe that maybe there is one rule for one group and one rule for another group in the stratified realm of “open source”.

The Role of the Elephant

In developing Free Software and acknowledging it as such, we put software freedom – the elephant in this particular room – at the forefront of our efforts. Without it, as we have seen, the narrative is weaker, people’s motivations seem less convincing or unfathomable, and suggestions for improving everybody’s experience, although welcome, fail to properly grasp some of the actual causes of dissatisfaction and unhappiness. This because the usual myths of efficiency, creative exuberancy, and an idealised “gift culture” need to be conjured up to either explain people’s behaviour or to motivate it, the latter often in a deliberately exploitative way.

It is, in fact, software freedom that gives Python 2 users any hope for their plight, even though many of them may be dissatisfied and some of them may end up committing to other languages instead of Python 3 in future. By emphasising software freedom, they and others may be educated about their right to control their technological investment, and they may be reminded that in seeking assistance to exercise that control, they might be advised to pay others to sustain their investment. At no point does the narrative need to slip off into “free stuff”, “gifts” and the like.

Putting software freedom at the centre of Free Software activities might also promote a more respectful environment. When others are obliged to uphold end-user freedoms, they might already be inclined to think about how they treat other people. We have seen a lot written about interpersonal interactions, and it is right to demand that people treat each other with respect, but maybe such respect needs to be cultivated by having people think about higher goals. And maybe such respect is absent if those goals are deliberately ignored, focusing people to consider only each individual transaction in isolation and to wonder why everyone acts so selfishly.

So instead of having an environment where a company might be looking for people to do free work so that they can seal it up, sell a proprietary product to hapless end-users, treat the workers like “digital hippies”, and thus exploit everyone involved, we invoke software freedom to demand fairness and respect. A culture of respecting the rights of others should help participants realise that they have a collective responsibility, that everyone is in it together, that the well-being of others does not come at the cost of each participant’s own well-being.

I realise that some of the language used above is “political” for some, but when those who object to “political” language perpetuate ignorance of the roots of Free Software and marginalise such movements for social change, they also perpetuate a culture of exploitation, whether they have this as their deliberate goal or not. This elephant has been around for some time, and having a long memory as one might expect, it stands as a witness to the perils of neglecting the ethical imperatives for what we do as Free Software developers.

It is, of course, possible to form a more complete, more coherent picture of how Free Software development occurs and how sustainability in such endeavours might be achieved, but evidently this remains out of reach for those still wishing to pretend that there is no elephant in the room.

Sustainable Computing

Monday, September 3rd, 2018

Recent discussions about the purpose and functioning of the FSFE have led me to consider the broader picture of what I would expect Free Software and its developers and advocates to seek to achieve in wider society. It was noted, as one might expect, that as a central component of its work the FSFE seeks to uphold the legal conditions for the use of Free Software by making sure that laws and regulations do not discriminate against Free Software licensing.

This indeed keeps the activities of Free Software developers and advocates viable in the face of selfish and anticompetitive resistance to the notions of collaboration and sharing we hold dear. Advocacy for these notions is also important to let people know what is possible with technology and to be familiar with our rich technological heritage. But it turns out that these things, although rather necessary, are not sufficient for Free Software to thrive.

Upholding End-User Freedoms

Much is rightfully made of the four software freedoms: to use, share, study and modify, and to propagate modified works. But it seems likely that the particular enumeration of these four freedoms was inspired (consciously or otherwise) by those famously stated by President Franklin D. Roosevelt in his 1941 “State of the Union” address.

Although some of Roosevelt’s freedoms are general enough to be applicable in any number of contexts (freedom of speech and freedom from want, for instance), others arguably operate on a specific level appropriate for the political discourse of the era. His freedom from fear might well be generalised to go beyond national aggression and to address the general fears and insecurities that people face in their own societies. Indeed, his freedom of worship might be incorporated into a freedom from persecution or freedom from prejudice, these latter things being specialised but logically consequent forms of a universal freedom from fear.

But what might end-users have to fear? The list is long indeed, but here we might as well make a start. They might fear surveillance, the invasion of their privacy and of being manipulated to their disadvantage, the theft of their data, their identity and their belongings, the loss of their access to technology, be that through vandalism, technological failure or obsolescence, or the needless introduction of inaccessible or unintuitive technology in the service of fad and fashion.

Using technology has always entailed encountering risks, and the four software freedoms are a way of mitigating those risks, but as technology has proliferated it would seem that additional freedoms, or additional ways of asserting these freedoms, are now required. Let us look at some areas where advocacy and policy work fail to reach all by themselves.

Cultivating Free Software Development

Advocating for decent laws and the fair treatment of Free Software is an essential part of organisations like the FSFE. But there also has to be Free Software out in the wider world to be treated fairly, and here we encounter another absent freedom. Proponents of the business-friendly interpretation of “open source” insist that Free Software happens all by itself, that somewhere someone will find the time to develop a solution that is ripe for wider application and commercialisation.

Of course, this neglects the personal experience of any person actually doing Free Software development. Even if people really are doing a lot of development work in their own time, playing out their roles precisely as cast in the “sharing economy” (which rather seems to be about wringing the last drops of productivity out of the lower tiers of the economy than anyone in the upper tiers actually participating in any “sharing”), it is rather likely that someone else is really paying their bills, maybe an employer who pays them to do something else during the day. These people squeeze their Free Software contributions in around the edges, hopefully not burning themselves out in the process.

Freedom from want, then, very much applies to Free Software development. For those who wish to do the right thing and even get paid for it, the task is to find a sympathetic employer. Some companies do indeed value Free Software enough to pay people to develop it, maybe because those companies provide such software themselves. Others may only pay people as a consequence of providing non-free software or services that neglect some of the other freedoms mentioned above. And still others may just treat Free Software as that magical resource that keeps on providing code for nothing.

Making sure that Free Software may actually be developed should be a priority for anyone seriously advocating Free Software adoption. Otherwise, it becomes a hypothetical quantity: something that could be used for serious things but might never actually be observed in such forms, easily dismissed as the work of “hobbyists” and not “professionals”, never mind that the same people can act in either capacity.

Unfortunately, there are no easy solutions to suggest for this need. It is fair to state that with a genuine “business case”, Free Software can get funded and find its audience, but that often entails a mixture of opportunism, the right connections, and an element of good fortune, as well as the mindset needed to hustle for business that many developers either do not have or do not wish to cultivate. It also assumes that all Free Software worth funding needs to have some kind of sales value, whereas much of the real value in Free Software is not to be found in things that deliver specific solutions: it is in the mundane infrastructure code that makes such solutions possible.

Respecting the User

Those of us who have persuaded others to use Free Software have not merely been doing so out of personal conviction that it is the ethically-correct thing for us and those others to use. There are often good practical reasons for using Free Software and asserting control over computing devices, even if it might make a little more work for us when things do not work as they should.

Maybe the risks of malware or experience of such unpleasantness modifies attitudes, combined with a realisation that not much help is actually to be had with supposedly familiar and convenient (and illegally bundled) proprietary software when such malevolence strikes. The very pragmatism that Free Software advocates supposedly do not have – at least if you ask an advocate for proprietary or permissively-licensed software – is, in fact, a powerful motivation for them to embrace Free Software in the first place. They realise that control is an illusion without the four software freedoms.

But the story cannot end with the user being able to theoretically exercise those freedoms. Maybe they do not have the time, skills or resources to do so. Maybe they cannot find someone to do so on their behalf, perhaps because nobody is able to make a living performing such services. And all the while, more software is written, deployed and pushed out globally. Anyone who has seen familiar user interfaces becoming distorted, degraded, unfamiliar, frustrating as time passes, shaped by some unfathomable agenda, knows that only a very well-resourced end-user would be able to swim against such an overpowering current.

To respect the user must involve going beyond acknowledging their software freedoms and also acknowledge their needs: for secure computing environments that remain familiar (even if that seems “boring”), that do not change abruptly (because someone had a brainwave in an airport lounge waiting to go to some “developer summit” or other), that allow sensible customisation that can be reconciled with upstream improvements (as opposed to imposing a “my way or the highway”, “delete your settings” ultimatum). It involves recognising their investment in the right thing, not telling them that they have to work harder, or to buy newer hardware, just to keep up.

And this also means that the Free Software movement has to provide answers beyond those concerning the nature of the software. Free Software licensing does not have enough to say about privacy and security, let alone how those things might be upheld in the real world. Yet such concerns do impact Free Software developers, meaning that some kinds of solutions do exist that might benefit a wider audience. Is it not time to deliver things like properly secure communications where people can trust the medium, verify who it is that sends them messages, ignore the time-wasters, opportunists and criminals, and instead focus on the interactions that are meaningful and important?

And is it not time that those with the knowledge and influence in the Free Software realm offered a more coherent path to achieving this, instead of all the many calls for people to “use encryption” only to be presented with a baffling array of options and a summary that combines “it’s complicated” with “you’re on your own”? To bring the users freedom from the kind of fear they experience through a lack of privacy and security? It requires the application of technical knowledge, certainly, but it also requires us to influence the way in which technology is being driven by forces in wider society.

Doing the Right Thing

Free Software, especially when labelled as “open source”, often has little to say about how the realm of technology should evolve. Indeed, Free Software has typically reacted to technological evolution, responding to the demands of various industries, but not making demands of its own. Of course, software in itself is generally but a mere instrument to achieve other things, and there are some who advocate a form of distinction between the specific ethics of software freedom and ethics applying elsewhere. For them, it seems to be acceptable to promote “open source” while undermining the rights and freedoms of others.

Our standards should be far higher than that! Although there is a logical argument to not combine other considerations with the clearly-stated four software freedoms, it should not stop us from complementing those freedoms with statements of our own values. Those who use or are subject to our software should be respected and their needs safeguarded. And we should seek to influence the development of technology to uphold our ideals.

Let us consider a mundane but useful example. The World Wide Web has had an enormous impact on society, providing people with access to information, knowledge, communication, services on a scale and with a convenience that would have been almost unimaginable only a few decades ago. In the beginning, it was slow (due to bandwidth limitations, even on academic networks), it was fairly primitive (compared to some kinds of desktop applications), and it lacked support for encryption and sophisticated interactions. More functionality was needed to make it more broadly useful for the kinds of things people wanted to see using it.

In the intervening years, a kind of “functional escalation” has turned it into something that is indeed powerful, with sophisticated document rendering and interaction mechanisms, perhaps achieving some of the ambitions of those who were there when the Web first gathered momentum. But it has caused a proliferation of needless complexity, as sites lazily call out to pull down megabytes of data to dress up significantly smaller amounts of content, as “trackers” and “analytics” are added to spy on the user, as absurd visual effects are employed (background videos, animated form fields), with the user’s computer now finding it difficult to bear the weight of all this bloat, and with that user struggling to minimise their exposure to privacy invasions and potential exploitation.

For many years it was a given that people would, even should, upgrade their computers regularly. It was almost phrased as a public duty by those who profited from driving technological progress in such a selfish fashion. As is frequently the case with technology, it is only after people have realised what can be made possible that they start to consider whether it should have been made possible at all. Do we really want to run something resembling an operating system in a Web browser? Is it likely that this will be efficient or secure? Can we trust the people who bring us these things, their visions, their competence?

The unquestioning proliferation of technology poses serious challenges to the well-being of individuals and the ecology of our planet. As people who have some control over the way technology is shaped and deployed, is it not our responsibility to make sure that its use is not damaging to its users, that it does not mandate destructive consumer practices, that people can enjoy the benefits – modest as they often are when the background videos and animated widgets are stripped away – without being under continuous threat of being left behind, isolated, excluded because their phone or computer is not this season’s model?

Strengthening Freedoms

In rather too many words, I have described some of the challenges that need to be confronted by Free Software advocates. We need to augment the four software freedoms with some freedoms or statements of our own. They might say that the software and the solutions we want to develop and to encourage should be…

  • Sustainable to make: developers and their collaborators should be respected, their contributions fairly rewarded, their work acknowledged and sustained by those who use it
  • Sustainable to choose and to use: adopters should have their role recognised, with their choices validated and rewarded through respectful maintenance and evolution of the software on which they have come to depend
  • Encouraging of sustainable outcomes: the sustainability of the production and adoption of the software should encourage sustainability in other ways, promoting longevity, guarding against obsolescence, preventing needless and frivolous consumption, strengthening society and making it fairer and more resilient

It might be said that in order to have a fairer, kinder world there are no shortage of battles to be fought. With such sentiments, the discussion about what more might be done is usually brought to a swift conclusion. In this article, I hope to have made a case that what we can be doing is often not so different from what we are already doing.

And, of course, this brings us back to the awkward matter of why we, or the organisations we support, are not always so enthusiastic about these neglected areas of concern. Wouldn’t we all be better off by adding a dimension of sustainability to the freedoms we already recognise and enjoy?

Tuple Performance Optimisations in CPython and Lichen

Sunday, July 22nd, 2018

One of the nice things about the Python programming language is the selection of built-in data types it offers for common activities. Things like lists (collections of values) and dictionaries (key-value mappings) are very convenient and do not need much further explanation, but there is also the concept of the tuple, which is also a collection of values, like a list, but whose size and values are fixed for its entire lifespan, unlike a list. Here is what a very simple tuple looks like:

(123, "abc")

Normally, the need for a data type like this becomes apparent when programming and needing to return multiple values from a function. In languages that do not support such a convenient way of bundling things together, some extra thought is usually required to send data back to the caller, and in languages like C the technique of mutating function arguments and thus communicating such data via a function’s parameters is often used instead.

Lichen, being a Python-like language, also supports tuples. The parsing of source code, employing various existing Python libraries, involves the identification of tuple literals: occurrences of tuples written directly in the code, as seen above. For such values to have any meaning, they must be supported by a particular program representation of the tuple plus the routines that provide each tuple with their familiar characteristics. In other words, we need to provide a way of translating these values into code that makes them tangible things within a running program.

Here, Lichen differs from various Python implementations somewhat. CPython, for instance, defines practically all of the nature of its tuples in C language code (the “C” in CPython), with the pertinent file in CPython versions from 1.0 all the way to 2.7 being found as Objects/tupleobject.c within the source distribution. Meanwhile, Jython defines its tuples in Java language code, with the pertinent file being found as src/org/python/core/PyTuple.java (without the outermost src directory in very early versions). Lichen, on the other hand, implements the general form of tuples in the Lichen language itself.

Tuples All The Way Down

This seems almost nonsensical! How can Lichen’s tuples be implemented in the language itself? What trickery is involved to pull off such an illusion? Well, it might be worth clarifying what kind of code is involved and which parts of the tuple functionality are really provided by the language framework generally. The Lichen code for tuples is found in lib/__builtins__/tuple.py and has the following outline:

class tuple(sequence, hashable):
    "Implementation of tuple."

    def __init__(self, args=None):
        "Initialise the tuple."

    def __hash__(self):
        "Return a hashable value for the tuple."

    def __add__(self, other):
        "Add this tuple to 'other'."

    def __str__(self):
        "Return a string representation."

    def __bool__(self):
        "Tuples are true if non-empty."

    def __iter__(self):
        "Return an iterator."

    def __get_single_item__(self, index):
        "Return the item at the normalised (positive) 'index'."

Here, the actual code within each method has been omitted, but the outline itself defines the general structure of the data type, described by a class, representing the behaviour of each tuple. As in Python, a collection of special methods are provided to support standard operations: __hash__ supports the hash built-in function and is used when using tuples as dictionary keys; __bool__ permits the truth value testing of tuples so that they may be considered as “true” or not; and so on.

Since this definition of classes (data types) is something that needs to be supported generally, it makes sense to use the mechanisms already in place to allow us to define the tuple class in this way. Particularly notable here is the way that the tuple class inherits from other “base classes” (sequence and hashable). Indeed, why should the tuple class be different from any other class? It still needs to behave like any other class with regard to supporting things like methods, and in Lichen its values (or instances) are fundamentally just like instances of other classes.

It would, of course, be possible for me to define the tuple class in C (it being the language to which Lichen programs are compiled), but another benefit of just using the normal process of parsing and compiling the code written in the Lichen language is that it saves me the bother of having to work with such a lower-level representation and the accompanying need to update it carefully when changing its functionality. The functionality itself, being adequately expressed as Lichen code, would need to be hand-compiled to C: a tedious exercise indeed.

One can turn such questions around and ask why tuples are special things in various Python implementations. A fairly reasonable response is that CPython, at least, has evolved its implementation of types and objects over the years, starting out as a “scripting language” offering access to convenient data structures implemented in C and a type system built using those data structures. It was not until Python 2.2 that “type/class unification” became addressed, meaning that the built-in types implemented at the lowest levels – tuples amongst them – could then be treated more like “user-defined classes”, these classes being implemented in Python code.

Although the outline of a tuple class can be defined in the Lichen language itself, and although operations defining tuple behaviour are provided as Lichen code, this does not mean that everything can be implemented at this level. For example, programs written in Lichen do not manage the memory their objects use but instead delegate this task to “native” code. Moreover, some of the memory being managed may have representations that only sensibly exist at a lower level. We can start to investigate this by considering the method returning the size or length of a tuple, invoked when the len built-in function is called on a tuple:

    def __len__(self):
        "Return the length of the tuple."
        return list_len(self.__data__)

Here, the method delegates practically everything to another function, presenting the __data__ attribute of the instance involved in the method call (self). This other function actually isn’t implemented in Lichen: it is a native function that knows about memory and some low-level structures that support the tuple and list abstractions. It looks like this:

__attr __fn_native_list_list_len(__attr self, __attr _data)
{
    unsigned int size = _data.seqvalue->size;
    return __new_int(size);
}

And what it does is to treat the __data__ attribute as a special sequence structure, obtaining its size and passing that value back as an integer usable within Lichen code. The sequence structure is defined as part of the support library for compiled Lichen programs, along with routines to allocate such structures and to populate them. Other kinds of values are also represented at the native level, such as integers and character strings.

To an extent, such native representations are not so different from the special data types implemented in C within CPython and in Java within Jython. However, the Lichen implementation seeks to minimise the amount of native code dedicated to providing abstractions. Where functionality supporting a basic abstraction such as a tuple does not need to interact directly with native representations or perform “machine-level” operations, it is coded in Lichen, and this code can remain happily oblivious to the nature of the data passing through it.

There are interesting intellectual challenges involved here. One might wonder how minimal the layer of native code might need to be, for instance. With a suitable regime in place for translating Lichen code into native operations, might it be possible to do memory management, low-level arithmetic, character string operations, system calls, and more, all in the same language, not writing any (or hardly writing any) native code by hand? It is an intriguing question but also a distraction, and that leads me back towards the main topic of the article!

The Benchmarking Game

Quite a few years ago now, there was a project being run to benchmark different programming languages in order to compare their performance. It still exists, it would seem. But in the early days of this initiative, the programs were fairly simple translations between languages and the results relatively easy to digest. Later on, however, there seemed to be a choice of results depending on the hardware used to create them, and the programs became more complicated, perhaps as people saw their favourite language falling down the result tables and felt that they needed to employ a few tricks to boost their language’s standing.

I have been interested in Python implementations and their performance for a long time, and one of the programs that I have used from time to time has been the “binary trees” benchmark. You can find a more complicated version of this on the Python Interpreters Benchmarks site as well as on the original project’s site. It would appear that on both these sites, different versions are being run even for the same language implementation, presumably to showcase optimisations.

I prefer to keep things simple, however. As the Wikipedia page notes, the “binary trees” benchmark is presumably a test of memory allocation performance. What I discovered when compiling a modified version of this program, one that I had originally obtained without the adornments of multiprocessing module and generator usage, was perhaps more interesting in its own right. The first thing I found was that my generated C program was actually slower than the original program run using CPython: it took perhaps 140% of the CPython running time (48 seconds versus 34 seconds).

My previous article described various realisations that I had around integer performance optimisations in CPython. But when I first tried to investigate this issue, I was at a loss to explain it. It could be said that I had spent so much effort getting the toolchain and supporting library code into some kind of working order that I had little energy left for optimisation investigations, even though I had realised one of my main objectives and now had the basis for such investigations available to me. Perhaps a quick look at the “binary trees” code is in order, so here is an extract:

def make_tree(item, depth):
    if depth > 0:
        item2 = 2 * item
        depth -= 1
        return (item, make_tree(item2 - 1, depth), make_tree(item2, depth))
    else:
        return (item, None, None)

So, here we have some tuples in action, and in the above function, recursion takes place – the function calls itself – to make the tree, hence the function name. Consequently, we have a lot of tuples being created and can now understand what the Wikipedia page was claiming about the program. The result of this function gets presented to another function which unpacks the return value, inspects it, and then calls itself recursively, too:

def check_tree(tree):
    (item, left, right) = tree
    if left is not None:
        return item + check_tree(left) - check_tree(right)
    else:
        return item

I did wonder about all these tuples, and in the struggle to get the language system into a working state, I had cobbled together a working tuple representation, in which I didn’t really have too much confidence. But I wondered about what the program would look like in the other languages involved in the benchmarking exercise and whether tuples (or some equivalent) were also present in whichever original version that had been written for the exercise, possibly in a language like Java or C. Sure enough, the Java versions (simple version) employ class instances and not things like arrays or other anonymous data structures comparable to tuples.

So I decided to change the program to also use classes and to give these tree nodes a more meaningful form:

class Node:
    def __init__(self, item, left, right):
        self.item = item
        self.left = left
        self.right = right

def make_tree(item, depth):
    if depth > 0:
        item2 = 2 * item
        depth -= 1
        return Node(item, make_tree(item2 - 1, depth), make_tree(item2, depth))
    else:
        return Node(item, None, None)

In fact, this is a somewhat rudimentary attempt at introducing object orientation since we might also make the function a method. Meanwhile, in the function handling the return value of the above function, the tuple unpacking was changed to instead access the attributes of the returned Node instances seen above.

def check_tree(tree):
    if tree.left is not None:
        return tree.item + check_tree(tree.left) - check_tree(tree.right)
    else:
        return tree.item

Now, I expected this to be slower in CPython purely because there is more work being done, and instance creation is probably more costly than tuple creation, but I didn’t expect it to be four times slower (at around 2 minutes 15 seconds), which it was! And curiously, running the same program compiled by Lichen was actually quicker (22 seconds), which is about 65% of the original version’s running time in CPython, half the running time of the original version compiled by Lichen, and nearly an sixth of the revised version’s running time in CPython.

One may well wonder why CPython is so much slower when dealing with instances instead of tuples, and this may have been a motivation for using tuples in the benchmarking exercise, but what was more interesting to me at this point was how the code generated by the Lichen toolchain was managing to be faster for instances, especially since tuples are really just another kind of object in the Lichen implementation. So why were tuples slower, and could there be a way of transferring some of the performance of general objects to tuples?

Unpacking More Performance

The “binary trees” benchmark is supposed to give memory allocation a workout, but after the integer performance investigation, I wasn’t about to fall for the trick of blaming the allocator (provided by the Boehm-Demers-Weiser garbage collection library) whose performance is nothing I would complain about. Instead, I considered how CPython might be optimising tuple operations and paid another visit to the interpreter source code (found in Python/ceval.c in the sources for all the different releases of Python 1 and 2) and searched for tuple-related operations.

My experiments with Python over the years have occasionally touched upon the bytecode employed by CPython to represent compiled programs, each bytecode instruction being evaluated by the CPython interpreter. I already knew that some program operations were supported by specific bytecodes, and sure enough, it wasn’t long before I encountered a tuple-specific example: the UNPACK_SEQUENCE instruction (and its predecessors in Python 1.5 and earlier, UNPACK_TUPLE and UNPACK_LIST). This instruction is generated when source code like the following is used:

(item, left, right) = tree

The above would translate to something like this:

              0 LOAD_FAST                0 (tree)
              3 UNPACK_SEQUENCE          3
              6 STORE_FAST               1 (item)
              9 STORE_FAST               2 (left)
             12 STORE_FAST               3 (right)

In CPython, you can investigate the generated bytecode instructions using the dis module, putting the code of interest in a function, and running the dis.dis function on the function object, which is how I generated the above output. Here, UNPACK_SEQUENCE makes an appearance, accessing the items in the tree sequence one by one, pushing them onto the evaluation stack, CPython’s interpreter being a stack-based virtual machine. And sure enough, the interpreter capitalises on the assumption that the operand of this instruction will most likely be a tuple, testing it and then using tuple-specific operations to get at the tuple’s items.

Meanwhile, the translation of the same source code by the Lichen toolchain was rather less optimal. In the translation code, the unpacking operation from the input program is rewritten as a sequence of assignments, and something like the following was being generated:

item = tree[0]
left = tree[1]
right = tree[2]

This in turn gets processed, rewriting the subscript operations (indicated by the bracketing) to the following:

item = tree.__getitem__(0)
left = tree.__getitem__(1)
right = tree.__getitem__(2)

This in turn was being translated to C for the output program. But this is not particularly efficient: it uses a generic mechanism to access each item in the tree tuple, since it is possible that the only thing we may generally assert about tree is that it may provide the __getitem__ special method. The resulting code has to perform extra work to eventually arrive at the code that actually extracts an item from the tuple, and it will be doing this over and over again.

So, the first thing to try was to see if there was any potential for a speed-up by optimising this unpacking operation. I changed the generated C code emitted for the operations above to use the native tuple-accessing functions instead and re-ran the program. This was promising: the running time decreased from 48 seconds to 23 seconds; I had struck gold! But it was all very well demonstrating the potential. What now needed to be done was to find a general way of introducing something similarly effective that would work automatically for all programs.

Of course, I could change the initial form of the unpacking operations to use the __getitem__ method directly, but this is what was being produced anyway, so there would be no change whatsoever in the resulting program. However, I had introduced a Lichen-specific special method, used within the standard library, that accesses individual items in a given sequence instance. (It should be noted that in Python and Lichen, the __getitem__ method can accept a slice object and thus return a collection of values, not just one.) Here is what the rewritten form of the unpacking would now look like:

item = tree.__get_single_item__(0)
left = tree.__get_single_item__(1)
right = tree.__get_single_item__(2)

Compiling the program and running it gave a time of 34 seconds. We were now at parity with CPython. Ostensibly, the overhead in handling different kinds of item index (integers or slice objects) was responsible for 30% of the original version’s running time. What other overhead might there be, given that 34 seconds is still rather longer than 23 seconds? What does this other special method do that my quick hack does not?

It is always worth considering what the compiler is able to know about the program in these cases. Looking at the __get_single_item__ method for a tuple reveals something of interest:

    def __get_single_item__(self, index):
        "Return the item at the normalised (positive) 'index'."
        self._check_index(index)
        return list_element(self.__data__, index)

In the above, the index used to obtain an item is checked to see if it is valid for the tuple. Then, the list_element native function (also used on tuples) obtains the item from the low-level data structure holding all the items. But is there a need to check each index? Although we do need to make sure that accesses to do not try and read “off the end” of the collection of items, accessing items that do not exist, we do not actually need to “normalise” the index.

Such normalisation is the process of interpreting negative index values as referring to items relative to the end of the collection, with -1 referring to the last item, -2 to the next last item, and so on, all the way back to -n referring to the first item (with n being the number of items in the collection). However, the code being generated does not use negative index values, and if we introduce a test to make sure that the tuple is large enough, then we should be able to get away with operations that use the provided index values directly. So I resolved to introduce another special method for this purpose, now rewriting the code as follows:

__builtins__.sequence._test_length(tree, 3)
item = tree.__get_single_item_unchecked__(0)
left = tree.__get_single_item_unchecked__(1)
right = tree.__get_single_item_unchecked__(2)

The _test_length function will raise an exception if the length is inappropriate. Meanwhile, the newly-introduced special method is implemented in a base class of both tuples and lists, and it merely employs a call to list_element for the provided index. Compiling the code with these operations now being generated and running the result yielded a running time of 27 seconds. Some general changes to the code generation, not specific to tuples, brought this down to 24 seconds (and the original version down to 44 seconds, with the object-based version coming down to 16 seconds).

So, the progression in performance looks like this:

Program Version (Lichen Strategy) Lichen CPython
Objects 135 seconds
Tuples (__getitem__) 48 seconds

44 seconds
Tuples (__get_single_item__) 34 seconds  34 seconds
Tuples (__get_single_item_unchecked__) 27 seconds

24 seconds
Objects 22 seconds

16 seconds

Here, the added effect of these other optimisations is also shown where measured.

Conclusions

As we saw with the handling of integers in CPython, optimisations also exist to tune tuple performance in that implementation of Python, and these also exist in other implementations such as Jython (see the unpackSequence method in the org.python.core.Py class, found in the org/python/core/Py.java file). Taking advantage of guarantees about accesses to tuples that are written explicitly into the program source, the generated code can avoid incurring unnecessary overhead, thus considerably speeding up the running time of programs employing tuple unpacking operations.

One might still be wondering why the object-based version of the program is faster than the tuple-based version for Lichen. This is most likely due to the ability of the compiler to make the attribute accesses on the tree object efficient based on deductions it has performed. Fewer low-level operations are performed to achieve the same result, and time is saved as a consequence. One might also wonder why the object-based version is slower when run by CPython. That would probably be due to the flexible but costly way objects are represented and accessed in that language implementation, and this was indeed one of my motivations for exploring other language design and implementation approaches with Lichen.

Shared-Mode Executables in L4Re for MIPS-Based Devices

Sunday, July 8th, 2018

I have been meaning to write about my device driver experiments with L4Re, following on from my porting exercises, but that exercise took me along various routes and I haven’t yet got back to documenting all of them. Meanwhile, one thing that did start to bother me was how much space the software was taking up when compiled, linked and ready to deploy.

Since each of my device drivers is a separate program, and since each one may be linked to various libraries, they each started to contribute substantially to the size of the resulting file – the payload – needing to be transferred to the device. At one point, I had to resize the boot partition on the memory card used by the Letux 400 notebook computer to make the payload fit in the available space.

The work done to port L4Re to the MIPS Creator CI20 had already laid the foundations for functioning payloads, and once the final touches were put in place to support the peculiarities of the Ingenic JZ4780 system-on-a-chip, it was possible to run both the conventional “hello” example which is statically linked to its libraries, as well as a “shared-hello” example which is dynamically linked to its libraries. The latter configuration of the program results in a smaller executable program and thus a smaller payload.

So it seemed clear that I might be able to run my own programs on the Letux 400 or Ben NanoNote with similar reductions in payload size. Unfortunately, nothing ever seems to be as straightforward as it ought to be.

Exceptional Obstructions and Observations

Initially, I set about trying one of my own graphical examples with the MODE variable set to “shared” in its Makefile. This, upon powering up, merely indicated that it had not managed to start up properly. Instead of a blank screen, the viewports set up by the graphical multiplexer, Mag, were still active and showing their usual blankness. But these regions did not then change in any way when I pressed keys on the keyboard (which is functionality that I will hopefully get round to describing in another article).

I sought some general advice from the l4-hackers mailing list, but quickly realised that to make any real progress, I would need a decent way of accessing the debugging output produced by the dynamic linker. This took me on a diversion that led to my debugging capabilities being strengthened with the availability of a textual output console on the screen of my devices. I still don’t like the idea of performing hardware modifications to get access to the serial console, so this is a useful and welcome alternative.

Having switched out the “hello” program with the “shared-hello” program in the system configuration and module list demonstrating the framebuffer terminal, I deployed the payload and powered up, but I did not get the satisfying output of the program operating normally. Instead, the framebuffer terminal appeared and rewarded me with the following message:

L4Re: rom/ex_hello_shared: Unhandled exception: PC=0x800000 PFA=8d7a LdrFlgs=0

This isn’t really the kind of thing you want to see. Having not had to debug L4Re or Fiasco.OC in any serious fashion for a couple of months, I was out of practice in considering the next step, but fortunately some encouragement arrived in a private e-mail from Jean Wolter. This brought the suggestion of triggering the kernel debugger, but since this requires serial console access, it wasn’t a viable approach. But another idea that I could use involved writing out a bit more information in the routine that was producing this output.

The message in question originates in the pkg/l4re-core/l4re_kernel/server/src/region.cc file, within the Region_map::op_exception method. The details it produces are rather minimal and generic: the program counter (PC) tells us where the exception occurred; the loader flags (LdrFlags) presumably tell us about the activity of the library loader; the mysterious “PFA” is supposedly the page fault address but it actually seemed to be the stack pointer address on these MIPS-based systems.

On their own, these details are not particularly informative, but I suppose that more useful information could quickly become fairly specific to a particular architecture. Jean suggested looking at the structure describing the exception state, l4_exc_regs_t (defined with MIPS-specific members in pkg/l4re-core/l4sys/include/ARCH-mips/utcb.h), to see what else I might dig up. This I did, generating the following:

pc=0x800000
gp=0x82dd30
sp=0x8d7a
ra=0x802f6c
cause=0x1000002c

A few things interested me, thus motivating my choice of registers to dump. The global pointer (gp) register tells us about symbols in the problematic code, and I felt that having once made changes to the L4Re sources – way back in the era of getting the CI20 to run GCC-generated code – so that another register (t9) would be initialised correctly, this so that the gp register would be set up correctly within programs, it was entirely possible that I had rather too enthusiastically changed something that was now causing a problem.

The stack pointer (sp) is useful to check, just to see if it located in a sensible region of memory, and here I discovered that this seemed to be the same as the “PFA” number. Oddly, the “PFA” seems to occupy the same place in the exception structure as any “bad virtual address” featuring in an address exception, and so I started to suspect that maybe the stack pointer was referencing the wrong part of memory. But this was partially ruled out by examining the value of the stack pointer in the “hello” example, which appeared to reference broadly the same part of memory. And, of course, the “hello” example works just fine.

In fact, the cause register indicated another kind of exception entirely, and it was one I was not really expecting: a “coprocessor unusable” exception indicating that coprocessor 1, typically a floating point arithmetic unit, was being illegally requested by an instruction. Here is how I interpreted that register’s value:

hex value   binary value
1000002c == 00010000000000000000000000101100
              --                     -----
              CE                     ExcCode

=> CE == 1; ExcCode == 11 (coprocessor unusable)
=> coprocessor 1 unusable

Now, as I may have mentioned before, the hardware involved in this exercise does not support floating point instructions itself, and this is why I have configured compilers to use “soft-float” (software-based floating point arithmetic) support. It meant that I had to find places that might have wanted to use floating point instructions and eliminate those instructions somehow. Fortunately, only code generated by the compiler was likely to contain such instructions. But now I wondered if there weren’t some instructions of this nature lurking in places I hadn’t checked.

I had also thought to check the return address (ra) register. This tells us where the processor will jump to when it has finished executing the current routine, and since this is usually a matter of “returning” somewhere, it tells us something about the code that was being executed before the problematic routine was called. I figured that the work being done before the exception was probably going to be more important than the exception itself.

Floating Point Magic

Another debugging suggestion that now became unavoidable was to inspect the erroneous instruction. I noted above that this instruction was causing the processor to signal an illegal attempt to use an unusable – actually completely unavailable – coprocessor. Writing a numeric representation of the instruction to the display provided me with the following hexadecimal (base 16) value:

464c457f

This can be interpreted as follows in binary, with groups of bits defined for interpretation according to the MIPS instruction set architecture, and with tentative interpretations of these groups provided beneath:

010001 10010  01100 01000 10101 111111
COP1   rs/fmt rt/ft rd/fs       C.ABS.NGT

The first group of bits is the opcode field which is interpreted as a coprocessor 1 (COP1) opcode. Should we then wish to consider what the other groups mean, we might then examine the final group which could indicate a comparison instruction. However, this becomes rather hypothetical since the processor will most likely interpret the opcode field and then decide that it cannot handle the instruction.

So, I started to look for places where the instruction might have been written, but no obvious locations were forthcoming. One peculiar aspect of all this is that the location of the instruction is at a rather “clean” location – 0x800000 – and some investigations indicated that this is where the library containing the problematic code gets loaded. I actually don’t remember precisely how I figured this out, but I think it was as follows.

I had looked at linker scripts that might give some details of the location of program objects, and one of them (pkg/l4re-core/ldscripts/ARCH-mips/main_dyn.ld) seemed to be related. It gave an address for the code of 0x400000. This made me think that some misconfiguration or erroneous operation was putting the observed code somewhere it shouldn’t be. But changing this address in the linker script just gave another exception at 0x400000, meaning that I had disrupted something that was intentional and probably working fine.

Meanwhile, emitting the t9 register’s value from the exception state yielded 0x800000, indicating that the calling routine had most likely jumped straight to that address, not to another address with execution having then proceeded normally until reaching the exception location. I decided to look at the instructions around the return address, these most likely being the ones that had set up the call to the exception location. Writing these locations out gave me some idea about the instructions involved. Below, I provide the stored values and their interpretations as machine instructions:

8f998250 # lw $t9, -32176($gp)
24a55fa8 # addiu $a1, $a1, 0x5fa8
0320f809 # jalr $t9
24844ee4 # addiu $a0, $a0, 0x4ee4
8fbc0010 # lw $gp, 16($sp)

One objective of doing this, apart from confirming that a jump instruction (jalr) was involved, with the t9 register being employed as is the convention with MIPS code, was to use the fragment to identify the library that was causing the error. A brute-force approach was employed here, generating “object dumps” from the library files and writing them out as files in a new directory:

mkdir tmpdir
for FILENAME in mybuild/lib/mips_32/l4f/* ; do
    mipsel-linux-gnu-objdump -d "$FILENAME" > tmpdir/`basename "$FILENAME"`
done

The textual dump files were then searched for the instruction values using grep, narrowing down the places where these instructions were found in consecutive locations. This yielded the following code, found in the libld-l4.so library:

    2f5c:       8f998250        lw      t9,-32176(gp)
    2f60:       24a55fa8        addiu   a1,a1,24488
    2f64:       0320f809        jalr    t9
    2f68:       24844ee4        addiu   a0,a0,20196
    2f6c:       8fbc0010        lw      gp,16(sp)

The integer operands for the addiu instructions are the same, of course, just being shown as decimal rather than hexadecimal values. Now, we previously saw that the return address (ra) register had the value 0x802f6c. When a MIPS processor executes a jump instruction, it will also fetch the following instruction and execute it, this being a consequence of the way the processor architecture is designed.

So, the instruction after the jump, residing in what is known as the “branch delay slot” is not the instruction that will be visited upon returning from the called routine. Instead, it is the instruction after that. Here, we see that the return address from the jump at location 0x2f64 would be two locations later at 0x2f6c. This provides a kind of realisation that the program object – the libld-l4.so library – is positioned in memory at 0x800000: 0x2f6c added to 0x800000 gives the value of ra, 0x802f6c.

And this means that the location of the problematic instruction – the cause of our exception – is the first location within this object. Anyone with any experience of this kind of software will have realised by now that this doesn’t sound like a healthy situation: the first location within a library is not actually going to be code because these kinds of objects are packaged up in a way that permits their manipulation by other programs.

So what is the first location of a library used for? Since such objects employ the Executable and Linkable Format (ELF), we can take a look at some documentation. And we see that the first location is used to identify the kind of object, employing a “magic number” for the purpose. And that magic number would be…

464c457f

In the little-endian arrangement employed by this processor, the stored bytes are as follows:

7f
45 ('E')
4c ('L')
46 ('F')

The value was not a floating point instruction at all, but the magic number at the start of the library object! It was something of a coincidence that such a value would be interpreted as a floating point instruction, an accidentally convenient way of signalling something going badly wrong.

Missing Entries

The investigation now started to focus on how the code trying to jump to the start of the library had managed to get this incorrect address and what it was trying to do by jumping to it. I started to wonder if the global pointer (gp), whose job it is to reference the list of locations of program routines and other global data, might have been miscalculated such that attempts to load the addresses of routines would then be failing with data being fetched from the wrong places.

But looking around at code fragments where the gp register was being calculated, they seemed to look set to calculate the correct values based on assumptions about other registers. For example, from the object dump for libld-l4.so:

00002780 <_ftext>:
    2780:       3c1c0003        lui     gp,0x3
    2784:       279cb5b0        addiu   gp,gp,-19024
    2788:       0399e021        addu    gp,gp,t9

Assuming that the processor has t9 set to 0x2780 and then jumps to the value of t9, as is the convention, the following calculation is then performed:

gp = 0x30000 (since lui loads the "upper" half-word)
gp = gp - 19024 = 0x30000 - 19024 = 0x2b5b0
gp = gp + t9 = 0x2b5b0 + 0x2780 = 0x2dd30

Using the nm tool, which tells us about symbols in program objects, it was possible to check this value:

mipsel-linux-gnu-nm -n mybuild/lib/mips_32/l4f/libld-l4.so

This shows the following at the end of the output:

0002dd30 d _gp

Also appearing somewhat earlier in the output is this, telling us where the table of symbols starts (as well as the next thing in the file):

00025d40 a _GLOBAL_OFFSET_TABLE_
00025f90 g __dso_handle

Some digging around in the L4Re source code gave a kind of confirmation that the difference between _gp and _GLOBAL_OFFSET_TABLE_ was to be expected. Here is what I found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/mips/elfinterp.c file:

#define OFFSET_GP_GOT 0x7ff0

If gp, when recalculated in other places, ended up getting the same value, there didn’t seem to be anything wrong with it. Some quick inspections of neighbouring calculations indicated that this wasn’t likely to be the problem. But what about the values used in conjunction with gp? Might they be having an effect? In the case of the erroneous jump, the following calculation is involved:

lw t9,-32176(gp) => load word into t9 from the location at gp - 32176
                 => ...               from 0x2dd30 - 32176
                 => ...               from 0x25f80

The calculated address, 0x25f80, is after the start of _GLOBAL_OFFSET_TABLE_ providing entries for program routines and other things, which is a good sign, but what is perhaps more troubling is how far after the start of the table such a value is. In the above output, another symbol (__dso_handle) indicates something that is located at the end of the table. Now, although its address is still greater than the one computed above, meaning that the computation does not cause us to stray off the end of the table, the computed address is suspiciously close to the end.

There was nothing else to do than to have a look at the table contents itself, and here it was rather useful to have a way of displaying a number of values on the screen. At this point, we have to note that the addresses in use in the running system are adjusted according to the start of the loaded object, so that the table is positioned at 0x25d40 in the object dump, but in the running system we would see 0x800000 + 0x25d40 and thus 0x825d40 instead.

What I saw was that the table contained entries that varied in the expected way right up until 0x825f60 (corresponding to 0x25f60 in the object dump) being only 0x30 (or 48 bytes, or 12 entries) before the end of the table, but then all remaining entries starting at 0x825f64 (corresponding to 0x25f64) yielded a value of 0x800000, apart from 0x825f90 (corresponding to 0x25f90, right at the end of the table) which yielded itself.

Since the calculated address above (0x25f80, adjusted to 0x825f80 in the running system) lies in this final region, we now know the origin of this annoying 0x800000: it comes from entries at the end of the table that do not seem to hold meaningful values. Indeed, the object dump for the library seemed to skip over this region of the table entirely, presumably because it was left uninitialised. And using the readelf tool with the –relocs option to show “relocations”, which applies to this table, it appeared that the last entries rather confirmed my observations:

00025d34  00000003 R_MIPS_REL32
00025f90  00000003 R_MIPS_REL32

Clearly, something is missing from this table. But since something has to adjust the contents of the table to add the “base address”, 0x800000, to the entries in order to provide valid addresses within the running program, what started to intrigue me was whether the code that performed this adjustment had any idea about these missing entries, and how this code might be related to the code causing the exception situation.

Routines and Responsibilities

While considering the nature of the code causing the exception, I had been using the objdump utility with the -d (disassemble) and -D (disassemble all) options. These provide details of program sections, code routines and the machine instructions themselves. But Jean pointed out that if I really wanted to find out which part of the source code was responsible for producing certain regions of the program, I might use a combination of options: -d, -l (line numbers) and -S (source code). This was almost a revelation!

However, the code responsible for the jump to the start of the library resisted such measures. A large region of code appeared to have no corresponding source, suggesting that it might be generated. Here is how it starts:

_ftext():
    2dac:       00000000        nop
    2db0:       3c1c0003        lui     gp,0x3
    2db4:       279caf80        addiu   gp,gp,-20608
    2db8:       0399e021        addu    gp,gp,t9
    2dbc:       8f84801c        lw      a0,-32740(gp)
    2dc0:       8f828018        lw      v0,-32744(gp)

There is no function defined in the source code with the name _ftext. However, _ftext is defined in the linker script (in pkg/l4re-core/ldscripts/ARCH-mips/main_rel.ld) as follows:

  .text           :
  {
    _ftext = . ;
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em.  */
    *(.gnu.warning)
    *(.mips16.fn.*) *(.mips16.call.*)
  }

If you haven’t encountered linker scripts before, then you probably don’t want to spend too much time looking at this, linker scripts being frustratingly terse and arcane, but the essence of the above is that a bunch of code is stuffed into the .text section, with _ftext being assigned the address of the start of all this code. Now, _ftext in the linker script corresponds to a particular label in the object dump (which we saw earlier was positioned at 0x2780) whereas the _ftext function in the code occurs later (at 0x2dac, above). After the label but before the function is code whose source is found by objdump.

So I took the approach of removing things from the linker script, ultimately removing everything from the .text section apart from the assignment to _ftext. This removed the annotated regions of the code and left me with only the _ftext function. It really did appear that this was something the compiler might be responsible for. But where would I find the code responsible?

One hint that was present in the _ftext function code was the use of another identified function, __cxa_finalize. Searching the GCC sources for code that might use it led me to the libgcc sources and to code that invokes destructor functions upon program exit. This wasn’t really what I was looking for, but the file containing it (libgcc/crtstuff.c) would prove informative.

Back to the Table

Jean had indicated that there might be a difference in output between compilers, and that certain symbols might be produced by some but not by others. I investigated further by using the readelf tool with the -a option to show almost everything about the library file. Here, the focus was on the global offset table (GOT) and information about the entries. In particular, I wanted to know more about the entry providing the erroneous 0x800000 value, located at (gp – 32176). In my output I saw the following interesting thing:

 Global entries:
   Address     Access  Initial Sym.Val. Type    Ndx Name
  00025f80 -32176(gp) 00000000 00000000 FUNC    UND __register_frame_info

This seems to tell us what the program expects to find at the location in question, and it indicates that the named symbol is presumably undefined. There were some other undefined symbols, too:

_ITM_deregisterTMCloneTable
_ITM_registerTMCloneTable
__deregister_frame_info

Meanwhile, Jean was seeing symbols with other names:

__register_frame_info_base
__deregister_frame_info_base

During my perusal of the libgcc sources, I had noticed some of these symbols being tested to see if they were non-zero. For example:

  if (__register_frame_info)
    __register_frame_info (__EH_FRAME_BEGIN__, &object);

These fragments of code appear to be located in functions related to program initialisation. And it is also interesting to note that back in the library code, after the offending table entry has been accessed, there are tests against zero:

    2f34:       3c1c0003        lui     gp,0x3
    2f38:       279cadfc        addiu   gp,gp,-20996
    2f3c:       0399e021        addu    gp,gp,t9
    2f40:       27bdffe0        addiu   sp,sp,-32
    2f44:       8f828250        lw      v0,-32176(gp)
    2f48:       afbc0010        sw      gp,16(sp)
    2f4c:       afbf001c        sw      ra,28(sp)
    2f50:       10400007        beqz    v0,2f70 <_ftext+0x7f0>

Here, gp gets set up correctly, v0 is set to the value of the table entry, which we now believe refers to __register_frame_info, and the beqz instruction tests this value against zero, skipping ahead if it is zero. Does that not sound a bit like the code shown above? One might think that the libgcc code might handle an uninitialised table entry, and maybe it is intended to do so, but the table entry ends up getting adjusted to 0x800000, presumably as part of the library loading process.

I think that the most relevant function here for the adjustment of these entries is _dl_perform_mips_global_got_relocations which can be found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/ldso.c file as part of the L4Re C library code. It may well have changed the entry from zero to this erroneous non-zero value, merely because the entry lies within the table and is assumed to be valid.

So, as a consequence, the libgcc code acts as if it has a genuine __register_frame_info function to call, and doing so causes the jump to the start of the library object and the exception. Maybe the code is supposed to be designed to handle missing symbols, those symbols potentially being deliberately omitted, but it doesn’t function correctly under these particular circumstances.

Symbol Restoration

However, despite identifying this unfortunate interaction between C library and libgcc, the matter of a remedy remained unaddressed. What was I to do about these missing symbols? Were they supposed to be there? Was there a way to tell libgcc not to expect them to be there at all?

In attempting to learn a bit more about the linking process, I had probably been through the different L4Re packages several times, but Jean then pointed me to a file I had seen before, perhaps before I had needed to think about these symbols at all. It contained “empty” definitions for some of the symbols but not for others. Maybe the workaround or even the solution was to just add more definitions corresponding to the symbols the program was expecting? Jean thought so.

So, I added a few things to the file (pkg/l4re-core/ldso/ldso/fixup.c):

void __deregister_frame_info(void);
void __register_frame_info(void);
void _ITM_deregisterTMCloneTable(void);
void _ITM_registerTMCloneTable(void);

void __deregister_frame_info(void) {}
void __register_frame_info(void) {}
void _ITM_deregisterTMCloneTable(void) {}
void _ITM_registerTMCloneTable(void) {}

I wasn’t confident that this would fix the problem. After all the investigation, adding a few lines of trivial code to one file seemed like too easy a way to fix what seemed like a serious problem. But I checked the object dump of the library, and suddenly things looked a bit more reasonable. Instead of referencing an uninitialised table entry, objdump was able to identify the jump target as __register_frame_info:

    2e14:       8f828040        lw      v0,-32704(gp)
    2e18:       afbc0010        sw      gp,16(sp)
    2e1c:       afbf001c        sw      ra,28(sp)
    2e20:       10400007        beqz    v0,2e40 <_ftext+0x7f0>
    2e24:       8f85801c        lw      a1,-32740(gp)
    2e28:       8f84803c        lw      a0,-32708(gp)
    2e2c:       8f998040        lw      t9,-32704(gp)
    2e30:       24a55fa8        addiu   a1,a1,24488
    2e34:       04111c39        bal     9f1c <__register_frame_info>

Of course, the code had been reorganised and so things were no longer in quite the same places, but in the above, (gp – 32704) is actually a reference to __register_frame_info, and although this value gets tested against zero as before, we can see that enough is now known about the previously-missing symbol that a branch directly to the location of the function has been incorporated, rather than a jump to the address stored in the table.

And sure enough, powering up the Letux 400 produced the framebuffer terminal showing the expected output:

Hi World! (shared)

It had been a long journey for such a modest reward, but thanks to Jean’s help and a bit of perseverance, I got there in the end.

Investigating CPython’s Optimisation Trickery for Lichen

Saturday, July 7th, 2018

For those of us old enough to remember how the Python programming language was twenty or so years ago, nostalgic for a simpler and kinder time, looking to escape the hard reality of today’s feature enhancement arguments, controversies, general bitterness and recriminations, it can be informative to consider what was appealing about Python all those years ago. Some of us even take this slightly further and attempt to formulate our own take on the language, casting aside things that do not appeal or that seem superfluous, needlessly confusing, or redundant.

My own Python variant, called Lichen, strips away quite a few things from today’s Python but probably wouldn’t seem so different to twentieth century Python. Since my primary objective with Lichen is to facilitate static analysis so that observations can be made about program behaviour before running the program, certain needlessly-dynamic features have been eliminated. Usually, when such statements about feature elimination are made, people seize upon them to claim that the resulting language is statically typed, but this is deliberately not the case here. Indeed, “duck typing” is still as viable as ever in Lichen.

Ancient Dynamism

An example of needless dynamism in Python can arguably be found with the __getattr__ and __setattr__ methods, introduced as far back as Python 1.1. They allow accesses to attributes via instances to be intercepted and values supposedly provided by these attributes to be computed on demand. In effect, these methods support virtual or dynamic attributes that are not really present on an object. Here’s an extract from one of the Python 1.2 demonstration programs (Demo/pdist/client.py):

        def __getattr__(self, name):
                if name in self._methods:
                        method = _stub(self, name)
                        setattr(self, name, method) # XXX circular reference
                        return method
                raise AttributeError, name

In this code, if an instance of the Client class (from which this method is taken) is used to access an attribute called hello, then this method will see if the string “hello” is found in the instance’s _methods attribute, and if so it produces a special object that is then returned as the value for the hello attribute. Otherwise, it raises an exception to indicate that this virtual attribute is not recognised. (Here, the setattr call stores the special object as a genuine attribute in order to save this method from being called again for the same attribute.)

Admittedly, this is quite neat, and it quickly becomes tempting to use such facilities everywhere – this is very much the story of Python and its development – but such things make reasoning about programs more difficult. We cannot know what attributes the instances of this Client class may have without running the program. Indeed, to find out in this case, running the program is literally unavoidable since the _methods attribute is actually populated using the result of a message received over the network!

But even in simpler cases, it can readily be intuitively understood that finding out the supported attributes of instances whose class offers such a method might involve a complicated exercise looking at practically all the code in a program. Despite all the hard work, this exercise will nevertheless produce unreliable or imprecise results. It says something about the fragility of such facilities that properties were later added to Python 2.2 to offer a more declarative alternative.

(It also says something about Python 3 that the cornucopia of special mechanisms for dynamically exposing attributes are apparently still present, despite much having been said about Python 3 remedying such Python 1 and 2 design artefacts.)

Hidden Motives

With static analysis, we might expect to be able to deduce which attributes are provided by class instances without running a program, this potentially allowing us to determine the structure of program objects and to detect errors around their use. But another objective with Lichen is to see how constraints on the language may be used to optimise the performance of programs. I will not deny that performance has always been an interest of mine with respect to Python and its implementations, and I imagine that many compiler and virtual machine implementers have been motivated by such concerns throughout the years.

The deductions made during static analysis can potentially allow us to generate executable programs that perform the same work more efficiently. For example, if it is known that a collection of method calls on an object identify that object as being of a certain type, we can then employ more efficient ways of calling those methods. So, for the following code…

        while number:
            digits.append(_hexdigits[number % base])
            number = number / base

…if we can assert that digits is a list, then considering that we might normally generate code for the append method call as something like this…

__load_via_class(digits, append)(...)

…where the __load_via_class operation has to go and find the append method via the class of digits (or, in some cases, even look for the append attribute on the object first), we might instead be able to generate code like this…

__fn_list_append(digits, ...)

…where __fn_list_append is a genuine C function and the digits instance is passed directly to it, together with the elided arguments. When we can get this kind of thing to happen, it can obviously be very satisfying. Various Python implementations and tools also attempt to make method calls efficient in their own ways, some possibly relying on run-time caches that short-circuit the exercise of finding the method.

Magic Numbers

It can be informative to compare the performance of code generated by the Lichen toolchain and the performance of the same program running in the CPython virtual machine, Python and Lichen being broadly compatible (but not identical). As I noted in my summary of 2017, the performance of generated programs was rather disheartening to see at first. I needed to employ profiling to discover where the time was being spent in my generated code that seemed not to be a comparable burden on CPython.

The practicalities of profiling are definitely beyond the scope of this article, but what I did notice was just how much time was being spent allocating space in memory for integers used by programs. I recalled that Python does some special things with integers itself, and so I set about looking for the details of its memory allocation strategies.

It turns out that integers are allocated in a simplified fashion for performance reasons, instead of using the more general allocator that is compatible with garbage collection. And not just that: a range of “small” integers is also allocated in advance when programs run, so that no time is wasted repeatedly allocating objects for numbers that would likely see common use. The details of this can be found in the Objects/intobject.c file in CPython 1.x and 2.x source distributions. Even CPython 1.0 employs this technique.

At first, I thought I had discovered the remedy for my performance problems, but replicating similar allocation arrangements in my run-time code demonstrated that such a happy outcome was not to be so easily achieved. As I looked around for what other special treatment CPython does, I took a closer look at the bytecode interpreter (found in Python/ceval.c), which is the mechanism that takes the compiled form of Python programs (the bytecode) and evaluates the instructions encoded in this form.

My test programs involved simple loops like this:

i = 0
while i < 100000:
    f(i)
    i += 1

And I had a suspicion that apart from allocating new integers, the operations involved in incrementing them were more costly than they were in CPython. Now, in Python versions from 1.1 onwards, the special operator methods are supported for things like the addition, subtraction, multiplication and division operators. This could conceivably lead to integer addition being supported by the following logic in one of the simpler cases:

# c = a + b
c = a.__add__(b)

But from Python 1.5 onwards, some interesting things appear in the CPython source code:

                case BINARY_ADD:
                        w = POP();
                        v = POP();
                        if (PyInt_Check(v) && PyInt_Check(w)) {
                                /* INLINE: int + int */
                                register long a, b, i;
                                a = PyInt_AS_LONG(v);
                                b = PyInt_AS_LONG(w);
                                i = a + b;

Here, when handling the bytecode for the BINARY_ADD instruction, which is generated when the addition operator (plus, “+”) is used in programs, there is a quick test for two integer operands. If the conditions of this test are fulfilled, the result is computed directly (with some additional tests being performed for overflows not shown above). So, CPython was special-casing integers in two ways: with allocation tricks, and with “fast paths” in the interpreter for cases involving integers.

The Tag Team

My response to this was similarly twofold: find an efficient way of allocating integers, and introduce faster ways of handling integers when they are presented to operators. One option that the CPython implementers actually acknowledge in their source code is that of employing a different representation for integers entirely. CPython may have too much legacy baggage to make this transition, and Python 3 certainly didn’t help the implementers to make the break, it would seem, but I have a bit more flexibility.

The option in question is the so-called tagged pointer approach where instead of having a dedicated object for each integer, with a pointer being used to reference that object, the integers themselves are represented by a value that would normally act as a pointer. But this value is not actually a valid pointer at all since it has its lowest bit set, which violates a restriction that is imposed by some processor architectures, but it can be a self-imposed restriction on other systems as well, merely ruling out the positioning of objects at odd-numbered addresses.

So, we might have the following example representations on a 32-bit architecture:

hex value      31..............................0 (bits)
0x12345678 == 0b00010010001101000101011001111000 => pointer
0x12345679 == 0b00010010001101000101011001111001 => integer

Clearing bit 0 and shifting the other bits one position to the right yields the actual integer value, which in the above case will be 152709948. It is conceivable that in future I might sacrifice another bit for encoding other non-pointer values, especially since various 32-bit architectures require word-aligned addresses, where words are positioned on boundaries that are multiples of four bytes, meaning that the lowest two bits would have to be zero for a pointer to be valid.

Albeit with some additional cost incurred when handling pointers, we can with such an approach distinguish integers from other types rapidly and conveniently, which makes the second part of our strategy more efficient as well. Here, we need to identify and handle integers for the arithmetic operators, but unlike CPython, where this happens to be done in an interpreter loop, we have no such loop. Instead we generate code for such operators that simply invokes some existing functions (written in the Lichen language and compiled to C code, another goal being to write as much of the language system in Lichen itself, not C).

It would be rather wasteful to generate tests for integers in addition to these operator function calls every time such a call is made, but the tests can certainly reside within those functions instead. So, here is what we might do for the addition operator:

def add(a, b):
    if is_int(a) and is_int(b):
        return int_add(a, b)
    return binary_op(a, b, lambda a: a.__add__, lambda b: b.__radd__)

This code leaves me with a bit of explaining to do! Last things first: the final statement is the general case of dispatching to the operands and calling an appropriate operator method, with the binary_op function performing the logic in conjunction with the operands and some lambda functions that defer access to the special methods until they are really needed. It is probably best just to trust me that this does the job!

Before the generic operator method dispatch, however, is the test of the operands to see if they are both integers, and this should be vaguely familiar from the CPython source code. A special function is then called to add them efficiently. Note that we couldn’t use the addition (plus, “+”) operator because this code is meant to be handling that, and it would most likely send us on an infinitely recursive loop that never gets round to performing the addition! (I don’t treat the operator as a special case in this code, either. This code is compiled exactly like any other code written in the language.)

The is_int function is what I call “native”, meaning that it is implemented using low-level operations, in this case ones that test the representation of the argument to see if it has its lowest bit set, returning a true value if so. Meanwhile, int_add is largely equivalent to the addition operation seen in the CPython source code above, just with different details involved.

Progress and Reflections

Such adjustments made quite a difference to the performance of my generated code. They do also make some sense, too. Integers are used a lot in programs, being used not only for general arithmetic, but also for counters, index values for things like lists, tuples, strings and other collections, plus a range of other mundane things whose performance can be overlooked until it proves to be suboptimal. Python has something of a reputation for having slow implementations, but CPython’s trickery here optimises in favour of fast results where they can be obtained, falling back on the slower, general mechanisms when these are required.

But I discovered that this is not the only optimisation trickery CPython does, as another program with interesting representation choices and some wildly varying running times was to demonstrate. More on that in the next article on this topic!

L4Re: Textual Debugging Output on the Framebuffer

Monday, May 21st, 2018

I have actually been in the process of drafting another article about writing device drivers to run within the L4 Runtime Environment (L4Re) on top of the Fiasco.OC microkernel, this being for the Ben NanoNote and Letux 400 notebook computers. That article started to trail behind a lot of the work being done, and there are a few loose ends to be tied up before I can finish it.

Meanwhile, on the way towards some kind of achievement with L4Re, confounded somewhat by the sometimes impenetrable APIs, I managed to eventually get something working that I had thought would have been one of the first things to demonstrate. When initially perusing the range of software in the “pkg” directory within the L4Re distribution, I saw a package called “fbterminal” providing a terminal program that shows itself on the framebuffer (or display).

I imagined being able to launch this on top of the graphical user interface multiplexer, Mag, and then have the “hello” program provide some output to this terminal. I even imagined having the terminal accept input from the keyboard, but we aren’t quite at that point, and that is where my other article comes in. Of course, I initially had no idea how to achieve this, and there needed to be a lot of work put in just to get back to this particular point of entry.

Now, however, the act of launching fbterminal and have it work is fairly straightforward. A few additional packages are required, but the framebuffer works satisfactorily as far as the other components are concerned, and the result will be a blank region of the screen with the terminal showing precisely nothing. Obviously, we want it to show something in order to confirm that it is working. I had to get used to seeing this blank terminal for a while.

The intended companion to fbterminal for testing purposes is the hello program which merely writes output to what might be described as a logging destination. This particular output channel is usually the serial console for the device, which meant that when porting the system to the Ben and the Letux, the hello program was of no use to me. But now, with a framebuffer to show things on, and with a terminal that might be able to accept output from other things, it becomes interesting to see if the hello program can be persuaded to send its output elsewhere.

It was useful to investigate how the output from the hello program actually makes its way to its destination. Since it uses standard C library functions, there has to be a mechanism for those functions to use. As far as I know, these would typically involve various system calls concerning files and streams. A perusal of the sources dredged up an interesting symbol called “__rtld_l4re_env_posix_vfs_ops”. Further investigation led me to the L4Re virtual filesystem (Vfs) functionality and the following interesting files:

  • pkg/l4re-core/l4re_vfs/include/vfs.h
  • pkg/l4re-core/l4re_vfs/include/impl/vfs_impl.h

And these in turn led me to the virtual console (Vcon) functionality:

  • pkg/l4re-core/l4re_vfs/include/impl/vcon_stream.h
  • pkg/l4re-core/l4re_vfs/include/impl/vcon_stream_impl.h

It seems that standard output from the hello program goes via the standard C calls and Vfs functions and is packaged up and sent using the Vcon mechanisms to the logging destination, which is typically provided by the root task, Moe. Given that fbterminal understands the Vcon protocol and acts as a console server, there appeared to be some potential in looking at Vcon mechanisms more closely. It seemed that fbterminal might be able to take the place of Moe.

Indeed, the documentation offers some clues. In the description of the init process, Ned, a mention is made of a program loader configuration parameter called “log_fab” that indicates an object that can create a suitable logging destination. When starting a program, the program loader creates such an object using “log_fab” and presents it to the new program as a capability (or object reference).

However, this is not quite what we want because we don’t need anything else to be created: we already have fbterminal ready for us to use. I suppose something could be conjured up to act as a factory and provide a fbterminal instance, and maybe this is not too arduous in the Lua-based configuration environment used by Ned, but I wanted a more direct solution.

Contemplating this, I went and rediscovered the definitions used by Ned to support its configuration scripting (found in pkg/l4re-core/ned/server/src/ned.lua). Here, the workings of the “log_fab” mechanism can be found and studied. But what I started to favour was a way of just indicating a capability to the hello program and not have the loader create something else. This required a simple edit to one of the functions:

function App_env:log()
  Class.check(self, App_env);
  if self.loader.log_fab == nil or self.loader.log_fab.create == nil then
    error ("Starting an application without valid log factory", 4);
  end
  return self.loader.log_fab:create(Proto.Log, table.unpack(self.log_args));
end

Here, we want to ignore “log_fab” and just have our existing capability used instead. So, I introduced another clause to the if statement:

  if self.log_cap then
    return self.log_cap
  elseif self.loader.log_fab == nil or self.loader.log_fab.create == nil then
    error ("Starting an application without valid log factory", 4);
  end

Now, if we specify “log_cap” when starting a program, it should want to direct logging messages to the referenced object instead. So, with this available to us, it becomes possible to adjust the way the hello program is started. First of all, we define the way fbterminal is set up and started:

local term = l:new_channel();

l:start({
    caps = {
      fb = mag_caps.svc:create(L4.Proto.Goos, "g=320x230+0+0", "barheight=10"),
      term = term:svr(),
    },
  },
  "rom/fbterminal");

Since fbterminal needs to “export” its console abilities using a capability called “term”, this needs to be indicated in the “caps” table. (It doesn’t matter what the local variable for the channel is called.) So, the hello program is defined accordingly:

l:start({
    log_cap = term,
  },
  "rom/hello");

Here, we make use of “log_cap” and allow the output to be directed to the terminal that has already been started. And the result is this:

fbterminal on the Ben NanoNote showing the hello program's output

fbterminal on the Ben NanoNote showing the hello program's output

And at long last, it becomes possible to see what programs are printing out to the log!

Extending L4Re/Fiasco.OC to the Letux 400 Notebook Computer

Wednesday, April 18th, 2018

In my summary of the port of L4Re and Fiasco.OC to the Ben NanoNote, I remarked that progress had been made on supporting other products and hardware peripherals. In fact, such progress occurred more rapidly than I had thought possible, and I have been able to extend the work to support the Letux 400 notebook computer. It is perhaps worth describing the Letux 400 in a bit more detail because it has an interesting place in the history of netbook computers.

Some History

Back in the early 21st century, laptop computers were becoming increasingly popular at the expense of desktop computers, but as laptops began to take the place of desktops in homes and workplaces, this gradually led each successive generation of laptops to sacrifice portability and affordability in favour of larger, faster, higher-resolution screens and general hardware specifications more competitive with the desktop offerings they sought to replace. Laptops were becoming popular but also bigger, heavier and more expensive.

Things took an interesting turn in 2006 with the introduction of the XO-1 from the One Laptop per Child (OLPC) initiative. With rather different goals to those of the mainstream laptop vendors, the focus was to deliver a relatively-inexpensive yet robust portable computer for use by schoolchildren, many of whom might be living in places with limited infrastructure where increasingly power-hungry mainstream laptops would have been unsuitable, even unusable.

One unexpected consequence of the introduction of the XO-1 was the revival in interest in modestly-performing portable computing hardware. People were actually interested in a computer that did the things they needed, rather than having to buy something designed for gamers, software developers, or corporate “power users” (of both the pretend and genuine kinds). Rather than having to haul increasingly big and heavy laptops and all the usual accessories in a big dedicated bag, they liked the idea of slipping a smaller, lighter device into their everyday bag, as had probably been the idea with subnotebooks while they were still a thing.

Thus, the Asus Eee PC came about, regarded as the first widely-available netbook of recent times (acknowledging the earlier Psion netBook, of course), bringing with it the attention of large-volume manufacturers and economies of scale. For “lightweight tasks”, netbooks were enough for many people: a phenomenon that found itself repeating with tablets, particularly as recreational usage of technology became more important to buyers and evolved in certain ways.

Now, one thing that had been a big part of the OLPC initiative’s objectives was a $100 price point. At first, despite fairly radical techniques being used to reduce cost, and despite the involvement of a major original equipment manufacturer in the production of the XO-1, that price point of $100 was out of reach. Even the Eee PC retailed for a few hundred dollars.

This is where a product known as the Skytone Alpha 400 enters the picture. Some vendors, rebranding this product, offered it as possibly the first $100 laptop – or netbook – to be made available for sale. One of the vendors offers it as the Letux 400, and it has been available for as little as €125 during its retail lifespan. Noting that it has rather similar hardware to the Ben NanoNote, but has a more conventional physical profile and four times as much RAM, my brother bought one to investigate a few years ago. That is how I eventually ended up embarking on this experiment.

Extending Recent Work

There are many similarities between the JZ4720 system-on-a-chip (SoC) used in the Ben and the JZ4730 used in the Letux 400. However, it can be said that the JZ4720 is much better understood. The JZ4740 and closely-related devices like the JZ4720 have appeared in a number of different devices, documentation has surfaced for these products, and vendor source code has always been available, typically using or implicitly documenting most of the hardware.

In contrast, limited documentation is known to exist for the JZ4730, and the available vendor source code has not always described every detail of the hardware, even though the essential operations and register details appear to be present. Having looked at the Linux kernel sources that support the JZ4730, together with U-Boot source code, the similarities and differences between the JZ4720 and JZ4730 began to take shape in my mind.

I took an optimistic approach that mostly paid off. The Fiasco.OC kernel needs augmenting with the details of the JZ4730, but these are similar in some ways to the JZ4720 and familiar otherwise. For instance, the JZ4730 has a 32-bit “operating system timer” (OST) that curiously does not appear in the JZ4740 but does appear in more recent products such as the JZ4780. Bearing such things in mind, the timer and interrupt support was easily enough added.

One very different thing about the JZ4730 is that it does not seem to support the “set” and “clear” register locations that are probably common to most modern SoCs. Typically, one might want to update a hardware-related register to change a peripheral’s configuration, and it must have become apparent to hardware designers that such updates mostly want to either set or clear bits. Normally in a program, to achieve such things involves reading a value, performing a logical operation that combines the value with a description of the bits to be set or cleared, and then the value is written back to where it came from. For example:

define bits to set
load value from location (exposing a hardware register, perhaps)
logical-or value with bits
store result in location

Encapsulating this in a single instruction avoids potential issues with different things competing to update the location at the same time, if the hardware permits this, and just offers something that is more efficient and convenient, anyway. Separate locations are provided for “set” and “clear” operations, and the original location is provided to read and to overwrite the hardware register’s value. Sometimes, such registers might only support read-only access, in fact. But the JZ4730 does not support such additional locations, and so we have to do things the hard way when updating registers and doing things like clearing and setting bits.

One odd thing that caught me out was a strange result from the special “exception base” (EBASE) register that does not seem to return zero for the CPU identifier, something that the bootstrap code in L4Re expects. I suppressed this test and made the kernel always return zero when it asks for this identifier. To debug such things, I could not use the screen as I had done with the Ben since the bootloader does not configure it on the Letux. Fortunately, unlike the Ben, the Letux provides a few LEDs to indicate things like keyboard and network status, and these can be configured and activated to communicate simple status information.

Otherwise, the exercise mostly involved me reworking some existing code I had (itself borrowing somewhat from existing driver code) that provides driver support for the Letux hardware peripherals. The clock and power management (CPM) arrangement is familiar but different from the JZ4720; the LCD driver can actually be used as is; the general-purpose input/output (GPIO) arrangement is different from the JZ4720 and, curiously enough once again, perhaps more similar to the JZ4780 in some ways. To support the LCD panel’s backlight, a pulse-width modulation (PWM) driver needed to be added, but this involves very little code.

I also had to deal with the mistakes I made myself when not concentrating hard enough. Lots of testing and re-testing occurred. But in the space of a weekend or so, I had something to show for all the previous effort plus this round’s additional effort.

The Letux 400 and Ben NanoNote running the "spectrum" example

The Letux 400 and Ben NanoNote running the "spectrum" example

Here, you can see what kind of devices we are dealing with! The Letux 400 is less than half the width of a normal-size keyboard (with numeric keypad), and the Ben NanoNote is less than half the width of the Letux 400. Both of them were inexpensive computing devices when they were introduced, and although they may not be capable of running “modern” desktop environments or Web browsers, they offer computing facilities that were, once upon a time, “workstation class” in various respects. And they did, after all, run GNU/Linux when they were introduced.

And that is why it is attractive to consider running other “proper” operating system technologies on them now. Maybe we can revisit the compromises that led to the subnotebook and the netbook, perhaps even the tablet, where devices that are not the most powerful still have a place in fulfilling our computing needs.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Summary)

Monday, April 16th, 2018

As promised, here is a summary of the work involved in porting L4Re and Fiasco.OC to the Ben NanoNote. First of all, a list of all the articles with some brief descriptions of what they cover:

  1. Familiarisation with L4Re and Fiasco.OC on the MIPS Creator CI20, adding some missing pieces
  2. Setting up and introducing a suitable compiler for the Ben, also describing the hardware in the kernel
  3. Handling instructions unsupported by the JZ4720 (the Ben’s SoC) in the kernel
  4. Describing the Ben and dealing with unsupported instructions in the L4Re portion of the system
  5. Configuring the memory layout and attempting to bootstrap the kernel
  6. Making the kernel support the MIPS architecture revision used by the JZ4720, also fixing the interrupt system description
  7. Investigating context/thread switching and fixing an inadvertently-introduced fault in the unsupported instruction handling
  8. Configuring user space examples and getting a simple framebuffer demonstration working
  9. Getting the framebuffer driver, GUI multiplexer, and “spectrum” example working

As I may have noted a few times in the articles, this work just builds on previous work done by a number of people over the years, obviously starting with the whole L4 microkernel effort, the development of Fiasco.OC, L4Re and their predecessors, and the work done to port these components to the MIPS architecture. On the l4-hackers mailing list, Adam Lackorzynski was particularly helpful when I ran into obstacles, and Sarah Hoffman provided some insight into problems with the CI20 just as it was needed.

You really don’t have to read all the articles or even any of them! The point of this article is to summarise the work and perhaps make similar porting efforts a bit more approachable for others in the same position: anyone having a vague level of familiarity with L4Re/Fiasco.OC or similar systems, also having a device that might be supported, and being somewhat familiar with writing code that drives hardware.

Practical Details

It might be useful to give certain practical details here, if only to indicate the nature of the development and testing routine employed in this endeavour. First of all, I have been using a chroot containing the Debian “unstable” distribution for the i386 architecture. Although this was essential for a time when building the software for the CI20 and trying to take advantage of Debian’s cross-compiler packages, any fairly recent version of Debian would probably be fine because I ended up using a Buildroot toolchain to be able to target the Ben. You could probably choose any Free Software distribution and reproduce what I have done.

The distribution of patches contains instructions regarding preparation and the building of the software. It isn’t too useful to repeat that information here, but the following things need doing:

  1. Installing packages for build tools
  2. Obtaining or building a cross-compiler
  3. Checking out the source code for L4Re and Fiasco.OC from its repository
  4. Applying the patches
  5. Configuring and building the kernel
  6. Configuring and building the runtime environment
  7. Copying the payload to a memory card
  8. Booting the device

Some scripts have been included in the patch distribution, one of which should do the tricky job of applying patches to the repository checkout according to the chosen device configuration. Because a centralised version control system (Subversion) has been used to publish the L4Re and Fiasco.OC sources, I had to find a way of working with my own local changes. Consequently, I wrote a few scripts to maintain bundles of changes associated with certain files, and I then managed these bundles in a different version control system. Yes, this effectively meant versioning the changes themselves!

Things would be simpler with a decentralised version control system because local commits would be convenient, and upstream updates would be incorporated into the repository separately and merged with local changes in a controlled fashion. One of the corporate participants has made a Git repository for Fiasco.OC available, which may alleviate some issues, although I am increasingly finding larger Git repositories to be unusable on my modest hardware, and I also tend to disagree with everybody deciding to put everything on GitHub.

Fixing and Building

Needing to repeatedly build, test, go back and fix, I found myself issuing the same command sequences a lot. When working with the kernel, I tended to enter the kernel build directory, which I called “mybuild”, edit the kernel sources, and then re-run the make command:

cd mybuild
vi ../src/kern/mips/exception.S # edit a familiar file with vim
make

Having built a new kernel, I would then need to build a new payload to deploy, which meant ascending the directory hierarchy and building an image in the L4Re section of the sources:

cd ../../../l4
make O=mybuild uimage E=mips-qi_lb60-spectrum-example

Given a previously-built “user space”, this would bundle the new kernel together with code that might be able to test it. Of particular importance is the bootstrap code which launches the kernel: without that, there is no point in even trying to test the kernel!

I found that re-building L4Re components seemed to require a general build to be performed:

make O=mybuild

If that proved successful, an image would then be built and tested. In general, focusing on either the kernel or some user space component meant that there was rarely a need to build a new kernel and then build much of the user space.

Work Summary

The patches accumulated during this process cover a range of different areas of functionality. Looking at them organised by functional area, instead of in the more haphazard fashion presented throughout the series of articles, allows for a more convenient review of the work actually needed to get the job done.

Build System Adjustments and Various Fixes

As early as my experiments with the CI20, I experienced the need to fix some things that didn’t work on my system, either due to some Debian peculiarities or differences in compiler behaviour:

  • l4util-mips-thread.diff (fixes a symbol visibility issue with certain compiler versions)
  • mips-gcc-cpload.diff (fixes the initialisation of certain L4Re components)
  • no-at.diff (allows the build to work on Debian for the i386 architecture)

Other adjustments are required to let the build system do its job, setting paths for other components and for the toolchains:

  • conf-makeconf-boot.diff (lets the L4Re build system find things like the kernel, modules and hardware descriptions)
  • qi_lb60-gcc-buildroot-fiasco.diff (changes the compiler and architecture settings)
  • qi_lb60-gcc-buildroot-l4re.diff (changes the compiler, architecture and soft-float settings)

The build system also needs directing towards new drivers, and various files need to be excluded or changed:

  • ingenic-mips-drivers-top.diff (enables drivers added by this work)
  • qi_lb60-fbdrv.diff (changes the splash image for the framebuffer driver)
  • qi_lb60-l4re.diff (includes a temporary fix disabling a Mag plugin)

The first of these is important to remember when adding drivers since it changes the l4/pkg/drivers/Control file and defines the driver “packages” provided by each of the driver libraries. These package definitions help the build system work out which other parts of the system need to be consulted when building a particular driver.

Supporting MIPS32r1 Devices

Throughout the kernel and L4Re, changes need making to support the earlier architecture version provided by the JZ4720. The bulk of the following patch files deals with such changes:

  • qi_lb60-fiasco.diff
  • qi_lb60-l4re.diff

Maybe I will try and break out the architecture version changes into specific patch files, provided this does not result in the original source files ending up being patched by multiple patch files. My aim has been to avoid patches having to be applied in a particular order, and that starts to happen when multiple patches modify the same file.

Describing the Ben NanoNote

The kernel needs some knowledge of the Ben with regard to timers and interrupts. Meanwhile, L4Re needs to set the Ben up correctly when booting. Both sections of the system need an awareness of how memory is going to be used, and extra configuration options need to be provided to merely allow the selection of the Ben for building. Currently, the following patch files include things concerned with such matters:

  • qi_lb60-fiasco.diff (contains timer, interrupt and memory details, plus configuration system changes)
  • qi_lb60-l4re.diff (contains bootstrap and memory details, plus configuration system changes)
  • qi_lb60-platform.diff (platform definitions for the Ben in L4Re)

One significant objective here is to be able to offer the Ben as a “first class” configuration option and have the build system do the right thing, setting up all the components and code regions that the Ben needs to function.

Introducing Driver Code

To be able to activate the framebuffer on the Ben, driver code needs introducing for a few peripherals provided by the JZ4720: CPM (clock/power management), GPIO (general-purpose input/output) and LCD (liquid crystal display, or similar). A few different patch files cover these areas:

  • ingenic-mips-cpm.diff (CPM support for JZ4720 and JZ4780)
  • ingenic-mips-gpio.diff (GPIO support for JZ4720 and JZ4780)
  • qi_lb60-lcd.diff (LCD support for JZ4720)

The JZ4780 support is intended for the CI20 and will not be used with the Ben. However, it is convenient to incorporate support for these different platforms in the same patch file in each instance.

Meanwhile, the LCD driver should work with a range of JZ4700-series devices (labelled as JZ4740 in the patches). While focusing on getting things working, the only panel supported by this work was that provided by the Ben. Since then, support has been made slightly more general, just as was done with the Linux kernel support for products employing this particular SoC family and, subsequently, for panels in general. (Linux has moved towards a “device tree” approach for specifying things like panels and displays, although this is arguably just restating things that were once C-coded structures in another, rather peculiar, format.)

To support these drivers, some useful code has been copied from elsewhere in L4Re:

  • drivers_frst-register-block.diff

This provides a convenient abstraction for registers that is exposed via an include directive:

#include <l4/drivers/hw_mmio_register_block.h>

Indeed, it is worth focusing on the LCD driver briefly. The code has its origins in existing driver code written for the Ben that I adapted to get working as part of a simple “bare metal” payload. I have maintained a separation between the more intricate hardware configuration and aspects that deal with the surrounding software. As part of L4Re, the latter involves obtaining access to memory using the appropriate API calls and invoking other drivers.

In L4Re, there is a kind of framework for LCD drivers, and the existing drivers seem to be written in C rather than C++. Reminiscent of Linux, there is a mechanism for exporting driver operations using a well-defined data structure, and this permits the “probing” of drivers to see if they can be enabled and if meaningful information can be obtained about things like the supported resolution, colour depth and pixel format. To make the existing code compatible with L4Re, a fair amount of the work involves translating the information already known (and used) in the hardware configuration activity to a form that other L4Re components can understand and use.

Originally, for the GPIO driver, I had intended it to operate as part of the Io server framework. Components using GPIO functionality would then employ the appropriate API to configure and interact with the exposed input and output pins. Unfortunately, this proved rather cumbersome, and so I decided to take a simpler approach of providing the driver as an abstraction that a program would use together with explicitly-requested memory. I did decide to preserve the general form of the API for this relocated abstraction, however, meaning that various classes and methods are provided that behave in the same way as those “left behind” in the Io server framework.

Thus, a program would itself request access to the GPIO-related memory, and it would then use GPIO-related abstractions to “do the right thing” with this memory. One would envisage that such a program would not be a “normal”, unprivileged program as such, but instead be more like a server or driver in its own right. Indeed, the LCD driver employs these abstractions to use SPI-based signalling with the LCD panel, and it uses the same techniques to configure the LCD clock frequencies using the CPM-related memory and CPM-related abstractions.

Although the GPIO driver follows existing conventions, the CPM driver has no obvious precedent in L4Re, but I adopted some of the conventions employed in the GPIO driver, adding more specialised methods and functions to expose functionality specific to the SoC. Since I had previously written a CPM driver for the JZ4780, the main objective was to make the JZ4720/JZ4740 driver resemble the existing driver as much as possible.

Introducing and Configuring Example Programs

Throughout the series of articles, I was working towards running one specific example program, making some new ones on the way for testing purposes. These additional programs are provided together with their configuration, accompanied by suitable configurations for existing examples and components, by the following patch files:

  • ingenic-mips-modules.diff (example program definitions)
  • qi_lb60-examples.diff (example program implementations and configuration details)

The additional programs (defined in l4/conf/modules.list) are as follows:

  • mips-qi_lb60-lcd-example (implemented by qi_lb60_lcd, configured by the mips-qi_lb60-lcd files)
  • mips-qi_lb60-lcd-driver-example (implemented by qi_lb60_lcd_driver, configured by the mips-qi_lb60-lcd-driver files)

Configurations are provided for the existing examples and components as follows:

  • mips-qi_lb60-fbdrv-example (configured by the mips-qi_lb60-fbdrv files)
  • mips-qi_lb60-spectrum-example (configured by the mips-qi_lb60-spectrum files)

All configurations reside in the l4/conf/examples directory. All new examples reside in the l4/pkg/examples/misc directory.

Further Work

In the final article in the series, I mentioned a few ideas for further work based on that described above:

Since completing the above work, I have already made some progress on the first two of these topics. More on that in an upcoming post!

Some Updates

Since writing this article, a few things are worth adding. First of all, the patches produced in the initial effort described by this series of articles are now available in an “initial archive” via the Web page documenting the effort. In contrast, a “current archive” provides patches for the current state of the work, with the aim being to focus these patches only on essential support for these devices within L4Re and Fiasco.OC, and with future development being done elsewhere.

Another couple of observations have been made since completing this initial effort that qualify or correct some information provided here. On the topic of the memory map needed to support the Ben NanoNote and its bootloader, it turned out that a fairly conventional arrangement was feasible after all and that only the “exception base” might be a problem. Here is the more conventional arrangement:

0x80600000 payload load address when copied by bootm
0x802d0000 bootstrap start address
0x80010000 kernel load address
0x80001000 exception handlers

The kernel load address and bootstrap start address are now the same as for other MIPS platforms. The exception handlers are positioned 4 kilobytes above the normal exception base just in case overwriting them might upset the bootloader.

Previously, I had experienced problems with Mag and its mag-input-libinput plugin, causing me to disable that plugin to get things working. This can be avoided by just providing a capability called “vbus” when starting Mag.