Paul Boddie's Free Software-related blog

Archive for the ‘English’ Category

Investigating CPython’s Optimisation Trickery for Lichen

Saturday, July 7th, 2018

For those of us old enough to remember how the Python programming language was twenty or so years ago, nostalgic for a simpler and kinder time, looking to escape the hard reality of today’s feature enhancement arguments, controversies, general bitterness and recriminations, it can be informative to consider what was appealing about Python all those years ago. Some of us even take this slightly further and attempt to formulate our own take on the language, casting aside things that do not appeal or that seem superfluous, needlessly confusing, or redundant.

My own Python variant, called Lichen, strips away quite a few things from today’s Python but probably wouldn’t seem so different to twentieth century Python. Since my primary objective with Lichen is to facilitate static analysis so that observations can be made about program behaviour before running the program, certain needlessly-dynamic features have been eliminated. Usually, when such statements about feature elimination are made, people seize upon them to claim that the resulting language is statically typed, but this is deliberately not the case here. Indeed, “duck typing” is still as viable as ever in Lichen.

Ancient Dynamism

An example of needless dynamism in Python can arguably be found with the __getattr__ and __setattr__ methods, introduced as far back as Python 1.1. They allow accesses to attributes via instances to be intercepted and values supposedly provided by these attributes to be computed on demand. In effect, these methods support virtual or dynamic attributes that are not really present on an object. Here’s an extract from one of the Python 1.2 demonstration programs (Demo/pdist/client.py):

        def __getattr__(self, name):
                if name in self._methods:
                        method = _stub(self, name)
                        setattr(self, name, method) # XXX circular reference
                        return method
                raise AttributeError, name

In this code, if an instance of the Client class (from which this method is taken) is used to access an attribute called hello, then this method will see if the string “hello” is found in the instance’s _methods attribute, and if so it produces a special object that is then returned as the value for the hello attribute. Otherwise, it raises an exception to indicate that this virtual attribute is not recognised. (Here, the setattr call stores the special object as a genuine attribute in order to save this method from being called again for the same attribute.)

Admittedly, this is quite neat, and it quickly becomes tempting to use such facilities everywhere – this is very much the story of Python and its development – but such things make reasoning about programs more difficult. We cannot know what attributes the instances of this Client class may have without running the program. Indeed, to find out in this case, running the program is literally unavoidable since the _methods attribute is actually populated using the result of a message received over the network!

But even in simpler cases, it can readily be intuitively understood that finding out the supported attributes of instances whose class offers such a method might involve a complicated exercise looking at practically all the code in a program. Despite all the hard work, this exercise will nevertheless produce unreliable or imprecise results. It says something about the fragility of such facilities that properties were later added to Python 2.2 to offer a more declarative alternative.

(It also says something about Python 3 that the cornucopia of special mechanisms for dynamically exposing attributes are apparently still present, despite much having been said about Python 3 remedying such Python 1 and 2 design artefacts.)

Hidden Motives

With static analysis, we might expect to be able to deduce which attributes are provided by class instances without running a program, this potentially allowing us to determine the structure of program objects and to detect errors around their use. But another objective with Lichen is to see how constraints on the language may be used to optimise the performance of programs. I will not deny that performance has always been an interest of mine with respect to Python and its implementations, and I imagine that many compiler and virtual machine implementers have been motivated by such concerns throughout the years.

The deductions made during static analysis can potentially allow us to generate executable programs that perform the same work more efficiently. For example, if it is known that a collection of method calls on an object identify that object as being of a certain type, we can then employ more efficient ways of calling those methods. So, for the following code…

        while number:
            digits.append(_hexdigits[number % base])
            number = number / base

…if we can assert that digits is a list, then considering that we might normally generate code for the append method call as something like this…

__load_via_class(digits, append)(...)

…where the __load_via_class operation has to go and find the append method via the class of digits (or, in some cases, even look for the append attribute on the object first), we might instead be able to generate code like this…

__fn_list_append(digits, ...)

…where __fn_list_append is a genuine C function and the digits instance is passed directly to it, together with the elided arguments. When we can get this kind of thing to happen, it can obviously be very satisfying. Various Python implementations and tools also attempt to make method calls efficient in their own ways, some possibly relying on run-time caches that short-circuit the exercise of finding the method.

Magic Numbers

It can be informative to compare the performance of code generated by the Lichen toolchain and the performance of the same program running in the CPython virtual machine, Python and Lichen being broadly compatible (but not identical). As I noted in my summary of 2017, the performance of generated programs was rather disheartening to see at first. I needed to employ profiling to discover where the time was being spent in my generated code that seemed not to be a comparable burden on CPython.

The practicalities of profiling are definitely beyond the scope of this article, but what I did notice was just how much time was being spent allocating space in memory for integers used by programs. I recalled that Python does some special things with integers itself, and so I set about looking for the details of its memory allocation strategies.

It turns out that integers are allocated in a simplified fashion for performance reasons, instead of using the more general allocator that is compatible with garbage collection. And not just that: a range of “small” integers is also allocated in advance when programs run, so that no time is wasted repeatedly allocating objects for numbers that would likely see common use. The details of this can be found in the Objects/intobject.c file in CPython 1.x and 2.x source distributions. Even CPython 1.0 employs this technique.

At first, I thought I had discovered the remedy for my performance problems, but replicating similar allocation arrangements in my run-time code demonstrated that such a happy outcome was not to be so easily achieved. As I looked around for what other special treatment CPython does, I took a closer look at the bytecode interpreter (found in Python/ceval.c), which is the mechanism that takes the compiled form of Python programs (the bytecode) and evaluates the instructions encoded in this form.

My test programs involved simple loops like this:

i = 0
while i < 100000:
    f(i)
    i += 1

And I had a suspicion that apart from allocating new integers, the operations involved in incrementing them were more costly than they were in CPython. Now, in Python versions from 1.1 onwards, the special operator methods are supported for things like the addition, subtraction, multiplication and division operators. This could conceivably lead to integer addition being supported by the following logic in one of the simpler cases:

# c = a + b
c = a.__add__(b)

But from Python 1.5 onwards, some interesting things appear in the CPython source code:

                case BINARY_ADD:
                        w = POP();
                        v = POP();
                        if (PyInt_Check(v) && PyInt_Check(w)) {
                                /* INLINE: int + int */
                                register long a, b, i;
                                a = PyInt_AS_LONG(v);
                                b = PyInt_AS_LONG(w);
                                i = a + b;

Here, when handling the bytecode for the BINARY_ADD instruction, which is generated when the addition operator (plus, “+”) is used in programs, there is a quick test for two integer operands. If the conditions of this test are fulfilled, the result is computed directly (with some additional tests being performed for overflows not shown above). So, CPython was special-casing integers in two ways: with allocation tricks, and with “fast paths” in the interpreter for cases involving integers.

The Tag Team

My response to this was similarly twofold: find an efficient way of allocating integers, and introduce faster ways of handling integers when they are presented to operators. One option that the CPython implementers actually acknowledge in their source code is that of employing a different representation for integers entirely. CPython may have too much legacy baggage to make this transition, and Python 3 certainly didn’t help the implementers to make the break, it would seem, but I have a bit more flexibility.

The option in question is the so-called tagged pointer approach where instead of having a dedicated object for each integer, with a pointer being used to reference that object, the integers themselves are represented by a value that would normally act as a pointer. But this value is not actually a valid pointer at all since it has its lowest bit set, which violates a restriction that is imposed by some processor architectures, but it can be a self-imposed restriction on other systems as well, merely ruling out the positioning of objects at odd-numbered addresses.

So, we might have the following example representations on a 32-bit architecture:

hex value      31..............................0 (bits)
0x12345678 == 0b00010010001101000101011001111000 => pointer
0x12345679 == 0b00010010001101000101011001111001 => integer

Clearing bit 0 and shifting the other bits one position to the right yields the actual integer value, which in the above case will be 152709948. It is conceivable that in future I might sacrifice another bit for encoding other non-pointer values, especially since various 32-bit architectures require word-aligned addresses, where words are positioned on boundaries that are multiples of four bytes, meaning that the lowest two bits would have to be zero for a pointer to be valid.

Albeit with some additional cost incurred when handling pointers, we can with such an approach distinguish integers from other types rapidly and conveniently, which makes the second part of our strategy more efficient as well. Here, we need to identify and handle integers for the arithmetic operators, but unlike CPython, where this happens to be done in an interpreter loop, we have no such loop. Instead we generate code for such operators that simply invokes some existing functions (written in the Lichen language and compiled to C code, another goal being to write as much of the language system in Lichen itself, not C).

It would be rather wasteful to generate tests for integers in addition to these operator function calls every time such a call is made, but the tests can certainly reside within those functions instead. So, here is what we might do for the addition operator:

def add(a, b):
    if is_int(a) and is_int(b):
        return int_add(a, b)
    return binary_op(a, b, lambda a: a.__add__, lambda b: b.__radd__)

This code leaves me with a bit of explaining to do! Last things first: the final statement is the general case of dispatching to the operands and calling an appropriate operator method, with the binary_op function performing the logic in conjunction with the operands and some lambda functions that defer access to the special methods until they are really needed. It is probably best just to trust me that this does the job!

Before the generic operator method dispatch, however, is the test of the operands to see if they are both integers, and this should be vaguely familiar from the CPython source code. A special function is then called to add them efficiently. Note that we couldn’t use the addition (plus, “+”) operator because this code is meant to be handling that, and it would most likely send us on an infinitely recursive loop that never gets round to performing the addition! (I don’t treat the operator as a special case in this code, either. This code is compiled exactly like any other code written in the language.)

The is_int function is what I call “native”, meaning that it is implemented using low-level operations, in this case ones that test the representation of the argument to see if it has its lowest bit set, returning a true value if so. Meanwhile, int_add is largely equivalent to the addition operation seen in the CPython source code above, just with different details involved.

Progress and Reflections

Such adjustments made quite a difference to the performance of my generated code. They do also make some sense, too. Integers are used a lot in programs, being used not only for general arithmetic, but also for counters, index values for things like lists, tuples, strings and other collections, plus a range of other mundane things whose performance can be overlooked until it proves to be suboptimal. Python has something of a reputation for having slow implementations, but CPython’s trickery here optimises in favour of fast results where they can be obtained, falling back on the slower, general mechanisms when these are required.

But I discovered that this is not the only optimisation trickery CPython does, as another program with interesting representation choices and some wildly varying running times was to demonstrate. More on that in the next article on this topic!

Posted in English, Free Software, Lichen, Python | Comments Off on Investigating CPython’s Optimisation Trickery for Lichen

L4Re: Textual Debugging Output on the Framebuffer

Monday, May 21st, 2018

I have actually been in the process of drafting another article about writing device drivers to run within the L4 Runtime Environment (L4Re) on top of the Fiasco.OC microkernel, this being for the Ben NanoNote and Letux 400 notebook computers. That article started to trail behind a lot of the work being done, and there are a few loose ends to be tied up before I can finish it.

Meanwhile, on the way towards some kind of achievement with L4Re, confounded somewhat by the sometimes impenetrable APIs, I managed to eventually get something working that I had thought would have been one of the first things to demonstrate. When initially perusing the range of software in the “pkg” directory within the L4Re distribution, I saw a package called “fbterminal” providing a terminal program that shows itself on the framebuffer (or display).

I imagined being able to launch this on top of the graphical user interface multiplexer, Mag, and then have the “hello” program provide some output to this terminal. I even imagined having the terminal accept input from the keyboard, but we aren’t quite at that point, and that is where my other article comes in. Of course, I initially had no idea how to achieve this, and there needed to be a lot of work put in just to get back to this particular point of entry.

Now, however, the act of launching fbterminal and have it work is fairly straightforward. A few additional packages are required, but the framebuffer works satisfactorily as far as the other components are concerned, and the result will be a blank region of the screen with the terminal showing precisely nothing. Obviously, we want it to show something in order to confirm that it is working. I had to get used to seeing this blank terminal for a while.

The intended companion to fbterminal for testing purposes is the hello program which merely writes output to what might be described as a logging destination. This particular output channel is usually the serial console for the device, which meant that when porting the system to the Ben and the Letux, the hello program was of no use to me. But now, with a framebuffer to show things on, and with a terminal that might be able to accept output from other things, it becomes interesting to see if the hello program can be persuaded to send its output elsewhere.

It was useful to investigate how the output from the hello program actually makes its way to its destination. Since it uses standard C library functions, there has to be a mechanism for those functions to use. As far as I know, these would typically involve various system calls concerning files and streams. A perusal of the sources dredged up an interesting symbol called “__rtld_l4re_env_posix_vfs_ops”. Further investigation led me to the L4Re virtual filesystem (Vfs) functionality and the following interesting files:

pkg/l4re-core/l4re_vfs/include/vfs.h
pkg/l4re-core/l4re_vfs/include/impl/vfs_impl.h

And these in turn led me to the virtual console (Vcon) functionality:

pkg/l4re-core/l4re_vfs/include/impl/vcon_stream.h
pkg/l4re-core/l4re_vfs/include/impl/vcon_stream_impl.h

It seems that standard output from the hello program goes via the standard C calls and Vfs functions and is packaged up and sent using the Vcon mechanisms to the logging destination, which is typically provided by the root task, Moe. Given that fbterminal understands the Vcon protocol and acts as a console server, there appeared to be some potential in looking at Vcon mechanisms more closely. It seemed that fbterminal might be able to take the place of Moe.

Indeed, the documentation offers some clues. In the description of the init process, Ned, a mention is made of a program loader configuration parameter called “log_fab” that indicates an object that can create a suitable logging destination. When starting a program, the program loader creates such an object using “log_fab” and presents it to the new program as a capability (or object reference).

However, this is not quite what we want because we don’t need anything else to be created: we already have fbterminal ready for us to use. I suppose something could be conjured up to act as a factory and provide a fbterminal instance, and maybe this is not too arduous in the Lua-based configuration environment used by Ned, but I wanted a more direct solution.

Contemplating this, I went and rediscovered the definitions used by Ned to support its configuration scripting (found in pkg/l4re-core/ned/server/src/ned.lua). Here, the workings of the “log_fab” mechanism can be found and studied. But what I started to favour was a way of just indicating a capability to the hello program and not have the loader create something else. This required a simple edit to one of the functions:

function App_env:log()
  Class.check(self, App_env);
  if self.loader.log_fab == nil or self.loader.log_fab.create == nil then
    error ("Starting an application without valid log factory", 4);
  end
  return self.loader.log_fab:create(Proto.Log, table.unpack(self.log_args));
end

Here, we want to ignore “log_fab” and just have our existing capability used instead. So, I introduced another clause to the if statement:

  if self.log_cap then
    return self.log_cap
  elseif self.loader.log_fab == nil or self.loader.log_fab.create == nil then
    error ("Starting an application without valid log factory", 4);
  end

Now, if we specify “log_cap” when starting a program, it should want to direct logging messages to the referenced object instead. So, with this available to us, it becomes possible to adjust the way the hello program is started. First of all, we define the way fbterminal is set up and started:

local term = l:new_channel();

l:start({
    caps = {
      fb = mag_caps.svc:create(L4.Proto.Goos, "g=320x230+0+0", "barheight=10"),
      term = term:svr(),
    },
  },
  "rom/fbterminal");

Since fbterminal needs to “export” its console abilities using a capability called “term”, this needs to be indicated in the “caps” table. (It doesn’t matter what the local variable for the channel is called.) So, the hello program is defined accordingly:

l:start({
    log_cap = term,
  },
  "rom/hello");

Here, we make use of “log_cap” and allow the output to be directed to the terminal that has already been started. And the result is this:

fbterminal on the Ben NanoNote showing the hello program's output

And at long last, it becomes possible to see what programs are printing out to the log!

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, Letux 400, MIPS, NanoNote, operating systems | Comments Off on L4Re: Textual Debugging Output on the Framebuffer

Extending L4Re/Fiasco.OC to the Letux 400 Notebook Computer

Wednesday, April 18th, 2018

In my summary of the port of L4Re and Fiasco.OC to the Ben NanoNote, I remarked that progress had been made on supporting other products and hardware peripherals. In fact, such progress occurred more rapidly than I had thought possible, and I have been able to extend the work to support the Letux 400 notebook computer. It is perhaps worth describing the Letux 400 in a bit more detail because it has an interesting place in the history of netbook computers.

Some History

Back in the early 21st century, laptop computers were becoming increasingly popular at the expense of desktop computers, but as laptops began to take the place of desktops in homes and workplaces, this gradually led each successive generation of laptops to sacrifice portability and affordability in favour of larger, faster, higher-resolution screens and general hardware specifications more competitive with the desktop offerings they sought to replace. Laptops were becoming popular but also bigger, heavier and more expensive.

Things took an interesting turn in 2006 with the introduction of the XO-1 from the One Laptop per Child (OLPC) initiative. With rather different goals to those of the mainstream laptop vendors, the focus was to deliver a relatively-inexpensive yet robust portable computer for use by schoolchildren, many of whom might be living in places with limited infrastructure where increasingly power-hungry mainstream laptops would have been unsuitable, even unusable.

One unexpected consequence of the introduction of the XO-1 was the revival in interest in modestly-performing portable computing hardware. People were actually interested in a computer that did the things they needed, rather than having to buy something designed for gamers, software developers, or corporate “power users” (of both the pretend and genuine kinds). Rather than having to haul increasingly big and heavy laptops and all the usual accessories in a big dedicated bag, they liked the idea of slipping a smaller, lighter device into their everyday bag, as had probably been the idea with subnotebooks while they were still a thing.

Thus, the Asus Eee PC came about, regarded as the first widely-available netbook of recent times (acknowledging the earlier Psion netBook, of course), bringing with it the attention of large-volume manufacturers and economies of scale. For “lightweight tasks”, netbooks were enough for many people: a phenomenon that found itself repeating with tablets, particularly as recreational usage of technology became more important to buyers and evolved in certain ways.

Now, one thing that had been a big part of the OLPC initiative’s objectives was a $100 price point. At first, despite fairly radical techniques being used to reduce cost, and despite the involvement of a major original equipment manufacturer in the production of the XO-1, that price point of $100 was out of reach. Even the Eee PC retailed for a few hundred dollars.

This is where a product known as the Skytone Alpha 400 enters the picture. Some vendors, rebranding this product, offered it as possibly the first $100 laptop – or netbook – to be made available for sale. One of the vendors offers it as the Letux 400, and it has been available for as little as €125 during its retail lifespan. Noting that it has rather similar hardware to the Ben NanoNote, but has a more conventional physical profile and four times as much RAM, my brother bought one to investigate a few years ago. That is how I eventually ended up embarking on this experiment.

Extending Recent Work

There are many similarities between the JZ4720 system-on-a-chip (SoC) used in the Ben and the JZ4730 used in the Letux 400. However, it can be said that the JZ4720 is much better understood. The JZ4740 and closely-related devices like the JZ4720 have appeared in a number of different devices, documentation has surfaced for these products, and vendor source code has always been available, typically using or implicitly documenting most of the hardware.

In contrast, limited documentation is known to exist for the JZ4730, and the available vendor source code has not always described every detail of the hardware, even though the essential operations and register details appear to be present. Having looked at the Linux kernel sources that support the JZ4730, together with U-Boot source code, the similarities and differences between the JZ4720 and JZ4730 began to take shape in my mind.

I took an optimistic approach that mostly paid off. The Fiasco.OC kernel needs augmenting with the details of the JZ4730, but these are similar in some ways to the JZ4720 and familiar otherwise. For instance, the JZ4730 has a 32-bit “operating system timer” (OST) that curiously does not appear in the JZ4740 but does appear in more recent products such as the JZ4780. Bearing such things in mind, the timer and interrupt support was easily enough added.

One very different thing about the JZ4730 is that it does not seem to support the “set” and “clear” register locations that are probably common to most modern SoCs. Typically, one might want to update a hardware-related register to change a peripheral’s configuration, and it must have become apparent to hardware designers that such updates mostly want to either set or clear bits. Normally in a program, to achieve such things involves reading a value, performing a logical operation that combines the value with a description of the bits to be set or cleared, and then the value is written back to where it came from. For example:

define bits to set
load value from location (exposing a hardware register, perhaps)
logical-or value with bits
store result in location

Encapsulating this in a single instruction avoids potential issues with different things competing to update the location at the same time, if the hardware permits this, and just offers something that is more efficient and convenient, anyway. Separate locations are provided for “set” and “clear” operations, and the original location is provided to read and to overwrite the hardware register’s value. Sometimes, such registers might only support read-only access, in fact. But the JZ4730 does not support such additional locations, and so we have to do things the hard way when updating registers and doing things like clearing and setting bits.

One odd thing that caught me out was a strange result from the special “exception base” (EBASE) register that does not seem to return zero for the CPU identifier, something that the bootstrap code in L4Re expects. I suppressed this test and made the kernel always return zero when it asks for this identifier. To debug such things, I could not use the screen as I had done with the Ben since the bootloader does not configure it on the Letux. Fortunately, unlike the Ben, the Letux provides a few LEDs to indicate things like keyboard and network status, and these can be configured and activated to communicate simple status information.

Otherwise, the exercise mostly involved me reworking some existing code I had (itself borrowing somewhat from existing driver code) that provides driver support for the Letux hardware peripherals. The clock and power management (CPM) arrangement is familiar but different from the JZ4720; the LCD driver can actually be used as is; the general-purpose input/output (GPIO) arrangement is different from the JZ4720 and, curiously enough once again, perhaps more similar to the JZ4780 in some ways. To support the LCD panel’s backlight, a pulse-width modulation (PWM) driver needed to be added, but this involves very little code.

I also had to deal with the mistakes I made myself when not concentrating hard enough. Lots of testing and re-testing occurred. But in the space of a weekend or so, I had something to show for all the previous effort plus this round’s additional effort.

The Letux 400 and Ben NanoNote running the "spectrum" example

Here, you can see what kind of devices we are dealing with! The Letux 400 is less than half the width of a normal-size keyboard (with numeric keypad), and the Ben NanoNote is less than half the width of the Letux 400. Both of them were inexpensive computing devices when they were introduced, and although they may not be capable of running “modern” desktop environments or Web browsers, they offer computing facilities that were, once upon a time, “workstation class” in various respects. And they did, after all, run GNU/Linux when they were introduced.

And that is why it is attractive to consider running other “proper” operating system technologies on them now. Maybe we can revisit the compromises that led to the subnotebook and the netbook, perhaps even the tablet, where devices that are not the most powerful still have a place in fulfilling our computing needs.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, Letux 400, MIPS, NanoNote, operating systems | Comments Off on Extending L4Re/Fiasco.OC to the Letux 400 Notebook Computer

Porting L4Re and Fiasco.OC to the Ben NanoNote (Summary)

Monday, April 16th, 2018

As promised, here is a summary of the work involved in porting L4Re and Fiasco.OC to the Ben NanoNote. First of all, a list of all the articles with some brief descriptions of what they cover:

As I may have noted a few times in the articles, this work just builds on previous work done by a number of people over the years, obviously starting with the whole L4 microkernel effort, the development of Fiasco.OC, L4Re and their predecessors, and the work done to port these components to the MIPS architecture. On the l4-hackers mailing list, Adam Lackorzynski was particularly helpful when I ran into obstacles, and Sarah Hoffman provided some insight into problems with the CI20 just as it was needed.

You really don’t have to read all the articles or even any of them! The point of this article is to summarise the work and perhaps make similar porting efforts a bit more approachable for others in the same position: anyone having a vague level of familiarity with L4Re/Fiasco.OC or similar systems, also having a device that might be supported, and being somewhat familiar with writing code that drives hardware.

Practical Details

It might be useful to give certain practical details here, if only to indicate the nature of the development and testing routine employed in this endeavour. First of all, I have been using a chroot containing the Debian “unstable” distribution for the i386 architecture. Although this was essential for a time when building the software for the CI20 and trying to take advantage of Debian’s cross-compiler packages, any fairly recent version of Debian would probably be fine because I ended up using a Buildroot toolchain to be able to target the Ben. You could probably choose any Free Software distribution and reproduce what I have done.

The distribution of patches contains instructions regarding preparation and the building of the software. It isn’t too useful to repeat that information here, but the following things need doing:

Installing packages for build tools
Obtaining or building a cross-compiler
Checking out the source code for L4Re and Fiasco.OC from its repository
Applying the patches
Configuring and building the kernel
Configuring and building the runtime environment
Copying the payload to a memory card
Booting the device

Some scripts have been included in the patch distribution, one of which should do the tricky job of applying patches to the repository checkout according to the chosen device configuration. Because a centralised version control system (Subversion) has been used to publish the L4Re and Fiasco.OC sources, I had to find a way of working with my own local changes. Consequently, I wrote a few scripts to maintain bundles of changes associated with certain files, and I then managed these bundles in a different version control system. Yes, this effectively meant versioning the changes themselves!

Things would be simpler with a decentralised version control system because local commits would be convenient, and upstream updates would be incorporated into the repository separately and merged with local changes in a controlled fashion. One of the corporate participants has made a Git repository for Fiasco.OC available, which may alleviate some issues, although I am increasingly finding larger Git repositories to be unusable on my modest hardware, and I also tend to disagree with everybody deciding to put everything on GitHub.

Fixing and Building

Needing to repeatedly build, test, go back and fix, I found myself issuing the same command sequences a lot. When working with the kernel, I tended to enter the kernel build directory, which I called “mybuild”, edit the kernel sources, and then re-run the make command:

cd mybuild
vi ../src/kern/mips/exception.S # edit a familiar file with vim
make

Having built a new kernel, I would then need to build a new payload to deploy, which meant ascending the directory hierarchy and building an image in the L4Re section of the sources:

cd ../../../l4
make O=mybuild uimage E=mips-qi_lb60-spectrum-example

Given a previously-built “user space”, this would bundle the new kernel together with code that might be able to test it. Of particular importance is the bootstrap code which launches the kernel: without that, there is no point in even trying to test the kernel!

I found that re-building L4Re components seemed to require a general build to be performed:

make O=mybuild

If that proved successful, an image would then be built and tested. In general, focusing on either the kernel or some user space component meant that there was rarely a need to build a new kernel and then build much of the user space.

Work Summary

The patches accumulated during this process cover a range of different areas of functionality. Looking at them organised by functional area, instead of in the more haphazard fashion presented throughout the series of articles, allows for a more convenient review of the work actually needed to get the job done.

Build System Adjustments and Various Fixes

As early as my experiments with the CI20, I experienced the need to fix some things that didn’t work on my system, either due to some Debian peculiarities or differences in compiler behaviour:

l4util-mips-thread.diff (fixes a symbol visibility issue with certain compiler versions)
mips-gcc-cpload.diff (fixes the initialisation of certain L4Re components)
no-at.diff (allows the build to work on Debian for the i386 architecture)

Other adjustments are required to let the build system do its job, setting paths for other components and for the toolchains:

conf-makeconf-boot.diff (lets the L4Re build system find things like the kernel, modules and hardware descriptions)
qi_lb60-gcc-buildroot-fiasco.diff (changes the compiler and architecture settings)
qi_lb60-gcc-buildroot-l4re.diff (changes the compiler, architecture and soft-float settings)

The build system also needs directing towards new drivers, and various files need to be excluded or changed:

ingenic-mips-drivers-top.diff (enables drivers added by this work)
qi_lb60-fbdrv.diff (changes the splash image for the framebuffer driver)
qi_lb60-l4re.diff (includes a temporary fix disabling a Mag plugin)

The first of these is important to remember when adding drivers since it changes the l4/pkg/drivers/Control file and defines the driver “packages” provided by each of the driver libraries. These package definitions help the build system work out which other parts of the system need to be consulted when building a particular driver.

Supporting MIPS32r1 Devices

Throughout the kernel and L4Re, changes need making to support the earlier architecture version provided by the JZ4720. The bulk of the following patch files deals with such changes:

qi_lb60-fiasco.diff
qi_lb60-l4re.diff

Maybe I will try and break out the architecture version changes into specific patch files, provided this does not result in the original source files ending up being patched by multiple patch files. My aim has been to avoid patches having to be applied in a particular order, and that starts to happen when multiple patches modify the same file.

Describing the Ben NanoNote

The kernel needs some knowledge of the Ben with regard to timers and interrupts. Meanwhile, L4Re needs to set the Ben up correctly when booting. Both sections of the system need an awareness of how memory is going to be used, and extra configuration options need to be provided to merely allow the selection of the Ben for building. Currently, the following patch files include things concerned with such matters:

qi_lb60-fiasco.diff (contains timer, interrupt and memory details, plus configuration system changes)
qi_lb60-l4re.diff (contains bootstrap and memory details, plus configuration system changes)
qi_lb60-platform.diff (platform definitions for the Ben in L4Re)

One significant objective here is to be able to offer the Ben as a “first class” configuration option and have the build system do the right thing, setting up all the components and code regions that the Ben needs to function.

Introducing Driver Code

To be able to activate the framebuffer on the Ben, driver code needs introducing for a few peripherals provided by the JZ4720: CPM (clock/power management), GPIO (general-purpose input/output) and LCD (liquid crystal display, or similar). A few different patch files cover these areas:

ingenic-mips-cpm.diff (CPM support for JZ4720 and JZ4780)
ingenic-mips-gpio.diff (GPIO support for JZ4720 and JZ4780)
qi_lb60-lcd.diff (LCD support for JZ4720)

The JZ4780 support is intended for the CI20 and will not be used with the Ben. However, it is convenient to incorporate support for these different platforms in the same patch file in each instance.

Meanwhile, the LCD driver should work with a range of JZ4700-series devices (labelled as JZ4740 in the patches). While focusing on getting things working, the only panel supported by this work was that provided by the Ben. Since then, support has been made slightly more general, just as was done with the Linux kernel support for products employing this particular SoC family and, subsequently, for panels in general. (Linux has moved towards a “device tree” approach for specifying things like panels and displays, although this is arguably just restating things that were once C-coded structures in another, rather peculiar, format.)

To support these drivers, some useful code has been copied from elsewhere in L4Re:

drivers_frst-register-block.diff

This provides a convenient abstraction for registers that is exposed via an include directive:

#include <l4/drivers/hw_mmio_register_block.h>

Indeed, it is worth focusing on the LCD driver briefly. The code has its origins in existing driver code written for the Ben that I adapted to get working as part of a simple “bare metal” payload. I have maintained a separation between the more intricate hardware configuration and aspects that deal with the surrounding software. As part of L4Re, the latter involves obtaining access to memory using the appropriate API calls and invoking other drivers.

In L4Re, there is a kind of framework for LCD drivers, and the existing drivers seem to be written in C rather than C++. Reminiscent of Linux, there is a mechanism for exporting driver operations using a well-defined data structure, and this permits the “probing” of drivers to see if they can be enabled and if meaningful information can be obtained about things like the supported resolution, colour depth and pixel format. To make the existing code compatible with L4Re, a fair amount of the work involves translating the information already known (and used) in the hardware configuration activity to a form that other L4Re components can understand and use.

Originally, for the GPIO driver, I had intended it to operate as part of the Io server framework. Components using GPIO functionality would then employ the appropriate API to configure and interact with the exposed input and output pins. Unfortunately, this proved rather cumbersome, and so I decided to take a simpler approach of providing the driver as an abstraction that a program would use together with explicitly-requested memory. I did decide to preserve the general form of the API for this relocated abstraction, however, meaning that various classes and methods are provided that behave in the same way as those “left behind” in the Io server framework.

Thus, a program would itself request access to the GPIO-related memory, and it would then use GPIO-related abstractions to “do the right thing” with this memory. One would envisage that such a program would not be a “normal”, unprivileged program as such, but instead be more like a server or driver in its own right. Indeed, the LCD driver employs these abstractions to use SPI-based signalling with the LCD panel, and it uses the same techniques to configure the LCD clock frequencies using the CPM-related memory and CPM-related abstractions.

Although the GPIO driver follows existing conventions, the CPM driver has no obvious precedent in L4Re, but I adopted some of the conventions employed in the GPIO driver, adding more specialised methods and functions to expose functionality specific to the SoC. Since I had previously written a CPM driver for the JZ4780, the main objective was to make the JZ4720/JZ4740 driver resemble the existing driver as much as possible.

Introducing and Configuring Example Programs

Throughout the series of articles, I was working towards running one specific example program, making some new ones on the way for testing purposes. These additional programs are provided together with their configuration, accompanied by suitable configurations for existing examples and components, by the following patch files:

ingenic-mips-modules.diff (example program definitions)
qi_lb60-examples.diff (example program implementations and configuration details)

The additional programs (defined in l4/conf/modules.list) are as follows:

mips-qi_lb60-lcd-example (implemented by qi_lb60_lcd, configured by the mips-qi_lb60-lcd files)
mips-qi_lb60-lcd-driver-example (implemented by qi_lb60_lcd_driver, configured by the mips-qi_lb60-lcd-driver files)

Configurations are provided for the existing examples and components as follows:

mips-qi_lb60-fbdrv-example (configured by the mips-qi_lb60-fbdrv files)
mips-qi_lb60-spectrum-example (configured by the mips-qi_lb60-spectrum files)

All configurations reside in the l4/conf/examples directory. All new examples reside in the l4/pkg/examples/misc directory.

Further Work

In the final article in the series, I mentioned a few ideas for further work based on that described above:

Porting to other JZ4700-series devices and products
Writing drivers for other hardware peripherals
Investigating related frameworks such as Genode, given its support for Fiasco.OC
Contemplating the relevance of these technologies to the Hurd

Since completing the above work, I have already made some progress on the first two of these topics. More on that in an upcoming post!

Some Updates

Since writing this article, a few things are worth adding. First of all, the patches produced in the initial effort described by this series of articles are now available in an “initial archive” via the Web page documenting the effort. In contrast, a “current archive” provides patches for the current state of the work, with the aim being to focus these patches only on essential support for these devices within L4Re and Fiasco.OC, and with future development being done elsewhere.

Another couple of observations have been made since completing this initial effort that qualify or correct some information provided here. On the topic of the memory map needed to support the Ben NanoNote and its bootloader, it turned out that a fairly conventional arrangement was feasible after all and that only the “exception base” might be a problem. Here is the more conventional arrangement:

0x80600000 payload load address when copied by bootm
0x802d0000 bootstrap start address
0x80010000 kernel load address
0x80001000 exception handlers

The kernel load address and bootstrap start address are now the same as for other MIPS platforms. The exception handlers are positioned 4 kilobytes above the normal exception base just in case overwriting them might upset the bootloader.

Previously, I had experienced problems with Mag and its mag-input-libinput plugin, causing me to disable that plugin to get things working. This can be avoided by just providing a capability called “vbus” when starting Mag.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Summary)

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 9)

Thursday, March 29th, 2018

After all my effort modifying the Fiasco.OC kernel, adapting driver code for the Ben NanoNote to work in the L4 Runtime Environment (L4Re), I had managed to demonstrate programs that could access the framebuffer and thus show things on the screen. But my stated goal is to demonstrate the functioning of existing code and a range of different components, specifically the “spectrum” example, GUI multiplexer (Mag), and framebuffer driver (fb-drv), supported by additional driver functionality I have added myself.

Still to do were the tasks of getting the framebuffer driver to access my LCD driver, getting Mag to work on top of that, and then getting the “spectrum” example to work as it had done with Fiasco.OC-UX. Looking back at this exercise now, I can see that it is largely a case of wiring up the components just as we saw earlier when I showed how to get access to the hardware through the Io server. But this didn’t prevent excursions into the code for some debugging operations or a few, more lasting, modifications.

Making a Splash

I had sought to test the entire stack consisting of the example, GUI multiplexer, framebuffer driver, and the rest of the software using a configuration derived from both the UX “spectrum” demonstration (found in l4/conf/examples/x86-fb.cfg) and the corresponding demonstration for ARM-based devices (found in l4/conf/examples/arm-rv-lcd.cfg). Unsurprisingly, this did not work, and so I started out stripping down my own example configuration in order to test the components individually.

On the way, I learned a few things that I wished I’d realised earlier. The first one is pretty mundane but very important. In order to strip down the configuration, I had tried to comment out various components and had done so using the hash symbol (#), which vim had helped to make me believe was a valid comment symbol. In fact, in Lua, if the hash symbol can be used for “program metadata”, perhaps for the usual Unix scripting declaration, then its use may be restricted to such narrow purposes and no others (as far as I could tell from quickly looking through the documentation). Broader use of the symbol appears to involve it acting as the length operator.

By making an assumption almost instinctively, due to the prevalence of the hash character as a comment symbol in Unix scripting, I was now being confronted with “Starting kernel …” and nothing more, making me think that I had really broken something. I had to take several steps back, consider what might really be happening, that the Ned task responsible for executing the configuration had somehow failed, and then come to the realisation that were I able to read the serial console output, I would be staring at a Lua syntax error!

So, removing the configuration details that I didn’t want to test straight away, I set about testing the framebuffer driver all by itself. Here is the configuration file involved (edited slightly):

local io_buses =
  {
    fbdrv = l:new_channel();
  };

l:start({
  caps = {
      fbdrv = io_buses.fbdrv:svr(),
      icu = L4.Env.icu,
      sigma0 = L4.cast(L4.Proto.Factory, L4.Env.sigma0):create(L4.Proto.Sigma0),
    },
  },
  "rom/io -vvvv rom/hw_devices.io rom/mips-qi_lb60-fbdrv.io");

local fbdrv_fb = l:new_channel();

l:startv({
  caps = {
      vbus = io_buses.fbdrv,
      fb   = fbdrv_fb:svr(),
    },
  },
  "rom/fb-drv", "-c", "nanonote"); -- configuration passed to the LCD driver

It was around this time that I discovered the importance of the naming of certain things, noted previously, and so in the accompanying Io configuration, the devices are added to a vbus called “fbdrv”. What this example should do is to run just the framebuffer driver (fb-drv) and the Io server.

Now, you might be wondering what the point of running just the driver, without any client programs, might be. Well, I had noticed that the driver should show a “splash screen”, and so I eagerly anticipated seeing whatever the L4Re developers had decided to show to us upon the driver starting up. Unfortunately, all I got was a blank screen. That made me think some bad things about what might be happening in the code. Fortunately, it didn’t take me long to realise what the problem was, discovering the following (in l4/pkg/fb-drv/server/src/splash.cc):

  if (fb_info->width < SPLASHNAME.width)
    return;
  if (fb_info->height < SPLASHNAME.height)
    return;

Meanwhile, in the source file providing the screen data (l4/pkg/fb-drv/server/data/splash1.c), I saw that the width and height of the image were given as being 480 pixels and 65 pixels respectively. The Ben’s screen is 320 pixels wide and 240 pixels high, thus preventing the supplied image from being shown.

It seemed worthwhile trying to replace this image just to reassure myself that the driver was actually working. The supplied image was exported from GIMP and so I attempted to reproduce this by loading one of my own images into GIMP, cropping to 320×240, and saving as a C source file with run-length encoding, 3 bytes per pixel, just as the supplied image had been created. I then copied it into the appropriate package subdirectory (l4/pkg/fb-drv/server/data) and modified the Makefile (l4/pkg/fb-drv/server/src/Makefile), changing it to reference my image instead of the supplied one.

I also needed to change a few other things in the Makefile, as it turned out, such as the definitions of the sources and libraries used by MIPS architecture devices. It was odd to encounter architecture-specific artefacts at this level in the system, but I suppose that different architectures tend to employ different kinds of display mechanisms: the x86 architecture and derivatives have their “legacy” devices; other architectures used in system-on-a-chip products have other peripherals and devices. Anyway, building the payload and booting the Ben gave me this:

The Ben NanoNote showing my L4Re framebuffer driver splash screen: rainbow lorikeets on a suburban lawn

So, the framebuffer driver will happily run, showing its splash screen until something connects to it and gets something else onto the screen. Here, we just let it run by itself until it is time to switch off!

Missing Inputs

There was still one component to be tested before arriving at the “spectrum” example: the GUI multiplexer, Mag. But testing it alone did not necessarily seem to make sense, because unlike the framebuffer driver, Mag’s role is merely to arbitrate between different framebuffer clients. So testing it and some kind of example together seemed like the only workable strategy.

I tried to test it with the “spectrum” example, but I only ever saw the framebuffer splash screen, so it seemed that either the example or Mag wasn’t working. But if it were Mag that wasn’t working, how would I be able to find my way to the problem? I briefly considered whether Mag had a problem with the display configuration, given that I had already seen such a problem, albeit a minor one, in the framebuffer driver.

I chased references around the source code until I established that Mag was populating a fixed-length array of “factory” objects defined in one place (l4/pkg/mag-gfx/include/factory) whose members were statically defined in another place (l4/pkg/mag/server/src/screen.cc), each one indicating how to deal with a particular pixel format. Such objects being of type Mag_gfx::Mem::Factory are returned in the Mag main program (in pkg/mag/server/src/main.cc) when the following code is executed:

  Screen_factory *f = dynamic_cast<Screen_factory*>(Screen_factory::set.find(view_i.pixel_info));

I had wondered whether f was being presented with a null value and thus stopping Mag right there, but since there was a suitable factory object being created for the Ben’s pixel format, it seemed rather unlikely. So, the only approach I had considered at this point was not a particularly convenient one: I would have to replicate Mag piece by piece until discovering which part of it caused it to fail.

I set out with a simple example borrowing code from Mag that requests a memory region via a capability or channel, displaying some data on the screen. This managed to work. I then expanded the example, adding various data structures, copying functionality into the example directory from Mag, slowly reassembling what Mag had been all along. Things kept working, right until the point where I needed to set the example going as a server rather than have it wait in an infinite loop doing nothing. Then, the screen would very briefly show the splash image, then the bit patterns, and then go blank.

But maybe Mag was going to clear the framebuffer anyway and thus the server loop was working? Perhaps this was what success actually looked like under these circumstances, which is to say that it did not involve seeing two brightly-coloured birds on a lawn. At this point, out of laziness, I had avoided integrating the plugins into my example, and that was the only remaining difference between it and Mag.

I started to realise that maybe it was a matter of removing things from Mag and seeing if I could get it to work, and the obvious candidates for removal were the plugins. So I removed all the plugins as defined in the Makefile (found at pkg/mag/server/src/Makefile) and tested to see if it changed Mag’s behaviour. Sure enough, the screen went blank. I then added plugins back one by one, knowing by now that the mag-client_fb plugin was going to be required to get the example working.

Well, it turned out that there was one plugin causing all the problems: mag-input-libinput. Removing that got the “spectrum” example to work:

The Ben NanoNote showing the "spectrum" framebuffer example

Given that I haven’t addressed the matter of supporting input devices, which for the Ben would mostly concern the keyboard, disabling this errant plugin seemed like a reasonable compromise for now.

Update: a description of a remedy for the problem with the plugin is described in the summary article.

A Door Opens?

There isn’t much left to be said, which perhaps makes the ending something of an anticlimax. But perhaps this is part of the point of the exercise, that the L4Re/Fiasco.OC combination now just seems to work on the Ben NanoNote, and that it could potentially in future be just another thing that this software supports. Of course, the Ben is such a relatively rare device that it isn’t likely to have many potential users, but the SoC family of which the JZ4720 is a part is employed by a number of different devices.

If there haven’t been any privately-maintained ports of this software to those other devices, then this work potentially opens the door to its use on other handheld devices like the GCW Zero or any number of randomly-sourced pocket games consoles, portable media players, and even smartwatches and wearable devices, all of which have been vehicles for the SoC vendor’s products. The Letux 400 could probably be supported given the similarity of its own SoC, the JZ4730, to that used in the Ben.

When the Ben was released, work had been done, first by the SoC vendor, then by the Qi Hardware people, to provide a functioning GNU/Linux system for the device. Clearly, there isn’t an overwhelming need for another kind of system if the intention is to just use the device with existing Free Software. But perhaps this is another component in this exercise: to make other technological solutions possible and to explore avenues that get ignored as everyone follows the mainstream. The Ben is hardly a mainstream product itself; why not use it to make other alternative choices?

It seems interesting to consider writing other drivers for the Ben and to gain experience with microkernel-based systems design. The Genode framework might also be worth investigating, given its focus on becoming a system suitable for deployment to end-users. The Hurd was also ported in an exploratory fashion to one of the L4 implementations some time ago, and revisiting this might be possible, even simplifying the effort thanks to the evolution in features provided by implementations like Fiasco.OC.

In any case, I hope that this account has been vaguely interesting and entertaining. If you managed to read it all, you deserve my sincere thanks for spending the time to do so. A summary of the work will probably follow, and the patches themselves are available on a page dedicated to the effort. Good luck with your own investigations into alternative Free Software operating systems!

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 9)

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 8)

Tuesday, March 27th, 2018

With some confidence that the effort put in previously had resulted in a working Fiasco.OC kernel, I now found myself where I had expected to be at the start of this adventure: in the position of trying to get an example program working that could take advantage of the framebuffer on the Ben NanoNote. In fact, there is already an example program called “spectrum” that I had managed to get working with the UX variant of Fiasco.OC (this being a kind of “User Mode Fiasco” which runs as a normal process, or maybe a few processes, on top of a normal GNU/Linux system).

However, if this exercise had taught me nothing else, it had taught me not to try out huge chunks of untested functionality and then expect everything to work. It would instead be a far better idea to incrementally add small chunks of functionality whose correctness (or incorrectness) would be easier to see. Thus, I decided to start with a minimal example, adding pieces of tested functionality from my CI20 experiments that obtain access to various memory regions. In the first instance, I wanted to see if I could exercise control over the LCD panel via the general-purpose input/output (GPIO) memory region.

Wiring Up

At this point, I should make a brief detour into the matter of making peripherals available to programs. Unlike the debugging code employed previously, it isn’t possible to just access some interesting part of memory by taking an address and reading from it or writing to it. Not only are user space programs working with virtual memory, whose addresses need not correspond to the actual physical locations employed by the underlying hardware, but as a matter of protecting the system against program misbehaviour, access to memory regions is strictly controlled.

In L4Re, a component called Io is responsible for granting access to predefined memory regions that describe hardware peripherals or devices. My example needed to have the following resources defined:

A configuration description (placed in l4/conf/examples/mips-qi_lb60-lcd.cfg)
The Io peripheral bus description (placed in l4/conf/examples/mips-qi_lb60-lcd.io)
A list of modules needing deployment (added as an entry to l4/conf/modules.list, described separately in l4/conf/examples/mips-qi_lb60-lcd.list)

It is perhaps clearer to start with the Io-related resource, which is short enough to reproduce here:

local hw = Io.system_bus()

local bus = Io.Vi.System_bus
{
  CPM = wrap(hw:match("jz4740-cpm"));
  GPIO = wrap(hw:match("jz4740-gpio"));
  LCD = wrap(hw:match("jz4740-lcd"));
}

Io.add_vbus("devices", bus)

This is Lua source code, with Lua having been adopted as a scripting language in L4Re. Here, the important things are those appearing within the hw:match function call brackets: each of these values refers to a device defined in the Ben’s hardware description (found in l4/pkg/io/io/config/plat-qi_lb60/hw_devices.io), identified using the “hid” property. The “jz4740-gpio” value is used to locate the GPIO device or peripheral, for instance.

Also important is the name used for the bus, “devices”, in the add_vbus function call at the bottom, as we shall see in a moment. Moving on from this, the configuration description for the example itself contains the following slightly-edited details:

local io_buses =
  {
    devices = l:new_channel();
  };

l:start({
  caps = {
      devices = io_buses.devices:svr(),
      icu = L4.Env.icu,
      sigma0 = L4.cast(L4.Proto.Factory, L4.Env.sigma0):create(L4.Proto.Sigma0),
    },
  },
  "rom/io -vvvv rom/hw_devices.io rom/mips-qi_lb60-lcd.io");

l:start({
  caps = {
      icu = L4.Env.icu,
      vbus = io_buses.devices,
    },
  },
  "rom/ex_qi_lb60_lcd");

Here, the “devices” name seemingly needs to be consistent with the name used in the caps mapping for the Io server, set up in the first l:start invocation. The name used for the io_buses mapping can be something else, but then the references to io_buses.devices must obviously be changed to follow suit. I am sure you get the idea.

The consequence of this wiring up is that the example program, set up at the end, accesses peripherals using the vbus capability or channel, communicating with Io and requesting access to devices, whose memory regions are described in the Ben’s hardware description. Io acts as the server on this channel whereas the example program acts as the client.

Is This Thing On?

There were a few good reasons for trying GPIO access first. The most significant one is related to the framebuffer. As you may recall, as the kernel finished starting up, sigma0 started, and we could then no longer be sure that the framebuffer configuration initialised by the bootloader had been preserved. So, I could no longer rely on any shortcuts in getting information onto the screen.

Another reason for trying GPIO operations was that I had written code for the CI20 that accessed input and output pins, and I had some confidence in the GPIO driver code that I had included in the L4Re package hierarchy actually working as intended. Indeed, when working with the CI20, I had implemented some driver code for both the GPIO and CPM (clock/power management) functionality as provided by the CI20’s SoC, the JZ4780. As part of my preparation for this effort, I had adapted this code for the Ben’s SoC, the JZ4720. It was arguably worthwhile to give this code some exercise.

One drawback with using general-purpose input/output, however, is that the Ben doesn’t really have any conveniently-accessed output pins or indicators, whereas the CI20 is a development board with lots of connectors and some LEDs that can be used to give simple feedback on what a program might be doing. Manipulating the LCD panel, in contrast, offers a very awkward way of communicating program status.

Experiments proved somewhat successful, however. Obtaining access to device memory was as straightforward as it had been in my CI20 examples, providing me with a virtual address corresponding to the GPIO memory region. Inserting some driver code employing GPIO operations directly into the principal source file for the example, in order to keep things particularly simple and avoid linking problems, I was able to tell the panel to switch itself off. This involves bit-banging SPI commands using a number of output pins. The consequence of doing this is that the screen fades to white.

I gradually gained confidence and decided to reserve memory for the framebuffer and descriptors. Instead of attempting to use the LCD driver code that I had prepared, I attempted to set up the descriptors manually by writing values to the structure that I knew would be compatible with the state of the peripheral as it had been configured previously. But once again, I found myself making no progress. No image would appear on screen, and I started to wonder whether I needed to do more work to reinitialise the framebuffer or maybe even the panel itself.

But the screen was at least switched on, indicating that the Ben had not completely hung and might still be doing something. One odd thing was that the screen would often turn a certain colour. Indeed, it was turning purple fairly consistently when I realised my mistake: my program was terminating! And this was, of course, as intended. The program would start, access memory, set up the framebuffer, fill it with a pattern, and then finish. I suspected something was happening and when I started to think about it, I had noticed a transient bit pattern before the screen went blank, but I suppose I had put this down to some kind of ghosting or memory effect, at least instinctively.

When a program terminates, it is most likely that the memory it used gets erased so as to prevent other programs from inheriting data that they should not see. And naturally, this will result in the framebuffer being cleared, maybe even the descriptors getting trashed again. So the trivial solution to my problem was to just stop my program from terminating:

while (1);

That did the trick!

The Build-Up

Accessing the LCD peripheral directly from my example is all very well, but I had always intended to provide a display driver for the Ben within L4Re. Attempting to compile the driver as part of the general build process had already identified some errors that had been present in what effectively had been untested code, migrated from my earlier work and modified for the L4Re APIs. But I wasn’t about to try and get the driver to work in one go!

Instead, I copied more functionality into my example program, giving things like the clock management functionality some exercise, using my newly-enabled framebuffer output to test the results from various enquiries about clock frequencies, also setting various frequencies and seeing if this stopped the program from working. This did lead me on another needlessly-distracting diversion, however. When looking at the clock configuration values, something seemed wrong: they were shifted one bit to the left and therefore provided values twice as large as they should have been.

I emitted some other values and saw that they too were shifted to the left. For a while, I wondered whether the panel configuration was wrong and started adjusting all sorts of timing-related values (front and back porches, sync pulses, video-related things), reading up about how the panel was configured using the SPI communications described above. It occurred to me that I should investigate how the display output was shifted or wrapped, and I used the value 0x80000001 to put bit values at the extreme left and right of the screen. Something was causing the output to be wrapped, apparently displaced by 9 pixels or 36 bytes to the left.

You can probably guess where this is heading! The panel configuration had not changed and there was no reason to believe that it was wrong. Similarly, the framebuffer was being set up correctly and was not somehow moved slightly in memory, not that there is really any mechanism in L4Re or the hardware that would cause such a bizarre displacement. I looked again at my debugging code and saw my mistake; in concise terms, it took this form:

for (i = 0; i < fb_size / 4; i++)
{
    fb[i] = value & mask ? onpix : offpix; // write a pixel
    if ((i % 10) == 0)                     // move to the next bit?
    ...
}

For some reason, I had taken some perfectly acceptable code and changed one aspect of it. Here is what it had looked like on the many occasions I had used it in the past:

i = 0;
while (i < fb_size / 4)
{
    fb[i] = value & mask ? onpix : offpix; // write a pixel
    i++;
    if ((i % 10) == 0)                     // move to the next bit?
    ...
}

That’s right: I had accidentally reordered the increment and “next bit” tests, causing the first bit in the value to occupy only one pixel, prematurely skipping to the next bit; at the end of the value, the mask would wrap around and show the remaining nine pixels of the first bit. Again, I was debugging the debugging and wasting my own time!

But despite such distractions, over time, it became interesting to copy the complete source file that does most of the work configuring the hardware from my driver into the example directory and to use its functions directly, these functions normally being accessed by more general driver code. In effect, I was replicating the driver within an environment that was being incrementally enhanced, up until the point where I might assume that the driver itself could work.

With no obvious problems having occurred, I then created another example without all the memory lookup operations, copied-in functions, and other “instrumentation”: one that merely called the driver API, relying on the general mechanisms to find and activate the hardware…

  struct arm_lcd_ops *ops = arm_lcd_probe("nanonote");

…obtaining a framebuffer address and size…

  fb_vaddr = ops->get_fb();
  fb_size = ops->get_video_mem_size();

…and then writing data to the screen just as I had done before. Indeed, this worked just like the previous example, indicating that I had managed to implement the driver’s side of the mechanisms correctly. I had a working LCD driver!

Adopting Abstractions

If the final objective had been to just get Fiasco.OC running and to have a program accessing the framebuffer, then this would probably be the end of this adventure. But, in fact, my aim is to demonstrate that the Ben NanoNote can take advantage of the various advertised capabilities of L4Re and Fiasco.OC. This means supporting facilities like the framebuffer driver, GUI multiplexer, and potentially other things. And such facilities are precisely those demonstrated by the “spectrum” example which I had promised myself (and my readership) that I would get working.

At this point, I shouldn’t need to write any more of my own code: the framebuffer driver, GUI multiplexer, and “spectrum” example are all existing pieces of L4Re code. But what I did need to figure out was whether my recipe for wiring these things up was still correct and functional, and whether my porting exercise had preserved the functionality of these things or mysteriously broken them!

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 8)

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 7)

Tuesday, March 27th, 2018

Having tracked down lots of places in the Fiasco.OC kernel that needed changing to run on the Ben NanoNote, including some places I had already visited, changed, and then changed again to fix my earlier mistakes, I was on the “home straight”, needing only to activate the sigma0 thread and the boot thread before the kernel would enter its normal operational state. But the sigma0 thread was proving problematic: control was apparently passed to it, but the kernel would never get control back.

Could this be something peculiar to sigma0, I wondered? I decided to swap the order of the activation, putting the boot thread first and seeing if the kernel “kept going”. Indeed, it did seem to, showing an indication on screen that I had inserted into the code to test whether the kernel regained control after activating the boot thread. So this indeed meant that something about sigma0 was causing the problem. I decided to trace execution through various method calls to see what might be going wrong.

The Big Switch

The process of activating a thread is rather interesting, involving the following methods:

Context::activate (in kernel/fiasco/src/kern/context.cpp)
Context::switch_to_locked
Context::schedule_switch_to_locked
Context::switch_exec_locked
Context::switch_cpu (in kernel/fiasco/src/kern/mips/context-mips.cpp)

This final, architecture-specific method saves a few registers on the stack belonging to the current thread, doing so because we are now going to switch away from that thread. One of these registers ($31 or $ra) holds the return address which, by convention, is the address the processor jumps to after returning from a subroutine, function, method or something of that sort. Here, the return address, before it is stored, is set to a label just after a jump (or branch) instruction that is coming up. The flow of the code is such that we are to imagine that this jump will occur and then the processor will return to the label immediately following the jump:

Set return address register to refer to the instruction after the impending jump
Store registers on the stack
Take the jump…
Continue after the jump

However, there is a twist. Just before the jump is taken, the stack pointer is switched to one belonging to another thread. On top of this, as the jump is taken, the return address is also replaced by loading a value from this other thread’s stack. The processor then jumps to the Context::switchin_context method. When the processor encounters the end of the generated code for this method, it encounters an instruction that makes it jump to the address in the return address register. Here is a more accurate summary:

Set return address register to refer to the instruction after the impending jump
Store registers on the stack
Switch the stack pointer, referencing another thread’s stack
Switch the return address to the other thread’s value
Take the jump…

Let us consider what might happen if this other thread was, in fact, the one we started with: the original kernel thread. Then the return address register would now hold the address of the label after the jump. Control would now pass back into the Context::switch_cpu method, and back out of that stack of invocations shown above, going from bottom to top.

But this is not what happens at the moment: the return address is something that was conjured up for a new thread and execution will instead proceed in the place indicated by that address. A switch occurs, leaving the old thread dormant while a new thread starts its life somewhere else. The problem I now faced was figuring out where the processor was “returning” to at the end of Context::switchin_context and what was happening next.

Following the Threads

I already had some idea of where the processor would end up, but inserting some debugging code in Context::switchin_context and reading the return address from the $ra register allowed me to see what had been used without chasing down how the value had got there in the first place. Then, there is a useful tool that can help people like me find out the significance of a program address. Indeed, I had already used it a few times by this stage: the sibling of objdump known as addr2line. Here is an example of its use:

mipsel-linux-gnu-addr2line -e mybuild/fiasco.debug 814165f0

This indicated that the processor was “returning” to the Thread::user_invoke method (in kernel/fiasco/src/kern/mips/thread-mips.cpp). In fact, looking at the Thread::Thread constructor function, it becomes apparent that this information is specified using the Context::prepare_switch_to method, setting this particular destination up for the above activation “switch” operation. And here we encounter some more architecture-specific tricks.

One thing that happens of importance in user_invoke is the disabling of interrupts. But if sigma0 is to be activated, they have to be enabled again so that sigma0 is then interrupted when a timer interrupt occurs. And I couldn’t find how this was going to be achieved in the code: nothing was actually setting the interrupt enable (IE) flag in the status register.

The exact mechanism escaped me for a while, and it was only after some intensive debugging interventions that I realised that the status register is also set up way back in the Thread::Thread constructor function. There, using the result of the Cp0_status::status_eret_to_user_ei method (found in kernel/fiasco/src/kern/mips/cp0_status.cpp), the stored status register is set for future deployment. The status_eret_to_user_ei method initially looks like it might do things directly with the status register, but it just provides a useful value for the point at which control is handed over to a “user space” program.

And, indeed, in the routine called ret_from_user_invoke (found in kernel/fiasco/src/kern/mips/exception.S), implemented by a pile of macros, there is a macro called restore_cp0_status that finally sets the status register to this useful value. A few instructions later, the “return from exception” instruction called eret appears and, with that, control passes to user space and hopefully some valid program code.

Finding the Fault

I now wondered whether the eret instruction was “returning” to a valid address. This caused me to take a closer look at the data structure, Entry_frame (found in kernel/fiasco/src/kern/mips/entry_frame-mips.cpp), used to hold the details of thread execution states. Debugging in user_invoke, by invoking the ip method on the current context (itself obtained from the stored stack information) yielded an address of 0x2000e0. I double-checked this by doing some debugging in the different macros implementing the routine returning control to user space.

Looking around, the Makefile for sigma0 (found as l4/pkg/l4re-core/sigma0/server/src/Makefile) provided this important clue as to its potential use of this address:

DEFAULT_RELOC_mips  := 0x00200000

Using our old friend objdump on the sigma0 binary (found as mybuild/bin/mips_32/l4f/sigma0), it was confirmed that 0x2000e0 is the address of the _start routine for sigma0. So we could at least suppose that the eret instruction was passing control to the start of sigma0.

I had a suspicion that some instruction was causing an exception but not getting handled. But I had checked the generated code already and found no new unsupported instructions. This now meant the need to debug an unhandled exception condition. This can be done with care, as always, in the assembly language file containing the various “bare metal” handlers (exception.S, mentioned above), and such an intervention was enough to discover that the cause of the exception was an invalid memory access: an address exception.

Now, the handling of such exceptions can be traced from the handlers into the kernel. There is what is known as a “vector table” which lists the different exception causes and the corresponding actions to be taken when they occur. One of the entries in the table looks like this:

        ENTRY_ADDR(slowtrap)      # AdEL

This indicates that for an address exception upon load or instruction fetch, the slowtrap routine will be used. And this slowtrap routine itself employs a jump to a function in the kernel called thread_handle_trap (found in kernel/fiasco/src/kern/mips/thread-mips.cpp). Here, unsurprisingly, I found that attempts to handle the exception would fail and that the kernel would enter the Thread::halt method (in kernel/fiasco/src/kern/thread.cpp). This was a natural place for my next debugging intervention!

I now emitted bit patterns for several saved registers associated with the thread/context. One looked particularly interesting: the stored exception program counter which contained the value 0x2092ac. I had a look at the code for sigma0 using objdump and saw the following (with some extra annotations added for explanatory purposes):

  209290:       3c02fff3        lui     v0,0xfff3
  209294:       24422000        addiu   v0,v0,8192    // 0xfff32000
  209298:       afc2011c        sw      v0,284(s8)
  20929c:       7c02e83b        0x7c02e83b            // rdhwr v0, $29 (ULR)
  2092a0:       afc20118        sw      v0,280(s8)
  2092a4:       8fc20118        lw      v0,280(s8)
  2092a8:       8fc30158        lw      v1,344(s8)
  2092ac:       8c508ff4        lw      s0,-28684(v0) // 0xfff2aff4

Of particular interest was the instruction at 0x20929c, which I annotated above as corresponding to our old favourite, rdhwr, along with some other key values. Now, the final instruction above is where the error occurs, and it is clear that the cause is the access to an address that is completely invalid (as annotated). The origin of this address information occurs in the second instruction above. You may have realised by now that the rdhwr instruction was failing to set the v0 (or $v0) register with the value retrieved from the indicated hardware register.

How could this be possible?! Support for this rdhwr variant was already present in Fiasco.OC, so could it be something I had done to break it? Perhaps the first rule of debugging is not to assume your own innocence for any fault, particularly if you have touched the code in the recent past. So I invoked my first strategy once again and started to look in the “reserved instruction” handler for likely causes of problems. Sure enough, I found this:

        ASM_INS         $at, zero, 0, THREAD_BLOCK_SHIFT, k1 # TCB addr in $at

I had introduced a macro for the ins instruction that needed a temporary register to do its work, it being specified by the last argument. But k1 (or $k1) still holds the instruction that is being inspected, and it gets used later in the routine. By accidentally overwriting it, the wrong target register gets selected, and then the hardware register value is transferred to the wrong register, leaving the correct register unaffected. This is exactly what we saw above!

What I meant to write was this:

        ASM_INS         $at, zero, 0, THREAD_BLOCK_SHIFT, k0 # TCB addr in $at

We can afford to overwrite k0 (or $k0) since it gets overwritten shortly after, anyway. And sure enough, with this modification, sigma0 started working, with control being returned to the kernel after its activation.

Returning to User Space

Verifying that the kernel boot process completed was slightly tricky in that further debugging interventions seemed to be unreliable. Although I didn’t investigate this thoroughly, I suspect that with sigma0 initialised, memory may have been cleared, and this clearing activity erases the descriptor structure used by the LCD peripheral. So it then becomes necessary to reinitialise the peripheral by choosing somewhere for the descriptor, writing the appropriate members, and enabling the peripheral again. Since we might now be overwriting other things critical for the system’s proper functioning, we cannot then expect execution to continue after we have written something informative to the framebuffer, so an infinite loop deliberately hangs the kernel and lets us see our debugging output on the screen.

I felt confident that the kernel was now booting and going about its normal business of switching between threads, handling interrupts and exceptions, and so on. For testing purposes, I had chosen the “hello” example as a user space program to accompany the kernel in the deployed payload, but this example is useless on the Ben without connecting up the serial console connection, which I hadn’t done. So now, I needed to start preparing examples that would actually show something on the screen, working towards running the “spectrum” example provided in the L4Re distribution.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 7)

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 6)

Monday, March 26th, 2018

With all the previous effort adjusting Fiasco.OC and the L4Re bootstrap code, I had managed to get the Ben NanoNote to run some code, but I had encountered a problem entering the kernel. When an instruction was encountered that attempted to clear the status register, it would seem that the boot process would not get any further. The status register controls things like which mode the processor is operating in, and certain modes offer safety guarantees about how much else can go wrong, which is useful when they are typically entered when something has already gone wrong.

When starting up, a MIPS-based processor typically has the “error level” (ERL) flag set in the status register, meaning that normal operations are suspended until the processor can be configured. The processor will be in kernel mode and it will bypass the memory mapping mechanisms when reading from and writing to memory addresses. As I found out in my earlier experiments with the Ben, running in kernel mode with ERL set can mask problems that may then emerge when unsetting it, and clearing the status register does precisely that.

With the kernel’s _start routine having unset ERL, accesses to lower memory addresses will now need to go through a memory mapping even though kernel mode still applies, and if a memory mapping hasn’t been set up then exceptions will start to occur. This tripped me up for quite some time in my previous experiments until I figured out that a routine accessing some memory in an apparently safe way was in fact using lower memory addresses. This wasn’t a problem with ERL set – the processor wouldn’t care and just translate them directly to physical memory – but with ERL unset, the processor now wanted to know how those addresses should really be translated. And since I wasn’t handling the resulting exception, the Ben would just hang.

Debugging the Debugging

I had a long list of possible causes, some more exotic than others: improperly flushed caches, exception handlers in the wrong places, a bad memory mapping configuration. I must admit that it is difficult now, even looking at my notes, to fully review my decision-making when confronting this problem. I can apply my patches from this time and reproduce a situation where the Ben doesn’t seem to report any progress from within the kernel, but with hindsight and all the progress I have made since, it hardly seems like much of an obstacle at all.

Indeed, I have already given the game away in what I have already written. Taking over the framebuffer involves accessing some memory set up by the bootloader. In the bootstrap code, all of this memory should actually be mapped, but since ERL is set I now doubt that this mapping was even necessary for the duration of the bootstrap code’s execution. And I do wonder whether this mapping was preserved once the kernel was started. But what appeared to happen was that when my debugging code tried to load data from the framebuffer descriptor, it would cause an exception that would appear to cause a hang.

Since I wanted to make progress, I took the easy way out. Rather than try and figure out what kind of memory mapping would be needed to support my debugging activities, I simply wrapped the code accessing the framebuffer descriptor and the framebuffer itself with instructions that would set ERL and then unset it afterwards. This would hopefully ensure that even if things weren’t working, then at least my things weren’t making it all worse.

Lost in Translation

It now started to become possible to track the progress of the processor through the kernel. From the _start routine, the processor jumps to the kernel_main function (in kernel/fiasco/src/kern/mips/main.cpp) and starts setting things up. As I was quite sure that the kernel wasn’t functioning correctly, it made sense to drop my debugging code into various places and see if execution got that far. This is tedious work – almost a form of inefficient “single-stepping” – but it provides similar feedback about how the code is behaving.

Although I had tried to do a reasonable job translating certain operations to work on the JZ4720 used by the Ben, I was aware that I might have made one or two mistakes, also missing areas where work needed doing. One such area appeared in the Cpu class implementation (in kernel/fiasco/src/kern/mips/cpu-mips.cpp): a rich seam of rather frightening-looking processor-related initialisation activities. The Cpu::first_boot method contained this ominous-looking code:

require(c.r<0>().ar() > 0,  "MIPS r1 CPUs are notsupported\n");

I hadn’t noticed this earlier, and its effect is to terminate the kernel if it detects an architecture revision just like the one provided by the JZ4720. There were a few other tests of the capabilities of the processor that needed to be either disabled or reworked, and I spent some time studying the documentation concerning configuration registers and what they can tell programs about the kind of processor they are running on. Amusingly, our old friend the rdhwr instruction is enabled for user programs in this code, but since the JZ4720 has no notion of that instruction, we actually disable the instruction that would switch rdhwr on in this way.

Another area that proved to be rather tricky was that of switching interrupts on and having the system timer do its work. Early on, when laying the groundwork for the Ben in the kernel, I had made a rough translation of the CI20 code for the Ben, and we saw some of these hardware details a few articles ago. Now it was time to start the timer, enable interrupts, and have interrupts delivered as the timer counter reaches its limit. The part of the kernel concerned is Kernel_thread::bootstrap (in kernel/fiasco/src/kern/kernel_thread.cpp), where things are switched on and then we wait in a delay loop for the interrupts to cause a variable to almost magically change by itself (in kernel/fiasco/src/drivers/delayloop.cpp):

  Cpu_time t = k->clock;
  Timer::update_timer(t + 1000); // 1ms
  while (t == (t1 = k->clock))
    Proc::pause();

But the processor just got stuck in this loop forever! Clearly, I hadn’t done everything right. Some investigation confirmed that various timer registers were being set up correctly, interrupts were enabled, but that they were being signalled using the “interrupt pending 2” (IP2) flag of the processor’s “cause” register which reports exception and interrupt conditions. Meanwhile, I had misunderstood the meaning of the number in the last statement of the following code (from kernel/fiasco/src/kern/mips/bsp/qi_lb60/mips_bsp_irqs-qi_lb60.cpp):

  _ic[0] = new Boot_object<Irq_chip_ingenic>(0xb0001000);
  m->add_chip(_ic[0], 0);

  auto *c = new Boot_object<Cascade_irq>(nullptr, ingenic_cascade);
  Mips_cpu_irqs::chip->alloc(c, 1);

Here is how the CI20 had been set up (in kernel/fiasco/src/kern/mips/bsp/ci20/mips_bsp_irqs-ci20.cpp):

  _ic[0] = new Boot_object<Irq_chip_ingenic>(0xb0001000);
  m->add_chip(_ic[0], 0);
  _ic[1] = new Boot_object<Irq_chip_ingenic>(0xb0001020);
  m->add_chip(_ic[1], 32);

  auto *c = new Boot_object<Cascade_irq>(nullptr, ingenic_cascade);
  Mips_cpu_irqs::chip->alloc(c, 2);

With virtually no knowledge of the details, and superficially transcribing the code, editing here and there, I had thought that the argument to the alloc method referred to the number of “chips” (actually the number of registers dedicated to interrupt signals). But in fact, it indicates an adjustment to what the kernel code calls a “pin”. In the Mips_cpu_irq_chip class (in kernel/fiasco/src/kern/mips/mips_cpu_irqs.cpp), we actually see that this number is used to indicate the IP flag that will need to be tested. So, the correct argument to the alloc method is 2, not 1, just as it is for the CI20:

  Mips_cpu_irqs::chip->alloc(c, 2);

This fix led to another problem and the discovery of another area that I had missed: the “exception base” gets redefined in the kernel (in src/kern/mips/cpu-mips.cpp), and so I had to make sure it was consistent with the other places I had defined it by changing its value in the kernel (in src/kern/mips/mem_layout-mips32.cpp). I mentioned this adjustment previously, but it was at this stage that I realised that the exception base gets set in the kernel after all. (I had previously set it in the bootstrap code to override the typical default of 0x80000000.)

Leaving the Kernel

Although we are not quite finished with the kernel, the next significant obstacle involved starting programs that are not part of the kernel. Being a microkernel, Fiasco.OC needs various other things to offer the kind of environment that a “monolithic” operating system kernel might provide. One of these is the sigma0 program which has responsibilities related to “paging” or allocating memory. Attempting to start such programs exercises how the kernel behaves when it hands over control to these other programs. We should expect that timer interrupts should periodically deliver control back to the kernel for it to do work itself, this usually involving giving other programs a chance to run.

I was seeing the kernel in the “home straight” with regard to completing its boot activities. The Kernel_thread::init_workload method (in kernel/fiasco/src/kern/kernel_thread-std.cpp) was pretty much the last thing to be invoked in the Kernel_thread::run method (in kernel/fiasco/src/kern/kernel_thread.cpp) before normal service begins. But it was upon attempting to “activate” a thread to run sigma0 that a problem arose: the kernel never got control back! This would be my last major debugging exercise before embarking on a final excursion into “user space”.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 6)

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 5)

Sunday, March 25th, 2018

We left off last time with the unenviable task of debugging a non-working system. In such a situation the outlook can seem bleak, but I mentioned a couple of strategies that can sometimes rescue the situation. The first of these is to rule out areas of likely problems, which in my case tends to involve reviewing what I have done and seeing if I have made some stupid mistakes. Naturally, it helps to have a certain amount of experience to inform this process; otherwise, practically everything might be a place where such mistakes may be lurking.

One thing that bothered me was the use of the memory map by Fiasco.OC on the Ben NanoNote. When deploying my previous experimental work to the Ben, I had become aware of limitations around where things might be stored, at least while any bootloader might be active. Care must be taken to load new code away from memory already being used, and it seems that the base of memory must also be avoided, at least at first. I wasn’t convinced that this avoidance was going to happen with the default configuration of the different components.

The Memory Map

Of particular concern were the exception vectors – where the processor jumps to if an exception or interrupt occurs – whose defaults in Fiasco.OC situate them at the base of kernel memory: 0x80000000. If the bootloader were to try and copy the code that handles exceptions to this region, I rather suspected that it would immediately cause problems.

I was also unsure whether the bootloader was able to load the payload from the MMC/MicroSD card into memory without overwriting itself or corrupting the payload as it copied things around in memory. According to the boot script that seems to apply to the Ben, it loads the payload into memory at 0x80600000:

#define CONFIG_BOOTCOMMANDFROMSD        "mmc init; ext2load mmc 0 0x80600000 /boot/uImage; bootm"

Meanwhile, the default memory settings for L4Re has code loaded rather low in the kernel address space at 0x802d0000. Without really knowing what happens, I couldn’t be sure that something might get copied to that location, that the copied data might then run past 0x80600000, and this might overwrite some other thing – the kernel, perhaps – that hadn’t been copied yet. Maybe this was just paranoia, but it was at least something that could be dealt with. So I came up with some alternative arrangements:

0x81401000 exception handlers
0x81400000 kernel load address
0x80d00000 bootstrap start address
0x80600000 payload load address when copied by bootm

I wanted to rule out memory conflicts but try and not conjure up more exotic solutions than strictly necessary. So I made some adjustments to the location of the kernel, keeping the exception vectors in the same place relative to the kernel, but moving the vectors far away from the base of memory. It turns out that there are quite a few places that need changing if you do this:

A configuration setting, CONFIG_KERNEL_LOAD_ADDR-32, in the kernel build scripts (in kernel/fiasco/src/Modules.mips)
The exception base, EXC_BASE, in the kernel’s linker script (in kernel/fiasco/src/kernel.mips.ld)
The exception base, Exception_base, in a description of the kernel memory layout (in kernel/fiasco/src/kern/mips/mem_layout-mips32.cpp)
The exception base, if it is mentioned in the bootstrap initialisation (in l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S)

The location of the rest of the payload seems to be configured by just changing DEFAULT_RELOC_mips32 in the bootstrap package’s build scripts (in l4/pkg/bootstrap/server/src/Make.rules).

With this done, I had hoped that I might have “moved the needle” a little and provoked a visible change when attempting to boot the system, but this was always going to be rather optimistic. Having pursued the first strategy, I now decided to pursue the second.

Update: it turns out that a more conventional memory arrangement can be used, and this is described in the summary article.

Getting in at the Start

The second strategy is to use every opportunity to get the device to show what it is doing. But how can we achieve this if we cannot boot the kernel and start some code that sets up the framebuffer? Here, there were two things on my side: the role of the bootstrap code, it being rather similar to code I have written before, and the state of the framebuffer when this code is run.

I had already discovered that provided that the code is loaded into a place that can be started by the bootloader, then the _start routine (in l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S) will be called in kernel mode. And I had already looked at this code for the purposes of identifying instructions that needed rewriting as well as for setting the “exception base”. There were a few other precautions that were worth taking here before we might try and get the code to show its activity.

For instance, the code present that attempts to enable a floating point unit in the processor does not apply to the Ben, so this was disabled. I was also unconvinced that the memory mapping instructions would work on the Ben: the JZ4720 does not seem to support memory pages of 256MB, with the Ben only having 32MB anyway, so I changed this to use 16MB pages instead. This must be set up correctly because any wandering into unmapped memory – visiting bad addresses – cannot be rectified before the kernel is active, and the whole point of the bootstrap code is to get the kernel active!

Now, it wasn’t clear just how far the processor was getting through this code before failing somewhere, but this is where the state of the framebuffer comes in. On the Ben, the bootloader initialises the framebuffer in order to show the state of the device, indicate whether it found a payload to load and boot from, complain about error conditions, and so on. It occurred to me that instead of trying to initialise a framebuffer by programming the LCD peripheral in the JZ4720, set up various structures in memory, decide where these structures should even be situated, I could just read the details of the existing framebuffer from the LCD peripheral’s registers, then find out where the framebuffer resides, and then just write whatever data I liked to the framebuffer in order to communicate with the outside world.

So, I would just need to write a few lines of assembly language, slip it into the bootstrap code, and then see if the framebuffer was changed and the details of interest written to the Ben’s display. Here is a fragment of code in a form that would become rather familiar after a time:

        li $8, 0xb3050040       /* LCD_DA0 */
        lw $9, 0($8)            /* &descriptor */
        lw $10, 4($9)           /* fsadr = descriptor[1] */
        lw $11, 12($9)          /* ldcmd = descriptor[3] */
        li $8, 0x00ffffff
        and $11, $8, $11        /* size = ldcmd & LCD_CMD0_LEN */
        li $9, 0xa5a5a5a5
1:
        sw $9, 0($10)           /* *fsadr = ... */
        addiu $11, $11, -1      /* size -= 1 */
        addiu $10, $10, 4       /* fsadr += 4 */
        bnez $11, 1b            /* until size == 0 */
        nop

To summarise, it loads the address of a “descriptor” from a configuration register provided by the LCD peripheral, this register having been set by the bootloader. It then examines some members of the structure provided by the descriptor, notably the framebuffer address (fsadr) and size (a subset of ldcmd). Just to show some sign of progress, the code loops and fills the screen with a specific value, in this case a shade of grey.

By moving this code around in the bootstrap initialisation routine, I could see whether the processor even managed to get as far as this little debugging fragment. Fortunately for me, it did get run, the screen did turn grey, and I could then start to make deductions about why it only got so far but no further. One enhancement to the above that I had to make after a while was to temporarily change the processor status to “error level” (ERL) when accessing things like the LCD configuration. Not doing so risks causing errors in itself, and there is nothing more frustrating than chasing down errors only to discover that the debugging code caused these errors and introduced them as distractions from the ones that really matter.

Enter the Kernel

The bootstrap code isn’t all assembly language, and at the end of the _start routine, the code attempts to jump to __main. Provided this works, the processor enters code that started out life as C++ source code (in l4/pkg/bootstrap/server/src/ARCH-mips/head.cc) and hopefully proceeds to the startup function (in l4/pkg/bootstrap/server/src/startup.cc) which undertakes a range of activities to prepare for the kernel.

Here, my debugging routine changed form slightly, minimising the assembly language portion and replacing the simple screen-clearing loop with something in C++ that could write bit patterns to the screen. It became interesting to know what address the bootstrap code thought it should be using for the kernel, and by emitting this address’s bit pattern I could check whether the code had understood the structure of the payload. It seemed that the kernel was being entered, but upon executing instructions in the _start routine (in kernel/fiasco/src/kern/mips/crt0.S), it would hang.

The Ben NanoNote showing a bit pattern on the screen with adjacent bits using slightly different colours to help resolve individual bit values; here, the framebuffer address is shown (0x01fb5000), but other kinds of values can be shown, potentially many at a time

This now led to a long and frustrating process of detective work. With a means of at least getting the code to report its status, I had a chance of figuring out what might be wrong, but I also needed to draw on experience and ideas about likely causes. I started to draw up a long list of candidates, suggesting and eliminating things that could have been problems that weren’t. Any relief that a given thing was not the cause of the problem was tempered by the realisation that something else, possibly something obscure or beyond the limit of my own experiences, might be to blame. It was only some consolation that the instruction provoking the failure involved my nemesis from my earlier experiments: the “error level” (ERL) flag in the processor’s status register.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 5)

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 4)

Saturday, March 24th, 2018

As described previously, having hopefully done enough to modify the kernel – Fiasco.OC – for the Ben NanoNote, it then became necessary to investigate the bootstrap package that is responsible for setting up the hardware and starting the kernel. This package resides in the L4Re distribution, which is technically a separate thing, even though both L4Re and Fiasco.OC reside in the same published repository structure.

Before continuing into the details, it is worth noting which things need to be retrieved from the L4Re section of the repository in order to avoid frustration later on with package dependencies. I had previously discovered that the following package installation operation would be required (from inside the l4 directory):

svn update pkg/acpica pkg/bootstrap pkg/cxx_thread pkg/drivers pkg/drivers-frst pkg/examples \
           pkg/fb-drv pkg/hello pkg/input pkg/io pkg/l4re-core pkg/libedid pkg/libevent \
           pkg/libgomp pkg/libirq pkg/libvcpu pkg/loader pkg/log pkg/mag pkg/mag-gfx pkg/x86emu

With the listed packages available, it should be possible to build the examples that will eventually interest us. Some of these appear superfluous – x86emu, for instance – but some of the more obviously-essential packages have dependencies on these other packages, and so we cannot rely on our intuition alone!

Also needed when building a payload is some path definitions in the l4/conf/Makeconf.boot file. Here is what I used:

MODULE_SEARCH_PATH += $(L4DIR_ABS)/../kernel/fiasco/mybuild
MODULE_SEARCH_PATH += $(L4DIR_ABS)/conf/examples
MODULE_SEARCH_PATH += $(L4DIR_ABS)/pkg/io/io/config
BOOTSTRAP_SEARCH_PATH  = $(L4DIR_ABS)/conf/examples
BOOTSTRAP_SEARCH_PATH += $(L4DIR_ABS)/../kernel/fiasco/mybuild
BOOTSTRAP_SEARCH_PATH += $(L4DIR_ABS)/pkg/io/io/config
BOOTSTRAP_MODULES_LIST = $(L4DIR_ABS)/conf/modules.list

This assumes that the build directory used when building the kernel is called mybuild. The Makefile will try and copy the kernel into the final image to be deployed and so needs to know where to find it.

Describing the Ben (Again)

Just as we saw with the kernel, there is a need to describe the Ben and to audit the code to make sure that it stands a chance of working on the Ben. This is done slightly differently in L4Re but the general form of the activity is similar, defining the following:

An architecture version (MIPS32r1) for the JZ4720 (in l4/mk/arch/Kconfig.mips.inc)
A platform configuration for the Ben (in l4/mk/platforms)
Some platform details in the bootstrap package (in l4/pkg/bootstrap/server/src)
Some hardware details related to memory and interrupts (in l4/pkg/io/io/config/plat-qi_lb60)

For the first of these, I introduced a configuration setting (CPU_MIPS_32R1) to allow us to distinguish between the Ben’s SoC (JZ4720) and other processors, just as I did in the kernel code. With this done, the familiar task of hunting down problematic assembly language instructions can begin, and these can be divided into the following categories:

Those that can be rewritten using other instructions that are available to us
Those that must be “trapped” and handled by the kernel

Candidates for the former category include all unprivileged instructions that the JZ4720 doesn’t support, such as ext and ins. Where privileged instructions or ones that “bridge” privileges in some way are used, we can still rewrite them if they appear in the bootstrap code, since this code is also running in privileged mode. Here is an example of such privileged instruction rewriting (from l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S):

#if defined(CONFIG_CPU_MIPS_32R1)
       cache   0x01, 0($a0)     # Index_Writeback_Inv_D
       nop
       cache   0x08, 0($a0)     # Index_Store_Tag_I
#else
       synci   0($a0)
#endif

Candidates for the latter category include all awkward privileged or privilege-escalating instructions outside the bootstrap package. Fortunately, though, we don’t need to worry about them very much at all. Since the kernel will be obliged to trap them, we can just keep them where they are and concede that there is nothing else we can do with them.

However, there is one pitfall: these preserved-but-unsupported instructions will upset the compiler! Consider the use of the now overly-familiar rdhwr instruction. If it is mentioned in an assembly language statement, the compiler will notice that amongst its clean MIPS32r1-compliant output, something is inserting an unrecognised instruction, yielding that error we saw earlier:

Error: opcode not supported on this processor: mips32 (mips32)

But we do know what we’re doing! So how can we persuade the compiler? The solution is to override what the compiler (or assembler) thinks it should be producing by introducing a suitable directive as in the following example (from l4/pkg/l4re-core/l4sys/include/ARCH-mips/cache.h):

  asm volatile (
    ".set push\n"
    ".set mips32r2\n"
    "rdhwr %0, $1\n"
    ".set pop"
    : "=r"(step));

Here, with the .set directives, we switch this little region of code to MIPS32r2 compliance and emit our forbidden instruction into the output. Since the kernel will take care of it in the end, the compiler shouldn’t be made to feel that it has to defend us against it.

In L4Re, there are also issues experienced with the CI20 that will also affect the Ben, such as an awkward and seemingly compiler-related issue affecting the way programs are started. In this regard, I just kept my existing patches for these things applied.

My other platform-related adjustments for the Ben have mostly borrowed from support for the CI20 where it existed. For instance, the bootstrap package’s definition for the Ben (in l4/pkg/bootstrap/server/src/platform/qi_lb60.cc) just takes the CI20 equivalent, eliminates superfluous features, modifies details that are different between the SoCs, and changes identifiers. The general definition for the Ben (in l4/mk/platforms/qi_lb60.conf) merely acknowledges differences in some basic platform details.

The CI20 was not supported with a hardware definition describing memory regions and interrupts used by the io package. Taking other devices as inspiration, I consulted the device documentation and wrote a definition when experimenting with the CI20. For the Ben, the form of this definition (in l4/pkg/io/io/config/plat-qi_lb60/hw_devices.io) remains similar but is obviously adjusted for the SoC differences.

Device Drivers and Output

One topic that I have not really mentioned at all is that pertaining to device drivers. I would not have even started this work if I didn’t feel there was a chance of seeing some signs of success from the Ben. Although the Ben, like the CI20, has the capability of exposing a serial console to the outside world, meaning that it can emit messages via a cable to another computer and receive input from that computer, unlike the CI20, its serial console pins are not particularly convenient to use: they really require wires to be soldered to some tiny pads that are found in the battery compartment of the device.

Now, my soldering skills are not very good, and I also want to be able to put the battery back into the device in future. I did try and experiment by holding wires against the pads, this working once or twice by showing output when booting the Ben into its more typical Linux-based environment. But such experiments proved to be unsustainable and rather uncomfortable, needing some kind of “guitar grip” while juggling cables and holding down buttons. So I quickly knew that I would need to get output from the Ben in other ways.

Having deployed low-level payloads to the Ben before, I knew something about the framebuffer, so I had some confidence about initialising it and getting something on the screen that might tell me what has or hasn’t happened. And I adapted my code from this previous effort, itself being derived from driver code written by the people responsible for the Ben, wrapping it up for L4Re. I tried to keep this code minimally different from its previous incarnation, meaning that I could eliminate certain kinds of mistakes in case the code didn’t manage to do its job. With this in place, I felt that I could now consider trying out my efforts and seeing what, if anything, might happen.

Attempting to Bootstrap

Being in the now-familiar position of believing that enough has been done to make the software run, I now considered an attempt at actually bootstrapping the kernel. It may sound naive, but I almost expected to be able to compile everything – the kernel, L4Re, my drivers – and for them all to work together in harmony and produce at least something on the display. But instead, after “Starting kernel …”, nothing happened.

The Ben NanoNote trying to boot a payload from the memory card

It should be said that in these kinds of exercises, just one source of failure need present itself and the outcome is, of course, failure. And I can confirm that there were many sources of failure at this point. The challenges, then, are to identify all of these and then to eliminate them all. But how can you even know what all of these sources of failure actually are? It seemed disheartening, but then there are two kinds of strategy that can be employed: to investigate areas likely to be causing problems, and to take every opportunity to persuade the device to tell us what is happening. And with this, the debugging would begin.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, L4, MIPS, NanoNote, operating systems | Comments Off on Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 4)

Archives

Categories