Paul Boddie's Free Software-related blog


Archive for the ‘MIPS’ Category

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 8)

Tuesday, March 27th, 2018

With some confidence that the effort put in previously had resulted in a working Fiasco.OC kernel, I now found myself where I had expected to be at the start of this adventure: in the position of trying to get an example program working that could take advantage of the framebuffer on the Ben NanoNote. In fact, there is already an example program called “spectrum” that I had managed to get working with the UX variant of Fiasco.OC (this being a kind of “User Mode Fiasco” which runs as a normal process, or maybe a few processes, on top of a normal GNU/Linux system).

However, if this exercise had taught me nothing else, it had taught me not to try out huge chunks of untested functionality and then expect everything to work. It would instead be a far better idea to incrementally add small chunks of functionality whose correctness (or incorrectness) would be easier to see. Thus, I decided to start with a minimal example, adding pieces of tested functionality from my CI20 experiments that obtain access to various memory regions. In the first instance, I wanted to see if I could exercise control over the LCD panel via the general-purpose input/output (GPIO) memory region.

Wiring Up

At this point, I should make a brief detour into the matter of making peripherals available to programs. Unlike the debugging code employed previously, it isn’t possible to just access some interesting part of memory by taking an address and reading from it or writing to it. Not only are user space programs working with virtual memory, whose addresses need not correspond to the actual physical locations employed by the underlying hardware, but as a matter of protecting the system against program misbehaviour, access to memory regions is strictly controlled.

In L4Re, a component called Io is responsible for granting access to predefined memory regions that describe hardware peripherals or devices. My example needed to have the following resources defined:

  • A configuration description (placed in l4/conf/examples/mips-qi_lb60-lcd.cfg)
  • The Io peripheral bus description (placed in l4/conf/examples/mips-qi_lb60-lcd.io)
  • A list of modules needing deployment (added as an entry to l4/conf/modules.list, described separately in l4/conf/examples/mips-qi_lb60-lcd.list)

It is perhaps clearer to start with the Io-related resource, which is short enough to reproduce here:

local hw = Io.system_bus()

local bus = Io.Vi.System_bus
{
  CPM = wrap(hw:match("jz4740-cpm"));
  GPIO = wrap(hw:match("jz4740-gpio"));
  LCD = wrap(hw:match("jz4740-lcd"));
}

Io.add_vbus("devices", bus)

This is Lua source code, with Lua having been adopted as a scripting language in L4Re. Here, the important things are those appearing within the hw:match function call brackets: each of these values refers to a device defined in the Ben’s hardware description (found in l4/pkg/io/io/config/plat-qi_lb60/hw_devices.io), identified using the “hid” property. The “jz4740-gpio” value is used to locate the GPIO device or peripheral, for instance.

Also important is the name used for the bus, “devices”, in the add_vbus function call at the bottom, as we shall see in a moment. Moving on from this, the configuration description for the example itself contains the following slightly-edited details:

local io_buses =
  {
    devices = l:new_channel();
  };

l:start({
  caps = {
      devices = io_buses.devices:svr(),
      icu = L4.Env.icu,
      sigma0 = L4.cast(L4.Proto.Factory, L4.Env.sigma0):create(L4.Proto.Sigma0),
    },
  },
  "rom/io -vvvv rom/hw_devices.io rom/mips-qi_lb60-lcd.io");

l:start({
  caps = {
      icu = L4.Env.icu,
      vbus = io_buses.devices,
    },
  },
  "rom/ex_qi_lb60_lcd");

Here, the “devices” name seemingly needs to be consistent with the name used in the caps mapping for the Io server, set up in the first l:start invocation. The name used for the io_buses mapping can be something else, but then the references to io_buses.devices must obviously be changed to follow suit. I am sure you get the idea.

The consequence of this wiring up is that the example program, set up at the end, accesses peripherals using the vbus capability or channel, communicating with Io and requesting access to devices, whose memory regions are described in the Ben’s hardware description. Io acts as the server on this channel whereas the example program acts as the client.

Is This Thing On?

There were a few good reasons for trying GPIO access first. The most significant one is related to the framebuffer. As you may recall, as the kernel finished starting up, sigma0 started, and we could then no longer be sure that the framebuffer configuration initialised by the bootloader had been preserved. So, I could no longer rely on any shortcuts in getting information onto the screen.

Another reason for trying GPIO operations was that I had written code for the CI20 that accessed input and output pins, and I had some confidence in the GPIO driver code that I had included in the L4Re package hierarchy actually working as intended. Indeed, when working with the CI20, I had implemented some driver code for both the GPIO and CPM (clock/power management) functionality as provided by the CI20’s SoC, the JZ4780. As part of my preparation for this effort, I had adapted this code for the Ben’s SoC, the JZ4720. It was arguably worthwhile to give this code some exercise.

One drawback with using general-purpose input/output, however, is that the Ben doesn’t really have any conveniently-accessed output pins or indicators, whereas the CI20 is a development board with lots of connectors and some LEDs that can be used to give simple feedback on what a program might be doing. Manipulating the LCD panel, in contrast, offers a very awkward way of communicating program status.

Experiments proved somewhat successful, however. Obtaining access to device memory was as straightforward as it had been in my CI20 examples, providing me with a virtual address corresponding to the GPIO memory region. Inserting some driver code employing GPIO operations directly into the principal source file for the example, in order to keep things particularly simple and avoid linking problems, I was able to tell the panel to switch itself off. This involves bit-banging SPI commands using a number of output pins. The consequence of doing this is that the screen fades to white.

I gradually gained confidence and decided to reserve memory for the framebuffer and descriptors. Instead of attempting to use the LCD driver code that I had prepared, I attempted to set up the descriptors manually by writing values to the structure that I knew would be compatible with the state of the peripheral as it had been configured previously. But once again, I found myself making no progress. No image would appear on screen, and I started to wonder whether I needed to do more work to reinitialise the framebuffer or maybe even the panel itself.

But the screen was at least switched on, indicating that the Ben had not completely hung and might still be doing something. One odd thing was that the screen would often turn a certain colour. Indeed, it was turning purple fairly consistently when I realised my mistake: my program was terminating! And this was, of course, as intended. The program would start, access memory, set up the framebuffer, fill it with a pattern, and then finish. I suspected something was happening and when I started to think about it, I had noticed a transient bit pattern before the screen went blank, but I suppose I had put this down to some kind of ghosting or memory effect, at least instinctively.

When a program terminates, it is most likely that the memory it used gets erased so as to prevent other programs from inheriting data that they should not see. And naturally, this will result in the framebuffer being cleared, maybe even the descriptors getting trashed again. So the trivial solution to my problem was to just stop my program from terminating:

while (1);

That did the trick!

The Build-Up

Accessing the LCD peripheral directly from my example is all very well, but I had always intended to provide a display driver for the Ben within L4Re. Attempting to compile the driver as part of the general build process had already identified some errors that had been present in what effectively had been untested code, migrated from my earlier work and modified for the L4Re APIs. But I wasn’t about to try and get the driver to work in one go!

Instead, I copied more functionality into my example program, giving things like the clock management functionality some exercise, using my newly-enabled framebuffer output to test the results from various enquiries about clock frequencies, also setting various frequencies and seeing if this stopped the program from working. This did lead me on another needlessly-distracting diversion, however. When looking at the clock configuration values, something seemed wrong: they were shifted one bit to the left and therefore provided values twice as large as they should have been.

I emitted some other values and saw that they too were shifted to the left. For a while, I wondered whether the panel configuration was wrong and started adjusting all sorts of timing-related values (front and back porches, sync pulses, video-related things), reading up about how the panel was configured using the SPI communications described above. It occurred to me that I should investigate how the display output was shifted or wrapped, and I used the value 0x80000001 to put bit values at the extreme left and right of the screen. Something was causing the output to be wrapped, apparently displaced by 9 pixels or 36 bytes to the left.

You can probably guess where this is heading! The panel configuration had not changed and there was no reason to believe that it was wrong. Similarly, the framebuffer was being set up correctly and was not somehow moved slightly in memory, not that there is really any mechanism in L4Re or the hardware that would cause such a bizarre displacement. I looked again at my debugging code and saw my mistake; in concise terms, it took this form:

for (i = 0; i < fb_size / 4; i++)
{
    fb[i] = value & mask ? onpix : offpix; // write a pixel
    if ((i % 10) == 0)                     // move to the next bit?
    ...
}

For some reason, I had taken some perfectly acceptable code and changed one aspect of it. Here is what it had looked like on the many occasions I had used it in the past:

i = 0;
while (i < fb_size / 4)
{
    fb[i] = value & mask ? onpix : offpix; // write a pixel
    i++;
    if ((i % 10) == 0)                     // move to the next bit?
    ...
}

That’s right: I had accidentally reordered the increment and “next bit” tests, causing the first bit in the value to occupy only one pixel, prematurely skipping to the next bit; at the end of the value, the mask would wrap around and show the remaining nine pixels of the first bit. Again, I was debugging the debugging and wasting my own time!

But despite such distractions, over time, it became interesting to copy the complete source file that does most of the work configuring the hardware from my driver into the example directory and to use its functions directly, these functions normally being accessed by more general driver code. In effect, I was replicating the driver within an environment that was being incrementally enhanced, up until the point where I might assume that the driver itself could work.

With no obvious problems having occurred, I then created another example without all the memory lookup operations, copied-in functions, and other “instrumentation”: one that merely called the driver API, relying on the general mechanisms to find and activate the hardware…

  struct arm_lcd_ops *ops = arm_lcd_probe("nanonote");

…obtaining a framebuffer address and size…

  fb_vaddr = ops->get_fb();
  fb_size = ops->get_video_mem_size();

…and then writing data to the screen just as I had done before. Indeed, this worked just like the previous example, indicating that I had managed to implement the driver’s side of the mechanisms correctly. I had a working LCD driver!

Adopting Abstractions

If the final objective had been to just get Fiasco.OC running and to have a program accessing the framebuffer, then this would probably be the end of this adventure. But, in fact, my aim is to demonstrate that the Ben NanoNote can take advantage of the various advertised capabilities of L4Re and Fiasco.OC. This means supporting facilities like the framebuffer driver, GUI multiplexer, and potentially other things. And such facilities are precisely those demonstrated by the “spectrum” example which I had promised myself (and my readership) that I would get working.

At this point, I shouldn’t need to write any more of my own code: the framebuffer driver, GUI multiplexer, and “spectrum” example are all existing pieces of L4Re code. But what I did need to figure out was whether my recipe for wiring these things up was still correct and functional, and whether my porting exercise had preserved the functionality of these things or mysteriously broken them!

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 7)

Tuesday, March 27th, 2018

Having tracked down lots of places in the Fiasco.OC kernel that needed changing to run on the Ben NanoNote, including some places I had already visited, changed, and then changed again to fix my earlier mistakes, I was on the “home straight”, needing only to activate the sigma0 thread and the boot thread before the kernel would enter its normal operational state. But the sigma0 thread was proving problematic: control was apparently passed to it, but the kernel would never get control back.

Could this be something peculiar to sigma0, I wondered? I decided to swap the order of the activation, putting the boot thread first and seeing if the kernel “kept going”. Indeed, it did seem to, showing an indication on screen that I had inserted into the code to test whether the kernel regained control after activating the boot thread. So this indeed meant that something about sigma0 was causing the problem. I decided to trace execution through various method calls to see what might be going wrong.

The Big Switch

The process of activating a thread is rather interesting, involving the following methods:

  • Context::activate (in kernel/fiasco/src/kern/context.cpp)
  • Context::switch_to_locked
  • Context::schedule_switch_to_locked
  • Context::switch_exec_locked
  • Context::switch_cpu (in kernel/fiasco/src/kern/mips/context-mips.cpp)

This final, architecture-specific method saves a few registers on the stack belonging to the current thread, doing so because we are now going to switch away from that thread. One of these registers ($31 or $ra) holds the return address which, by convention, is the address the processor jumps to after returning from a subroutine, function, method or something of that sort. Here, the return address, before it is stored, is set to a label just after a jump (or branch) instruction that is coming up. The flow of the code is such that we are to imagine that this jump will occur and then the processor will return to the label immediately following the jump:

  1. Set return address register to refer to the instruction after the impending jump
  2. Store registers on the stack
  3. Take the jump…
  4. Continue after the jump

However, there is a twist. Just before the jump is taken, the stack pointer is switched to one belonging to another thread. On top of this, as the jump is taken, the return address is also replaced by loading a value from this other thread’s stack. The processor then jumps to the Context::switchin_context method. When the processor encounters the end of the generated code for this method, it encounters an instruction that makes it jump to the address in the return address register. Here is a more accurate summary:

  1. Set return address register to refer to the instruction after the impending jump
  2. Store registers on the stack
  3. Switch the stack pointer, referencing another thread’s stack
  4. Switch the return address to the other thread’s value
  5. Take the jump…

Let us consider what might happen if this other thread was, in fact, the one we started with: the original kernel thread. Then the return address register would now hold the address of the label after the jump. Control would now pass back into the Context::switch_cpu method, and back out of that stack of invocations shown above, going from bottom to top.

But this is not what happens at the moment: the return address is something that was conjured up for a new thread and execution will instead proceed in the place indicated by that address. A switch occurs, leaving the old thread dormant while a new thread starts its life somewhere else. The problem I now faced was figuring out where the processor was “returning” to at the end of Context::switchin_context and what was happening next.

Following the Threads

I already had some idea of where the processor would end up, but inserting some debugging code in Context::switchin_context and reading the return address from the $ra register allowed me to see what had been used without chasing down how the value had got there in the first place. Then, there is a useful tool that can help people like me find out the significance of a program address. Indeed, I had already used it a few times by this stage: the sibling of objdump known as addr2line. Here is an example of its use:

mipsel-linux-gnu-addr2line -e mybuild/fiasco.debug 814165f0

This indicated that the processor was “returning” to the Thread::user_invoke method (in kernel/fiasco/src/kern/mips/thread-mips.cpp). In fact, looking at the Thread::Thread constructor function, it becomes apparent that this information is specified using the Context::prepare_switch_to method, setting this particular destination up for the above activation “switch” operation. And here we encounter some more architecture-specific tricks.

One thing that happens of importance in user_invoke is the disabling of interrupts. But if sigma0 is to be activated, they have to be enabled again so that sigma0 is then interrupted when a timer interrupt occurs. And I couldn’t find how this was going to be achieved in the code: nothing was actually setting the interrupt enable (IE) flag in the status register.

The exact mechanism escaped me for a while, and it was only after some intensive debugging interventions that I realised that the status register is also set up way back in the Thread::Thread constructor function. There, using the result of the Cp0_status::status_eret_to_user_ei method (found in kernel/fiasco/src/kern/mips/cp0_status.cpp), the stored status register is set for future deployment. The status_eret_to_user_ei method initially looks like it might do things directly with the status register, but it just provides a useful value for the point at which control is handed over to a “user space” program.

And, indeed, in the routine called ret_from_user_invoke (found in kernel/fiasco/src/kern/mips/exception.S), implemented by a pile of macros, there is a macro called restore_cp0_status that finally sets the status register to this useful value. A few instructions later, the “return from exception” instruction called eret appears and, with that, control passes to user space and hopefully some valid program code.

Finding the Fault

I now wondered whether the eret instruction was “returning” to a valid address. This caused me to take a closer look at the data structure,  Entry_frame (found in kernel/fiasco/src/kern/mips/entry_frame-mips.cpp), used to hold the details of thread execution states. Debugging in user_invoke, by invoking the ip method on the current context (itself obtained from the stored stack information) yielded an address of 0x2000e0. I double-checked this by doing some debugging in the different macros implementing the routine returning control to user space.

Looking around, the Makefile for sigma0 (found as l4/pkg/l4re-core/sigma0/server/src/Makefile) provided this important clue as to its potential use of this address:

DEFAULT_RELOC_mips  := 0x00200000

Using our old friend objdump on the sigma0 binary (found as mybuild/bin/mips_32/l4f/sigma0), it was confirmed that 0x2000e0 is the address of the _start routine for sigma0. So we could at least suppose that the eret instruction was passing control to the start of sigma0.

I had a suspicion that some instruction was causing an exception but not getting handled. But I had checked the generated code already and found no new unsupported instructions. This now meant the need to debug an unhandled exception condition. This can be done with care, as always, in the assembly language file containing the various “bare metal” handlers (exception.S, mentioned above), and such an intervention was enough to discover that the cause of the exception was an invalid memory access: an address exception.

Now, the handling of such exceptions can be traced from the handlers into the kernel. There is what is known as a “vector table” which lists the different exception causes and the corresponding actions to be taken when they occur. One of the entries in the table looks like this:

        ENTRY_ADDR(slowtrap)      # AdEL

This indicates that for an address exception upon load or instruction fetch, the slowtrap routine will be used. And this slowtrap routine itself employs a jump to a function in the kernel called thread_handle_trap (found in kernel/fiasco/src/kern/mips/thread-mips.cpp). Here, unsurprisingly, I found that attempts to handle the exception would fail and that the kernel would enter the Thread::halt method (in kernel/fiasco/src/kern/thread.cpp). This was a natural place for my next debugging intervention!

I now emitted bit patterns for several saved registers associated with the thread/context. One looked particularly interesting: the stored exception program counter which contained the value 0x2092ac. I had a look at the code for sigma0 using objdump and saw the following (with some extra annotations added for explanatory purposes):

  209290:       3c02fff3        lui     v0,0xfff3
  209294:       24422000        addiu   v0,v0,8192    // 0xfff32000
  209298:       afc2011c        sw      v0,284(s8)
  20929c:       7c02e83b        0x7c02e83b            // rdhwr v0, $29 (ULR)
  2092a0:       afc20118        sw      v0,280(s8)
  2092a4:       8fc20118        lw      v0,280(s8)
  2092a8:       8fc30158        lw      v1,344(s8)
  2092ac:       8c508ff4        lw      s0,-28684(v0) // 0xfff2aff4

Of particular interest was the instruction at 0x20929c, which I annotated above as corresponding to our old favourite, rdhwr, along with some other key values. Now, the final instruction above is where the error occurs, and it is clear that the cause is the access to an address that is completely invalid (as annotated). The origin of this address information occurs in the second instruction above. You may have realised by now that the rdhwr instruction was failing to set the v0 (or $v0) register with the value retrieved from the indicated hardware register.

How could this be possible?! Support for this rdhwr variant was already present in Fiasco.OC, so could it be something I had done to break it? Perhaps the first rule of debugging is not to assume your own innocence for any fault, particularly if you have touched the code in the recent past. So I invoked my first strategy once again and started to look in the “reserved instruction” handler for likely causes of problems. Sure enough, I found this:

        ASM_INS         $at, zero, 0, THREAD_BLOCK_SHIFT, k1 # TCB addr in $at

I had introduced a macro for the ins instruction that needed a temporary register to do its work, it being specified by the last argument. But k1 (or $k1) still holds the instruction that is being inspected, and it gets used later in the routine. By accidentally overwriting it, the wrong target register gets selected, and then the hardware register value is transferred to the wrong register, leaving the correct register unaffected. This is exactly what we saw above!

What I meant to write was this:

        ASM_INS         $at, zero, 0, THREAD_BLOCK_SHIFT, k0 # TCB addr in $at

We can afford to overwrite k0 (or $k0) since it gets overwritten shortly after, anyway. And sure enough, with this modification, sigma0 started working, with control being returned to the kernel after its activation.

Returning to User Space

Verifying that the kernel boot process completed was slightly tricky in that further debugging interventions seemed to be unreliable. Although I didn’t investigate this thoroughly, I suspect that with sigma0 initialised, memory may have been cleared, and this clearing activity erases the descriptor structure used by the LCD peripheral. So it then becomes necessary to reinitialise the peripheral by choosing somewhere for the descriptor, writing the appropriate members, and enabling the peripheral again. Since we might now be overwriting other things critical for the system’s proper functioning, we cannot then expect execution to continue after we have written something informative to the framebuffer, so an infinite loop deliberately hangs the kernel and lets us see our debugging output on the screen.

I felt confident that the kernel was now booting and going about its normal business of switching between threads, handling interrupts and exceptions, and so on. For testing purposes, I had chosen the “hello” example as a user space program to accompany the kernel in the deployed payload, but this example is useless on the Ben without connecting up the serial console connection, which I hadn’t done. So now, I needed to start preparing examples that would actually show something on the screen, working towards running the “spectrum” example provided in the L4Re distribution.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 6)

Monday, March 26th, 2018

With all the previous effort adjusting Fiasco.OC and the L4Re bootstrap code, I had managed to get the Ben NanoNote to run some code, but I had encountered a problem entering the kernel. When an instruction was encountered that attempted to clear the status register, it would seem that the boot process would not get any further. The status register controls things like which mode the processor is operating in, and certain modes offer safety guarantees about how much else can go wrong, which is useful when they are typically entered when something has already gone wrong.

When starting up, a MIPS-based processor typically has the “error level” (ERL) flag set in the status register, meaning that normal operations are suspended until the processor can be configured. The processor will be in kernel mode and it will bypass the memory mapping mechanisms when reading from and writing to memory addresses. As I found out in my earlier experiments with the Ben, running in kernel mode with ERL set can mask problems that may then emerge when unsetting it, and clearing the status register does precisely that.

With the kernel’s _start routine having unset ERL, accesses to lower memory addresses will now need to go through a memory mapping even though kernel mode still applies, and if a memory mapping hasn’t been set up then exceptions will start to occur. This tripped me up for quite some time in my previous experiments until I figured out that a routine accessing some memory in an apparently safe way was in fact using lower memory addresses. This wasn’t a problem with ERL set – the processor wouldn’t care and just translate them directly to physical memory – but with ERL unset, the processor now wanted to know how those addresses should really be translated. And since I wasn’t handling the resulting exception, the Ben would just hang.

Debugging the Debugging

I had a long list of possible causes, some more exotic than others: improperly flushed caches, exception handlers in the wrong places, a bad memory mapping configuration. I must admit that it is difficult now, even looking at my notes, to fully review my decision-making when confronting this problem. I can apply my patches from this time and reproduce a situation where the Ben doesn’t seem to report any progress from within the kernel, but with hindsight and all the progress I have made since, it hardly seems like much of an obstacle at all.

Indeed, I have already given the game away in what I have already written. Taking over the framebuffer involves accessing some memory set up by the bootloader. In the bootstrap code, all of this memory should actually be mapped, but since ERL is set I now doubt that this mapping was even necessary for the duration of the bootstrap code’s execution. And I do wonder whether this mapping was preserved once the kernel was started. But what appeared to happen was that when my debugging code tried to load data from the framebuffer descriptor, it would cause an exception that would appear to cause a hang.

Since I wanted to make progress, I took the easy way out. Rather than try and figure out what kind of memory mapping would be needed to support my debugging activities, I simply wrapped the code accessing the framebuffer descriptor and the framebuffer itself with instructions that would set ERL and then unset it afterwards. This would hopefully ensure that even if things weren’t working, then at least my things weren’t making it all worse.

Lost in Translation

It now started to become possible to track the progress of the processor through the kernel. From the _start routine, the processor jumps to the kernel_main function (in kernel/fiasco/src/kern/mips/main.cpp) and starts setting things up. As I was quite sure that the kernel wasn’t functioning correctly, it made sense to drop my debugging code into various places and see if execution got that far. This is tedious work – almost a form of inefficient “single-stepping” – but it provides similar feedback about how the code is behaving.

Although I had tried to do a reasonable job translating certain operations to work on the JZ4720 used by the Ben, I was aware that I might have made one or two mistakes, also missing areas where work needed doing. One such area appeared in the Cpu class implementation (in kernel/fiasco/src/kern/mips/cpu-mips.cpp): a rich seam of rather frightening-looking processor-related initialisation activities. The Cpu::first_boot method contained this ominous-looking code:

require(c.r<0>().ar() > 0,  "MIPS r1 CPUs are notsupported\n");

I hadn’t noticed this earlier, and its effect is to terminate the kernel if it detects an architecture revision just like the one provided by the JZ4720. There were a few other tests of the capabilities of the processor that needed to be either disabled or reworked, and I spent some time studying the documentation concerning configuration registers and what they can tell programs about the kind of processor they are running on. Amusingly, our old friend the rdhwr instruction is enabled for user programs in this code, but since the JZ4720 has no notion of that instruction, we actually disable the instruction that would switch rdhwr on in this way.

Another area that proved to be rather tricky was that of switching interrupts on and having the system timer do its work. Early on, when laying the groundwork for the Ben in the kernel, I had made a rough translation of the CI20 code for the Ben, and we saw some of these hardware details a few articles ago. Now it was time to start the timer, enable interrupts, and have interrupts delivered as the timer counter reaches its limit. The part of the kernel concerned is Kernel_thread::bootstrap (in kernel/fiasco/src/kern/kernel_thread.cpp), where things are switched on and then we wait in a delay loop for the interrupts to cause a variable to almost magically change by itself (in kernel/fiasco/src/drivers/delayloop.cpp):

  Cpu_time t = k->clock;
  Timer::update_timer(t + 1000); // 1ms
  while (t == (t1 = k->clock))
    Proc::pause();

But the processor just got stuck in this loop forever! Clearly, I hadn’t done everything right. Some investigation confirmed that various timer registers were being set up correctly, interrupts were enabled, but that they were being signalled using the “interrupt pending 2” (IP2) flag of the processor’s “cause” register which reports exception and interrupt conditions. Meanwhile, I had misunderstood the meaning of the number in the last statement of the following code (from kernel/fiasco/src/kern/mips/bsp/qi_lb60/mips_bsp_irqs-qi_lb60.cpp):

  _ic[0] = new Boot_object<Irq_chip_ingenic>(0xb0001000);
  m->add_chip(_ic[0], 0);

  auto *c = new Boot_object<Cascade_irq>(nullptr, ingenic_cascade);
  Mips_cpu_irqs::chip->alloc(c, 1);

Here is how the CI20 had been set up (in kernel/fiasco/src/kern/mips/bsp/ci20/mips_bsp_irqs-ci20.cpp):

  _ic[0] = new Boot_object<Irq_chip_ingenic>(0xb0001000);
  m->add_chip(_ic[0], 0);
  _ic[1] = new Boot_object<Irq_chip_ingenic>(0xb0001020);
  m->add_chip(_ic[1], 32);

  auto *c = new Boot_object<Cascade_irq>(nullptr, ingenic_cascade);
  Mips_cpu_irqs::chip->alloc(c, 2);

With virtually no knowledge of the details, and superficially transcribing the code, editing here and there, I had thought that the argument to the alloc method referred to the number of “chips” (actually the number of registers dedicated to interrupt signals). But in fact, it indicates an adjustment to what the kernel code calls a “pin”. In the Mips_cpu_irq_chip class (in kernel/fiasco/src/kern/mips/mips_cpu_irqs.cpp), we actually see that this number is used to indicate the IP flag that will need to be tested. So, the correct argument to the alloc method is 2, not 1, just as it is for the CI20:

  Mips_cpu_irqs::chip->alloc(c, 2);

This fix led to another problem and the discovery of another area that I had missed: the “exception base” gets redefined in the kernel (in src/kern/mips/cpu-mips.cpp), and so I had to make sure it was consistent with the other places I had defined it by changing its value in the kernel (in src/kern/mips/mem_layout-mips32.cpp). I mentioned this adjustment previously, but it was at this stage that I realised that the exception base gets set in the kernel after all. (I had previously set it in the bootstrap code to override the typical default of 0x80000000.)

Leaving the Kernel

Although we are not quite finished with the kernel, the next significant obstacle involved starting programs that are not part of the kernel. Being a microkernel, Fiasco.OC needs various other things to offer the kind of environment that a “monolithic” operating system kernel might provide. One of these is the sigma0 program which has responsibilities related to “paging” or allocating memory. Attempting to start such programs exercises how the kernel behaves when it hands over control to these other programs. We should expect that timer interrupts should periodically deliver control back to the kernel for it to do work itself, this usually involving giving other programs a chance to run.

I was seeing the kernel in the “home straight” with regard to completing its boot activities. The Kernel_thread::init_workload method (in kernel/fiasco/src/kern/kernel_thread-std.cpp) was pretty much the last thing to be invoked in the Kernel_thread::run method (in kernel/fiasco/src/kern/kernel_thread.cpp) before normal service begins. But it was upon attempting to “activate” a thread to run sigma0 that a problem arose: the kernel never got control back! This would be my last major debugging exercise before embarking on a final excursion into “user space”.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 5)

Sunday, March 25th, 2018

We left off last time with the unenviable task of debugging a non-working system. In such a situation the outlook can seem bleak, but I mentioned a couple of strategies that can sometimes rescue the situation. The first of these is to rule out areas of likely problems, which in my case tends to involve reviewing what I have done and seeing if I have made some stupid mistakes. Naturally, it helps to have a certain amount of experience to inform this process; otherwise, practically everything might be a place where such mistakes may be lurking.

One thing that bothered me was the use of the memory map by Fiasco.OC on the Ben NanoNote. When deploying my previous experimental work to the Ben, I had become aware of limitations around where things might be stored, at least while any bootloader might be active. Care must be taken to load new code away from memory already being used, and it seems that the base of memory must also be avoided, at least at first. I wasn’t convinced that this avoidance was going to happen with the default configuration of the different components.

The Memory Map

Of particular concern were the exception vectors – where the processor jumps to if an exception or interrupt occurs – whose defaults in Fiasco.OC situate them at the base of kernel memory: 0x80000000. If the bootloader were to try and copy the code that handles exceptions to this region, I rather suspected that it would immediately cause problems.

I was also unsure whether the bootloader was able to load the payload from the MMC/MicroSD card into memory without overwriting itself or corrupting the payload as it copied things around in memory. According to the boot script that seems to apply to the Ben, it loads the payload into memory at 0x80600000:

#define CONFIG_BOOTCOMMANDFROMSD        "mmc init; ext2load mmc 0 0x80600000 /boot/uImage; bootm"

Meanwhile, the default memory settings for L4Re has code loaded rather low in the kernel address space at 0x802d0000. Without really knowing what happens, I couldn’t be sure that something might get copied to that location, that the copied data might then run past 0x80600000, and this might overwrite some other thing – the kernel, perhaps – that hadn’t been copied yet. Maybe this was just paranoia, but it was at least something that could be dealt with. So I came up with some alternative arrangements:

0x81401000 exception handlers
0x81400000 kernel load address
0x80d00000 bootstrap start address
0x80600000 payload load address when copied by bootm

I wanted to rule out memory conflicts but try and not conjure up more exotic solutions than strictly necessary. So I made some adjustments to the location of the kernel, keeping the exception vectors in the same place relative to the kernel, but moving the vectors far away from the base of memory. It turns out that there are quite a few places that need changing if you do this:

  • A configuration setting, CONFIG_KERNEL_LOAD_ADDR-32, in the kernel build scripts (in kernel/fiasco/src/Modules.mips)
  • The exception base, EXC_BASE, in the kernel’s linker script (in kernel/fiasco/src/kernel.mips.ld)
  • The exception base, Exception_base, in a description of the kernel memory layout (in kernel/fiasco/src/kern/mips/mem_layout-mips32.cpp)
  • The exception base, if it is mentioned in the bootstrap initialisation (in l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S)

The location of the rest of the payload seems to be configured by just changing DEFAULT_RELOC_mips32 in the bootstrap package’s build scripts (in l4/pkg/bootstrap/server/src/Make.rules).

With this done, I had hoped that I might have “moved the needle” a little and provoked a visible change when attempting to boot the system, but this was always going to be rather optimistic. Having pursued the first strategy, I now decided to pursue the second.

Update: it turns out that a more conventional memory arrangement can be used, and this is described in the summary article.

Getting in at the Start

The second strategy is to use every opportunity to get the device to show what it is doing. But how can we achieve this if we cannot boot the kernel and start some code that sets up the framebuffer? Here, there were two things on my side: the role of the bootstrap code, it being rather similar to code I have written before, and the state of the framebuffer when this code is run.

I had already discovered that provided that the code is loaded into a place that can be started by the bootloader, then the _start routine (in l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S) will be called in kernel mode. And I had already looked at this code for the purposes of identifying instructions that needed rewriting as well as for setting the “exception base”. There were a few other precautions that were worth taking here before we might try and get the code to show its activity.

For instance, the code present that attempts to enable a floating point unit in the processor does not apply to the Ben, so this was disabled. I was also unconvinced that the memory mapping instructions would work on the Ben: the JZ4720 does not seem to support memory pages of 256MB, with the Ben only having 32MB anyway, so I changed this to use 16MB pages instead. This must be set up correctly because any wandering into unmapped memory – visiting bad addresses – cannot be rectified before the kernel is active, and the whole point of the bootstrap code is to get the kernel active!

Now, it wasn’t clear just how far the processor was getting through this code before failing somewhere, but this is where the state of the framebuffer comes in. On the Ben, the bootloader initialises the framebuffer in order to show the state of the device, indicate whether it found a payload to load and boot from, complain about error conditions, and so on. It occurred to me that instead of trying to initialise a framebuffer by programming the LCD peripheral in the JZ4720, set up various structures in memory, decide where these structures should even be situated, I could just read the details of the existing framebuffer from the LCD peripheral’s registers, then find out where the framebuffer resides, and then just write whatever data I liked to the framebuffer in order to communicate with the outside world.

So, I would just need to write a few lines of assembly language, slip it into the bootstrap code, and then see if the framebuffer was changed and the details of interest written to the Ben’s display. Here is a fragment of code in a form that would become rather familiar after a time:

        li $8, 0xb3050040       /* LCD_DA0 */
        lw $9, 0($8)            /* &descriptor */
        lw $10, 4($9)           /* fsadr = descriptor[1] */
        lw $11, 12($9)          /* ldcmd = descriptor[3] */
        li $8, 0x00ffffff
        and $11, $8, $11        /* size = ldcmd & LCD_CMD0_LEN */
        li $9, 0xa5a5a5a5
1:
        sw $9, 0($10)           /* *fsadr = ... */
        addiu $11, $11, -1      /* size -= 1 */
        addiu $10, $10, 4       /* fsadr += 4 */
        bnez $11, 1b            /* until size == 0 */
        nop

To summarise, it loads the address of a “descriptor” from a configuration register provided by the LCD peripheral, this register having been set by the bootloader. It then examines some members of the structure provided by the descriptor, notably the framebuffer address (fsadr) and size (a subset of ldcmd). Just to show some sign of progress, the code loops and fills the screen with a specific value, in this case a shade of grey.

By moving this code around in the bootstrap initialisation routine, I could see whether the processor even managed to get as far as this little debugging fragment. Fortunately for me, it did get run, the screen did turn grey, and I could then start to make deductions about why it only got so far but no further. One enhancement to the above that I had to make after a while was to temporarily change the processor status to “error level” (ERL) when accessing things like the LCD configuration. Not doing so risks causing errors in itself, and there is nothing more frustrating than chasing down errors only to discover that the debugging code caused these errors and introduced them as distractions from the ones that really matter.

Enter the Kernel

The bootstrap code isn’t all assembly language, and at the end of the _start routine, the code attempts to jump to __main. Provided this works, the processor enters code that started out life as C++ source code (in l4/pkg/bootstrap/server/src/ARCH-mips/head.cc) and hopefully proceeds to the startup function (in l4/pkg/bootstrap/server/src/startup.cc) which undertakes a range of activities to prepare for the kernel.

Here, my debugging routine changed form slightly, minimising the assembly language portion and replacing the simple screen-clearing loop with something in C++ that could write bit patterns to the screen. It became interesting to know what address the bootstrap code thought it should be using for the kernel, and by emitting this address’s bit pattern I could check whether the code had understood the structure of the payload. It seemed that the kernel was being entered, but upon executing instructions in the _start routine (in kernel/fiasco/src/kern/mips/crt0.S), it would hang.

The Ben NanoNote showing a bit pattern on the screen

The Ben NanoNote showing a bit pattern on the screen with adjacent bits using slightly different colours to help resolve individual bit values; here, the framebuffer address is shown (0x01fb5000), but other kinds of values can be shown, potentially many at a time

This now led to a long and frustrating process of detective work. With a means of at least getting the code to report its status, I had a chance of figuring out what might be wrong, but I also needed to draw on experience and ideas about likely causes. I started to draw up a long list of candidates, suggesting and eliminating things that could have been problems that weren’t. Any relief that a given thing was not the cause of the problem was tempered by the realisation that something else, possibly something obscure or beyond the limit of my own experiences, might be to blame. It was only some consolation that the instruction provoking the failure involved my nemesis from my earlier experiments: the “error level” (ERL) flag in the processor’s status register.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 4)

Saturday, March 24th, 2018

As described previously, having hopefully done enough to modify the kernel – Fiasco.OC – for the Ben NanoNote, it then became necessary to investigate the bootstrap package that is responsible for setting up the hardware and starting the kernel.  This package resides in the L4Re distribution, which is technically a separate thing, even though both L4Re and Fiasco.OC reside in the same published repository structure.

Before continuing into the details, it is worth noting which things need to be retrieved from the L4Re section of the repository in order to avoid frustration later on with package dependencies. I had previously discovered that the following package installation operation would be required (from inside the l4 directory):

svn update pkg/acpica pkg/bootstrap pkg/cxx_thread pkg/drivers pkg/drivers-frst pkg/examples \
           pkg/fb-drv pkg/hello pkg/input pkg/io pkg/l4re-core pkg/libedid pkg/libevent \
           pkg/libgomp pkg/libirq pkg/libvcpu pkg/loader pkg/log pkg/mag pkg/mag-gfx pkg/x86emu

With the listed packages available, it should be possible to build the examples that will eventually interest us. Some of these appear superfluous – x86emu, for instance – but some of the more obviously-essential packages have dependencies on these other packages, and so we cannot rely on our intuition alone!

Also needed when building a payload is some path definitions in the l4/conf/Makeconf.boot file. Here is what I used:

MODULE_SEARCH_PATH += $(L4DIR_ABS)/../kernel/fiasco/mybuild
MODULE_SEARCH_PATH += $(L4DIR_ABS)/conf/examples
MODULE_SEARCH_PATH += $(L4DIR_ABS)/pkg/io/io/config
BOOTSTRAP_SEARCH_PATH  = $(L4DIR_ABS)/conf/examples
BOOTSTRAP_SEARCH_PATH += $(L4DIR_ABS)/../kernel/fiasco/mybuild
BOOTSTRAP_SEARCH_PATH += $(L4DIR_ABS)/pkg/io/io/config
BOOTSTRAP_MODULES_LIST = $(L4DIR_ABS)/conf/modules.list

This assumes that the build directory used when building the kernel is called mybuild. The Makefile will try and copy the kernel into the final image to be deployed and so needs to know where to find it.

Describing the Ben (Again)

Just as we saw with the kernel, there is a need to describe the Ben and to audit the code to make sure that it stands a chance of working on the Ben. This is done slightly differently in L4Re but the general form of the activity is similar, defining the following:

  • An architecture version (MIPS32r1) for the JZ4720 (in l4/mk/arch/Kconfig.mips.inc)
  • A platform configuration for the Ben (in l4/mk/platforms)
  • Some platform details in the bootstrap package (in l4/pkg/bootstrap/server/src)
  • Some hardware details related to memory and interrupts (in l4/pkg/io/io/config/plat-qi_lb60)

For the first of these, I introduced a configuration setting (CPU_MIPS_32R1) to allow us to distinguish between the Ben’s SoC (JZ4720) and other processors, just as I did in the kernel code. With this done, the familiar task of hunting down problematic assembly language instructions can begin, and these can be divided into the following categories:

  • Those that can be rewritten using other instructions that are available to us
  • Those that must be “trapped” and handled by the kernel

Candidates for the former category include all unprivileged instructions that the JZ4720 doesn’t support, such as ext and ins. Where privileged instructions or ones that “bridge” privileges in some way are used, we can still rewrite them if they appear in the bootstrap code, since this code is also running in privileged mode. Here is an example of such privileged instruction rewriting (from l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S):

#if defined(CONFIG_CPU_MIPS_32R1)
       cache   0x01, 0($a0)     # Index_Writeback_Inv_D
       nop
       cache   0x08, 0($a0)     # Index_Store_Tag_I
#else
       synci   0($a0)
#endif

Candidates for the latter category include all awkward privileged or privilege-escalating instructions outside the bootstrap package. Fortunately, though, we don’t need to worry about them very much at all. Since the kernel will be obliged to trap them, we can just keep them where they are and concede that there is nothing else we can do with them.

However, there is one pitfall: these preserved-but-unsupported instructions will upset the compiler! Consider the use of the now overly-familiar rdhwr instruction. If it is mentioned in an assembly language statement, the compiler will notice that amongst its clean MIPS32r1-compliant output, something is inserting an unrecognised instruction, yielding that error we saw earlier:

Error: opcode not supported on this processor: mips32 (mips32)

But we do know what we’re doing! So how can we persuade the compiler? The solution is to override what the compiler (or assembler) thinks it should be producing by introducing a suitable directive as in the following example (from l4/pkg/l4re-core/l4sys/include/ARCH-mips/cache.h):

  asm volatile (
    ".set push\n"
    ".set mips32r2\n"
    "rdhwr %0, $1\n"
    ".set pop"
    : "=r"(step));

Here, with the .set directives, we switch this little region of code to MIPS32r2 compliance and emit our forbidden instruction into the output. Since the kernel will take care of it in the end, the compiler shouldn’t be made to feel that it has to defend us against it.

In L4Re, there are also issues experienced with the CI20 that will also affect the Ben, such as an awkward and seemingly compiler-related issue affecting the way programs are started. In this regard, I just kept my existing patches for these things applied.

My other platform-related adjustments for the Ben have mostly borrowed from support for the CI20 where it existed. For instance, the bootstrap package’s definition for the Ben (in l4/pkg/bootstrap/server/src/platform/qi_lb60.cc) just takes the CI20 equivalent, eliminates superfluous features, modifies details that are different between the SoCs, and changes identifiers. The general definition for the Ben (in l4/mk/platforms/qi_lb60.conf) merely acknowledges differences in some basic platform details.

The CI20 was not supported with a hardware definition describing memory regions and interrupts used by the io package. Taking other devices as inspiration, I consulted the device documentation and wrote a definition when experimenting with the CI20. For the Ben, the form of this definition (in l4/pkg/io/io/config/plat-qi_lb60/hw_devices.io) remains similar but is obviously adjusted for the SoC differences.

Device Drivers and Output

One topic that I have not really mentioned at all is that pertaining to device drivers. I would not have even started this work if I didn’t feel there was a chance of seeing some signs of success from the Ben. Although the Ben, like the CI20, has the capability of exposing a serial console to the outside world, meaning that it can emit messages via a cable to another computer and receive input from that computer, unlike the CI20, its serial console pins are not particularly convenient to use: they really require wires to be soldered to some tiny pads that are found in the battery compartment of the device.

Now, my soldering skills are not very good, and I also want to be able to put the battery back into the device in future. I did try and experiment by holding wires against the pads, this working once or twice by showing output when booting the Ben into its more typical Linux-based environment. But such experiments proved to be unsustainable and rather uncomfortable, needing some kind of “guitar grip” while juggling cables and holding down buttons. So I quickly knew that I would need to get output from the Ben in other ways.

Having deployed low-level payloads to the Ben before, I knew something about the framebuffer, so I had some confidence about initialising it and getting something on the screen that might tell me what has or hasn’t happened. And I adapted my code from this previous effort, itself being derived from driver code written by the people responsible for the Ben, wrapping it up for L4Re. I tried to keep this code minimally different from its previous incarnation, meaning that I could eliminate certain kinds of mistakes in case the code didn’t manage to do its job. With this in place, I felt that I could now consider trying out my efforts and seeing what, if anything, might happen.

Attempting to Bootstrap

Being in the now-familiar position of believing that enough has been done to make the software run, I now considered an attempt at actually bootstrapping the kernel. It may sound naive, but I almost expected to be able to compile everything – the kernel, L4Re, my drivers – and for them all to work together in harmony and produce at least something on the display. But instead, after “Starting kernel …”, nothing happened.

The Ben NanoNote trying to boot a payload from the memory card

The Ben NanoNote trying to boot a payload from the memory card

It should be said that in these kinds of exercises, just one source of failure need present itself and the outcome is, of course, failure. And I can confirm that there were many sources of failure at this point. The challenges, then, are to identify all of these and then to eliminate them all. But how can you even know what all of these sources of failure actually are? It seemed disheartening, but then there are two kinds of strategy that can be employed: to investigate areas likely to be causing problems, and to take every opportunity to persuade the device to tell us what is happening. And with this, the debugging would begin.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 3)

Friday, March 23rd, 2018

So far, in this exercise of porting L4Re and Fiasco.OC to the Ben NanoNote, we have toured certain parts of the kernel, made adjustments for the compiler to generate suitable code, and added some descriptions of the device itself. But, as we saw, the Ben needs some additional changes to be made to the software in places where certain instructions are used that it doesn’t support. Attempting to compile the kernel will most likely end with an error if we ignore such matters, because although the C and C++ code will produce acceptable instructions, upon encountering an assembly language statement containing an unacceptable instruction, the compiler will probably report something like this:

Error: opcode not supported on this processor: mips32 (mips32)

So, we find ourselves in a situation where the compiler is doing the right thing for the code it is generating, but it also notices when the programmer has chosen to do what is now the wrong thing. We must therefore track down these instructions and offer a supported alternative. Previously, we introduced a special configuration setting that might be used to indicate to the compiler when to choose these alternative sequences of instructions: CPU_MIPS32_R1. This gets expanded to CONFIG_CPU_MIPS32_R1 by the build system and it is this identifier that gets used in the program code.

Those Unsupported Instructions

I have put off giving out the details so far, but now is as good a time as any to provide some about the instructions that the JZ4720 (the SoC in the Ben NanoNote) doesn’t seem to support. Some of them are just conveniences, offering a single instruction where many would otherwise be needed. Others offer functionality that is not always trivially replicated.

Instructions Description Privileges
di, ei Disable, enable interrupts Privileged
ext Extract bits from register Unprivileged
ins Insert bits into register Unprivileged
rdhwr Read hardware register Unprivileged, accesses privileged information
synci Synchronise instruction cache Unprivileged, performs privileged operations

We have already mentioned rdhwr, and this is precisely the kind of instruction that can pose problems, these mostly being concerned with it offering access to some (supposedly) privileged information from an unprivileged processor mode. However, since the kernel runs in a privileged mode, typically referred to as “kernel mode”, we won’t see rdhwr when doing our modifications to the kernel. And since the need to provide rdhwr also applied to the JZ4780 (the SoC in the MIPS Creator CI20), it turned out that I didn’t need to do much in addition to what others had already done in supporting it.

Another instruction that requires a bridging of privilege levels is synci. If we discover synci being used in the kernel, it is possible to rewrite it in terms of the equivalent cache instructions. However, outside the kernel in unprivileged mode, those cache instructions cannot be used and we would not wish to support them either, because “user mode” programs are not meant to be playing around with such aspects of the hardware. The solution for such situations is to “trap” synci when it gets used in unprivileged code and to handle it using the same mechanism as that employed to handle rdhwr: to treat it as a “reserved instruction”.

Thus, some extra code is added in the kernel to support this “trap” mechanism, but where we can just replace the instructions, we do so as in this example (from kernel/fiasco/src/kern/mips/alternatives.cpp):

#ifdef CONFIG_CPU_MIPS32_R1
    asm volatile ("cache 0x01, %0\n"
                  "nop\n"
                  "cache 0x08, %0"
                  : : "R"(orig_insn[i]));
#else
    asm volatile ("synci %0" : : "R"(orig_insn[i]));
#endif

We could choose not to bother doing this even in the kernel, instead just trapping all usage of synci. But this would have a performance impact, and L4 is ostensibly very much about performance, and so the opportunity is taken to maximise it by going round and fixing up the code in all these places instead. (Note that I’ve used the nop instruction above, but maybe I should use ehb. It’s probably something to take another look at, perhaps more generally with regard to which instruction I use in these situations.)

The other unsupported instructions don’t create as many problems. The di (disable interrupts) and ei (enable interrupts) instructions are really shorthand for modifications to the processor’s status register, albeit performing those modifications “atomically”. In principle, in cases where I have written out the equivalent sequence of instructions but not done anything to “guard” these instructions from untimely interruptions or exceptions, something bad could happen that wouldn’t have happened with the di or ei instructions themselves.

Maybe I will revisit this, too, and see what the risks might actually be, but for the purposes of getting the kernel working – which is where these instructions appear – the minimal solution seemed reasonably adequate. Here is an extract from a statement employing the ei instruction (from kernel/fiasco/src/drivers/mips/processor-mips.cpp):

#ifdef CONFIG_CPU_MIPS32_R1
    ASM_MFC0 " $t0, $12\n"
    "ehb\n"
    "or $t0, $t0, %[ie]\n"
    ASM_MTC0 " $t0, $12\n"
#else
    "ei\n"
#endif

Meanwhile, the ext (extract) and ins (insert) instructions have similar properties in that they too access parts of registers, replacing sequences of instructions that do the work piece by piece. One challenge that they pose is that they appear in potentially many different places, some with minimal register use, and the equivalent instruction sequence may end up needing an extra register to get the same work done. Fortunately, though, those equivalent instructions are all perfectly usable at whichever privilege level happens to be involved. Here is an extract from a statement employing the ins instruction (from kernel/fiasco/src/kern/mips/thread-mips.cpp):

#ifdef CONFIG_CPU_MIPS32_R1
       "  andi  $t0, %[status], 0xff  \n"
       "  li    $t1, 0xffffff00       \n"
       "  and   $t2, $t2, $t1         \n"
       "  or    $t2, $t2, $t0         \n"
#else
       "  ins   $t2, %[status], 0, 8  \n"
#endif

Note how temporary registers are employed to isolate the bits from the status register and to erase bits in the $t2 register before these two things are combined and stored in $t2.

Bridging the Privilege Gap

The rdhwr instruction has been mentioned quite a few times already. In the kernel, it is handled in the kernel/fiasco/src/kern/mips/exception.S file, specifically in the routine called “reserved_insn”. When the processor encounters an instruction it doesn’t understand, the kernel should have been configured to send it here. I will admit that I knew little to nothing about what to do to handle such situations, but the people who did the MIPS port of the kernel had laid the foundations by supporting one rdhwr variant, and I adapted their work to handle another.

In essence, what happens is that the processor “shows up” in the reserved_insn routine with the location of the bad instruction in its “exception program counter” register. By loading the value stored at that location, we obtain the instruction – or its value, at least – and can then inspect this value to see if we recognise it and can do anything with it. Here is the general representation of rdhwr with an example of its use:

SPECIAL3 _____ t s _____ RDHWR
011111 00000 01000 00001 00000 111011

The first and last portions of the above representation identify the instruction in general, with the bits for the second and next-to-last portions being set to zero presumably because they are either not needed to encode an instruction in this category, or they encode two parameters that are not needed by this particular instruction. To be honest, I haven’t checked which explanation applies, but I suspect it is the latter.

This leaves the remaining portions to indicate specific registers: the target (t) and source (s). With t=8, the result is written to register $8, which is normally known as $t0 (or just t0) in MIPS assembly language. Meanwhile, with s=1, the source register has been given as $1, which is the SYNCI_Step hardware register. So, the above is equivalent to the following:

rdhwr $t0, $1

To reproduce this same realisation in code, we must isolate the parts of the value that identify the instruction. For rdhwr accessing the SYNCI_Step hardware register, this means using a mask that preserves the SPECIAL3, RDHWR, s and blank regions, ignoring the target register value t because it will change according to specific circumstances. Applying this mask to the instruction value and comparing it to an expected value is done rather like this:

li $k0, 0x7c00083b # $k0 = SPECIAL3, blank, s=1, blank, RDHWR
li $at, 0xffe0ffff # $at = define a mask to mask out t
and $at, $at, $k1  # $at = the mask applied to the instruction value

Now, if $at is equal to $k0, the instruction value is identified as encoding rdhwr accessing SYNCI_Step, with the target register being masked out so as not to confuse things. Later on, the target register is itself selected and some trickery is employed to get the appropriate data into that register before returning from this routine.

For the above case and for the synci instruction, the work that needs doing once such an instruction has been identified is equivalent to what would have happened had it been possible to just insert into the code the alternative sequence of instructions that achieves the same thing. So, for synci, the equivalent cache instructions are executed before control is returned to the instruction after synci in the program where it appeared. Thus, upon encountering an unsupported instruction, control is handed over from an unprivileged program to the kernel, the instruction is identified and handled using the necessary privileged instructions, and then control is handed back to the unprivileged program again.

In fact, most of my efforts in exception.S were not really directed towards these two awkward instructions. Instead I had to deal with the use of quite a number of ext and ins instructions. Although it seems tempting to just trap those as well and to provide handlers for them, that would add considerable overhead, and so I added some macros to provide the same functionality when building the kernel for the Ben.

Prepare for Launch

Looking at my patches for the kernel now, I can see that there isn’t much else to cover. One or two details are rather important in the context of the Ben and how it manages to boot, however, and the process of figuring out those details was, like much else in this exercise, time-consuming, slightly frustrating, and left surprisingly little trace once the solution was found. At this stage, not everything was perfectly transcribed or expressed, leaving a degree of debugging activity that would also need to be performed in the future.

So, with a kernel that might be runnable, I considered what it would take to actually launch that kernel. This led me into the L4 Runtime Environment (L4Re) code and specifically to the bootstrap package. It turns out that the kernel distribution delegates such concerns to other software, and the bootstrap package sits uneasily alongside other packages, it being perhaps the only one amongst them that can exercise as much privilege as the kernel because its code actually runs at boot time before the kernel is started up.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 2)

Thursday, March 22nd, 2018

Having undertaken some initial investigations into running L4Re and Fiasco.OC on the MIPS Creator CI20, I envisaged attempting to get this software running on the Ben NanoNote, too. For a while, I put this off, feeling confident that when I finally got round to it, it would probably be a matter of just choosing the right compiler options and then merely fixing all the mistakes I had made in my own driver code. Little did I know that even the most trivial activities would prove more complicated than anticipated.

As you may recall, I had noted that a potentially viable approach to porting the software would merely involve setting the appropriate compiler switches for “soft-float” code, thus avoiding the generation of floating point instructions that the JZ4720 – the SoC on the Ben NanoNote – would not be able to execute. A quick check of the GCC documentation indicated the availability of the -msoft-float switch. And since I have a working cross-compiler for MIPS as provided by Debian, there didn’t seem to be much more to it than that. Until I discovered that the compiler doesn’t seem to support soft-float output at all.

I had hoped to avoid building my own cross-compiler, and apart from enthusiastic (and occasionally successful) attempts to build the Debian ones before they became more generally available, the last time I really had anything to do with this was when I first developed software for the Ben. As part of the general support for the device an OpenWrt distribution had been made available. Part of that was the recipe for building the cross-compiler and other tools, needed for building a kernel and all the software one would deploy on a device. I am sure that this would still be a good place to look for a solution, but I had heard things about Buildroot and so set off to investigate that instead.

So although Buildroot, like OpenWrt, is promoted as a way of building an entire system, it too offers help in building just the toolchain if that is all you need. Getting it to build the appropriately-configured cross-compiler is a matter of the familiar “make menuconfig” seen from the Linux kernel source distribution, choosing things in a menu – for us, asking for a soft-float toolchain, also enabling C++ support – and then running “make toolchain”. As a result, I got a range of tools in the output/host/bin directory prefixed with mipsel-buildroot-linux-uclibc.

Some Assembly Required

Changing the compiler settings for Fiasco.OC (in kernel/fiasco/src/Makeconf.mips) and L4Re (in l4/mk/arch/Makeconf.mips), and making sure not to enable any floating point support in Fiasco.OC, and recompiling the code to produce soft-float output was straightforward enough. However, despite the portability of this software, it isn’t completely C and C++ code: lurking in various places (typically in mips or ARCH-mips directories) are assembly language source files with the .S prefix, and in some C and C++ files one can also find “asm” statements which embed assembly language instructions within higher-level code.

With the assumption that by specifying the right compiler switches, no floating point instructions will be produced from C or C++ source code, all that remains is to determine whether any of these other code sections mention such forbidden instructions. It was asserted that Fiasco.OC doesn’t use any floating point instructions at all. Meanwhile, I couldn’t find any floating point instructions in the generated code: “mipsel-linux-gnu-objdump -D some-output-file” (or, indeed, “mipsel-buildroot-linux-uclibc-objdump -D some-output-file”) now started to become a familiar acquaintance if not exactly a friend!

In fact, the assembly language files and statements would provide other challenges in the form of instructions unsupported by the JZ4720. Again, I had the choice of either trying to support MIPS32r2 instructions, like rdhwr, by providing “reserved instruction” handlers, or to rewrite these instructions in forms suitable for the JZ4720. At least within Fiasco.OC – the “kernel” – where the environment for executing instructions is generally privileged, it is possible to reformulate MIPS32r2 instructions in terms of others. I will return to the details of these instructions later on.

Where to Find Things

Having spent all this time looking around in the L4Re and Fiasco.OC code, it is perhaps worth briefly mentioning where certain things can be found. The heart of the action in the kernel is found in these places:

Directory Significance
kernel/fiasco/src The top-level directory of the kernel sources, having some MIPS-specific files
kernel/fiasco/src/drivers/mips Various hardware abstractions related to MIPS
kernel/fiasco/src/jdb/mips MIPS-specific support code for the kernel debugger (which I don’t use)
kernel/fiasco/src/kern/mips MIPS-specific support code for the kernel itself
kernel/fiasco/src/templates Device configuration details

As noted above, I don’t use the kernel debugger, but I still made some edits that might make it possible to use it later on. For the most part, the bulk of my time and effort was spent in the src/kern/mips hierarchy, occasionally discovering things in src/drivers/mips that also needed some attention.

Describing the Ben

So it started to make sense to consider how the Ben might be described in terms of a kernel configuration, and whether we might want to indicate a less sophisticated revision of the architecture so that we could test for it in the code and offer alternative sequences of instructions where possible. There are a few different places where hardware platforms are described within Fiasco.OC, and I ended up defining the following:

  • An architecture version (MIPS32r1) for the JZ4720 (in kernel/fiasco/src/kern/mips/Kconfig)
  • A definition for the Ben itself (in kernel/fiasco/src/templates/globalconfig.out.mips-qi_lb60)
  • A board entry for the Ben (in kernel/fiasco/src/kern/mips/bsp/qi_lb60/Kconfig) as part of a board-specific collection of functionality

This is not by any means enough, even disregarding any code required to do things specific to the Ben. But with the additional configuration setting for the JZ4720, which I called CPU_MIPS32_R1, it becomes possible to go around inside the kernel code and start to mark up places which need different instruction sequences for the Ben, using CONFIG_CPU_MIPS32_R1 as the symbol corresponding to this setting in the code itself. There are places where this new setting will also change the compiler’s behaviour: in kernel/fiasco/src/Makeconf.mips, the -march=mips32 compiler switch is activated by the setting, preventing the compiler from generating instructions we do not want.

For the board-specific functionality (found in kernel/fiasco/src/kern/mips/bsp/qi_lb60), I took the CI20’s collection of files as a starting point. Fortunately for me, the Ben’s JZ4720 and the CI20’s JZ4780 are so similar that I could, with reference to Linux kernel code and other sources of documentation, make a first effort at support for the Ben by transcribing and editing these files. Some things I didn’t understand straight away, and I only later discovered what some parameters to certain methods really mean.

But generally, this work was simply a matter of seeing what peripheral registers were mentioned in the CI20 version, figuring out whether those registers were present in the earlier SoC, and determining whether their locations were the same or whether they had been moved around from one product to the next. Let us take a brief look at the registers associated with the timer/counter unit (TCU) in the JZ4720 and JZ4780 (with apologies for WordPress converting “x” into a multiplication symbol in some places):

JZ4720 (Ben NanoNote) JZ4780 (MIPS Creator CI20)
Registers Offsets Size Registers Offsets Size
TER, TESR, TECR (timer enable, set, clear) 0x10, 0x14, 0x18 8-bit TER, TESR, TECR (timer enable, set, clear) 0x10, 0x14, 0x18 16-bit
TFR, TFSR, TFCR (timer flag, set, clear) 0x20, 0x24, 0x28 32-bit TFR, TFSR, TFCR (timer flags, set, clear) 0x20, 0x24, 0x28 32-bit
TMR, TMSR, TMCR (timer mask, set, clear) 0x30, 0x34, 0x38 32-bit TMR, TMSR, TMCR (timer mask, set, clear) 0x30, 0x34, 0x38 32-bit
TDFR0, TDHR0, TCNT0, TCSR0 (timer data full match, half match, counter, control) 0x40, 0x44, 0x48, 0x4c 16-bit TDFR0, TDHR0, TCNT0, TCSR0 (timer data full match, half match, counter, control) 0x40, 0x44, 0x48, 0x4c 16-bit
TSR, TSSR, TSCR (timer stop, set, clear) 0x1c, 0x2c, 0x3c 8-bit TSR, TSSR, TSCR (timer stop, set, clear) 0x1c, 0x2c, 0x3c 32-bit

We can see how the later product (JZ4780) has evolved from the earlier one (JZ4720), with some registers supporting more bits, exposing control over an increased number of timers. A lot of the details are the same, which was fortunate for me! Even the oddly-located timer stop registers, separated by intervals of 16 bytes (0x10) instead of 4 bytes, have been preserved between the products.

One interesting difference is the absence of the “operating system timer” in the JZ4720. This is a 64-bit counter provided by the JZ4780, but for the Ben it seems that we have to make do with the standard 16-bit timers provided by both products. Otherwise, for this part of the hardware, it is a matter of making sure the fundamental operations look reasonable – whether the registers are initialised sensibly – and then seeing how this functionality is used elsewhere. A file called tcu_jz4740.cpp in the board-specific directory for the Ben preserves this information. (Note that the JZ4720 is largely the same as the JZ4740 which can be considered as a broader product category that includes the JZ4720 as a variant with slightly reduced functionality.)

In the same directory, there is a file covering timer functionality from the perspective of the kernel: timer-jz4740.cpp. Here, the above registers are manipulated to realise certain operations – enabling and disabling timers, reading them, indicating which interrupt they may cause – and the essence of this work again involves checking documentation sources, register layouts, and making sure that the intent of the code is preserved. It may be mundane work, but any little detail that is not correct may prevent the kernel from working.

Covering the Ground

At this point, the essential hardware has mostly been described, building on all the work done by others to port the kernel to the MIPS architecture and to the CI20, merely adding a description of the differences presented by the Ben. When I made these changes, I was slowly immersing myself in the code, writing things that I felt I mostly understood from having previously seen code accessing certain hardware features of the Ben. But I knew that there will still some way to go before being able to expect anything to actually work.

From this point, I would now need to confront the unimplemented instructions, deal with the memory layout, and figure out how the kernel actually gets launched in the first place. This would also mean that I could no longer keep just adding and changing code and feeling like progress was being made: I would actually have to try and get the Ben to run something. And as those of us who write software know very well, there can be nothing more punishing than being confronted with the behaviour of a program that is incorrect, with the computer caring not about intentions or aspirations but only about executing the logic whether it is correct or not.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 1)

Wednesday, March 21st, 2018

For quite some time, I have been interested in alternative operating system technologies, particularly kernels beyond the likes of Linux. Things like the Hurd and technologies associated with it, such as Mach, seem like worthy initiatives, and contrary to largely ignorant and conveniently propagated myths, they are available and usable today for anyone bothered to take a look. Indeed, Mach has had quite an active life despite being denigrated for being an older-generation microkernel with questionable performance credentials.

But one technological branch that has intrigued me for a while has been the L4 family of microkernels. Starting out with the motivation to improve microkernel performance, particularly with regard to interprocess communication, different “flavours” of L4 have seen widespread use and, like Mach, have been ported to different hardware architectures. One of these L4 implementations, Fiasco.OC, appeared particularly interesting in this latter regard, in addition to various other features it offers over earlier L4 implementations.

Meanwhile, I have had some success with software and hardware experiments with the Ben NanoNote. As you may know or remember, the Ben NanoNote is a “palmtop” computer based on an existing design (apparently for a pocket dictionary product) that was intended to offer a portable computing experience supported entirely by Free Software, not needing any proprietary drivers or firmware whatsoever. Had the Free Software Foundation been certifying devices at the time of its introduction, I imagine that it would have received the “Respects Your Freedom” certification. So, it seems to me that it is a worthy candidate for a Free Software porting exercise.

The Starting Point

Now, it so happened that Fiasco.OC received some attention with regards to being able to run on the MIPS architecture. The Ben NanoNote employs a system-on-a-chip (SoC) whose own architecture closely (and deliberately) resembles the MIPS architecture, but all information about the JZ4720 SoC specifies “XBurst” as the architecture name. In fact, one can regard XBurst as a clone of a particular version of the MIPS architecture with some additional instructions.

Indeed, the vendor, Ingenic, subsequently licensed the MIPS architecture, produced some SoCs that are officially MIPS-labelled, culminating in the production of the MIPS Creator CI20 product: a development board commissioned by the then-owners of the MIPS portfolio, Imagination Technologies, utilising the Ingenic JZ4780 SoC to presumably showcase the suitability of the MIPS architecture for various applications. It was apparently for this product that an effort was made to port Fiasco.OC to MIPS, and it was this effort that managed to attract my attention.

The MIPS Creator CI20 single-board computer

The MIPS Creator CI20 single-board computer

It was just as well others had done this hard work. Although I have been gradually immersing myself in the details of how MIPS-based CPUs function, having written some code that can boot the Ben, run a few things concurrently, map memory for different processes, read the keyboard and show things on the screen, I doubt that my knowledge is anywhere near comprehensive enough to tackle porting an existing operating system kernel. But knowing that not only had others done this work, but they had also targeted a rather similar system, gave me some confidence that I might be able to perform the relatively minor porting exercise to target the Ben.

But first I felt that I had to gain experience with Fiasco.OC on MIPS in a more convenient fashion. Although I had muddled through the development of code on the Ben, reusing existing framebuffer driver code and hacking away until I managed to get some output on the display, I felt that if I were to continue my experiments, a more efficient way of debugging my code would be required. With this in mind, I purchased a MIPS Creator CI20 and, after doing things with the pre-installed Debian image plus installing a newer version of Debian, I set out to try Fiasco.OC on the hardware.

The Missing Pieces

According to the Fiasco.OC features page, the “Ci20” is supported. Unfortunately, this assertion of support is not entirely true, as we will come to see. Previously, I mentioned that the JZ4720 in the Ben NanoNote largely implements the instructions of a certain version of the MIPS architecture. Although the JZ4780 in the CI20 introduces some new features over the JZ4720, such as a floating point arithmetic unit, it still lacks various instructions that are present in commonly-used MIPS versions that might be taken as the “baseline” for software support: MIPS32 Release 2 (MIPS32r2), for instance.

Upon trying to get Fiasco.OC to start up, I soon encountered one of these instructions, or at least a particular variant of it: rdhwr (read hardware register) accessing SYNCI_Step (the instruction cache line size). This sounds quite fearsome, but I had been somewhat exposed to cache management operations when conjuring up my own code to run on the Ben. In fact, all this instruction variant does is to ask how big the step size has to be in a loop that invalidates the instruction cache, instead of stuffing such a value into the program when compiling it and thus making an executable that will then be specific to a particular processor.

Fortunately, those hardworking people who had already ported the code to MIPS had previously encountered another rdhwr variant and had written code to “trap” it in the “reserved instruction” handler. That provided some essential familiarisation with the kernel code, saving me the effort of having to identify the right place to modify, as well as providing a template for how such handlers should operate. I feel fairly competent writing MIPS assembly language, although I would manage to make an easy mistake in this code that would impede progress much later on.

There were one or two other things that also needed fixing up, mentioned briefly in my review of the year article, generally involving position-independent code that was not called correctly and may have been related to me using a generic version of GCC instead of some vendor-modified version. But as I described in that article, I finally managed to boot Fiasco.OC and run a program on top of it, writing the output via the serial connection to my personal computer.

The End of the Very Beginning

I realised that compiling such code for the Ben would either require the complete avoidance of floating point instructions, due to the lack of that floating point unit in the JZ4720, or that I would need to provide implementations of those instructions in software. Fortunately, GCC provides a mode to compile “soft-float” versions of C and C++ programs, and so this looked like the next step. And so, apart from polishing support for features of the Ben like the framebuffer, input/output pins, the clock circuitry, it didn’t really seem that there would be so much to do.

As it so often turns out with technology, optimism can lead to unrealistic estimates of how much time and effort remains in a project. I now know that a description of all this effort would be just too much for a single article. So, I will wrap this article up with a promise that the next one will descend into the details of compilers, assembly language, the SoC, and before too long, we will get to see the inconvenience of debugging low-level software with nothing more than a framebuffer.

2017 in Review

Thursday, December 7th, 2017

On Planet Debian there seems to be quite a few regularly-posted articles summarising the work done by various people in Free Software over the month that has most recently passed. I thought it might be useful, personally at least, to review the different things I have been doing over the past year. The difference between this article and many of those others is that the work I describe is not commissioned or generally requested by others, instead relying mainly on my own motivation for it to happen. The rate of progress can vary somewhat as a result.

Learning KiCad

Over the years, I have been playing around with Arduino boards, sensors, displays and things of a similar nature. Although I try to avoid buying more things to play with, sometimes I manage to acquire interesting items regardless, and these aren’t always ready to use with the hardware I have. Last December, I decided to buy a selection of electronics-related items for interfacing and experimentation. Some of these items have yet to be deployed, but others were bought with the firm intention of putting different “spare” pieces of hardware to use, or at least to make them usable in future.

One thing that sits in this category of spare, potentially-usable hardware is a display circuit board that was once part of a desk telephone, featuring a two-line, bitmapped character display, driven by the Hitachi HD44780 LCD controller. It turns out that this hardware is so common and mundane that the Arduino libraries already support it, but the problem for me was being able to interface it to the Arduino. The display board uses a cable with a connector that needs a special kind of socket, and so some research is needed to discover the kind of socket needed and how this might be mounted on something else to break the connections out for use with the Arduino.

Fortunately, someone else had done all this research quite some time ago. They had even designed a breakout board to hold such a socket, making it available via the OSH Park board fabricating service. So, to make good on my plan, I ordered the mandatory minimum of three boards, also ordering some connectors from Mouser. When all of these different things arrived, I soldered the socket to the board along with some headers, wired up a circuit, wrote a program to use the LiquidCrystal Arduino library, and to my surprise it more or less worked straight away.

Breakout board for the Molex 52030 connector

Breakout board for the Molex 52030 connector

Hitachi HD44780 LCD display boards driven by an Arduino

Hitachi HD44780 LCD display boards driven by an Arduino

This satisfying experience led me to consider other boards that I might design and get made. Previously, I had only made a board for the Arduino using Fritzing and the Fritzing Fab service, and I had held off looking at other board design solutions, but this experience now encouraged me to look again. After some evaluation of the gEDA tools, I decided that I might as well give KiCad a try, given that it seems to be popular in certain “open source hardware” circles. And after a fair amount of effort familiarising myself with it, with a degree of frustration finding out how to do certain things (and also finding up-to-date documentation), I managed to design my own rather simple board: a breakout board for the Acorn Electron cartridge connector.

Acorn Electron cartridge breakout board (in 3D-printed case section)

Acorn Electron cartridge breakout board (in 3D-printed case section)

In the back of my mind, I have vague plans to do other boards in future, but doing this kind of work can soak up a lot of time and be rather frustrating: you almost have to get into some modified mental state to work efficiently in KiCad. And it isn’t as if I don’t have other things to do. But at least I now know something about what this kind of work involves.

Retro and Embedded Hardware

With the above breakout board in hand, a series of experiments were conducted to see if I could interface various circuits to the Acorn Electron microcomputer. These mostly involved 7400-series logic chips (ICs, integrated circuits) and featured various logic gates and counters. Previously, I had re-purposed an existing ROM cartridge design to break out signals from the computer and make it access a single flash memory chip instead of two ROM chips.

With a dedicated prototyping solution, I was able to explore the implementation of that existing board, determine various aspects of the signal timings that remained rather unclear (despite being successfully handled by the existing board’s logic), and make it possible to consider a dedicated board for a flash memory cartridge. In fact, my brother, David, also wanting to get into board design, later adapted the prototyping cartridge to make such a board.

But this experimentation also encouraged me to tackle some other items in the electronics shipment: the PIC32 microcontrollers that I had acquired because they were MIPS-based chips, with somewhat more built-in RAM than the Atmel AVR-based chips used by the average Arduino, that could also be used on a breadboard. I hoped that my familiarity with the SoC (system-on-a-chip) in the Ben NanoNote – the Ingenic JZ4720 – might confer some benefits when writing low-level code for the PIC32.

PIC32 on breadboard with Arduino programming circuit

PIC32 on breadboard with Arduino programming circuit (and some LEDs for diagnostic purposes)

I do not need to reproduce an account of my activities here, given that I wrote about the effort involved in getting started with the PIC32 earlier in the year, and subsequently described an unusual application of such a microcontroller that seemed to complement my retrocomputing interests. I have since tried to make that particular piece of work more robust, but deducing the actual behaviour of the hardware has been frustrating, the documentation can be vague when it needs to be accurate, and much of the community discussion is focused on proprietary products and specific software tools rather than techniques. Maybe this will finally push me towards investigating programmable logic solutions in the future.

Compiling a Python-like Language

As things actually happened, the above hardware activities were actually distractions from something I have been working on for a long time. But at this point in the article, this can be a diversion from all the things that seem to involve hardware or low-level software development. Many years ago, I started writing software in Python. Over the years since, alternative implementations of the Python language (the main implementation being CPython) have emerged and seen some use, some continuing to be developed to this day. But around fifteen years ago, it became a bit more common for people to consider whether Python could be compiled to something that runs more efficiently (and more quickly).

I followed some of these projects enthusiastically for a while. Starkiller promised compilation to C++ but never delivered any code for public consumption, although the associated academic thesis might have prompted the development of Shed Skin which does compile a particular style of Python program to C++ and is available as Free Software. Meanwhile, PyPy elevated to prominence the notion of writing a language and runtime library implementation in the language itself, previously seen with language technologies like Slang, used to implement Squeak/Smalltalk.

Although other projects have also emerged and evolved to attempt the compilation of Python to lower-level languages (Pyrex, Cython, Nuitka, and so on), my interests have largely focused on the analysis of programs so that we may learn about their structure and behaviour before we attempt to run them, this alongside any benefits that might be had in compiling them to something potentially faster to execute. But my interests have also broadened to consider the evolution of the Python language since the point fifteen years ago when I first started to think about the analysis and compilation of Python. The near-mythical Python 3000 became a real thing in the form of the Python 3 development branch, introducing incompatibilities with Python 2 and fragmenting the community writing software in Python.

With the risk of perfectly usable software becoming neglected, its use actively (and destructively) discouraged, it becomes relevant to consider how one might take control of one’s software tools for long-term stability, where tools might be good for decades of use instead of constantly changing their behaviour and obliging their users to constantly change their software. I expressed some of my thoughts about this earlier in the year having finally reached a point where I might be able to reflect on the matter.

So, the result of a great deal of work, informed by experiences and conversations over the years related to previous projects of my own and those of others, is a language and toolchain called Lichen. This language resembles Python in many ways but does not try to be a Python implementation. The toolchain compiles programs to C which can then be compiled and executed like “normal” binaries. Programs can be trivially cross-compiled by any available C cross-compilers, too, which is something that always seems to be a struggle elsewhere in the software world. Unlike other Python compilers or implementations, it does not use CPython’s libraries, nor does it generate in “longhand” the work done by the CPython virtual machine.

One might wonder why anyone should bother developing such a toolchain given its incompatibility with Python and a potential lack of any other compelling reason for people to switch. Given that I had to accept some necessary reductions in the original scope of the project and to limit my level of ambition just to feel remotely capable of making something work, one does need to ask whether the result is too compromised to be attractive to others. At one point, programs manipulating integers were slower when compiled than when they were run by CPython, and this was incredibly disheartening to see, but upon further investigation I noticed that CPython effectively special-cases integer operations. The design of my implementation permitted me to represent integers as tagged references – a classic trick of various language implementations – and this overturned the disadvantage.

For me, just having the possibility of exploring alternative design decisions is interesting. Python’s design is largely done by consensus, with pronouncements made to settle disagreements and to move the process forward. Although this may have served the language well, depending on one’s perspective, it has also meant that certain paths of exploration have not been followed. Certain things have been improved gradually but not radically due to backwards compatibility considerations, this despite the break in compatibility between the Python 2 and 3 branches where an opportunity was undoubtedly lost to do greater things. Lichen is an attempt to explore those other paths without having to constantly justify it to a group of people who may regard such exploration as hostile to their own interests.

Lichen is not really complete: it needs floating point number and other useful types; its library is minimal; it could be made more robust; it could be made more powerful. But I find myself surprised that it works at all. Maybe I should have more confidence in myself, especially given all the preparation I did in trying to understand the good and bad aspects of my previous efforts before getting started on this one.

Developing for MIPS-based Platforms

A couple of years ago I found myself wondering if I couldn’t write some low-level software for the Ben NanoNote. One source of inspiration for doing this was “The CI20 bare-metal project“: a series of blog articles discussing the challenges of booting the MIPS Creator CI20 single-board computer. The Ben and the CI20 use CPUs (or SoCs) from the same family: the Ingenic JZ4720 and JZ4780 respectively.

For the Ben, I looked at the different boot payloads, principally those written to support booting from a USB host, but also the version of U-Boot deployed on the Ben. I combined elements of these things with the framebuffer driver code from the Linux kernel supporting the Ben, and to my surprise I was able to get the device to boot up and show a pattern on the screen. Progress has not always been steady, though.

For a while, I struggled to make the CPU leave its initial exception state without hanging, and with the screen as my only debugging tool, it was hard to see what might have been going wrong. Some careful study of the code revealed the problem: the code I was using to write to the framebuffer was using the wrong address region, meaning that as soon as an attempt was made to update the contents of the screen, the CPU would detect a bad memory access and an exception would occur. Such exceptions will not be delivered in the initial exception state, but with that state cleared, the CPU will happily trigger a new exception when the program accesses memory it shouldn’t be touching.

Debugging low-level code on the Ben NanoNote (the hard way)

Debugging low-level code on the Ben NanoNote (the hard way)

I have since plodded along introducing user mode functionality, some page table initialisation, trying to read keypresses, eventually succeeding after retracing my steps and discovering my errors along the way. Maybe this will become a genuinely useful piece of software one day.

But one useful purpose this exercise has served is that of familiarising myself with the way these SoCs are organised, the facilities they provide, how these may be accessed, and so on. My brother has the Letux 400 notebook containing yet another SoC in the same family, the JZ4730, which seems to be almost entirely undocumented. This notebook has proven useful under certain circumstances. For instance, it has been used as a kind of appliance for document scanning, driving a multifunction scanner/printer over USB using the enduring SANE project’s software.

However, the Letux 400 is already an old machine, with products based on this hardware platform being almost ten years old, and when originally shipped it used a 2.4 series Linux kernel instead of a more recent 2.6 series kernel. Like many products whose software is shipped as “finished”, this makes the adoption of newer software very difficult, especially if the kernel code is not “upstreamed” or incorporated into the official Linux releases.

As software distributions such as Debian evolve, they depend on newer kernel features, but if a device is stuck on an older kernel (because the special functionality that makes it work on that device is specific to that kernel) then the device, unable to run the newer kernels, gradually becomes unable to run newer versions of the distribution as well. Thus, Debian Etch was the newest distribution version that would work on the 2.4 kernel used by the Letux 400 as shipped.

Fortunately, work had been done to make a 2.6 series kernel work on the Letux 400, and this made Debian Lenny functional. But time passes and even this is now considered ancient. Although David was running some software successfully, there was other software that really needed a newer distribution to be able to run, and this meant considering what it might take to support Debian Squeeze on the hardware. So he set to work adding patches to the 2.6.24 kernel to try and take it within the realm of Squeeze support, making it beyond the bare minimum of 2.6.29 and into the “release candidate” territory of 2.6.30. And this was indeed enough to run Squeeze on the notebook, at least supporting the devices needed to make the exercise worthwhile.

Now, at a much earlier stage in my own experiments with the Ben NanoNote, I had tried without success to reproduce my results on the Letux 400. And I had also made a rather tentative effort at modifying Ben NanoNote kernel drivers to potentially work with the Letux 400 from some 3.x kernel version. David’s success in updating the kernel version led me to look again at the tasks of familiarising myself with kernel drivers, machine details and of supporting the Letux 400 in even newer kernels.

The outcome of this is uncertain at present. Most of the work on updating the drivers and board support has been done, but actual testing of my work still needs to be done, something that I cannot really do myself. That might seem strange: why start something I cannot finish by myself? But how I got started in this effort is also rather related to the topic of the next section.

The MIPS Creator CI20 and L4/Fiasco.OC

Low-level programming on the Ben NanoNote is frustrating unless you modify the device and solder the UART connections to the exposed pads in the battery compartment, thereby enabling a serial connection and allowing debugging information to be sent to a remote display for perusal. My soldering skills are not that great, and I don’t want to damage my device. So debugging was a frustrating exercise. Since I felt that I needed a bit more experience with the MIPS architecture and the Ingenic SoCs, it occurred to me that getting a CI20 might be the way to go.

I am not really a supporter of Imagination Technologies, producer of the CI20, due to the company’s rather hostile attitude towards Free Software around their PowerVR technologies, meaning that of the different graphics acceleration chipsets, PowerVR has been increasingly isolated as a technology that is consistently unsupportable by Free Software drivers. However, the CI20 is well-documented and has been properly supported with Free Software, apart from the PowerVR parts of the hardware, of course. Ingenic were seemingly persuaded to make the programming manual for the JZ4780 used by the CI20 publicly available, unlike the manuals for other SoCs in that family. And the PowerVR hardware is not actually needed to be able to use the CI20.

The MIPS Creator CI20 single-board computer

The MIPS Creator CI20 single-board computer

I had hoped that the EOMA68 campaign would have offered a JZ4775 computer card, and that the campaign might have delivered such a card by now, but with both of these things not having happened I took the plunge and bought a CI20. There were a few other reasons for doing so: I wanted to see how a single-board computer with a decent amount of RAM (1GB) might perform as a working desktop machine; having another computer to offload certain development and testing tasks, rather than run virtual machines, would be useful; I also wanted to experiment with and even attempt to port other operating systems, loosening my dependence on the Linux monoculture.

One of these other operating systems involves two components: the Fiasco.OC microkernel and the L4 Runtime Environment (L4Re). Over the years, microkernels in the L4 family have seen widespread use, and at one point people considered porting GNU Hurd to one of the L4 family microkernels from the Mach microkernel it then used (and still uses). It seems to me like something worth looking at more closely, and fortunately it also seemed that this software combination had been ported to the CI20. However, it turned out that my expectations of building an image, testing the result, and then moving on to developing interesting software were a little premature.

The first real problem was that GCC produced position-independent code that was not called correctly. This meant that upon trying to get the addresses of functions, the program would end up loading garbage addresses and trying to call any code that might be there at those addresses. So some fixes were required. Then, it appeared that the JZ4780 doesn’t support a particular MIPS instruction, meaning that the CPU would encounter this instruction and cause an exception. So, with some guidance, I wrote a handler to decode the instruction and generate the rather trivial result that the instruction should produce. There were also some more generic problems with the microkernel code that had previously been patched but which had not appeared in the upstream repository. But in the end, I got the “hello” program to run.

With a working foundation I tried to explore the hardware just as I had done with the Ben NanoNote, attempting to understand things like the clock and power management hardware, general purpose input/output (GPIO) peripherals, and also the Inter-Integrated Circuit (I2C) peripherals. Some assistance was available in the form of Linux kernel driver code, although the style of code can vary significantly, and it also takes time to “decode” various mechanisms in the Linux code and to unpick the useful bits related to the hardware. I had hoped to get further, but in trying to use the I2C peripherals to talk to my monitor using the DDC protocol, I found that the data being returned was not entirely reliable. This was arguably a distraction from the more interesting task of enabling the display, given that I know what resolutions my monitor supports.

However, all this hardware-related research and detective work at least gave me an insight into mechanisms – software and hardware – that would inform the effort to “decode” the vendor-written code for the Letux 400, making certain things seem a lot more familiar and increasing my confidence that I might be understanding the things I was seeing. For example, the JZ4720 in the Ben NanoNote arranges its hardware registers for GPIO configuration and access in a particular way, but the code written by the vendor for the JZ4730 in the Letux 400 accesses GPIO registers in a different way.

Initially, I might have thought that I was missing some important detail: are the two products really so different, and if not, then why is the code so different? But then, looking at the JZ4780, I encountered another scheme for GPIO register organisation that is different again, but which does have similarities to the JZ4730. With the JZ4780 being publicly documented, the code for the Letux 400 no longer seemed quite so bizarre or unfathomable. With more experience, it is possible to have a little more confidence in one’s understanding of the mechanisms at work.

I would like to spend a bit more time looking at microkernels and alternatives to Linux. While many people presumably think that Linux is running on everything and has “won”, it is increasingly likely that the Linux one sees on devices does not completely control the hardware and is, in fact, virtualised or confined by software systems like L4/Fiasco.OC. I also have reservations about the way Linux is developed and how well it is able to handle the demands of its proliferation onto every kind of device, many of them hooked up to the Internet and being left to fend for themselves.

Developing imip-agent

Alongside Lichen, a project that has been under development for the last couple of years has been imip-agent, allowing calendar-based scheduling activities to be integrated with mail transport agents. I haven’t been able to spend quite as much time on imip-agent this year as I might have liked, although I will also admit that I haven’t always been motivated to spend much time on it, either. Still, there have been brief periods of activity tidying up, fixing, or improving the code. And some interest in packaging the software led me to reconsider some of the techniques used to deploy the software, in particular the way scheduling extensions are discovered, and the way the system configuration is processed (since Debian does not want “executable scripts” in places like /etc, even if those scripts just contain some simple configuration setting definitions).

It is perhaps fairly typical that a project that tries to assess the feasibility of a concept accumulates the necessary functionality in order to demonstrate that it could do a particular task. After such an initial demonstration, the effort of making the code easier to work with, more reliable, more extensible, must occur if further progress is to be made. One intervention that kept imip-agent viable as a project was the introduction of a test suite to ensure that the basic functionality did indeed work. There were other architectural details that I felt needed remedying or improving for the code to remain manageable.

Recently, I have been refining the parts of the code that support editing of calendar objects and the exchange of updates caused by changes to calendar events. Such work is intended to make the Web client easier to understand and to expose such functionality to proper testing. One side-effect of this may be the introduction of a text-based client for people using e-mail programs like Mutt, as well as a potentially usable library for other mail clients. Such tidying up and fixing does not show off fancy new features or argue the case for developing such software in the first place, but I suppose it makes me feel better about the software I have written.

Whither Moin?

There are probably plenty of other little projects of my own that I have started or at least contemplated this year. And there are also projects that are not mine but which I use and which have had contributions from me over the years. One of these is the MoinMoin wiki software that powers a number of Free Software and other Web sites where collaborative editing is made available to the communities involved. I use MoinMoin – or Moin for short – to publish content on the Web myself, and I have encouraged others to use it in the past. However, it worries me now that the level of maintenance it is receiving has fallen to a level where updates for faults in the software are not likely to be forthcoming and where it is no longer clear where such updates should be coming from.

Earlier in the year, having previously read queries about the static export output from Moin, which can be rather basic and not necessarily resemble the appearance of the wiki such output has come from, I spent some time considering my own use of Moin for documentation publishing. For some of my projects, I don’t take advantage of the “through the Web” editing of the solution when publishing the public documentation. Instead, I use Moin locally, store the pages in a separate repository, and then make page packages that get installed on a public instance of Moin. This means that I do not have to worry about Web-based authentication and can just have a wiki as a read-only resource.

Obviously, the parts of Moin that I really need here are just the things that parse the wiki formatting (which I regard as more usable than other document markup formats in various respects) and that format the content as HTML. If I could format it as static content with some pages, some stylesheets, some images, with some Web server magic to make the URLs look nice, then that would probably be sufficient. For some things like the automatic generation of SVG from Graphviz-format files, I would also need to have the relevant parsers available, too. Having a complete Web framework, which is what Moin really is, is rather unnecessary with these diminished requirements.

But I do use Moin as a full wiki solution as well, and so it made me wonder whether I shouldn’t try and bring it up to date. Of course, there is already the MoinMoin 2.0 effort that was intended to modernise and tidy up the software, but since this effort made a clean break from Moin 1.x, it was never an attractive choice for those people already using Moin in anything more than a basic sense. Since there wasn’t an established API for extensions, it was not readily usable for many existing sites that rely on such extensions. In a way, Moin 2 has suffered from something that Python 3 only avoided by having a lot more people working on it, including people being paid to work on it, together with a policy of openly shaming those people who had made Python 2 viable – by developing software for it – into spending time migrating their code to Python 3.

I don’t have an obvious plan of action here. Moin perhaps illustrates the fundamental problem facing many Free Software projects, this being a theme that I have discussed regularly this year: how they may remain viable by having people able to dedicate their time to writing and maintaining Free Software without this work being squeezed in around the edges of people’s “actual work” and thus burdening them with yet another obligation in their lives, particularly one that is not rewarded by a proper appreciation of the sacrifice being made.

Plenty of individuals and organisations benefit from Moin, but we live in an age of “comparison shopping” where people will gladly drop one thing if someone offers them something newer and shinier. This is, after all, how everyone ends up using “free” services where the actual costs are hidden. To their credit, when Moin needed to improve its password management, the Python Software Foundation stepped up and funded this work rather than dropping Moin, which is what I had expected given certain Python community attitudes. Maybe other, more well-known organisations that use Moin also support its development, but I don’t really see much evidence of it.

Maybe they should consider doing so. The notion that something else will always come along, developed by some enthusiastic developer “scratching their itch”, is misguided and exploitative. And a failure to sustain Free Software development can only undermine Free Software as a resource, as an activity or a cause, and as the basis of many of those organisations’ continued existence. Many of us like developing Free Software, as I hope this article has shown, but motivation alone does not keep that software coming forever.

VGA Signal Generation with the PIC32

Monday, May 22nd, 2017

It all started after I had designed – and received from fabrication – a circuit board for prototyping cartridges for the Acorn Electron microcomputer. Although some prototyping had already taken place with an existing cartridge, with pins intended for ROM components being routed to drive other things, this board effectively “breaks out” all connections available to a cartridge that has been inserted into the computer’s Plus 1 expansion unit.

Acorn Electron cartridge breakout board

The Acorn Electron cartridge breakout board being used to drive an external circuit

One thing led to another, and soon my brother, David, was interfacing a microcontroller to the Electron in order to act as a peripheral being driven directly by the system’s bus signals. His approach involved having a program that would run and continuously scan the signals for read and write conditions and then interpret the address signals, sending and receiving data on the bus when appropriate.

Having acquired some PIC32 devices out of curiosity, with the idea of potentially interfacing them with the Electron, I finally took the trouble of looking at the datasheet to see whether some of the hard work done by David’s program might be handled by the peripheral hardware in the PIC32. The presence of something called “Parallel Master Port” was particularly interesting.

Operating this function in the somewhat insensitively-named “slave” mode, the device would be able to act like a memory device, with the signalling required by read and write operations mostly being dealt with by the hardware. Software running on the PIC32 would be able to read and write data through this port and be able to get notifications about new data while getting on with doing other things.

So began my journey into PIC32 experimentation, but this article isn’t about any of that, mostly because I put that particular investigation to one side after a degree of experience gave me perhaps a bit too much confidence, and I ended up being distracted by something far more glamorous: generating a video signal using the PIC32!

The Precedents’ Hall of Fame

There are plenty of people who have written up their experiments generating VGA and other video signals with microcontrollers. Here are some interesting examples:

And there are presumably many more pages on the Web with details of people sending pixel data over a cable to a display of some sort, often trying to squeeze every last cycle out of their microcontroller’s instruction processing unit. But, given an awareness of how microcontrollers should be able to take the burden off the programs running on them, employing peripheral hardware to do the grunt work of switching pins on and off at certain frequencies, maybe it would be useful to find examples of projects where such advantages of microcontrollers had been brought to bear on the problem.

In fact, I was already aware of the Maximite “single chip computer” partly through having seen the cloned version of the original being sold by Olimex – something rather resented by the developer of the Maximite for reasons largely rooted in an unfortunate misunderstanding of Free Software licensing on his part – and I was aware that this computer could generate a VGA signal. Indeed, the method used to achieve this had apparently been written up in a textbook for the PIC32 platform, albeit generating a composite video signal using one of the on-chip SPI peripherals. The Colour Maximite uses three SPI channels to generate one red, one green, and one blue channel of colour information, thus supporting eight-colour graphical output.

But I had been made aware of the Parallel Master Port (PMP) and its “master” mode, used to drive LCD panels with eight bits of colour information per pixel (or, using devices with many more pins than those I had acquired, with sixteen bits of colour information per pixel). Would it surely not be possible to generate 256-colour graphical output at the very least?

Information from people trying to use PMP for this purpose was thin on the ground. Indeed, reading again one article that mentioned an abandoned attempt to get PMP working in this way, using the peripheral to emit pixel data for display on a screen instead of a panel, I now see that it actually mentions an essential component of the solution that I finally arrived at. But the author had unfortunately moved away from that successful component in an attempt to get the data to the display at a rate regarded as satisfactory.

Direct Savings

It is one thing to have the means to output data to be sent over a cable to a display. It is another to actually send the data efficiently from the microcontroller. Having contemplated such issues in the past, it was not a surprise that the Maximite and other video-generating solutions use direct memory access (DMA) to get the hardware, as opposed to programs, to read through memory and to write its contents to a destination, which in most cases seemed to be the memory address holding output data to be emitted via a data pin using the SPI mechanism.

I had also envisaged using DMA and was still fixated on using PMP to emit the different data bits to the output circuit producing the analogue signals for the display. Indeed, Microchip promotes the PMP and DMA combination as a way of doing “low-cost controllerless graphics solutions” involving LCD panels, so I felt that there surely couldn’t be much difference between that and getting an image on my monitor via a few resistors on the breadboard.

And so, a tour of different PIC32 features began, trying to understand the DMA documentation, the PMP documentation, all the while trying to get a grasp of what the VGA signal actually looks like, the timing constraints of the various synchronisation pulses, and battle various aspects of the MIPS architecture and the PIC32 implementation of it, constantly refining my own perceptions and understanding and learning perhaps too often that there may have been things I didn’t know quite enough about before trying them out!

Using VGA to Build a Picture

Before we really start to look at a VGA signal, let us first look at how a picture is generated by the signal on a screen:

VGA Picture Structure

The structure of a display image or picture produced by a VGA signal

The most important detail at this point is the central area of the diagram, filled with horizontal lines representing the colour information that builds up a picture on the display, with the actual limits of the screen being represented here by the bold rectangle outline. But it is also important to recognise that even though there are a number of visible “display lines” within which the colour information appears, the entire “frame” sent to the display actually contains yet more lines, even though they will not be used to produce an image.

Above and below – really before and after – the visible display lines are the vertical back and front porches whose lines are blank because they do not appear on the screen or are used to provide a border at the top and bottom of the screen. Such extra lines contribute to the total frame period and to the total number of lines dividing up the total frame period.

Figuring out how many lines a display will have seems to involve messing around with something called the “generalised timing formula”, and if you have an X server like Xorg installed on your system, you may even have a tool called “gtf” that will attempt to calculate numbers of lines and pixels based on desired screen resolutions and frame rates. Alternatively, you can look up some common sets of figures on sites providing such information.

What a VGA Signal Looks Like

Some sources show diagrams attempting to describe the VGA signal, but many of these diagrams are open to interpretation (in some cases, very much so). They perhaps show the signal for horizontal (display) lines, then other signals for the entire image, but they either do not attempt to combine them, or they instead combine these details ambiguously.

For instance, should the horizontal sync (synchronisation) pulse be produced when the vertical sync pulse is active or during the “blanking” period when no pixel information is being transmitted? This could be deduced from some diagrams but only if you share their authors’ unstated assumptions and do not consider other assertions about the signal structure. Other diagrams do explicitly show the horizontal sync active during vertical sync pulses, but this contradicts statements elsewhere such as “during the vertical sync period the horizontal sync must also be held low”, for instance.

After a lot of experimentation, I found that the following signal structure was compatible with the monitor I use with my computer:

VGA Signal Structure

The basic structure of a VGA signal, or at least a signal that my monitor can recognise

There are three principal components to the signal:

  • Colour information for the pixel or line data forms the image on the display and it is transferred within display lines during what I call the visible display period in every frame
  • The horizontal sync pulse tells the display when each horizontal display line ends, or at least the frequency of the lines being sent
  • The vertical sync pulse tells the display when each frame (or picture) ends, or at least the refresh rate of the picture

The voltage levels appear to be as follows:

  • Colour information should be at 0.7V (although some people seem to think that 1V is acceptable as “the specified peak voltage for a VGA signal”)
  • Sync pulses are supposed to be at “TTL” levels, which apparently can be from 0V to 0.5V for the low state and from 2.7V to 5V for the high state

Meanwhile, the polarity of the sync pulses is also worth noting. In the above diagram, they have negative polarity, meaning that an active pulse is at the low logic level. Some people claim that “modern VGA monitors don’t care about sync polarity”, but since it isn’t clear to me what determines the polarity, and since most descriptions and demonstrations of VGA signal generation seem to use negative polarity, I chose to go with the flow. As far as I can tell, the gtf tool always outputs the same polarity details, whereas certain resources provide signal characteristics with differing polarities.

It is possible, and arguably advisable, to start out trying to generate sync pulses and just grounding the colour outputs until your monitor (or other VGA-capable display) can be persuaded that it is receiving a picture at a certain refresh rate and resolution. Such confirmation can be obtained on a modern display by seeing a blank picture without any “no signal” or “input not supported” messages and by being able to activate the on-screen menu built into the device, in which an option is likely to exist to show the picture details.

How the sync and colour signals are actually produced will be explained later on. This section was merely intended to provide some background and gather some fairly useful details into one place.

Counting Lines and Generating Vertical Sync Pulses

The horizontal and vertical sync pulses are each driven at their own frequency. However, given that there are a fixed number of lines in every frame, it becomes apparent that the frequency of vertical sync pulse occurrences is related to the frequency of horizontal sync pulses, the latter occurring once per line, of course.

With, say, 622 lines forming a frame, the vertical sync will occur once for every 622 horizontal sync pulses, or at a rate that is 1/622 of the horizontal sync frequency or “line rate”. So, if we can find a way of generating the line rate, we can not only generate horizontal sync pulses, but we can also count cycles at this frequency, and every 622 cycles we can produce a vertical sync pulse.

But how do we calculate the line rate in the first place? First, we decide what our refresh rate should be. The “classic” rate for VGA output is 60Hz. Then, we decide how many lines there are in the display including those extra non-visible lines. We multiply the refresh rate by the number of lines to get the line rate:

60Hz * 622 = 37320Hz = 37.320kHz

On a microcontroller, the obvious way to obtain periodic events is to use a timer. Given a particular frequency at which the timer is updated, a quick calculation can be performed to discover how many times a timer needs to be incremented before we need to generate an event. So, let us say that we have a clock frequency of 24MHz, and a line rate of 37.320kHz, we calculate the number of timer increments required to produce the latter from the former:

24MHz / 37.320kHz = 24000000Hz / 37320Hz = 643

So, if we set up a timer that counts up to 642 and then upon incrementing again to 643 actually starts again at zero, with the timer sending a signal when this “wraparound” occurs, we can have a mechanism providing a suitable frequency and then make things happen at that frequency. And this includes counting cycles at this particular frequency, meaning that we can increment our own counter by 1 to keep track of display lines. Every 622 display lines, we can initiate a vertical sync pulse.

One aspect of vertical sync pulses that has not yet been mentioned is their duration. Various sources suggest that they should last for only two display lines, although the “gtf” tool specifies three lines instead. Our line-counting logic therefore needs to know that it should enable the vertical sync pulse by bringing it low at a particular starting line and then disable it by bringing it high again after two whole lines.

Generating Horizontal Sync Pulses

Horizontal sync pulses take place within each display line, have a specific duration, and they must start at the same time relative to the start of each line. Some video output demonstrations seem to use lots of precisely-timed instructions to achieve such things, but we want to use the peripherals of the microcontroller as much as possible to avoid wasting CPU time. Having considered various tricks involving specially formulated data that might be transferred from memory to act as a pulse, I was looking for examples of DMA usage when I found a mention of something called the Output Compare unit on the PIC32.

What the Output Compare (OC) units do is to take a timer as input and produce an output signal dependent on the current value of the timer relative to certain parameters. In clearer terms, you can indicate a timer value at which the OC unit will cause the output to go high, and you can indicate another timer value at which the OC unit will cause the output to go low. It doesn’t take much imagination to realise that this sounds almost perfect for generating the horizontal sync pulse:

  1. We take the timer previously set up which counts up to 643 and thus divides the display line period into units of 1/643.
  2. We identify where the pulse should be brought low and present that as the parameter for taking the output low.
  3. We identify where the pulse should be brought high and present that as the parameter for taking the output high.

Upon combining the timer and the OC unit, then configuring the output pin appropriately, we end up with a low pulse occurring at the line rate, but at a suitable offset from the start of each line.

VGA Display Line Structure

The structure of each visible display line in the VGA signal

In fact, the OC unit also proves useful in actually generating the vertical sync pulses, too. Although we have a timer that can tell us when it has wrapped around, we really need a mechanism to act upon this signal promptly, at least if we are to generate a clean signal. Unfortunately, handling an interrupt will introduce a delay between the timer wrapping around and the CPU being able to do something about it, and it is not inconceivable that this delay may vary depending on what the CPU has been doing.

So, what seems to be a reasonable solution to this problem is to count the lines and upon seeing that the vertical sync pulse should be initiated at the start of the next line, we can enable another OC unit configured to act as soon as the timer value is zero. Thus, upon wraparound, the OC unit will spring into action and bring the vertical sync output low immediately. Similarly, upon realising that the next line will see the sync pulse brought high again, we can reconfigure the OC unit to do so as soon as the timer value again wraps around to zero.

Inserting the Colour Information

At this point, we can test the basic structure of the signal and see if our monitor likes it. But none of this is very interesting without being able to generate a picture, and so we need a way of getting pixel information from the microcontroller’s memory to its outputs. We previously concluded that Direct Memory Access (DMA) was the way to go in reading the pixel data from what is usually known as a framebuffer, sending it to another place for output.

As previously noted, I thought that the Parallel Master Port (PMP) might be the right peripheral to use. It provides an output register, confusingly called the PMDIN (parallel master data in) register, that lives at a particular address and whose value is exposed on output pins. On the PIC32MX270, only the least significant eight bits of this register are employed in emitting data to the outside world, and so a DMA destination having a one-byte size, located at the address of PMDIN, is chosen.

The source data is the framebuffer, of course. For various retrocomputing reasons hinted at above, I had decided to generate a picture 160 pixels in width, 256 lines in height, and with each byte providing eight bits of colour depth (specifying how many distinct colours are encoded for each pixel). This requires 40 kilobytes and can therefore reside in the 64 kilobytes of RAM provided by the PIC32MX270. It was at this point that I learned a few things about the DMA mechanisms of the PIC32 that didn’t seem completely clear from the documentation.

Now, the documentation talks of “transactions”, “cells” and “blocks”, but I don’t think it describes them as clearly as it could do. Each “transaction” is just a transfer of a four-byte word. Each “cell transfer” is a collection of transactions that the DMA mechanism performs in a kind of batch, proceeding with these as quickly as it can until it either finishes the batch or is told to stop the transfer. Each “block transfer” is a collection of cell transfers. But what really matters is that if you want to transfer a certain amount of data and not have to keep telling the DMA mechanism to keep going, you need to choose a cell size that defines this amount. (When describing this, it is hard not to use the term “block” rather than “cell”, and I do wonder why they assigned these terms in this way because it seems counter-intuitive.)

You can perhaps use the following template to phrase your intentions:

I want to transfer <cell size> bytes at a time from a total of <block size> bytes, reading data starting from <source address>, having <source size>, and writing data starting at <destination address>, having <destination size>.

The total number of bytes to be transferred – the block size – is calculated from the source and destination sizes, with the larger chosen to be the block size. If we choose a destination size less than the source size, the transfers will not go beyond the area of memory defined by the specified destination address and the destination size. What actually happens to the “destination pointer” is not immediately obvious from the documentation, but for our purposes, where we will use a destination size of one byte, the DMA mechanism will just keep writing source bytes to the same destination address over and over again. (One might imagine the pointer starting again at the initial start address, or perhaps stopping at the end address instead.)

So, for our purposes, we define a “cell” as 160 bytes, being the amount of data in a single display line, and we only transfer one cell in a block. Thus, the DMA source is 160 bytes long, and even though the destination size is only a single byte, the DMA mechanism will transfer each of the source bytes into the destination. There is a rather unhelpful diagram in the documentation that perhaps tries to communicate too much at once, leading one to believe that the cell size is a factor in how the destination gets populated by source data, but the purpose of the cell size seems only to be to define how much data is transferred at once when a transfer is requested.

DMA Transfer Mechanism

The transfer of framebuffer data to PORTB using DMA cell transfers (noting that this hints at the eventual approach which uses PORTB and not PMDIN)

In the matter of requesting a transfer, we have already described the mechanism that will allow us to make this happen: when the timer signals the start of a new line, we can use the wraparound event to initiate a DMA transfer. It would appear that the transfer will happen as fast as both the source and the destination will allow, at least as far as I can tell, and so it is probably unlikely that the data will be sent to the destination too quickly. Once the transfer of a line’s pixel data is complete, we can do some things to set up the transfer for the next line, like changing the source data address to point to the next 160 bytes representing the next display line.

(We could actually set the block size to the length of the entire framebuffer – by setting the source size – and have the DMA mechanism automatically transfer each line in turn, updating its own address for the current line. However, I intend to support hardware scrolling, where the address of the first line of the screen can be adjusted so that the display starts part way through the framebuffer, reaches the end of the framebuffer part way down the screen, and then starts again at the beginning of the framebuffer in order to finish displaying the data at the bottom of the screen. The DMA mechanism doesn’t seem to support the necessary address wraparound required to manage this all by itself.)

Output Complications

Having assumed that the PMP peripheral would be an appropriate choice, I soon discovered some problems with the generated output. Although the data that I had stored in the RAM seemed to be emitted as pixels in appropriate colours, there were gaps between the pixels on the screen. Yet the documentation seemed to vaguely indicate that the PMDIN register was more or less continuously updated. That meant that the actual output signals were being driven low between each pixel, causing black-level gaps and ruining the result.

I wondered if anything could be done about this issue. PMP is really intended as some kind of memory interface, and it isn’t unreasonable for it to only maintain valid data for certain periods of time, modifying control signals to define this valid data period. That PMP can be used to drive LCD panels is merely a result of those panels themselves upholding this kind of interface. For those of you familiar with microcontrollers, the solution to my problem was probably obvious several paragraphs ago, but it needed me to reconsider my assumptions and requirements before I realised what I should have been doing all along.

Unlike SPI, which concerns itself with the bit-by-bit serial output of data, PMP concerns itself with the multiple-bits-at-once parallel output of data, and all I wanted to do was to present multiple bits to a memory location and have them translated to a collection of separate signals. But, of course, this is exactly how normal I/O (input/output) pins are provided on microcontrollers! They all seem to provide “PORT” registers whose bits correspond to output pins, and if you write a value to those registers, all the pins can be changed simultaneously. (This feature is obscured by platforms like Arduino where functions are offered to manipulate only a single pin at once.)

And so, I changed the DMA destination to be the PORTB register, which on the PIC32MX270 is the only PORT register with enough bits corresponding to I/O pins to be useful enough for this application. Even then, PORTB does not have a complete mapping from bits to pins: some pins that are available in other devices have been dedicated to specific functions on the PIC32MX270F256B and cannot be used for I/O. So, it turns out that we can only employ at most seven bits of our pixel data in generating signal data:

PORTB Pin Availability on the PIC32MX270F256B
Pins
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RPB15 RPB14 RPB13 RPB11 RPB10 RPB9 RPB8 RPB7 RPB5 RPB4 RPB3 RPB2 RPB1 RPB0

We could target the first byte of PORTB (bits 0 to 7) or the second byte (bits 8 to 15), but either way we will encounter an unmapped bit. So, instead of choosing a colour representation making use of eight bits, we have to make do with only seven.

Initially, not noticing that RPB6 was not available, I was using a “RRRGGGBB” or “332” representation. But persuaded by others in a similar predicament, I decided to choose a representation where each colour channel gets two bits, and then a separate intensity bit is used to adjust the final intensity of the basic colour result. This also means that greyscale output is possible because it is possible to balance the channels.

The 2-bit-per-channel plus intensity colours

The colours employing two bits per channel plus one intensity bit, perhaps not shown completely accurately due to circuit inadequacies and the usual white balance issues when taking photographs

It is worth noting at this point that since we have left the 8-bit limitations of the PMP peripheral far behind us now, we could choose to populate two bytes of PORTB at once, aiming for sixteen bits per pixel but actually getting fourteen bits per pixel once the unmapped bits have been taken into account. However, this would double our framebuffer memory requirements for the same resolution, and we don’t have that much memory. There may be devices with more than sixteen bits mapped in the 32-bit PORTB register (or in one of the other PORT registers), but they had better have more memory to be useful for greater colour depths.

Back in Black

One other matter presented itself as a problem. It is all very well generating a colour signal for the pixels in the framebuffer, but what happens at the end of each DMA transfer once a line of pixels has been transmitted? For the portions of the display not providing any colour information, the channel signals should be held at zero, yet it is likely that the last pixel on any given line is not at the lowest possible (black) level. And so the DMA transfer will have left a stray value in PORTB that could then confuse the monitor, producing streaks of colour in the border areas of the display, making the monitor unsure about the black level in the signal, and also potentially confusing some monitors about the validity of the picture, too.

As with the horizontal sync pulses, we need a prompt way of switching off colour information as soon as the pixel data has been transferred. We cannot really use an Output Compare unit because that only affects the value of a single output pin, and although we could wire up some kind of blanking in our external circuit, it is simpler to look for a quick solution within the capabilities of the microcontroller. Fortunately, such a quick solution exists: we can “chain” another DMA channel to the one providing the pixel data, thereby having this new channel perform a transfer as soon as the pixel data has been sent in its entirety. This new channel has one simple purpose: to transfer a single byte of black pixel data. By doing this, the monitor will see black in any borders and beyond the visible regions of the display.

Wiring Up

Of course, the microcontroller still has to be connected to the monitor somehow. First of all, we need a way of accessing the pins of a VGA socket or cable. One reasonable approach is to obtain something that acts as a socket and that breaks out the different signals from a cable, connecting the microcontroller to these broken-out signals.

Wanting to get something for this task quickly and relatively conveniently, I found a product at a local retailer that provides a “male” VGA connector and screw-adjustable terminals to break out the different pins. But since the VGA cable also has a male connector, I also needed to get a “gender changer” for VGA that acts as a “female” connector in both directions, thus accommodating the VGA cable and the male breakout board connector.

Wiring up to the broken-out VGA connector pins is mostly a matter of following diagrams and the pin numbering scheme, illustrated well enough in various resources (albeit with colour signal transposition errors in some resources). Pins 1, 2 and 3 need some special consideration for the red, green and blue signals, and we will look at them in a moment. However, pins 13 and 14 are the horizontal and vertical sync pins, respectively, and these can be connected directly to the PIC32 output pins in this case, since the 3.3V output from the microcontroller is supposedly compatible with the “TTL” levels. Pins 5 through 10 can be connected to ground.

We have seen mentions of colour signals with magnitudes of up to 0.7V, but no explicit mention of how they are formed has been presented in this article. Fortunately, everyone is willing to show how they converted their digital signals to an analogue output, with most of them electing to use a resistor network to combine each output pin within a channel to produce a hopefully suitable output voltage.

Here, with two bits per channel, I take the most significant bit for a channel and send it through a 470ohm resistor. Meanwhile, the least significant bit for the channel is sent through a 1000ohm resistor. Thus, the former contributes more to the magnitude of the signal than the latter. If we were only dealing with channel information, this would be as much as we need to do, but here we also employ an intensity bit whose job it is to boost the channels by a small amount, making sure not to allow the channels to pollute each other via this intensity sub-circuit. Here, I feed the intensity output through a 2200ohm resistor and then to each of the channel outputs via signal diodes.

VGA Output Circuit

The circuit showing connections relevant to VGA output (generic connections are not shown)

The Final Picture

I could probably go on and cover other aspects of the solution, but the fundamental aspects are probably dealt with sufficiently above to help others reproduce this experiment themselves. Populating memory with usable image data, at least in this solution, involves copying data to RAM, and I did experience problems with accessing RAM that are probably related to CPU initialisation (as covered in my previous article) and to synchronising the memory contents with what the CPU has written via its cache.

As for the actual picture data, the RGB-plus-intensity representation is not likely to be the format of most images these days. So, to prepare data for output, some image processing is needed. A while ago, I made a program to perform palette optimisation and dithering on images for the Acorn Electron, and I felt that it was going to be easier to adapt the dithering code than it was to figure out the necessary techniques required for software like ImageMagick or the Python Imaging Library. The pixel data is then converted to assembly language data definition statements and incorporated into my PIC32 program.

VGA output from a PIC32 microcontroller

VGA output from a PIC32 microcontroller, featuring a picture showing some Oslo architecture, with the PIC32MX270 being powered (and programmed) by the Arduino Duemilanove, and with the breadboards holding the necessary resistors and diodes to supply the VGA breakout and, beyond that, the cable to the monitor

To demonstrate control over the visible region, I deliberately adjusted the display frequencies so that the monitor would consider the signal to be carrying an image 800 pixels by 600 pixels at a refresh rate of 60Hz. Since my framebuffer is only 256 lines high, I double the lines to produce 512 lines for the display. It would seem that choosing a line rate to try and produce 512 lines has the monitor trying to show something compatible with the traditional 640×480 resolution and thus lines are lost off the screen. I suppose I could settle for 480 lines or aim for 300 lines instead, but I actually don’t mind having a border around the picture.

The on-screen menu showing the monitor's interpretation of the signal

The on-screen menu showing the monitor's interpretation of the signal

It is worth noting that I haven’t really mentioned a “pixel clock” or “dot clock” so far. As far as the display receiving the VGA signal is concerned, there is no pixel clock in that signal. And as far as we are concerned, the pixel clock is only important when deciding how quickly we can get our data into the signal, not in actually generating the signal. We can generate new colour values as slowly (or as quickly) as we might like, and the result will be wider (or narrower) pixels, but it shouldn’t make the actual signal invalid in any way.

Of course, it is important to consider how quickly we can generate pixels. Previously, I mentioned a 24MHz clock being used within the PIC32, and it is this clock that is used to drive peripherals and this clock’s frequency that will limit the transfer speed. As noted elsewhere, a pixel clock frequency of 25MHz is used to support the traditional VGA resolution of 640×480 at 60Hz. With the possibilities of running the “peripheral clock” in the PIC32MX270 considerably faster than this, it becomes a matter of experimentation as to how many pixels can be supported horizontally.

Some Oslo street art being displayed by the PIC32

Some Oslo street art being displayed by the PIC32

For my own purposes, I have at least reached the initial goal of generating a stable and usable video signal. Further work is likely to involve attempting to write routines to modify the framebuffer, maybe support things like scrolling and sprites, and even consider interfacing with other devices.

Naturally, this project is available as Free Software from its own repository. Maybe it will inspire or encourage you to pursue something similar, knowing that you absolutely do not need to be any kind of “expert” to stubbornly persist and to eventually get results!