Paul Boddie's Free Software-related blog

Paul's activities and perspectives around Free Software

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 5)

We left off last time with the unenviable task of debugging a non-working system. In such a situation the outlook can seem bleak, but I mentioned a couple of strategies that can sometimes rescue the situation. The first of these is to rule out areas of likely problems, which in my case tends to involve reviewing what I have done and seeing if I have made some stupid mistakes. Naturally, it helps to have a certain amount of experience to inform this process; otherwise, practically everything might be a place where such mistakes may be lurking.

One thing that bothered me was the use of the memory map by Fiasco.OC on the Ben NanoNote. When deploying my previous experimental work to the Ben, I had become aware of limitations around where things might be stored, at least while any bootloader might be active. Care must be taken to load new code away from memory already being used, and it seems that the base of memory must also be avoided, at least at first. I wasn’t convinced that this avoidance was going to happen with the default configuration of the different components.

The Memory Map

Of particular concern were the exception vectors – where the processor jumps to if an exception or interrupt occurs – whose defaults in Fiasco.OC situate them at the base of kernel memory: 0x80000000. If the bootloader were to try and copy the code that handles exceptions to this region, I rather suspected that it would immediately cause problems.

I was also unsure whether the bootloader was able to load the payload from the MMC/MicroSD card into memory without overwriting itself or corrupting the payload as it copied things around in memory. According to the boot script that seems to apply to the Ben, it loads the payload into memory at 0x80600000:

#define CONFIG_BOOTCOMMANDFROMSD        "mmc init; ext2load mmc 0 0x80600000 /boot/uImage; bootm"

Meanwhile, the default memory settings for L4Re has code loaded rather low in the kernel address space at 0x802d0000. Without really knowing what happens, I couldn’t be sure that something might get copied to that location, that the copied data might then run past 0x80600000, and this might overwrite some other thing – the kernel, perhaps – that hadn’t been copied yet. Maybe this was just paranoia, but it was at least something that could be dealt with. So I came up with some alternative arrangements:

0x81401000 exception handlers
0x81400000 kernel load address
0x80d00000 bootstrap start address
0x80600000 payload load address when copied by bootm

I wanted to rule out memory conflicts but try and not conjure up more exotic solutions than strictly necessary. So I made some adjustments to the location of the kernel, keeping the exception vectors in the same place relative to the kernel, but moving the vectors far away from the base of memory. It turns out that there are quite a few places that need changing if you do this:

  • A configuration setting, CONFIG_KERNEL_LOAD_ADDR-32, in the kernel build scripts (in kernel/fiasco/src/Modules.mips)
  • The exception base, EXC_BASE, in the kernel’s linker script (in kernel/fiasco/src/kernel.mips.ld)
  • The exception base, Exception_base, in a description of the kernel memory layout (in kernel/fiasco/src/kern/mips/mem_layout-mips32.cpp)
  • The exception base, if it is mentioned in the bootstrap initialisation (in l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S)

The location of the rest of the payload seems to be configured by just changing DEFAULT_RELOC_mips32 in the bootstrap package’s build scripts (in l4/pkg/bootstrap/server/src/Make.rules).

With this done, I had hoped that I might have “moved the needle” a little and provoked a visible change when attempting to boot the system, but this was always going to be rather optimistic. Having pursued the first strategy, I now decided to pursue the second.

Update: it turns out that a more conventional memory arrangement can be used, and this is described in the summary article.

Getting in at the Start

The second strategy is to use every opportunity to get the device to show what it is doing. But how can we achieve this if we cannot boot the kernel and start some code that sets up the framebuffer? Here, there were two things on my side: the role of the bootstrap code, it being rather similar to code I have written before, and the state of the framebuffer when this code is run.

I had already discovered that provided that the code is loaded into a place that can be started by the bootloader, then the _start routine (in l4/pkg/bootstrap/server/src/ARCH-mips/crt0.S) will be called in kernel mode. And I had already looked at this code for the purposes of identifying instructions that needed rewriting as well as for setting the “exception base”. There were a few other precautions that were worth taking here before we might try and get the code to show its activity.

For instance, the code present that attempts to enable a floating point unit in the processor does not apply to the Ben, so this was disabled. I was also unconvinced that the memory mapping instructions would work on the Ben: the JZ4720 does not seem to support memory pages of 256MB, with the Ben only having 32MB anyway, so I changed this to use 16MB pages instead. This must be set up correctly because any wandering into unmapped memory – visiting bad addresses – cannot be rectified before the kernel is active, and the whole point of the bootstrap code is to get the kernel active!

Now, it wasn’t clear just how far the processor was getting through this code before failing somewhere, but this is where the state of the framebuffer comes in. On the Ben, the bootloader initialises the framebuffer in order to show the state of the device, indicate whether it found a payload to load and boot from, complain about error conditions, and so on. It occurred to me that instead of trying to initialise a framebuffer by programming the LCD peripheral in the JZ4720, set up various structures in memory, decide where these structures should even be situated, I could just read the details of the existing framebuffer from the LCD peripheral’s registers, then find out where the framebuffer resides, and then just write whatever data I liked to the framebuffer in order to communicate with the outside world.

So, I would just need to write a few lines of assembly language, slip it into the bootstrap code, and then see if the framebuffer was changed and the details of interest written to the Ben’s display. Here is a fragment of code in a form that would become rather familiar after a time:

        li $8, 0xb3050040       /* LCD_DA0 */
        lw $9, 0($8)            /* &descriptor */
        lw $10, 4($9)           /* fsadr = descriptor[1] */
        lw $11, 12($9)          /* ldcmd = descriptor[3] */
        li $8, 0x00ffffff
        and $11, $8, $11        /* size = ldcmd & LCD_CMD0_LEN */
        li $9, 0xa5a5a5a5
1:
        sw $9, 0($10)           /* *fsadr = ... */
        addiu $11, $11, -1      /* size -= 1 */
        addiu $10, $10, 4       /* fsadr += 4 */
        bnez $11, 1b            /* until size == 0 */
        nop

To summarise, it loads the address of a “descriptor” from a configuration register provided by the LCD peripheral, this register having been set by the bootloader. It then examines some members of the structure provided by the descriptor, notably the framebuffer address (fsadr) and size (a subset of ldcmd). Just to show some sign of progress, the code loops and fills the screen with a specific value, in this case a shade of grey.

By moving this code around in the bootstrap initialisation routine, I could see whether the processor even managed to get as far as this little debugging fragment. Fortunately for me, it did get run, the screen did turn grey, and I could then start to make deductions about why it only got so far but no further. One enhancement to the above that I had to make after a while was to temporarily change the processor status to “error level” (ERL) when accessing things like the LCD configuration. Not doing so risks causing errors in itself, and there is nothing more frustrating than chasing down errors only to discover that the debugging code caused these errors and introduced them as distractions from the ones that really matter.

Enter the Kernel

The bootstrap code isn’t all assembly language, and at the end of the _start routine, the code attempts to jump to __main. Provided this works, the processor enters code that started out life as C++ source code (in l4/pkg/bootstrap/server/src/ARCH-mips/head.cc) and hopefully proceeds to the startup function (in l4/pkg/bootstrap/server/src/startup.cc) which undertakes a range of activities to prepare for the kernel.

Here, my debugging routine changed form slightly, minimising the assembly language portion and replacing the simple screen-clearing loop with something in C++ that could write bit patterns to the screen. It became interesting to know what address the bootstrap code thought it should be using for the kernel, and by emitting this address’s bit pattern I could check whether the code had understood the structure of the payload. It seemed that the kernel was being entered, but upon executing instructions in the _start routine (in kernel/fiasco/src/kern/mips/crt0.S), it would hang.

The Ben NanoNote showing a bit pattern on the screen

The Ben NanoNote showing a bit pattern on the screen with adjacent bits using slightly different colours to help resolve individual bit values; here, the framebuffer address is shown (0x01fb5000), but other kinds of values can be shown, potentially many at a time

This now led to a long and frustrating process of detective work. With a means of at least getting the code to report its status, I had a chance of figuring out what might be wrong, but I also needed to draw on experience and ideas about likely causes. I started to draw up a long list of candidates, suggesting and eliminating things that could have been problems that weren’t. Any relief that a given thing was not the cause of the problem was tempered by the realisation that something else, possibly something obscure or beyond the limit of my own experiences, might be to blame. It was only some consolation that the instruction provoking the failure involved my nemesis from my earlier experiments: the “error level” (ERL) flag in the processor’s status register.