Paul Boddie's Free Software-related blog


Archive for the ‘MIPS’ Category

Experiments with a Screen

Sunday, November 19th, 2023

Not much to report, really. Plenty of ongoing effort to overhaul my L4Re-based software support for the MIPS-based Ingenic SoC products, plus the slow resumption of some kind of focus on my more general framework to provide a demand-paged system on top of L4Re. And then various distractions and obligations on top of that.

Anyway, here is a picture of some kind of result:

MIPS Creator CI20 and Pirate Audio Mini Speaker board

The MIPS Creator CI20 driving the Pirate Audio Mini Speaker board’s screen.

It shows the MIPS Creator CI20 using a Raspberry Pi “hat”, driving the screen using the SPI peripheral built into the CI20’s JZ4780 SoC. Although the original Raspberry Pi had a 26-pin expansion header that the CI20 adopted for compatibility, the Pi range then adopted a 40-pin header instead. Hopefully, there weren’t too many unhappy vendors of accessories as a result of this change.

What it means for the CI20 is that its primary expansion header cannot satisfy the requirements of the expansion connector provided by this “hat” or board in its entirety. Instead, 14 pins of the board’s connector are left unconnected, with the board hanging over the side of the CI20 if mounted directly. Another issue is that the pinout of the board employs a pin as a data/command pin instead of as its designated function as a SPI data input pin. Whether the Raspberry Pi can configure itself to utilise this pin efficiently in this way might help to explain the configuration, but it isn’t compatible with the way such pins are assigned on the CI20.

Fortunately, the CI20’s designers exposed a SPI peripheral via a secondary header, including a dedicated data/command pin, meaning that a few jumper wires can connect the relevant pins to the appropriate connector pins. After some tedious device driver implementation and accompanying frustration, the screen could be persuaded to show an image. With the SPI peripheral being used instead of “bit banging”, or driving the data transfer to the screen controller directly in software, it became possible to use DMA to have the screen image repeatedly sent. And with that, the screen can be used to continuously reflect the contents of a generic framebuffer, becoming like a tiny monitor.

The board also has a speaker that can be driven using I2S communication. The CI20 doesn’t expose I2S signals via the header pins, instead routing I2S audio via the HDMI connector, analogue audio via the headphone socket, and PCM audio via the Wi-Fi/Bluetooth chip, presumably supporting Bluetooth audio. Fortunately, I have another means of testing the speaker, so I didn’t waste half of my money buying this board!

Gradual Explorations of Filesystems, Paging and L4Re

Thursday, June 30th, 2022

A surprising three years have passed since my last article about my efforts to make a general-purpose filesystem accessible to programs running in the L4 (or L4Re) Runtime Environment. Some of that delay was due to a lack of enthusiasm about blogging for various reasons, much more was due to having much of my time occupied by full-time employment involving other technologies (Python and Django mostly, since you ask) that limited the amount of time and energy that could be spent focusing on finding my way around the intricacies of L4Re.

In fact, various other things I looked into in 2019 (or maybe 2018) also went somewhat unreported. I looked into trying to port the “user mode” (UX) variant of the Fiasco.OC microkernel to the MIPS architecture used by the MIPS Creator CI20. This would have allowed me to conveniently develop and test L4Re programs in the GNU/Linux environment on that hardware. I did gain some familiarity with the internals of that software, together with the Linux ptrace mechanism, making some progress but not actually getting to a usable conclusion. Recommendations to use QEMU instead led me to investigate the situation with KVM on MIPS, simply to try and get half-way reasonable performance: emulation is otherwise rather slow.

You wouldn’t think that running KVM on anything other than Intel/AMD or ARM architectures were possible if you only read the summary on the KVM project page or the Debian Wiki’s KVM page. In fact, KVM is supported on multiple architectures including MIPS, but the latest (and by now very old 3.18) “official” kernel for the CI20 turned out to be too old to support what I needed. Or at least, I tried to get it to work but even with all the necessary configuration to support “trap and emulate” on a CPU without virtualisation support, it seemed to encounter instructions it did not emulate. As the hot summer of 2019 (just like 2018) wound down, I switched back to using my main machine at the time: an ancient Pentium 4 system that I didn’t want heating the apartment; one that could run QEMU rather slowly, albeit faster than the CI20, but which gave me access to Fiasco.OC-UX once again.

Since then, the hard yards of upstreaming Linux kernel device support for the CI20 has largely been pursued by the ever-patient Nikolaus Schaller, vendor of the Letux 400 mini-notebook and hardware designer of the Pyra, and a newer kernel capable of running KVM satisfactorily might now be within reach. That is something to be investigated somewhere in the future.

Back to the Topic

In my last article on the topic of this article, I had noted that to take advantage of various features that L4Re offers, I would need to move on from providing a simple mechanism to access files through read and write operations, instead embracing the memory mapping paradigm that is pervasive in L4Re, adopting such techniques to expose file content to programs. This took us through a tour of dataspaces, mapping, pages, flexpages, virtual memory and so on. Ultimately, I had a few simple programs working that could still read and write to files, but they would be doing so via a region of memory where pages of this memory would be dynamically “mapped” – made available – and populated with file content. I even integrated the filesystem “client” library with the Newlib C library implementation, but that is another story.

Nothing is ever simple, though. As I stressed the test programs, introducing concurrent access to files, crashes would occur in the handling of the pages issued to the clients. Since I had ambitiously decided that programs accessing the same files would be able to share memory regions assigned to those files, with two or more programs being issued with access to the same memory pages if they happened to be accessing the same areas of the underlying file, I had set myself up for the accompanying punishment: concurrency errors! Despite the heroic help of l4-hackers mailing list regulars (Frank and Jean), I had to concede that a retreat, some additional planning, and then a new approach would be required. (If nothing else, I hope this article persuades some l4-hackers readers that their efforts in helping me are not entirely going to waste!)

Prototyping an Architecture

In some spare time a couple of years ago, I started sketching out what I needed to consider when implementing such an architecture. Perhaps bizarrely, given the nature of the problem, my instinct was to prototype such an architecture in Python, running as a normal program on my GNU/Linux system. Now, Python is not exactly celebrated for its concurrency support, given the attention its principal implementation, CPython, has often had for a lack of scalability. However, whether or not the Python implementation supports running code in separate threads simultaneously, or whether it merely allows code in threads to take turns running sequentially, the most important thing was that I could have code happily running along being interrupted at potentially inconvenient moments by some other code that could conceivably ruin everything.

Fortunately, Python has libraries for threading and provides abstractions like semaphores. Such facilities would be all that was needed to introduce concurrency control in the different program components, allowing the simulation of the mechanisms involved in acquiring memory pages, populating them, issuing them to clients, and revoking them. It may sound strange to even consider simulating memory pages in Python, which operates at another level entirely, and the issuing of pages via a simulated interprocess communication (IPC) mechanism might seem unnecessary and subject to inaccuracy, but I found it to be generally helpful in refining my approach and even deepening my understanding of concepts such as flexpages, which I had applied in a limited way previously, making me suspect that I had not adequately tested the limits of my understanding.

Naturally, L4Re development is probably never done in Python, so I then had the task of reworking my prototype in C++. Fortunately, this gave me the opportunity to acquaint myself with the more modern support in the C++ standard libraries for threading and concurrency, allowing me to adopt constructs such as mutexes, condition variables and lock guards. Some of this exercise was frustrating: C++ is, after all, a lower-level language that demands more attention to various mundane details than Python does. It did suggest potential improvements to Python’s standard library, however, although I don’t really pay any attention to Python core development any more, so unless someone else has sought to address such issues, I imagine that Python will gain even more in the way of vanity features while such genuine deficiencies and omissions remain unrecognised.

Transplanting the Prototype

Upon reintroducing this prototype functionality into L4Re, I decided to retain the existing separation of functionality into various libraries within the L4Re build system – ones for filesystem clients, servers, IPC – also making a more general memory abstractions library, but I ultimately put all these libraries within a single package. At some point, it is this package that I will be making available, and I think that it will be easier to evaluate with all the functionality in a single bundle. The highest priority was then to test the mechanisms employed by the prototype using the same concurrency stress test program, this originally being written in Python, then ported to C++, having been used in my GNU/Linux environment to loosely simulate the conditions under L4Re.

This stress testing exercise eventually ended up working well enough, but I did experience issues with resource limits within L4Re as well as some concurrency issues with capability management that I should probably investigate further. My test program opens a number of files in a number of threads and attempts to read various regions of these files over and over again. I found that I would run out of capability slots, these tracking the references to other components available to a task in L4Re, and since each open file descriptor or session would require a slot, as would each thread, I had to be careful not to exceed the default budget of such slots. Once again, with help from another l4-hackers participant (Philipp), I realised that I wasn’t releasing some of the slots in my own code, but I also learned that above a certain number of threads, open files, and so on, I would need to request more resources from the kernel. The concurrency issue with allocating individual capability slots remains unexplored, but since I already wrap the existing L4Re functionality in my own library, I just decided to guard the allocation functionality with semaphores.

With some confidence in the test program, which only accesses simulated files with computed file content, I then sought to restore functionality accessing genuine files, these being the read-only files already exposed within L4Re along with ext2-resident files previously supported by my efforts. The former kind of file was already simulated in the prototype in the form of “host” files, although L4Re unhelpfully gives an arbitary (counter) value for the inode identifiers for each file, so some adjustments were required. Meanwhile, introducing support for the latter kind of file led me to update the bundled version of libext2fs I am using, refine various techniques for adapting the upstream code, introduce more functionality to help use libext2fs from my own code (since libext2fs can be rather low-level), and to consider the broader filesystem support architecture.

Here is the general outline of the paging mechanism supporting access to filesystem content:

Paging data structures

The data structures employed to provide filesystem content to programs.

It is rather simplistic, and I have practically ignored complicated page replacement algorithms. In practice, pages are obtained for use when a page fault occurs in a program requesting a particular region of file content, and fulfilment of this request will move a page to the end of a page queue. Any independent requests for pages providing a particular file region will also reset the page’s position in the queue. However, since successful accesses to pages will not cause programs to repeatedly request those pages, eventually those pages will move to the front of the queue and be reclaimed.

Without any insight into how much programs are accessing a page successfully, relying purely on the frequency of page faults, I imagine that various approaches can be adopted to try and assess the frequency of accesses, extrapolating from the page fault frequency and seeking to “bias” or “weight” pages with a high frequency of requests so that they move through the queue more slowly or, indeed, move through a queue that provides pages less often. But all of this is largely a distraction from getting a basic mechanism working, so I haven’t directed any more time to it than I have just now writing this paragraph!

Files and File Sessions

While I am quite sure that I ended up arriving at a rather less than optimal approach for the paging architecture, I found that the broader filesystem architecture also needed to be refined further as I restored the functionality that I had previously attempted to provide. When trying to support shared access to file content, it is appropriate to implement some kind of registry of open files, these retaining references to objects that are providing access to each of the open files. Previously, this had been done in a fairly simple fashion, merely providing a thread-safe map or dictionary yielding the appropriate file-related objects when present, otherwise permitting new objects to be registered.

Again, concurrency issues needed closer consideration. When one program requests access to a file, it is highly undesirable for another program to interfere during the process of finding the file, if it exists already, or creating the file, if it does not. Therefore, there must be some kind of “gatekeeper” for the file, enforcing sequential access to filesystem operations involving it and ensuring that any preparatory activities being undertaken to make a file available, or to remove a file, are not interrupted or interfered with. I came up with an architecture looking like this, with a resource registry being the gatekeeper, resources supporting file sessions, providers representing open files, and accessors transferring data to and from files:

Filesystem access data structures

The data structures employed to provide access to the underlying filesystem objects.

I became particularly concerned with the behaviour of the system around file deletion. On Unix systems, it is fairly well understood that one can “unlink” an existing file and keep accessing it, as long as a file descriptor has been retained to access that file. Opening a file with the same name as the unlinked file under such circumstances will create a new file, provided that the appropriate options are indicated, or otherwise raise a non-existent file error, and yet the old file will still exist somewhere. Any new file with the same name can be unlinked and retained similarly, and so on, building up a catalogue of old files that ultimately will be removed when the active file descriptors are closed.

I thought I might have to introduce general mechanisms to preserve these Unix semantics, but the way the ext2 filesystem works largely encodes them to some extent in its own abstractions. In fact, various questions that I had about Unix filesystem semantics and how libext2fs might behave were answered through the development of various test programs, some being normal programs accessing files in my GNU/Linux environment, others being programs that would exercise libext2fs in that environment. Having some confidence that libext2fs would do the expected thing leads me to believe that I can rely on it at least for some of the desired semantics of the eventual system.

The only thing I really needed to consider was how the request to remove a file when that file was still open would affect the “provider” abstraction permitting continued access to the file contents. Here, I decided to support a kind of deferred removal: if a program requested the removal of a file, the provider and the file itself would be marked for removal upon the final closure of the file, but the provider for the file would no longer be available for new usage, and the file would be unlinked; programs already accessing the file would continue to operate, but programs opening a file of the same name would obtain a new file and a new provider.

The key to this working satisfactorily is that libext2fs will assign a new inode identifier when opening a new file, whereas an unlinked file retains its inode identifier. Since providers are indexed by inode identifier, and since libext2fs translates the path of a file to the inode identifier associated with the file in its directory entry, attempts to access a recreated file will always yield the new inode identifier and thus the new file provider.

Pipes, Listings and Notifications

In the previous implementation of this filesystem functionality, I had explored some other aspects of accessing a filesystem. One of these was the ability to obtain directory listings, usually exposed in Unix operating systems by the opendir and readdir functions. The previous implementation sought to expose such listings as files, this in an attempt to leverage the paging mechanisms already built, but the way that libext2fs provides such listing information is not particularly compatible with the random-access file model: instead, it provides something more like an iterator that involves the repeated invocation of a callback function, successively supplying each directory entry for the callback function to process.

For this new implementation, I decided to expose directory listings via pipes, with a server thread accessing the filesystem and, in that callback function, writing directory entries to one end of a pipe, and with a client thread reading from the other end of the pipe. Of course, this meant that I needed to have an implementation of pipes! In my previous efforts, I did implement pipes as a kind of investigation, and one can certainly make this very complicated indeed, but I deliberately kept this very simple in this current round of development, merely having a couple of memory regions, one being used by the reader and one being used by the writer, with each party transferring the regions to the other (and blocking) if they find themselves respectively running out of content or running out of space.

One necessary element in making pipes work is that of coordinating the reading and writing parties involved. If we restrict ourselves to a pipe that will not expand (or even not expand indefinitely) to accommodate more data, at some point a writer may fill the pipe and may then need to block, waiting for more space to become available again. Meanwhile, a reader may find itself encountering an empty pipe, perhaps after having read all available data, and it may need to block and wait for more content to become available again. Readers and writers both need a way of waiting efficiently and requesting a notification for when they might start to interact with the pipe again.

To support such efficient blocking, I introduced a notifier abstraction for use in programs that could be instantiated and a reference to such an instance (in the form of a capability) presented in a subscription request to the pipe endpoint. Upon invoking the wait operation on a notifier, the notifier will cause the program (or a thread within a program) to wait for the delivery of a notification from the pipe, this being efficient because the thread will then sleep, only to awaken if a message is sent to it. Here is how pipes make use of notifiers to support blocking reads and writes:

Communication via pipes employing notifications

The use of notifications when programs communicate via a pipe.

A certain amount of plumbing is required behind the scenes to support notifications. Since programs accessing files will have their own sessions, there needs to be a coordinating object representing each file itself, this being able to propagate notification events to the users of the file concerned. Fortunately, I introduced the notion of a “provider” object in my architecture that can act in such a capacity. When an event occurs, the provider will send a notification to each of the relevant notifier endpoints, also providing some indication of the kind of event occurring. Previously, I had employed L4Re’s IRQ (interrupt request) objects as a means of delivering notifications to programs, but these appear to be very limited and do not allow additional information to be conveyed, as far as I can tell.

One objective I had with a client-side notifier was to support waiting for events from multiple files or streams collectively, instead of requiring a program to have threads that wait for events from each file individually, thus attempting to support the functionality provided by Unix functions like select and poll. Such functionality relies on additional information indicating the kind of event that has occurred. The need to wait for events from numerous sources also inverts the roles of client and server, with a notifier effectively acting like a server but residing in a client program, waiting for messages from its clients, these typically residing in the filesystem server framework.

Testing and Layering

Previously, I found that it was all very well developing functionality, but only through a commitment to testing it would I discover its flaws. When having to develop functionality at a number of levels in a system at the same time, testing generally starts off in a fairly limited fashion. Initially, I reintroduced a “block” server that merely provides access to a contiguous block of data, this effectively simulating storage device access that will hopefully be written at some point, and although genuine filesystem support utilises this block server, it is reassuring to be able to know whether it is behaving correctly. Meanwhile, for programs to access servers, they must send requests to those servers, assisted by a client library that provides support for such interprocess communication at a fairly low level. Thus, initial testing focused on using this low-level support to access the block server and verify that it provides access to the expected data.

On top of the lowest-level library functionality is a more usable level of “client” functions that automates the housekeeping needing to be done so that programs may expect an experience familiar to that provided by traditional C library functionality. Again, testing of file operations at that level helped to assess whether library and server functionality was behaving in line with expectations. With some confidence, the previously-developed ext2 filesystem functionality was reintroduced and updated. By layering the ext2 filesystem server on top of the block server, the testing activity is actually elevated to another level: libext2fs absolutely depends on properly functioning access to the block device; otherwise, it will not be able to perform even the simplest operations on files.

When acquainting myself with libext2fs, I developed a convenience library called libe2access that encapsulates some of the higher-level operations, and I made a tool called e2access that is able to populate a filesystem image from a normal program. This tool, somewhat reminiscent of the mtools suite that was popular at one time to allow normal users to access floppy disks on a system, is actually a fairly useful thing to have, and I remain surprised that there isn’t anything like it in common use. In any case, e2access allows me to populate images for use in L4Re, but I then thought that an equivalent to it would also be useful in L4Re for testing purposes. Consequently, a tool called fsaccess was created, but unlike e2access it does not use libe2access or libext2fs directly: instead, it uses the “client” filesystem library, exercising filesystem access via the IPC system and filesystem server architecture.

Ultimately, testing will be done completely normally using C library functions, these wrapping the “client” library. At that point, there will be no distinction between programs running within L4Re and within Unix. To an extent, L4Re already supports normal Unix-like programs using C library functions, this being particularly helpful when developing all this functionality, but of course it doesn’t support “proper” filesystems or Unix-like functionality in a particularly broad way, with various common C library or POSIX functions being stubs that do nothing. Of course, all this effort started out precisely to remedy these shortcomings.

Paging, Loading and Running Programs

Beyond explicitly performed file access, the next level of mutually-reinforcing testing and development came about through the simple desire to have a more predictable testing environment. In wanting to be able to perform tests sequentially, I needed control over the initiation of programs and to be able to rely on their completion before initiating successive programs. This may well be possible within L4Re’s Lua-based scripting environment, but I generally find the details to be rather thin on the ground. Besides, the problem provided some motivation to explore and understand the way that programs are launched in the environment.

There is some summary-level information about how programs (or tasks) are started in L4Re – for example, pages 41 onwards of “Memory, IPC, and L4Re” – but not much in the way of substantial documentation otherwise. Digging into the L4Re libraries yielded a confusing array of classes and apparent interactions which presumably make sense to anyone who is already very familiar with the specific approach being taken, as well as the general techniques being applied, but it seems difficult for outsiders to distinguish between the specifics and the generalities.

Nevertheless, some ideas were gained from looking at the code for various L4Re components including Moe (the root task), Ned (the init program), the loader and utilities libraries, and the oddly-named l4re_kernel component, this actually providing the l4re program which itself hosts actual programs by providing the memory management functionality necessary for those programs to work. In fact, we will eventually be looking at a solution that replicates that l4re program.

A substantial amount of investigation and testing took place to explore the topic. There were a number of steps required to initialise a new program:

  1. Open the program executable file and obtain details of the different program segments and the program’s start address, this requiring some knowledge of ELF binaries.
  2. Initialise a stack for the program containing the arguments to be presented to it, plus details of the program’s environment. The environment is of particular concern.
  3. Create a task for the program together with a thread to begin execution at the start address, setting the stack pointer to the appropriate place in where the stack should be made available.
  4. Initialise a control block for the thread.
  5. Start the thread. This should immediately generate a page fault because the memory at the start address is not yet available within the task.
  6. Service page faults for the program, providing pages for the program code – thus resolving that initial page fault – as well as for the stack and other regions of memory.

Naturally, each of these steps entails a lot more work than is readily apparent. Particularly the last step is something of an understatement in terms of what is required: the mechanism by which demand paging of the program is to be achieved.

L4Re provides some support for inspecting ELF binaries in its utilities library, but I found the ELF specification to be very useful in determining the exact purposes of various program header fields. For more practical guidance, the OSDev wiki page about ELF provides an explanation of the program loading process, along with how the different program segments are to be applied in the initialisation of a new program or process. With this information to hand, together with similar descriptions in the context of L4Re, it became possible to envisage how the address space of a new program might be set up, determining which various parts of the program file might be installed and where they might be found. I wrote some test programs, making some use of the structures in the utilities library, but wrote my own functions to extract the segment details from an ELF binary.

I found a couple of helpful resources describing the initialisation of the program stack: “Linux x86 Program Start Up” and “How statically linked programs run on Linux”. These mainly demystify the code that is run when a program starts up, setting up a program before the user’s main function is called, giving a degree of guidance about the work required to set up the stack so that such code may perform as expected. I was, of course, also able to study what the various existing L4Re components were doing in this respect, although I found the stack abstractions used to be idiomatic C/C++ bordering on esoteric. Nevertheless, the exercise involves obtaining some memory that can eventually be handed over to the new program, populating that memory, and then making it available to the new program, either immediately or on request.

Although I had already accumulated plenty of experience passing object capabilities around in L4Re, as well as having managed to map memory between tasks by sending the appropriate message items, the exact methods of setting up another task with memory and capabilities had remained mysterious to me, and so began another round of experimentation. What I wanted to do was to take a fairly easy route to begin with: create a task, populate some memory regions containing the program code and stack, transfer these things to the new task (using the l4_task_map function), and then start the thread to run the program, just to see what happened. Transferring capabilities was fairly easily achieved, and the L4Re libraries and frameworks do employ the equivalent of l4_task_map in places like the Remote_app_model class found in libloader, albeit obfuscated by the use of the corresponding C++ abstractions.

Frustratingly, this simple approach did not seem to work for the memory, and I could find only very few cases of anyone trying to use l4_task_map (or its equivalent C++ incantations) to transfer memory. Despite the memory apparently being transferred to the new task, the thread would immediately cause a page fault. Eventually, a page fault is what we want, but that would only occur because no memory would be made available initially, precisely because we would be implementing a demand paging solution. In the case of using l4_task_map to set up program memory, there should be no new “demand” for pages of such memory, this demand having been satisfied in advance. Nevertheless, I decided to try and get a page fault handler to supply flexpages to resolve these faults, also without success.

Having checked and double-checked my efforts, an enquiry on the l4-hackers list yielded the observation that the memory I had reserved and populated had not been configured as “executable”, for use by code in running programs. And indeed, since I had relied on the plain posix_memalign function to allocate that memory, it wasn’t set up for such usage. So, I changed my memory allocation strategy to permit the allocation of appropriately executable memory, and fortunately the problem was solved. Further small steps were then taken. I sought to introduce a region mapper that would attempt to satisfy requests for memory regions occurring in the new program, these occurring because a program starting up in L4Re will perform some setting up activities of its own. These new memory regions would be recognised by the page fault handler, with flexpages supplied to resolve page faults involving those regions. Eventually, it became possible to run a simple, statically-linked program in its own task.

Supporting program loading with an external page fault handler

When loading and running a new program, an external page fault handler makes sure that accesses to memory are supported by memory regions that may be populated with file content.

Up to this point, the page fault handler had been external to the new task and had been supplying memory pages from its own memory regions. Requests for data from the program file were being satisfied by accessing the appropriate region of the file, this bringing in the data using the file’s paging mechanism, and then supplying a flexpage for that part of memory to the program running in the new task. This particular approach compels the task containing the page fault handler to have a memory region dedicated to the file. However, the more elegant solution involves having a page fault handler communicating directly with the file’s pager component which will itself supply flexpages to map the requested memory pages into the new task. And to be done most elegantly, the page fault handler needs to be resident in the same task as the actual program.

Putting the page fault handler and the actual program in the same task demanded some improvements in the way I was setting up tasks and threads, providing capabilities to them, and so on. Separate stacks need to be provided for the handler and the program, and these will run in different threads. Moving the page fault handler into the new task is all very well, but we still need to be able to handle page faults that the “internal” handler might cause, so this requires us to retain an “external” handler. So, the configuration of the handler and program are slightly different.

Another tricky aspect of this arrangement is how the program is configured to send its page faults to the handler running alongside it – the internal handler – instead of the one servicing the handler itself. This requires an IPC gate to be created for the internal handler, presented to it via its configuration, and then the handler will bind to this IPC gate when it starts up. The program may then start up using a reference to this IPC gate capability as its “pager” or page fault handler. You would be forgiven for thinking that all of this can be quite difficult to orchestrate correctly!

Configuring the communication between program and page fault handler

An IPC gate must be created and presented to the page fault handler for it to bind to before it is presented to the program as its “pager”.

Although I had previously been sending flexpages in messages to satisfy map requests, the other side of such transactions had not been investigated. Senders of map requests will specify a “receive window” to localise the placement of flexpages returned from such requests, this being an intrinsic part of the flexpage concept. Here, some aspects of the IPC system became more prominent and I needed to adjust the code generated by my interface description language tool which had mostly ignored the use of message buffer registers, employing them only to control the reception of object capabilities.

More testing was required to ensure that I was successfully able to request the mapping of memory in a particular region and that the supplied memory did indeed get mapped into the appropriate place. With that established, I was then able to modify the handler deployed to the task. Since the flexpage returned by the dataspace (or resource) providing access to file content effectively maps the memory into the receiving task, the page fault handler does not need to explicitly return a valid flexpage: the mapping has already been done. The semantics here were not readily apparent, but this approach appears to work correctly.

The use of an internal page fault handler with a new program

An internal page fault handler satisfies accesses to memory from the program running in the same task, providing it with access to memory regions that may be populated with file content.

One other detail that proved to be important was that of mapping file content to memory regions so that they would not overlap somehow and prevent the correct region from being used to satisfy page faults. Consider the following regions of the test executable file described by the readelf utility (with the -l option):

  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000001000000 0x0000000001000000
                 0x00000000000281a6 0x00000000000281a6  R E    0x1000
  LOAD           0x0000000000028360 0x0000000001029360 0x0000000001029360
                 0x0000000000002058 0x0000000000008068  RW     0x1000

Here, we need to put the first region providing the program code at a virtual address of 0x1000000, having a size of at least 0x281a6, populated with exactly that amount of content from the file. Meanwhile, we need to put the second region at address 0x1029360, having a size of 0x8068, but only filled with 0x2058 bytes of data. Both regions need to be aligned to addresses that are multiples of 0x1000, but their contents must be available at the stated locations. Such considerations brought up two apparently necessary enhancements to the provision of file content: the masking of content so that undefined areas of each region are populated with zero bytes, this being important in the case of the partially filled data region; the ability to support writes to a region without those writes being propagated to the original file.

The alignment details help to avoid the overlapping of regions, and the matter of populating the regions can be managed in a variety of ways. I found that since file content was already being padded at the extent of a file, I could introduce a variation of the page mapper already used to manage the population of memory pages that would perform such padding at the limits of regions defined within files. For read-only file regions, such a “masked” page mapper would issue a single read-only page containing only zero bytes for any part of a file completely beyond the limits of such regions, thus avoiding the allocation of lots of identical pages. For writable regions that are not to be committed to the actual files, a “copied” page mapper abstraction was introduced, this providing copy-on-write functionality where write accesses cause new memory pages to be allocated and used to retain the modified data.

Some packaging up of the functionality into library routines and abstractions was undertaken, although as things stand more of that still needs to be done. I haven’t even looked into support for dynamic library loading, nor am I handling any need to extend the program stack when that is necessary, amongst other things, and I also need to make the process of creating tasks as simple as a function call and probably also expose the process via IPC in order to have a kind of process server. I still need to get back to addressing the lack of convenient support for the sequential testing of functionality.

But I hope that much of the hard work has now already been done. Then again, I often find myself climbing one particular part of this mountain, thinking that the next part of the ascent will be easier, only to find myself confronted with another long and demanding stretch that brings me only marginally closer to the top! This article is part of a broader consolidation process, along with writing some documentation, and this process will continue with the packaging of this work for the historical record if nothing else.

Conclusions and Reflections

All of this has very much been a learning exercise covering everything from the nuts and bolts of L4Re, with its functions and abstractions, through the design of a component architecture to support familiar, intuitive but hard-to-define filesystem functionality, this requiring a deeper understanding of Unix filesystem behaviour, all the while considering issues of concurrency and resource management that are not necessarily trivial. With so much going on at so many levels, progress can be slow and frustrating. I see that similar topics and exercises are pursued in some university courses, and I am sure that these courses produce highly educated people who are well equipped to go out into the broader world, developing systems like these using far less effort than I seem to be applying.

That leads to the usual question of whether such systems are worth developing when one can just “use Linux” or adopt something already under development and aimed at a particular audience. As I note above, maybe people are routinely developing such systems for proprietary use and don’t see any merit in doing the same thing openly. The problem with such attitudes is that experience with the development of such systems is then not broadly cultivated, the associated expertise and the corresponding benefits of developing and deploying such systems are not proliferated, and the average user of technology only gets benefits from such systems in a limited sense, if they even encounter them at all, and then only for a limited period of time, most likely, before the products incorporating such technologies wear out or become obsolete.

In other words, it is all very well developing proprietary systems and celebrating achievements made decades ago, but having reviewed decades of computing history, it is evident to me that achievements that are not shared will need to be replicated over and over again. That such replication is not cutting-edge development or, to use the odious term prevalent in academia, “novel” is not an indictment of those seeking to replicate past glories: it is an indictment of the priorities of those who commercialised them on every prior occasion. As mundane as the efforts described in this article may be, I would hope that by describing them and the often frustrating journey involved in pursuing them, people may be motivated to explore the development of such systems and that techniques that would otherwise be kept as commercial secrets or solutions to assessment exercises might hopefully be brought to a broader audience.

Making a Board for the PIC32 VGA Signal Generation Project

Tuesday, February 22nd, 2022

Although I have tried to maintain an interest in computing hardware over the last two years or so, both contemporary and somewhat older technologies, I haven’t really had much time or energy to devote to electronics projects. However, I was becoming increasingly bothered by my existing VGA signal generation project taking up space on a couple of solderless breadboards, demanding lots of jumper wires, and generally not helping me organise and consolidate my collection of electronics-related acquisitions, components, and so on. I was also feeling a bit bad for not moving the VGA project to the next step, which is what I had virtually promised to do. Having acquired a new computer in 2020 but not having really looked at KiCad since migrating to the new machine, I finally picked up the courage to set out translating my existing notes into a proper circuit diagram, reacquainting myself with KiCad’s peculiarities, and then translating this into an actual board layout.

My priorities for a circuit board were to support signal generation and programming, gathering together all the different passive components associated with signal generation – the resistors for combining colour intensity levels – plus all the components associated with programming the microcontroller – more resistors – and putting them on the board with a socket for the microcontroller. Unlike boards trying to replicate the Arduino experience, this board would not seek to offer programming facilities itself: that would remain the job of another board, with the Arduino Duemilanove already doing this job adequately under the control of the Nanu Nanu software. Headers would therefore expose the programming pins for connection to the Arduino or other device.

Nor would the board offer its own VGA connector for the VGA signals: I already have a VGA connector terminal block, and mounting a DE-15 connector on a board raises issues of selecting an appropriate component, defining a usable component footprint, making room on the board for it, and ensuring that the connector is securely attached without risking stress on the board or solder joints. Besides, the idea was to make use of my existing parts, and by retaining use of the terminal block, I could just use a header for the VGA signal outputs which would also require a modest amount of space. I envisaged that any future solution with a proper socket for a VGA cable might well involve a separate board securely mounted to a case, with the case taking the strain of cable insertions and removals, and with cables connecting this separate board to the board being designed here.

One issue when making a board that I could easily imagine stalling the process is that of the size and shape of any given board. With something resembling a miniature computing system, one might get tempted by selecting a broadly-adopted circuit board profile like PC/104 or one of the ITX family, or perhaps leveraging the deluge of Raspberry Pi cases available for one of those products (despite their variations and incompatibilities). However, my choice was practically made for me: I already had one Arduino case that I hadn’t used for anything, and it seemed to me that the traditional Arduino board profile was appropriately sized and would not be particularly inconvenient to adopt, although the actual physical characteristics were not adequately defined, leaving it to others to do the work of documentation that should have been done by the Arduino initiative right from the start. (There are subtle differences between Duemilanove models that have actually made some cases incompatible with some versions of that board, so this was a real problem.)

Anyway, with the physical and electrical constraints mostly figured out, I laid out the board, put headers in places I considered sensible enough, not aiming for Arduino shield compatibility since I needed more flexibility than that would demand. I also wanted to add support for various other useful features including some that I had already tested, such as UART communication, together with some I had not been able to adequately investigate, such as the driving of peripherals (or the microcontroller itself) over the parallel bus and USB connectivity. The latter involved messing around with the differential pair routing features of KiCad, but it is entirely possible that such features are largely superfluous at the transmission frequencies this board would end up using. After much checking, rechecking, agonising, and so on, I uploaded the board design to OSHPark and placed an order.

Several weeks later, I received the boards in the post, thankfully without any customs fees or other charges. In the meantime, I had also ordered components that I needed from a supplier in the UK, Technobots, who happen to sell logic chips and other things that suppliers targeting the “maker” community tend not to bother selling. Thankfully also in this case, the components arrived without incurring fees and charges, although I had kept the value of my order fairly low. In Norway, the industrial lobby hate to see people importing things, often taking the tone that people should buy locally produced goods, but since nobody makes these things here, anyone “local” that sells them has to import them, and the result is often just the cheapest stuff at ridiculously high prices and some middlemen making all the money, rather than anyone earning a decent living out of actually producing anything. But anyway.

The boards themselves were well made, as I have seen before, although I was disappointed with the “tabs” around the edge. When boards get made, people will tell you that they are all put on a big panel together and that they therefore need to be attached to each other somehow. The manufacturer will then typically spell out that apart from separating the boards by snapping them apart, they don’t do any further finishing work on the edges because we, the customers, aren’t paying for that level of service. However, I don’t remember the remnants of these inter-board connections – the tabs – being so awkward on previous boards from OSHPark. If I had to do anything before, I’m sure it just involved some gentle trimming, but these needed the application of glasspaper to grind down the tabs to be level with the actual board edge. Given the sub-millimetre tolerances involved in board fabrication, I find it perverse that such finishing would fall to someone – me or someone in the factory – to do this by hand.

The fabricated boards

The fabricated boards, with the annoying “tabs” visible around the edges.

Having tested the continuity between different pads using a simple power plus LED plus resistor arrangement, all that remained was to get the soldering equipment out and to build myself up to the usually arduous task of fitting all the headers, resistors, capacitors and the socket. This was something of a trial in itself, but I eventually remembered the various tricks needed to coerce my soldering iron to apply the solder in a barely decent way.

The finished board

The finished board connected to the VGA terminal block and Arduino Duemilanove.

One thing I realised very quickly was that I had chosen the wrong component footprint for the resistors. In navigating the rather unreadable list provided by the inflexible component assignment dialogue within KiCad, I had chosen footprints with the pads not being sufficiently separated for the resistors I happen to have (and ones that are likely to be sold for these purposes). So, I had to raise the resistors up from the board and tuck the leads under them slightly. Despite the additional height occupied by the resistors, the capacitors and headers are already rather tall and so this workaround causes no other problems. I also discovered that placing the resistors in the compact arrangement shown, while very neat and tidy, made all the unsoldered leads rather crowded on the underside of the board and rather increased the risk of accidental solder bridging between components, at least with my soldering technique. The capacitors had appropriate footprints, however, and were less of a challenge.

What followed was another trial, mostly caused by a simple wiring up error, and lots of troubleshooting involving the programming software and circuits both on this board and on the breadboard, chasing down a non-existent programming issue. The programming circuit was actually fine, and the programming was actually being performed as normal, but in maintaining a level of doubt about the outcome of the exercise, I had run the verification operation of the programming software and had seen that it was consistently complaining about a programming error. Worryingly, this error now seemed to occur both on the manufactured board and on the breadboard, for both a microcontroller that I had used before and a completely unused one. I started to test and investigate different versions of the programming software to see if any regressions had occurred.

Eventually, I realised that the error was related to the last thing the software would do when programming the microcontroller and was related to setting a configuration register. Ignoring this and reminding myself of the UART configuration, I sought to test the example program demonstrating UART communication and found, eventually, that it worked on the breadboard. It also worked on the manufactured board, too, raising my confidence slightly. And then, while perusing my previous blog post, I realised that I had connected the VGA sync signal pins the wrong way round! Programming the VGA example program and fixing the wiring finally got me to where I should have been to begin with.

The board generating a VGA signal.

The board generating a VGA signal via a terminal block and cable, with the Arduino supplying power and acting as programmer.

The arrangement of boards required here is a bit cumbersome, although the Arduino could be replaced by a simple 3.3V power source if the appropriate software has already been programmed into the PIC32 microcontroller. As noted above, an auxiliary board could provide a VGA socket for a cable and thus eliminate the bulky terminal connector.

The three-board arrangement.

The three-board arrangement of VGA terminal block, VGA signal board, and Arduino Duemilanove.

As far as using an existing Arduino case is concerned, there are some obvious limitations and possibilities for improvement and refinement. The size and shape of the manufactured board is compatible with the case I happen to have, but where an Arduino shield would be stacked on top of the Duemilanove directly, this board necessitates that the connections be routed using wires or cables between the Arduino’s analogue header and the programming header on this board. Having mounted female headers on this board, this involves routing cables around the edge of the board, and the lack of clearance between the Arduino’s headers and the underside of this board means that there is not enough space to use jumper wires. Fortunately, breadboard jumper cables can do this job, if not entirely elegantly.

The board and the Duemilanove in a case.

The board and the Duemilanove in a case, with the cables routing programming and power signals between the boards.

Introducing a degree of shield compatibility might be a solution to such problems, but perhaps I should not be too hard on myself. The new computer I bought in 2020 has its own compartment for cables, and this compartment is rammed full of seemingly needless, bulky cabling, presumably demonstrating the designers’ aptitude for cable management as they neglected other basic functionality one would expect from a computer case in the third decade of the twenty-first century.

There are other modifications that I might make in a second version of this board. Challenged by the layout issues, I neglected to realise that if the VGA signal resistors are fitted, some of the parallel bus pins will have their own signals mixed with others: I should have realised this and specified diodes for the various signal lines concerned. Still, with spare boards and another microcontroller to play with, I could possibly just make up a board without those resistors if I wanted to experiment with parallel mode. Since this would most likely involve driving a screen, it wouldn’t make sense to try and support that and VGA output on the same board, although I could imagine some kind of switching solution disabling one and enabling the other as appropriate.

In reflection, the outcome is satisfactory. I did spend rather too long designing, building and – especially – troubleshooting the board, and there are definitely things that are obviously wrong with it, but it has now allowed me to dismantle the original breadboard circuit, free up some jumper wires and components, and to put all of those things away properly. It has also helped me become familiar with KiCad once again, encouraged me to document more schematics, and opened the door to doing more circuit design in the future. In any case, I hope that this was a somewhat interesting view into the realisation of a long overdue project.

Another Look at VGA Signal Generation with a PIC32 Microcontroller

Thursday, November 8th, 2018

Maybe some people like to see others attempting unusual challenges, things that wouldn’t normally be seen as productive, sensible or a good way to spend time and energy, things that shouldn’t be possible or viable but were nevertheless made to work somehow. And maybe they like the idea of indulging their own curiosity in such things, knowing that for potential future experiments of their own there is a route already mapped out to some kind of success. That might explain why my project to produce a VGA-compatible analogue video signal from a PIC32 microcontroller seems to attract more feedback than some of my other, arguably more useful or deserving, projects.

Nevertheless, I was recently contacted by different people inquiring about my earlier experiments. One was admittedly only interested in using Free Software tools to port his own software to the MIPS-based PIC32 products, and I tried to give some advice about navigating the documentation and to describe some of the issues I had encountered. Another was more concerned with the actual signal generation aspect of the earlier project and the usability of the end result. Previously, I had also had a conversation with someone looking to use such techniques for his project, and although he ended up choosing a different approach, the cumulative effect of my discussions with him and these more recent correspondents persuaded me to take another look at my earlier work and to see if I couldn’t answer some of the questions I had received more definitively.

Picking Over the Pieces

I was already rather aware of some of the demonstrated and potential limitations of my earlier work, these being concerned with generating a decent picture, and although I had attempted to improve the work on previous occasions, I think I just ran out of energy and patience to properly investigate other techniques. The following things were bothersome or a source of concern:

  • The unevenly sized pixels
  • The process of experimentation with the existing code
  • Whether the microcontroller could really do other things while displaying the picture

Although one of my correspondents was very complimentary about the form of my assembly language code, I rather felt that it was holding me back, making me focus on details that should be abstracted away. It should be said that MIPS assembly language is fairly pleasant to write, at least in comparison to certain other architectures.

(I was brought up on 6502 assembly language, where there is an “accumulator” that is the only thing even approaching a general-purpose register in function, and where instructions need to combine this accumulator with other, more limited, registers to do things like accessing “zero page”: an area of memory that supports certain kinds of operations by providing the contents of locations as inputs. Everything needs to be meticulously planned, and despite odd claims that “zero page” is really one big register file and that 6502 is therefore “RISC-like”, the existence of virtual machines such as SWEET16 say rather a lot about how RISC-like the 6502 actually is. Later, I learned ARM assembly language and found it rather liberating with its general-purpose registers and uncomplicated, rather easier to use, memory access instructions. Certain things are even simpler in MIPS assembly language, whereas other conveniences from ARM are absent and demand a bit more effort from the programmer.)

Anyway, I had previously introduced functionality in C in my earlier work, mostly because I didn’t want the challenge of writing graphics routines in assembly language. So with the need to more easily experiment with different peripheral configurations, I decided to migrate the remaining functionality to C, leaving only the lowest-level routines concerned with booting and exception/interrupt handling in assembly language. This effort took me to some frustrating places, making me deal with things like linker scripts and the kind of memory initialisation that one’s compiler usually does for you but which is absent when targeting a “bare metal” environment. I shall spare you those details in this article.

I therefore spent a certain amount of effort in developing some C library functionality for dealing with the hardware. It could be said that I might have used existing libraries instead, but ignoring Microchip’s libraries that will either be proprietary or the subject of numerous complaints by those unwilling to leave that company’s “ecosystem”, I rather felt that the exercise in library design would be useful in getting reacquainted and providing me with something I would actually want to use. An important goal was minimalism, whereas my impression of libraries such as those provided by the Pinguino effort are that they try and bridge the different PIC hardware platforms and consequently accumulate features and details that do not really interest me.

The Wide Pixel Problem

One thing that had bothered me when demonstrating a VGA signal was that the displayed images featured “wide” pixels. These are not always noticeable: one of my correspondents told me that he couldn’t see them in one of my example pictures, but they are almost certainly present because they are a feature of the mechanism used to generate the signal. Here is a crop from the example in question:

Picture detail from VGAPIC32 output

Picture detail from VGAPIC32 output

And here is the same crop with the wide pixels highlighted:

Picture detail with wide pixels highlighted

Picture detail with wide pixels highlighted

I have left the identification of all wide pixel columns to the reader! Nevertheless, it can be stated that these pixels occur in every fourth column and are especially noticeable with things like text, where at such low resolutions, the doubling of pixel widths becomes rather obvious and annoying.

Quite why this increase in pixel width was occurring became a matter I wanted to investigate. As you may recall, the technique I used to output pixels involved getting the direct memory access (DMA) controller in the PIC32 chip to “copy” the contents of memory to a hardware register corresponding to output pins. The signals from these pins were sent along the cable to the monitor. And the DMA controller was transferring data as fast as it could and thus producing pixel colours as fast as it could.

Pixel Output Using DMA Transfer

An overview of the architecture for pixel output using DMA transfer

One of my correspondents looked into the matter and confirmed that we were not imagining this problem, even employing an oscilloscope to check what was happening with the signals from the output pins. The DMA controller would, after starting each fourth pixel, somehow not be able to produce the next pixel in a timely fashion, leaving the current pixel colour unchanged as the monitor traced the picture across the screen. This would cause these pixels to “stretch” until the first pixel from the next group could be emitted.

Initially, I had thought that interrupts were occurring and the CPU, in responding to interrupt conditions and needing to read instructions, was gaining priority over the DMA controller and forcing pixel transfers to wait. Although I had specified a “cell size” of 160 bytes, corresponding to 160 pixels, I was aware that the architecture of the system would be dividing data up into four-byte “words”, and it would be natural at certain points for accesses to memory to be broken up and scheduled in terms of such units. I had therefore wanted to accommodate both the CPU and DMA using an approach where the DMA would not try and transfer everything at once, but without the energy to get this to work, I had left the matter to rest.

A Steady Rhythm

The documentation for these microcontrollers distinguishes between block and cell transfers when describing DMA. Previously, I had noted that these terms could be better described, and I think there are people who are under the impression that cells must be very limited in size and that you need to drive repeated cell transfers using various interrupt conditions to transfer larger amounts. We have seen that this is not the case: a single, large cell transfer is entirely possible, even though the characteristics of the transfer have been less than desirable. (Nevertheless, the documentation focuses on things like copying from one UART peripheral to another, arguably failing to explore the range of possible applications for DMA and to thereby elucidate the mechanisms involved.)

However, given the wide pixel effect, it becomes more interesting to introduce a steady rhythm by using smaller cell sizes and having an external event coordinate each cell’s transfer. With a single, large transfer, only one initiation event needs to occur: that produced by the timer whose period corresponds to that of a single horizontal “scanline”. The DMA channel producing pixels then runs to completion and triggers another channel to turn off the pixel output. In this scheme, the initiating condition for the cell transfer is the timer.

VGA Display Line Structure

The structure of each visible display line in the VGA signal

When using multiple cells to transfer the pixel data, however, it is no longer possible to use the timer in this way. Doing so would cause the initiation of the first cell, but then subsequent cells would only be transferred on subsequent timer events. And since these events only occur once per scanline, this would see a single line’s pixel data being transferred over many scanlines instead (or, since the DMA channel would be updated regularly, we would see only the first pixel on each line being emitted, stretched across the entire line). Since the DMA mechanism apparently does not permit one kind of interrupt condition to enable a channel and another to initiate each cell transfer, we must be slightly more creative.

Fortunately, the solution is to chain two channels, just as we already do with the pixel-producing channel and the one that resets the output. A channel is dedicated to waiting for the line timer event, and it transfers a single black pixel to the screen before handing over to the pixel-producing channel. This channel, now enabled, has its cell transfers regulated by another interrupt condition and proceeds as fast as such a condition may occur. Finally, the reset channel takes over and turns off the output as before.

Pixel Output Using Timed DMA Transfers

An overview of the architecture for pixel output using timed DMA transfers

The nature of the cell transfer interrupt can take various forms, but it is arguably most intuitive to use another timer for this purpose. We may set the limit of such a timer to 1, indicating that it may “wrap around” and thus produce an event almost continuously. And by configuring it to run as quickly as possible, at the frequency of the peripheral clock, it may drive cell transfers at a rate that is quick enough to hopefully produce enough pixels whilst also allowing other activities to occur between each transfer.

VGA Pixel Output (Using Transfer Timer)

Using a timer to initiate pixel transfers for the VGA signal

One thing is worth mentioning here just to be explicit about the mechanisms involved. When configuring interrupts that are used for DMA transfers, it is the actual condition that matters, not the interrupt nor the delivery of the interrupt request to the CPU. So, when using timer events for transfers, it appears to be sufficient to merely configure the timer; it will produce the interrupt condition upon its counter “wrapping around” regardless of whether the interrupt itself is enabled.

With a cell size of a single byte, and with a peripheral clock running at half the speed of the system clock, this approach is sufficient all by itself to yield pixels with consistent widths, with the only significant disadvantage being how few of them can be produced per line: I could only manage in the neighbourhood of 80 pixels! Making the peripheral clock run as fast as the system clock doesn’t help in theory: we actually want the CPU running faster than the transfer rate just to have a chance of doing other things. Nor does it help in practice: picture stability rather suffers.

A picture of the display output from timed DMA transfers

A picture of the display output from timed DMA transfers

Using larger cell sizes, we encounter the wide pixel problem, meaning that the end of a four-byte group is encountered and the transfer hangs on for longer than it should. However, larger cell sizes also introduce byte transfers at a different rate from cell transfers (at the system clock rate) and therefore risk making the last pixel produced by a cell longer than the others, anyway.

Uncovering DMA Transfers

I rather suspect that interruptions are not really responsible for the wide pixels at all, and that it is the DMA controller that causes them. Some discussion with another correspondent explored how the controller might be operating, with transfers perhaps looking something like this:

DMA read from memory
DMA write to display (byte #1)
DMA write to display (byte #2)
DMA write to display (byte #3)
DMA write to display (byte #4)
DMA read from memory
...

This would, by itself, cause a transfer pattern like this:

R____R____R____R____R____R ...
_WWWW_WWWW_WWWW_WWWW_WWWW_ ...

And thus pixel output as follows:

41234412344123441234412344 ...
=***==***==***==***==***== ... (narrow pixels as * and wide pixel components as =)

Even without any extra operations or interruptions, we would get a gap between the write operations that would cause a wider pixel. This would only get worse if the DMA controller had to update the address of the pixel data after every four-byte read and write, not being able to do so concurrently with those operations. And if the CPU were really able to interrupt longer transfers, even to obtain a single instruction to execute, it might then compete with the DMA controller in accessing memory, making the read operations even later every time.

Assuming, then, that wide pixels are the fault of the way the DMA controller works, we might consider how we might want it to work instead:

                     | ...
DMA read from memory | DMA write to display (byte #4)
                 \-> | DMA write to display (byte #1)
                     | DMA write to display (byte #2)
                     | DMA write to display (byte #3)
DMA read from memory | DMA write to display (byte #4)
                 \-> | ...

If only pixel data could be read from memory and written to the output register (and thus the display) concurrently, we might have a continuous stream of evenly-sized pixels. Such things do not seem possible with the microcontroller I happen to be using. Either concurrent reading from memory and writing to a peripheral is not possible or the DMA controller is not able to take advantage of this concurrency. But these observations did give me another idea.

Dual Channel Transfers

If the DMA controller cannot get a single channel to read ahead and get the next four bytes, could it be persuaded to do so using two channels? The intention would be something like the following:

Channel #1:                     Channel #2:
                                ...
DMA read from memory            DMA write to display (byte #4)
DMA write to display (byte #1)
DMA write to display (byte #2)
DMA write to display (byte #3)
DMA write to display (byte #4)  DMA read from memory
                                DMA write to display (byte #1)
...                             ...

This is really nothing different from the above, functionally, but the required operations end up being assigned to different channels explicitly. We would then want these channels to cooperate, interleaving their data so that the result is the combined sequence of pixels for a line:

Channel #1: 1234    1234     ...
Channel #2:     5678    5678 ...
  Combined: 1234567812345678 ...

It would seem that channels might even cooperate within cell transfers, meaning that we can apparently schedule two long transfer cells and have the DMA controller switch between the two channels after every four bytes. Here, I wrote a short test program employing text strings and the UART peripheral to see if the microcontroller would “zip up” the strings, with the following being used for single-byte cells:

Channel #1: "Adoc gi,hlo\r"
Channel #2: "n neaan el!\n"
  Combined: "And once again, hello\r\n"

Then, seeing that it did, I decided that this had a chance of also working with pixel data. Here, every other pixel on a line needs to be presented to each channel, with the first channel being responsible for the pixels in odd-numbered positions, and the second channel being responsible for the even-numbered pixels. Since the DMA controller is unable to step through the data at address increments other than one (which may be a feature of other DMA implementations), this causes us to rearrange each line of pixel data as follows:

 Displayed pixels: 123456......7890
Rearranged pixels: 135...79246...80
                   *       *

Here, the asterisks mark the start of each channel’s data, with each channel only needing to transfer half the usual amount.

Pixel Output Using Timed Dual-Channel Transfers

The architecture involved in employing two pixel data channels with timed transfers

The documentation does, in fact, mention that where multiple channels are active with the same priority, each one is given control in turn with the controller cycling through them repeatedly. The matter of which order they are chosen, which is important for us, seems to be dependent on various factors, only some of which I can claim to understand. For instance, I suspect that if the second channel refers to data that appears before the first channel’s data in memory, it gets scheduled first when both channels are activated. Although this is not a significant concern when just trying to produce a stable picture, it does limit more advanced operations such as horizontal scrolling.

A picture of the display output from timed, dual-channel DMA transfers

A picture of the display output from timed, dual-channel DMA transfers

As you can see, trying this technique out with timed transfers actually made a difference. Instead of only managing something approaching 80 pixels across the screen, more than 90 can be accommodated. Meanwhile, experiments with transfers going as fast as possible seemed to make no real difference, and the fourth pixel in each group was still wider than the others. Still, making the timed transfer mode more usable is a small victory worth having, I suppose.

Parallel Mode Revisited

At the start of my interest in this project, I had it in my mind that I would couple DMA transfers with the parallel mode (or Parallel Master Port) functionality in order to generate a VGA signal. Certain aspects of this, particularly gaps between pixels, made me less than enthusiastic about the approach. However, in considering what might be done to the output signal in other situations, I had contemplated the use of a flip-flop to hold output stable according to a regular tempo, rather like what I managed to achieve, almost inadvertently, when introducing a transfer timer. Until recently, I had failed to apply this idea to where it made most sense: in regulating the parallel mode signal.

Since parallel mode is really intended for driving memory devices and display controllers, various control signals are exposed via pins that can tell these external devices that data is available for their consumption. For our purposes, a flip-flop is just like a memory device: it retains the input values sampled by its input pins, and then exposes these values on its output pins when the inputs are “clocked” into memory using a “clock pulse” signal. The parallel mode peripheral in the microcontroller offers various different signals for such clock and selection pulse purposes.

VGA Output Circuit (Parallel Mode)

The parallel mode circuit showing connections relevant to VGA output (generic connections are not shown)

Employing the PMWR (parallel mode write) signal as the clock pulse, directing the display signals to the flip-flop’s inputs, and routing the flip-flop’s outputs to the VGA circuit solved the pixel gap problem at a stroke. Unfortunately, it merely reminded us that the wide pixel problem also affects parallel mode output, too. Although the clock pulse is able to tell an external component about the availability of a new pixel value, it is up to the external component to regulate the appearance of each pixel. A memory device does not care about the timing of incoming data as long as it knows when such data has arrived, and so such regulation is beyond the capabilities of a flip-flop.

It was observed, however, that since each group of pixels is generated at a regular frequency, the PMWR signalling frequency might be reduced by being scaled by a constant factor. This might allow some pixel data to linger slightly longer in the flip-flop and be slightly stretched. By the time the fourth pixel in a group arrives, the time allocated to that pixel would be the same as those preceding it, thus producing consistently-sized pixels. I imagine that a factor of 8/9 might do the trick, but I haven’t considered what modification to the circuit might be needed or whether it would all be too complicated.

Recognising the Obvious

When people normally experiment with video signals from microcontrollers, one tends to see people writing code to run as efficiently as is absolutely possible – using assembly language if necessary – to generate the video signal. It may only be those of us using microcontrollers with DMA peripherals who want to try and get the DMA hardware to do the heavy lifting. Indeed, those of us with exposure to display peripherals provided by system-on-a-chip solutions feel almost obliged to do things this way.

But recent discussions with one of my correspondents made me reconsider whether an adequate solution might be achieved by just getting the CPU to do the work of transferring pixel data to the display. Previously, another correspondent had indicated that it this was potentially tricky, and that getting the timings right was more difficult than letting the hardware synchronise the different mechanisms – timer and DMA – all by itself. By involving the CPU and making it run code, the display line timer would need to generate an interrupt that would be handled, causing the CPU to start running a loop to copy data from the framebuffer to the output port.

Pixel Output Using CPU-Driven Transfers

An overview of the architecture with the CPU driving transfers of pixel data

This approach puts us at the mercy of how the CPU handles and dispatches interrupts. Being somewhat conservative about the mechanisms more generally available on various MIPS-based products, I tend to choose a single interrupt vector and then test for the different conditions. Since we need as little variation as possible in the latency between a timer event occurring and the pixel data being generated, I test for that particular event before even considering anything else. Then, a loop of the following form is performed:

    for (current = line_data; current < end; current++)
        *output_port = *current;

Here, the line data is copied byte by byte to the output port. Some adornments are necessary to persuade the compiler to generate code that writes the data efficiently and in order, but there is nothing particularly exotic required and GCC does a decent job of doing what we want. After the loop, a black/reset pixel is generated to set the appropriate output level.

One concern that one might have about using the CPU for such long transfers in an interrupt handler is that it ties up the CPU, preventing it from doing other things, and it also prevents other interrupt requests from being serviced. In a system performing a limited range of activities, this can be acceptable: there may be little else going on apart from updating the display and running programs that access the display; even when other activities are incorporated, they may accommodate being relegated to a secondary status, or they may instead take priority over the display in a way that may produce picture distortion but only very occasionally.

Many of these considerations applied to systems of an earlier era. Thinking back to computers like the Acorn Electron – a 6502-based system that formed the basis of my first sustained experiences with computing – it employs a display controller that demands access to the computer’s RAM for a certain amount of the time dedicated to each video frame. The CPU is often slowed down or even paused during periods of this display controller’s activity, making programs slower than they otherwise would be, and making some kinds of input and output slightly less reliable under certain circumstances. Nevertheless, with certain kinds of additional hardware, the possibility is present for such hardware to interrupt the CPU and to override the display controller that would then produce “snow” or noise on the screen as a consquence of this high-priority interruption.

Such issues cause us to consider the role of the DMA controller in our modern experiment. We might well worry about loading the CPU with lots of work, preventing it from doing other things, but what if the DMA controller dominates the system in such a way that it effectively prevents the CPU from doing anything productive anyway? This would be rather similar to what happens with the Electron and its display controller.

So, evaluating a CPU-driven solution seems to be worthwhile just to see whether it produces an acceptable picture and whether it causes unacceptable performance degradation. My recent correspondence also brought up the assertion that the RAM and flash memory provided by PIC32 microcontrollers can be accessed concurrently. This would actually mitigate contention between DMA and any programs running from flash memory, at least until the point that accesses to RAM needed to be performed by those programs, meaning that we might expect some loss of program performance by shifting the transfer burden to the CPU.

(Again, I am reminded of the Electron whose ROM could be accessed at full speed but whose RAM could only be accessed at half speed by the CPU but at full speed by the display controller. This might have been exploited by software running from ROM, or by a special kind of RAM installed and made available at the right place in memory, but the 6502 favours those zero-page instructions mentioned earlier, forcing RAM access and thus contention with the display controller. There were upgrades to mitigate this by providing some dedicated memory for zero page, but all of this is really another story for another time.)

Ultimately, having accepted that the compiler would produce good-enough code and that I didn’t need to try more exotic things with assembly language, I managed to produce a stable picture.

A picture of the display output from CPU-driven pixel data transfers

A picture of the display output from CPU-driven pixel data transfers

Maybe I should have taken this obvious path from the very beginning. However, the presence of DMA support would have eventually caused me to investigate its viability for this application, anyway. And it should be said that the performance differences between the CPU-based approach and the DMA-based approaches might be significant enough to argue in favour of the use of DMA for some purposes.

Observations and Conclusions

What started out as a quick review of my earlier work turned out to be a more thorough study of different techniques and approaches. I managed to get timed transfers functioning, revisited parallel mode and made it work in a fairly acceptable way, and I discovered some optimisations that help to make certain combinations of techniques more usable. But what ultimately matters is which approaches can actually be used to produce a picture on a screen while programs are being run at the same time.

To give the CPU more to do, I decided to implement some graphical operations, mostly copying data to a framebuffer for its eventual transfer as pixels to the display. The idea was to engage the CPU in actual work whilst also exercising access to RAM. If significant contention between the CPU and DMA controller were to occur, the effects would presumably be visible on the screen, potentially making the chosen configuration unusable.

Although some approaches seem promising on paper, and they may even produce promising results when the CPU is doing little more than looping and decrementing a register to introduce a delay, these approaches may produce less than promising results under load. The picture may start to ripple and stretch, and under “real world” load, the picture may seem noisy and like a badly-tuned television (for those who remember the old days of analogue broadcast signals).

Two approaches seem to remain robust, however: the use of timed DMA transfers, and the use of the CPU to do all the transfer work. The former is limited in terms of resolution and introduces complexity around the construction of images in the framebuffer, at least if optimised as much as possible, but it seems to allow more work to occur alongside the update of the display, and the reduction in resolution also frees up RAM for other purposes for those applications that need it. Meanwhile, the latter provides the resolution we originally sought and offers a straightforward framebuffer arrangement, but it demands more effort from the CPU, slowing things down to the extent that animation practically demands double buffering and thus the allocation of even more RAM for display purposes.

But both of these seemingly viable approaches produce consistent pixel widths, which is something of a happy outcome given all the effort to try and fix that particular problem. One can envisage accommodating them both within a system given that various fundamental system properties (how fast the system and peripheral clocks are running, for example) are shared between the two approaches. Again, this is reminiscent of microcomputers where a range of display modes allowed developers and users to choose the trade-off appropriate for them.

A demonstration of text plotting at a resolution of 160x128

A demonstration of text plotting at a resolution of 160x128

Having investigated techniques like hardware scrolling and sprite plotting, it is tempting to keep developing software to demonstrate the techniques described in this article. I am even tempted to design a printed circuit board to tidy up my rather cumbersome breadboard arrangement. And perhaps the library code I have written can be used as the basis for other projects.

It is remarkable that a home-made microcontroller-based solution can be versatile enough to demonstrate aspects of simple computer systems, possibly even making it relevant for those wishing to teach or to learn about such things, particularly since all the components can be connected together relatively easily, with only some magic happening in the microcontroller itself. And with such potential, maybe this seemingly pointless project might have some meaning and value after all!

Update

Although I can’t embed video files of any size here, I have made a “standard definition” video available to demonstrate scrolling and sprites. I hope it is entertaining and also somewhat illustrative of the kind of thing these techniques can achieve.

Shared-Mode Executables in L4Re for MIPS-Based Devices

Sunday, July 8th, 2018

I have been meaning to write about my device driver experiments with L4Re, following on from my porting exercises, but that exercise took me along various routes and I haven’t yet got back to documenting all of them. Meanwhile, one thing that did start to bother me was how much space the software was taking up when compiled, linked and ready to deploy.

Since each of my device drivers is a separate program, and since each one may be linked to various libraries, they each started to contribute substantially to the size of the resulting file – the payload – needing to be transferred to the device. At one point, I had to resize the boot partition on the memory card used by the Letux 400 notebook computer to make the payload fit in the available space.

The work done to port L4Re to the MIPS Creator CI20 had already laid the foundations for functioning payloads, and once the final touches were put in place to support the peculiarities of the Ingenic JZ4780 system-on-a-chip, it was possible to run both the conventional “hello” example which is statically linked to its libraries, as well as a “shared-hello” example which is dynamically linked to its libraries. The latter configuration of the program results in a smaller executable program and thus a smaller payload.

So it seemed clear that I might be able to run my own programs on the Letux 400 or Ben NanoNote with similar reductions in payload size. Unfortunately, nothing ever seems to be as straightforward as it ought to be.

Exceptional Obstructions and Observations

Initially, I set about trying one of my own graphical examples with the MODE variable set to “shared” in its Makefile. This, upon powering up, merely indicated that it had not managed to start up properly. Instead of a blank screen, the viewports set up by the graphical multiplexer, Mag, were still active and showing their usual blankness. But these regions did not then change in any way when I pressed keys on the keyboard (which is functionality that I will hopefully get round to describing in another article).

I sought some general advice from the l4-hackers mailing list, but quickly realised that to make any real progress, I would need a decent way of accessing the debugging output produced by the dynamic linker. This took me on a diversion that led to my debugging capabilities being strengthened with the availability of a textual output console on the screen of my devices. I still don’t like the idea of performing hardware modifications to get access to the serial console, so this is a useful and welcome alternative.

Having switched out the “hello” program with the “shared-hello” program in the system configuration and module list demonstrating the framebuffer terminal, I deployed the payload and powered up, but I did not get the satisfying output of the program operating normally. Instead, the framebuffer terminal appeared and rewarded me with the following message:

L4Re: rom/ex_hello_shared: Unhandled exception: PC=0x800000 PFA=8d7a LdrFlgs=0

This isn’t really the kind of thing you want to see. Having not had to debug L4Re or Fiasco.OC in any serious fashion for a couple of months, I was out of practice in considering the next step, but fortunately some encouragement arrived in a private e-mail from Jean Wolter. This brought the suggestion of triggering the kernel debugger, but since this requires serial console access, it wasn’t a viable approach. But another idea that I could use involved writing out a bit more information in the routine that was producing this output.

The message in question originates in the pkg/l4re-core/l4re_kernel/server/src/region.cc file, within the Region_map::op_exception method. The details it produces are rather minimal and generic: the program counter (PC) tells us where the exception occurred; the loader flags (LdrFlags) presumably tell us about the activity of the library loader; the mysterious “PFA” is supposedly the page fault address but it actually seemed to be the stack pointer address on these MIPS-based systems.

On their own, these details are not particularly informative, but I suppose that more useful information could quickly become fairly specific to a particular architecture. Jean suggested looking at the structure describing the exception state, l4_exc_regs_t (defined with MIPS-specific members in pkg/l4re-core/l4sys/include/ARCH-mips/utcb.h), to see what else I might dig up. This I did, generating the following:

pc=0x800000
gp=0x82dd30
sp=0x8d7a
ra=0x802f6c
cause=0x1000002c

A few things interested me, thus motivating my choice of registers to dump. The global pointer (gp) register tells us about symbols in the problematic code, and I felt that having once made changes to the L4Re sources – way back in the era of getting the CI20 to run GCC-generated code – so that another register (t9) would be initialised correctly, this so that the gp register would be set up correctly within programs, it was entirely possible that I had rather too enthusiastically changed something that was now causing a problem.

The stack pointer (sp) is useful to check, just to see if it located in a sensible region of memory, and here I discovered that this seemed to be the same as the “PFA” number. Oddly, the “PFA” seems to occupy the same place in the exception structure as any “bad virtual address” featuring in an address exception, and so I started to suspect that maybe the stack pointer was referencing the wrong part of memory. But this was partially ruled out by examining the value of the stack pointer in the “hello” example, which appeared to reference broadly the same part of memory. And, of course, the “hello” example works just fine.

In fact, the cause register indicated another kind of exception entirely, and it was one I was not really expecting: a “coprocessor unusable” exception indicating that coprocessor 1, typically a floating point arithmetic unit, was being illegally requested by an instruction. Here is how I interpreted that register’s value:

hex value   binary value
1000002c == 00010000000000000000000000101100
              --                     -----
              CE                     ExcCode

=> CE == 1; ExcCode == 11 (coprocessor unusable)
=> coprocessor 1 unusable

Now, as I may have mentioned before, the hardware involved in this exercise does not support floating point instructions itself, and this is why I have configured compilers to use “soft-float” (software-based floating point arithmetic) support. It meant that I had to find places that might have wanted to use floating point instructions and eliminate those instructions somehow. Fortunately, only code generated by the compiler was likely to contain such instructions. But now I wondered if there weren’t some instructions of this nature lurking in places I hadn’t checked.

I had also thought to check the return address (ra) register. This tells us where the processor will jump to when it has finished executing the current routine, and since this is usually a matter of “returning” somewhere, it tells us something about the code that was being executed before the problematic routine was called. I figured that the work being done before the exception was probably going to be more important than the exception itself.

Floating Point Magic

Another debugging suggestion that now became unavoidable was to inspect the erroneous instruction. I noted above that this instruction was causing the processor to signal an illegal attempt to use an unusable – actually completely unavailable – coprocessor. Writing a numeric representation of the instruction to the display provided me with the following hexadecimal (base 16) value:

464c457f

This can be interpreted as follows in binary, with groups of bits defined for interpretation according to the MIPS instruction set architecture, and with tentative interpretations of these groups provided beneath:

010001 10010  01100 01000 10101 111111
COP1   rs/fmt rt/ft rd/fs       C.ABS.NGT

The first group of bits is the opcode field which is interpreted as a coprocessor 1 (COP1) opcode. Should we then wish to consider what the other groups mean, we might then examine the final group which could indicate a comparison instruction. However, this becomes rather hypothetical since the processor will most likely interpret the opcode field and then decide that it cannot handle the instruction.

So, I started to look for places where the instruction might have been written, but no obvious locations were forthcoming. One peculiar aspect of all this is that the location of the instruction is at a rather “clean” location – 0x800000 – and some investigations indicated that this is where the library containing the problematic code gets loaded. I actually don’t remember precisely how I figured this out, but I think it was as follows.

I had looked at linker scripts that might give some details of the location of program objects, and one of them (pkg/l4re-core/ldscripts/ARCH-mips/main_dyn.ld) seemed to be related. It gave an address for the code of 0x400000. This made me think that some misconfiguration or erroneous operation was putting the observed code somewhere it shouldn’t be. But changing this address in the linker script just gave another exception at 0x400000, meaning that I had disrupted something that was intentional and probably working fine.

Meanwhile, emitting the t9 register’s value from the exception state yielded 0x800000, indicating that the calling routine had most likely jumped straight to that address, not to another address with execution having then proceeded normally until reaching the exception location. I decided to look at the instructions around the return address, these most likely being the ones that had set up the call to the exception location. Writing these locations out gave me some idea about the instructions involved. Below, I provide the stored values and their interpretations as machine instructions:

8f998250 # lw $t9, -32176($gp)
24a55fa8 # addiu $a1, $a1, 0x5fa8
0320f809 # jalr $t9
24844ee4 # addiu $a0, $a0, 0x4ee4
8fbc0010 # lw $gp, 16($sp)

One objective of doing this, apart from confirming that a jump instruction (jalr) was involved, with the t9 register being employed as is the convention with MIPS code, was to use the fragment to identify the library that was causing the error. A brute-force approach was employed here, generating “object dumps” from the library files and writing them out as files in a new directory:

mkdir tmpdir
for FILENAME in mybuild/lib/mips_32/l4f/* ; do
    mipsel-linux-gnu-objdump -d "$FILENAME" > tmpdir/`basename "$FILENAME"`
done

The textual dump files were then searched for the instruction values using grep, narrowing down the places where these instructions were found in consecutive locations. This yielded the following code, found in the libld-l4.so library:

    2f5c:       8f998250        lw      t9,-32176(gp)
    2f60:       24a55fa8        addiu   a1,a1,24488
    2f64:       0320f809        jalr    t9
    2f68:       24844ee4        addiu   a0,a0,20196
    2f6c:       8fbc0010        lw      gp,16(sp)

The integer operands for the addiu instructions are the same, of course, just being shown as decimal rather than hexadecimal values. Now, we previously saw that the return address (ra) register had the value 0x802f6c. When a MIPS processor executes a jump instruction, it will also fetch the following instruction and execute it, this being a consequence of the way the processor architecture is designed.

So, the instruction after the jump, residing in what is known as the “branch delay slot” is not the instruction that will be visited upon returning from the called routine. Instead, it is the instruction after that. Here, we see that the return address from the jump at location 0x2f64 would be two locations later at 0x2f6c. This provides a kind of realisation that the program object – the libld-l4.so library – is positioned in memory at 0x800000: 0x2f6c added to 0x800000 gives the value of ra, 0x802f6c.

And this means that the location of the problematic instruction – the cause of our exception – is the first location within this object. Anyone with any experience of this kind of software will have realised by now that this doesn’t sound like a healthy situation: the first location within a library is not actually going to be code because these kinds of objects are packaged up in a way that permits their manipulation by other programs.

So what is the first location of a library used for? Since such objects employ the Executable and Linkable Format (ELF), we can take a look at some documentation. And we see that the first location is used to identify the kind of object, employing a “magic number” for the purpose. And that magic number would be…

464c457f

In the little-endian arrangement employed by this processor, the stored bytes are as follows:

7f
45 ('E')
4c ('L')
46 ('F')

The value was not a floating point instruction at all, but the magic number at the start of the library object! It was something of a coincidence that such a value would be interpreted as a floating point instruction, an accidentally convenient way of signalling something going badly wrong.

Missing Entries

The investigation now started to focus on how the code trying to jump to the start of the library had managed to get this incorrect address and what it was trying to do by jumping to it. I started to wonder if the global pointer (gp), whose job it is to reference the list of locations of program routines and other global data, might have been miscalculated such that attempts to load the addresses of routines would then be failing with data being fetched from the wrong places.

But looking around at code fragments where the gp register was being calculated, they seemed to look set to calculate the correct values based on assumptions about other registers. For example, from the object dump for libld-l4.so:

00002780 <_ftext>:
    2780:       3c1c0003        lui     gp,0x3
    2784:       279cb5b0        addiu   gp,gp,-19024
    2788:       0399e021        addu    gp,gp,t9

Assuming that the processor has t9 set to 0x2780 and then jumps to the value of t9, as is the convention, the following calculation is then performed:

gp = 0x30000 (since lui loads the "upper" half-word)
gp = gp - 19024 = 0x30000 - 19024 = 0x2b5b0
gp = gp + t9 = 0x2b5b0 + 0x2780 = 0x2dd30

Using the nm tool, which tells us about symbols in program objects, it was possible to check this value:

mipsel-linux-gnu-nm -n mybuild/lib/mips_32/l4f/libld-l4.so

This shows the following at the end of the output:

0002dd30 d _gp

Also appearing somewhat earlier in the output is this, telling us where the table of symbols starts (as well as the next thing in the file):

00025d40 a _GLOBAL_OFFSET_TABLE_
00025f90 g __dso_handle

Some digging around in the L4Re source code gave a kind of confirmation that the difference between _gp and _GLOBAL_OFFSET_TABLE_ was to be expected. Here is what I found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/mips/elfinterp.c file:

#define OFFSET_GP_GOT 0x7ff0

If gp, when recalculated in other places, ended up getting the same value, there didn’t seem to be anything wrong with it. Some quick inspections of neighbouring calculations indicated that this wasn’t likely to be the problem. But what about the values used in conjunction with gp? Might they be having an effect? In the case of the erroneous jump, the following calculation is involved:

lw t9,-32176(gp) => load word into t9 from the location at gp - 32176
                 => ...               from 0x2dd30 - 32176
                 => ...               from 0x25f80

The calculated address, 0x25f80, is after the start of _GLOBAL_OFFSET_TABLE_ providing entries for program routines and other things, which is a good sign, but what is perhaps more troubling is how far after the start of the table such a value is. In the above output, another symbol (__dso_handle) indicates something that is located at the end of the table. Now, although its address is still greater than the one computed above, meaning that the computation does not cause us to stray off the end of the table, the computed address is suspiciously close to the end.

There was nothing else to do than to have a look at the table contents itself, and here it was rather useful to have a way of displaying a number of values on the screen. At this point, we have to note that the addresses in use in the running system are adjusted according to the start of the loaded object, so that the table is positioned at 0x25d40 in the object dump, but in the running system we would see 0x800000 + 0x25d40 and thus 0x825d40 instead.

What I saw was that the table contained entries that varied in the expected way right up until 0x825f60 (corresponding to 0x25f60 in the object dump) being only 0x30 (or 48 bytes, or 12 entries) before the end of the table, but then all remaining entries starting at 0x825f64 (corresponding to 0x25f64) yielded a value of 0x800000, apart from 0x825f90 (corresponding to 0x25f90, right at the end of the table) which yielded itself.

Since the calculated address above (0x25f80, adjusted to 0x825f80 in the running system) lies in this final region, we now know the origin of this annoying 0x800000: it comes from entries at the end of the table that do not seem to hold meaningful values. Indeed, the object dump for the library seemed to skip over this region of the table entirely, presumably because it was left uninitialised. And using the readelf tool with the –relocs option to show “relocations”, which applies to this table, it appeared that the last entries rather confirmed my observations:

00025d34  00000003 R_MIPS_REL32
00025f90  00000003 R_MIPS_REL32

Clearly, something is missing from this table. But since something has to adjust the contents of the table to add the “base address”, 0x800000, to the entries in order to provide valid addresses within the running program, what started to intrigue me was whether the code that performed this adjustment had any idea about these missing entries, and how this code might be related to the code causing the exception situation.

Routines and Responsibilities

While considering the nature of the code causing the exception, I had been using the objdump utility with the -d (disassemble) and -D (disassemble all) options. These provide details of program sections, code routines and the machine instructions themselves. But Jean pointed out that if I really wanted to find out which part of the source code was responsible for producing certain regions of the program, I might use a combination of options: -d, -l (line numbers) and -S (source code). This was almost a revelation!

However, the code responsible for the jump to the start of the library resisted such measures. A large region of code appeared to have no corresponding source, suggesting that it might be generated. Here is how it starts:

_ftext():
    2dac:       00000000        nop
    2db0:       3c1c0003        lui     gp,0x3
    2db4:       279caf80        addiu   gp,gp,-20608
    2db8:       0399e021        addu    gp,gp,t9
    2dbc:       8f84801c        lw      a0,-32740(gp)
    2dc0:       8f828018        lw      v0,-32744(gp)

There is no function defined in the source code with the name _ftext. However, _ftext is defined in the linker script (in pkg/l4re-core/ldscripts/ARCH-mips/main_rel.ld) as follows:

  .text           :
  {
    _ftext = . ;
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em.  */
    *(.gnu.warning)
    *(.mips16.fn.*) *(.mips16.call.*)
  }

If you haven’t encountered linker scripts before, then you probably don’t want to spend too much time looking at this, linker scripts being frustratingly terse and arcane, but the essence of the above is that a bunch of code is stuffed into the .text section, with _ftext being assigned the address of the start of all this code. Now, _ftext in the linker script corresponds to a particular label in the object dump (which we saw earlier was positioned at 0x2780) whereas the _ftext function in the code occurs later (at 0x2dac, above). After the label but before the function is code whose source is found by objdump.

So I took the approach of removing things from the linker script, ultimately removing everything from the .text section apart from the assignment to _ftext. This removed the annotated regions of the code and left me with only the _ftext function. It really did appear that this was something the compiler might be responsible for. But where would I find the code responsible?

One hint that was present in the _ftext function code was the use of another identified function, __cxa_finalize. Searching the GCC sources for code that might use it led me to the libgcc sources and to code that invokes destructor functions upon program exit. This wasn’t really what I was looking for, but the file containing it (libgcc/crtstuff.c) would prove informative.

Back to the Table

Jean had indicated that there might be a difference in output between compilers, and that certain symbols might be produced by some but not by others. I investigated further by using the readelf tool with the -a option to show almost everything about the library file. Here, the focus was on the global offset table (GOT) and information about the entries. In particular, I wanted to know more about the entry providing the erroneous 0x800000 value, located at (gp – 32176). In my output I saw the following interesting thing:

 Global entries:
   Address     Access  Initial Sym.Val. Type    Ndx Name
  00025f80 -32176(gp) 00000000 00000000 FUNC    UND __register_frame_info

This seems to tell us what the program expects to find at the location in question, and it indicates that the named symbol is presumably undefined. There were some other undefined symbols, too:

_ITM_deregisterTMCloneTable
_ITM_registerTMCloneTable
__deregister_frame_info

Meanwhile, Jean was seeing symbols with other names:

__register_frame_info_base
__deregister_frame_info_base

During my perusal of the libgcc sources, I had noticed some of these symbols being tested to see if they were non-zero. For example:

  if (__register_frame_info)
    __register_frame_info (__EH_FRAME_BEGIN__, &object);

These fragments of code appear to be located in functions related to program initialisation. And it is also interesting to note that back in the library code, after the offending table entry has been accessed, there are tests against zero:

    2f34:       3c1c0003        lui     gp,0x3
    2f38:       279cadfc        addiu   gp,gp,-20996
    2f3c:       0399e021        addu    gp,gp,t9
    2f40:       27bdffe0        addiu   sp,sp,-32
    2f44:       8f828250        lw      v0,-32176(gp)
    2f48:       afbc0010        sw      gp,16(sp)
    2f4c:       afbf001c        sw      ra,28(sp)
    2f50:       10400007        beqz    v0,2f70 <_ftext+0x7f0>

Here, gp gets set up correctly, v0 is set to the value of the table entry, which we now believe refers to __register_frame_info, and the beqz instruction tests this value against zero, skipping ahead if it is zero. Does that not sound a bit like the code shown above? One might think that the libgcc code might handle an uninitialised table entry, and maybe it is intended to do so, but the table entry ends up getting adjusted to 0x800000, presumably as part of the library loading process.

I think that the most relevant function here for the adjustment of these entries is _dl_perform_mips_global_got_relocations which can be found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/ldso.c file as part of the L4Re C library code. It may well have changed the entry from zero to this erroneous non-zero value, merely because the entry lies within the table and is assumed to be valid.

So, as a consequence, the libgcc code acts as if it has a genuine __register_frame_info function to call, and doing so causes the jump to the start of the library object and the exception. Maybe the code is supposed to be designed to handle missing symbols, those symbols potentially being deliberately omitted, but it doesn’t function correctly under these particular circumstances.

Symbol Restoration

However, despite identifying this unfortunate interaction between C library and libgcc, the matter of a remedy remained unaddressed. What was I to do about these missing symbols? Were they supposed to be there? Was there a way to tell libgcc not to expect them to be there at all?

In attempting to learn a bit more about the linking process, I had probably been through the different L4Re packages several times, but Jean then pointed me to a file I had seen before, perhaps before I had needed to think about these symbols at all. It contained “empty” definitions for some of the symbols but not for others. Maybe the workaround or even the solution was to just add more definitions corresponding to the symbols the program was expecting? Jean thought so.

So, I added a few things to the file (pkg/l4re-core/ldso/ldso/fixup.c):

void __deregister_frame_info(void);
void __register_frame_info(void);
void _ITM_deregisterTMCloneTable(void);
void _ITM_registerTMCloneTable(void);

void __deregister_frame_info(void) {}
void __register_frame_info(void) {}
void _ITM_deregisterTMCloneTable(void) {}
void _ITM_registerTMCloneTable(void) {}

I wasn’t confident that this would fix the problem. After all the investigation, adding a few lines of trivial code to one file seemed like too easy a way to fix what seemed like a serious problem. But I checked the object dump of the library, and suddenly things looked a bit more reasonable. Instead of referencing an uninitialised table entry, objdump was able to identify the jump target as __register_frame_info:

    2e14:       8f828040        lw      v0,-32704(gp)
    2e18:       afbc0010        sw      gp,16(sp)
    2e1c:       afbf001c        sw      ra,28(sp)
    2e20:       10400007        beqz    v0,2e40 <_ftext+0x7f0>
    2e24:       8f85801c        lw      a1,-32740(gp)
    2e28:       8f84803c        lw      a0,-32708(gp)
    2e2c:       8f998040        lw      t9,-32704(gp)
    2e30:       24a55fa8        addiu   a1,a1,24488
    2e34:       04111c39        bal     9f1c <__register_frame_info>

Of course, the code had been reorganised and so things were no longer in quite the same places, but in the above, (gp – 32704) is actually a reference to __register_frame_info, and although this value gets tested against zero as before, we can see that enough is now known about the previously-missing symbol that a branch directly to the location of the function has been incorporated, rather than a jump to the address stored in the table.

And sure enough, powering up the Letux 400 produced the framebuffer terminal showing the expected output:

Hi World! (shared)

It had been a long journey for such a modest reward, but thanks to Jean’s help and a bit of perseverance, I got there in the end.

L4Re: Textual Debugging Output on the Framebuffer

Monday, May 21st, 2018

I have actually been in the process of drafting another article about writing device drivers to run within the L4 Runtime Environment (L4Re) on top of the Fiasco.OC microkernel, this being for the Ben NanoNote and Letux 400 notebook computers. That article started to trail behind a lot of the work being done, and there are a few loose ends to be tied up before I can finish it.

Meanwhile, on the way towards some kind of achievement with L4Re, confounded somewhat by the sometimes impenetrable APIs, I managed to eventually get something working that I had thought would have been one of the first things to demonstrate. When initially perusing the range of software in the “pkg” directory within the L4Re distribution, I saw a package called “fbterminal” providing a terminal program that shows itself on the framebuffer (or display).

I imagined being able to launch this on top of the graphical user interface multiplexer, Mag, and then have the “hello” program provide some output to this terminal. I even imagined having the terminal accept input from the keyboard, but we aren’t quite at that point, and that is where my other article comes in. Of course, I initially had no idea how to achieve this, and there needed to be a lot of work put in just to get back to this particular point of entry.

Now, however, the act of launching fbterminal and have it work is fairly straightforward. A few additional packages are required, but the framebuffer works satisfactorily as far as the other components are concerned, and the result will be a blank region of the screen with the terminal showing precisely nothing. Obviously, we want it to show something in order to confirm that it is working. I had to get used to seeing this blank terminal for a while.

The intended companion to fbterminal for testing purposes is the hello program which merely writes output to what might be described as a logging destination. This particular output channel is usually the serial console for the device, which meant that when porting the system to the Ben and the Letux, the hello program was of no use to me. But now, with a framebuffer to show things on, and with a terminal that might be able to accept output from other things, it becomes interesting to see if the hello program can be persuaded to send its output elsewhere.

It was useful to investigate how the output from the hello program actually makes its way to its destination. Since it uses standard C library functions, there has to be a mechanism for those functions to use. As far as I know, these would typically involve various system calls concerning files and streams. A perusal of the sources dredged up an interesting symbol called “__rtld_l4re_env_posix_vfs_ops”. Further investigation led me to the L4Re virtual filesystem (Vfs) functionality and the following interesting files:

  • pkg/l4re-core/l4re_vfs/include/vfs.h
  • pkg/l4re-core/l4re_vfs/include/impl/vfs_impl.h

And these in turn led me to the virtual console (Vcon) functionality:

  • pkg/l4re-core/l4re_vfs/include/impl/vcon_stream.h
  • pkg/l4re-core/l4re_vfs/include/impl/vcon_stream_impl.h

It seems that standard output from the hello program goes via the standard C calls and Vfs functions and is packaged up and sent using the Vcon mechanisms to the logging destination, which is typically provided by the root task, Moe. Given that fbterminal understands the Vcon protocol and acts as a console server, there appeared to be some potential in looking at Vcon mechanisms more closely. It seemed that fbterminal might be able to take the place of Moe.

Indeed, the documentation offers some clues. In the description of the init process, Ned, a mention is made of a program loader configuration parameter called “log_fab” that indicates an object that can create a suitable logging destination. When starting a program, the program loader creates such an object using “log_fab” and presents it to the new program as a capability (or object reference).

However, this is not quite what we want because we don’t need anything else to be created: we already have fbterminal ready for us to use. I suppose something could be conjured up to act as a factory and provide a fbterminal instance, and maybe this is not too arduous in the Lua-based configuration environment used by Ned, but I wanted a more direct solution.

Contemplating this, I went and rediscovered the definitions used by Ned to support its configuration scripting (found in pkg/l4re-core/ned/server/src/ned.lua). Here, the workings of the “log_fab” mechanism can be found and studied. But what I started to favour was a way of just indicating a capability to the hello program and not have the loader create something else. This required a simple edit to one of the functions:

function App_env:log()
  Class.check(self, App_env);
  if self.loader.log_fab == nil or self.loader.log_fab.create == nil then
    error ("Starting an application without valid log factory", 4);
  end
  return self.loader.log_fab:create(Proto.Log, table.unpack(self.log_args));
end

Here, we want to ignore “log_fab” and just have our existing capability used instead. So, I introduced another clause to the if statement:

  if self.log_cap then
    return self.log_cap
  elseif self.loader.log_fab == nil or self.loader.log_fab.create == nil then
    error ("Starting an application without valid log factory", 4);
  end

Now, if we specify “log_cap” when starting a program, it should want to direct logging messages to the referenced object instead. So, with this available to us, it becomes possible to adjust the way the hello program is started. First of all, we define the way fbterminal is set up and started:

local term = l:new_channel();

l:start({
    caps = {
      fb = mag_caps.svc:create(L4.Proto.Goos, "g=320x230+0+0", "barheight=10"),
      term = term:svr(),
    },
  },
  "rom/fbterminal");

Since fbterminal needs to “export” its console abilities using a capability called “term”, this needs to be indicated in the “caps” table. (It doesn’t matter what the local variable for the channel is called.) So, the hello program is defined accordingly:

l:start({
    log_cap = term,
  },
  "rom/hello");

Here, we make use of “log_cap” and allow the output to be directed to the terminal that has already been started. And the result is this:

fbterminal on the Ben NanoNote showing the hello program's output

fbterminal on the Ben NanoNote showing the hello program's output

And at long last, it becomes possible to see what programs are printing out to the log!

Extending L4Re/Fiasco.OC to the Letux 400 Notebook Computer

Wednesday, April 18th, 2018

In my summary of the port of L4Re and Fiasco.OC to the Ben NanoNote, I remarked that progress had been made on supporting other products and hardware peripherals. In fact, such progress occurred more rapidly than I had thought possible, and I have been able to extend the work to support the Letux 400 notebook computer. It is perhaps worth describing the Letux 400 in a bit more detail because it has an interesting place in the history of netbook computers.

Some History

Back in the early 21st century, laptop computers were becoming increasingly popular at the expense of desktop computers, but as laptops began to take the place of desktops in homes and workplaces, this gradually led each successive generation of laptops to sacrifice portability and affordability in favour of larger, faster, higher-resolution screens and general hardware specifications more competitive with the desktop offerings they sought to replace. Laptops were becoming popular but also bigger, heavier and more expensive.

Things took an interesting turn in 2006 with the introduction of the XO-1 from the One Laptop per Child (OLPC) initiative. With rather different goals to those of the mainstream laptop vendors, the focus was to deliver a relatively-inexpensive yet robust portable computer for use by schoolchildren, many of whom might be living in places with limited infrastructure where increasingly power-hungry mainstream laptops would have been unsuitable, even unusable.

One unexpected consequence of the introduction of the XO-1 was the revival in interest in modestly-performing portable computing hardware. People were actually interested in a computer that did the things they needed, rather than having to buy something designed for gamers, software developers, or corporate “power users” (of both the pretend and genuine kinds). Rather than having to haul increasingly big and heavy laptops and all the usual accessories in a big dedicated bag, they liked the idea of slipping a smaller, lighter device into their everyday bag, as had probably been the idea with subnotebooks while they were still a thing.

Thus, the Asus Eee PC came about, regarded as the first widely-available netbook of recent times (acknowledging the earlier Psion netBook, of course), bringing with it the attention of large-volume manufacturers and economies of scale. For “lightweight tasks”, netbooks were enough for many people: a phenomenon that found itself repeating with tablets, particularly as recreational usage of technology became more important to buyers and evolved in certain ways.

Now, one thing that had been a big part of the OLPC initiative’s objectives was a $100 price point. At first, despite fairly radical techniques being used to reduce cost, and despite the involvement of a major original equipment manufacturer in the production of the XO-1, that price point of $100 was out of reach. Even the Eee PC retailed for a few hundred dollars.

This is where a product known as the Skytone Alpha 400 enters the picture. Some vendors, rebranding this product, offered it as possibly the first $100 laptop – or netbook – to be made available for sale. One of the vendors offers it as the Letux 400, and it has been available for as little as €125 during its retail lifespan. Noting that it has rather similar hardware to the Ben NanoNote, but has a more conventional physical profile and four times as much RAM, my brother bought one to investigate a few years ago. That is how I eventually ended up embarking on this experiment.

Extending Recent Work

There are many similarities between the JZ4720 system-on-a-chip (SoC) used in the Ben and the JZ4730 used in the Letux 400. However, it can be said that the JZ4720 is much better understood. The JZ4740 and closely-related devices like the JZ4720 have appeared in a number of different devices, documentation has surfaced for these products, and vendor source code has always been available, typically using or implicitly documenting most of the hardware.

In contrast, limited documentation is known to exist for the JZ4730, and the available vendor source code has not always described every detail of the hardware, even though the essential operations and register details appear to be present. Having looked at the Linux kernel sources that support the JZ4730, together with U-Boot source code, the similarities and differences between the JZ4720 and JZ4730 began to take shape in my mind.

I took an optimistic approach that mostly paid off. The Fiasco.OC kernel needs augmenting with the details of the JZ4730, but these are similar in some ways to the JZ4720 and familiar otherwise. For instance, the JZ4730 has a 32-bit “operating system timer” (OST) that curiously does not appear in the JZ4740 but does appear in more recent products such as the JZ4780. Bearing such things in mind, the timer and interrupt support was easily enough added.

One very different thing about the JZ4730 is that it does not seem to support the “set” and “clear” register locations that are probably common to most modern SoCs. Typically, one might want to update a hardware-related register to change a peripheral’s configuration, and it must have become apparent to hardware designers that such updates mostly want to either set or clear bits. Normally in a program, to achieve such things involves reading a value, performing a logical operation that combines the value with a description of the bits to be set or cleared, and then the value is written back to where it came from. For example:

define bits to set
load value from location (exposing a hardware register, perhaps)
logical-or value with bits
store result in location

Encapsulating this in a single instruction avoids potential issues with different things competing to update the location at the same time, if the hardware permits this, and just offers something that is more efficient and convenient, anyway. Separate locations are provided for “set” and “clear” operations, and the original location is provided to read and to overwrite the hardware register’s value. Sometimes, such registers might only support read-only access, in fact. But the JZ4730 does not support such additional locations, and so we have to do things the hard way when updating registers and doing things like clearing and setting bits.

One odd thing that caught me out was a strange result from the special “exception base” (EBASE) register that does not seem to return zero for the CPU identifier, something that the bootstrap code in L4Re expects. I suppressed this test and made the kernel always return zero when it asks for this identifier. To debug such things, I could not use the screen as I had done with the Ben since the bootloader does not configure it on the Letux. Fortunately, unlike the Ben, the Letux provides a few LEDs to indicate things like keyboard and network status, and these can be configured and activated to communicate simple status information.

Otherwise, the exercise mostly involved me reworking some existing code I had (itself borrowing somewhat from existing driver code) that provides driver support for the Letux hardware peripherals. The clock and power management (CPM) arrangement is familiar but different from the JZ4720; the LCD driver can actually be used as is; the general-purpose input/output (GPIO) arrangement is different from the JZ4720 and, curiously enough once again, perhaps more similar to the JZ4780 in some ways. To support the LCD panel’s backlight, a pulse-width modulation (PWM) driver needed to be added, but this involves very little code.

I also had to deal with the mistakes I made myself when not concentrating hard enough. Lots of testing and re-testing occurred. But in the space of a weekend or so, I had something to show for all the previous effort plus this round’s additional effort.

The Letux 400 and Ben NanoNote running the "spectrum" example

The Letux 400 and Ben NanoNote running the "spectrum" example

Here, you can see what kind of devices we are dealing with! The Letux 400 is less than half the width of a normal-size keyboard (with numeric keypad), and the Ben NanoNote is less than half the width of the Letux 400. Both of them were inexpensive computing devices when they were introduced, and although they may not be capable of running “modern” desktop environments or Web browsers, they offer computing facilities that were, once upon a time, “workstation class” in various respects. And they did, after all, run GNU/Linux when they were introduced.

And that is why it is attractive to consider running other “proper” operating system technologies on them now. Maybe we can revisit the compromises that led to the subnotebook and the netbook, perhaps even the tablet, where devices that are not the most powerful still have a place in fulfilling our computing needs.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Summary)

Monday, April 16th, 2018

As promised, here is a summary of the work involved in porting L4Re and Fiasco.OC to the Ben NanoNote. First of all, a list of all the articles with some brief descriptions of what they cover:

  1. Familiarisation with L4Re and Fiasco.OC on the MIPS Creator CI20, adding some missing pieces
  2. Setting up and introducing a suitable compiler for the Ben, also describing the hardware in the kernel
  3. Handling instructions unsupported by the JZ4720 (the Ben’s SoC) in the kernel
  4. Describing the Ben and dealing with unsupported instructions in the L4Re portion of the system
  5. Configuring the memory layout and attempting to bootstrap the kernel
  6. Making the kernel support the MIPS architecture revision used by the JZ4720, also fixing the interrupt system description
  7. Investigating context/thread switching and fixing an inadvertently-introduced fault in the unsupported instruction handling
  8. Configuring user space examples and getting a simple framebuffer demonstration working
  9. Getting the framebuffer driver, GUI multiplexer, and “spectrum” example working

As I may have noted a few times in the articles, this work just builds on previous work done by a number of people over the years, obviously starting with the whole L4 microkernel effort, the development of Fiasco.OC, L4Re and their predecessors, and the work done to port these components to the MIPS architecture. On the l4-hackers mailing list, Adam Lackorzynski was particularly helpful when I ran into obstacles, and Sarah Hoffman provided some insight into problems with the CI20 just as it was needed.

You really don’t have to read all the articles or even any of them! The point of this article is to summarise the work and perhaps make similar porting efforts a bit more approachable for others in the same position: anyone having a vague level of familiarity with L4Re/Fiasco.OC or similar systems, also having a device that might be supported, and being somewhat familiar with writing code that drives hardware.

Practical Details

It might be useful to give certain practical details here, if only to indicate the nature of the development and testing routine employed in this endeavour. First of all, I have been using a chroot containing the Debian “unstable” distribution for the i386 architecture. Although this was essential for a time when building the software for the CI20 and trying to take advantage of Debian’s cross-compiler packages, any fairly recent version of Debian would probably be fine because I ended up using a Buildroot toolchain to be able to target the Ben. You could probably choose any Free Software distribution and reproduce what I have done.

The distribution of patches contains instructions regarding preparation and the building of the software. It isn’t too useful to repeat that information here, but the following things need doing:

  1. Installing packages for build tools
  2. Obtaining or building a cross-compiler
  3. Checking out the source code for L4Re and Fiasco.OC from its repository
  4. Applying the patches
  5. Configuring and building the kernel
  6. Configuring and building the runtime environment
  7. Copying the payload to a memory card
  8. Booting the device

Some scripts have been included in the patch distribution, one of which should do the tricky job of applying patches to the repository checkout according to the chosen device configuration. Because a centralised version control system (Subversion) has been used to publish the L4Re and Fiasco.OC sources, I had to find a way of working with my own local changes. Consequently, I wrote a few scripts to maintain bundles of changes associated with certain files, and I then managed these bundles in a different version control system. Yes, this effectively meant versioning the changes themselves!

Things would be simpler with a decentralised version control system because local commits would be convenient, and upstream updates would be incorporated into the repository separately and merged with local changes in a controlled fashion. One of the corporate participants has made a Git repository for Fiasco.OC available, which may alleviate some issues, although I am increasingly finding larger Git repositories to be unusable on my modest hardware, and I also tend to disagree with everybody deciding to put everything on GitHub.

Fixing and Building

Needing to repeatedly build, test, go back and fix, I found myself issuing the same command sequences a lot. When working with the kernel, I tended to enter the kernel build directory, which I called “mybuild”, edit the kernel sources, and then re-run the make command:

cd mybuild
vi ../src/kern/mips/exception.S # edit a familiar file with vim
make

Having built a new kernel, I would then need to build a new payload to deploy, which meant ascending the directory hierarchy and building an image in the L4Re section of the sources:

cd ../../../l4
make O=mybuild uimage E=mips-qi_lb60-spectrum-example

Given a previously-built “user space”, this would bundle the new kernel together with code that might be able to test it. Of particular importance is the bootstrap code which launches the kernel: without that, there is no point in even trying to test the kernel!

I found that re-building L4Re components seemed to require a general build to be performed:

make O=mybuild

If that proved successful, an image would then be built and tested. In general, focusing on either the kernel or some user space component meant that there was rarely a need to build a new kernel and then build much of the user space.

Work Summary

The patches accumulated during this process cover a range of different areas of functionality. Looking at them organised by functional area, instead of in the more haphazard fashion presented throughout the series of articles, allows for a more convenient review of the work actually needed to get the job done.

Build System Adjustments and Various Fixes

As early as my experiments with the CI20, I experienced the need to fix some things that didn’t work on my system, either due to some Debian peculiarities or differences in compiler behaviour:

  • l4util-mips-thread.diff (fixes a symbol visibility issue with certain compiler versions)
  • mips-gcc-cpload.diff (fixes the initialisation of certain L4Re components)
  • no-at.diff (allows the build to work on Debian for the i386 architecture)

Other adjustments are required to let the build system do its job, setting paths for other components and for the toolchains:

  • conf-makeconf-boot.diff (lets the L4Re build system find things like the kernel, modules and hardware descriptions)
  • qi_lb60-gcc-buildroot-fiasco.diff (changes the compiler and architecture settings)
  • qi_lb60-gcc-buildroot-l4re.diff (changes the compiler, architecture and soft-float settings)

The build system also needs directing towards new drivers, and various files need to be excluded or changed:

  • ingenic-mips-drivers-top.diff (enables drivers added by this work)
  • qi_lb60-fbdrv.diff (changes the splash image for the framebuffer driver)
  • qi_lb60-l4re.diff (includes a temporary fix disabling a Mag plugin)

The first of these is important to remember when adding drivers since it changes the l4/pkg/drivers/Control file and defines the driver “packages” provided by each of the driver libraries. These package definitions help the build system work out which other parts of the system need to be consulted when building a particular driver.

Supporting MIPS32r1 Devices

Throughout the kernel and L4Re, changes need making to support the earlier architecture version provided by the JZ4720. The bulk of the following patch files deals with such changes:

  • qi_lb60-fiasco.diff
  • qi_lb60-l4re.diff

Maybe I will try and break out the architecture version changes into specific patch files, provided this does not result in the original source files ending up being patched by multiple patch files. My aim has been to avoid patches having to be applied in a particular order, and that starts to happen when multiple patches modify the same file.

Describing the Ben NanoNote

The kernel needs some knowledge of the Ben with regard to timers and interrupts. Meanwhile, L4Re needs to set the Ben up correctly when booting. Both sections of the system need an awareness of how memory is going to be used, and extra configuration options need to be provided to merely allow the selection of the Ben for building. Currently, the following patch files include things concerned with such matters:

  • qi_lb60-fiasco.diff (contains timer, interrupt and memory details, plus configuration system changes)
  • qi_lb60-l4re.diff (contains bootstrap and memory details, plus configuration system changes)
  • qi_lb60-platform.diff (platform definitions for the Ben in L4Re)

One significant objective here is to be able to offer the Ben as a “first class” configuration option and have the build system do the right thing, setting up all the components and code regions that the Ben needs to function.

Introducing Driver Code

To be able to activate the framebuffer on the Ben, driver code needs introducing for a few peripherals provided by the JZ4720: CPM (clock/power management), GPIO (general-purpose input/output) and LCD (liquid crystal display, or similar). A few different patch files cover these areas:

  • ingenic-mips-cpm.diff (CPM support for JZ4720 and JZ4780)
  • ingenic-mips-gpio.diff (GPIO support for JZ4720 and JZ4780)
  • qi_lb60-lcd.diff (LCD support for JZ4720)

The JZ4780 support is intended for the CI20 and will not be used with the Ben. However, it is convenient to incorporate support for these different platforms in the same patch file in each instance.

Meanwhile, the LCD driver should work with a range of JZ4700-series devices (labelled as JZ4740 in the patches). While focusing on getting things working, the only panel supported by this work was that provided by the Ben. Since then, support has been made slightly more general, just as was done with the Linux kernel support for products employing this particular SoC family and, subsequently, for panels in general. (Linux has moved towards a “device tree” approach for specifying things like panels and displays, although this is arguably just restating things that were once C-coded structures in another, rather peculiar, format.)

To support these drivers, some useful code has been copied from elsewhere in L4Re:

  • drivers_frst-register-block.diff

This provides a convenient abstraction for registers that is exposed via an include directive:

#include <l4/drivers/hw_mmio_register_block.h>

Indeed, it is worth focusing on the LCD driver briefly. The code has its origins in existing driver code written for the Ben that I adapted to get working as part of a simple “bare metal” payload. I have maintained a separation between the more intricate hardware configuration and aspects that deal with the surrounding software. As part of L4Re, the latter involves obtaining access to memory using the appropriate API calls and invoking other drivers.

In L4Re, there is a kind of framework for LCD drivers, and the existing drivers seem to be written in C rather than C++. Reminiscent of Linux, there is a mechanism for exporting driver operations using a well-defined data structure, and this permits the “probing” of drivers to see if they can be enabled and if meaningful information can be obtained about things like the supported resolution, colour depth and pixel format. To make the existing code compatible with L4Re, a fair amount of the work involves translating the information already known (and used) in the hardware configuration activity to a form that other L4Re components can understand and use.

Originally, for the GPIO driver, I had intended it to operate as part of the Io server framework. Components using GPIO functionality would then employ the appropriate API to configure and interact with the exposed input and output pins. Unfortunately, this proved rather cumbersome, and so I decided to take a simpler approach of providing the driver as an abstraction that a program would use together with explicitly-requested memory. I did decide to preserve the general form of the API for this relocated abstraction, however, meaning that various classes and methods are provided that behave in the same way as those “left behind” in the Io server framework.

Thus, a program would itself request access to the GPIO-related memory, and it would then use GPIO-related abstractions to “do the right thing” with this memory. One would envisage that such a program would not be a “normal”, unprivileged program as such, but instead be more like a server or driver in its own right. Indeed, the LCD driver employs these abstractions to use SPI-based signalling with the LCD panel, and it uses the same techniques to configure the LCD clock frequencies using the CPM-related memory and CPM-related abstractions.

Although the GPIO driver follows existing conventions, the CPM driver has no obvious precedent in L4Re, but I adopted some of the conventions employed in the GPIO driver, adding more specialised methods and functions to expose functionality specific to the SoC. Since I had previously written a CPM driver for the JZ4780, the main objective was to make the JZ4720/JZ4740 driver resemble the existing driver as much as possible.

Introducing and Configuring Example Programs

Throughout the series of articles, I was working towards running one specific example program, making some new ones on the way for testing purposes. These additional programs are provided together with their configuration, accompanied by suitable configurations for existing examples and components, by the following patch files:

  • ingenic-mips-modules.diff (example program definitions)
  • qi_lb60-examples.diff (example program implementations and configuration details)

The additional programs (defined in l4/conf/modules.list) are as follows:

  • mips-qi_lb60-lcd-example (implemented by qi_lb60_lcd, configured by the mips-qi_lb60-lcd files)
  • mips-qi_lb60-lcd-driver-example (implemented by qi_lb60_lcd_driver, configured by the mips-qi_lb60-lcd-driver files)

Configurations are provided for the existing examples and components as follows:

  • mips-qi_lb60-fbdrv-example (configured by the mips-qi_lb60-fbdrv files)
  • mips-qi_lb60-spectrum-example (configured by the mips-qi_lb60-spectrum files)

All configurations reside in the l4/conf/examples directory. All new examples reside in the l4/pkg/examples/misc directory.

Further Work

In the final article in the series, I mentioned a few ideas for further work based on that described above:

Since completing the above work, I have already made some progress on the first two of these topics. More on that in an upcoming post!

Some Updates

Since writing this article, a few things are worth adding. First of all, the patches produced in the initial effort described by this series of articles are now available in an “initial archive” via the Web page documenting the effort. In contrast, a “current archive” provides patches for the current state of the work, with the aim being to focus these patches only on essential support for these devices within L4Re and Fiasco.OC, and with future development being done elsewhere.

Another couple of observations have been made since completing this initial effort that qualify or correct some information provided here. On the topic of the memory map needed to support the Ben NanoNote and its bootloader, it turned out that a fairly conventional arrangement was feasible after all and that only the “exception base” might be a problem. Here is the more conventional arrangement:

0x80600000 payload load address when copied by bootm
0x802d0000 bootstrap start address
0x80010000 kernel load address
0x80001000 exception handlers

The kernel load address and bootstrap start address are now the same as for other MIPS platforms. The exception handlers are positioned 4 kilobytes above the normal exception base just in case overwriting them might upset the bootloader.

Previously, I had experienced problems with Mag and its mag-input-libinput plugin, causing me to disable that plugin to get things working. This can be avoided by just providing a capability called “vbus” when starting Mag.

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 9)

Thursday, March 29th, 2018

After all my effort modifying the Fiasco.OC kernel, adapting driver code for the Ben NanoNote to work in the L4 Runtime Environment (L4Re), I had managed to demonstrate programs that could access the framebuffer and thus show things on the screen. But my stated goal is to demonstrate the functioning of existing code and a range of different components, specifically the “spectrum” example, GUI multiplexer (Mag), and framebuffer driver (fb-drv), supported by additional driver functionality I have added myself.

Still to do were the tasks of getting the framebuffer driver to access my LCD driver, getting Mag to work on top of that, and then getting the “spectrum” example to work as it had done with Fiasco.OC-UX. Looking back at this exercise now, I can see that it is largely a case of wiring up the components just as we saw earlier when I showed how to get access to the hardware through the Io server. But this didn’t prevent excursions into the code for some debugging operations or a few, more lasting, modifications.

Making a Splash

I had sought to test the entire stack consisting of the example, GUI multiplexer, framebuffer driver, and the rest of the software using a configuration derived from both the UX “spectrum” demonstration (found in l4/conf/examples/x86-fb.cfg) and the corresponding demonstration for ARM-based devices (found in l4/conf/examples/arm-rv-lcd.cfg). Unsurprisingly, this did not work, and so I started out stripping down my own example configuration in order to test the components individually.

On the way, I learned a few things that I wished I’d realised earlier. The first one is pretty mundane but very important. In order to strip down the configuration, I had tried to comment out various components and had done so using the hash symbol (#), which vim had helped to make me believe was a valid comment symbol. In fact, in Lua, if the hash symbol can be used for “program metadata”, perhaps for the usual Unix scripting declaration, then its use may be restricted to such narrow purposes and no others (as far as I could tell from quickly looking through the documentation). Broader use of the symbol appears to involve it acting as the length operator.

By making an assumption almost instinctively, due to the prevalence of the hash character as a comment symbol in Unix scripting, I was now being confronted with “Starting kernel …” and nothing more, making me think that I had really broken something. I had to take several steps back, consider what might really be happening, that the Ned task responsible for executing the configuration had somehow failed, and then come to the realisation that were I able to read the serial console output, I would be staring at a Lua syntax error!

So, removing the configuration details that I didn’t want to test straight away, I set about testing the framebuffer driver all by itself. Here is the configuration file involved (edited slightly):

local io_buses =
  {
    fbdrv = l:new_channel();
  };

l:start({
  caps = {
      fbdrv = io_buses.fbdrv:svr(),
      icu = L4.Env.icu,
      sigma0 = L4.cast(L4.Proto.Factory, L4.Env.sigma0):create(L4.Proto.Sigma0),
    },
  },
  "rom/io -vvvv rom/hw_devices.io rom/mips-qi_lb60-fbdrv.io");

local fbdrv_fb = l:new_channel();

l:startv({
  caps = {
      vbus = io_buses.fbdrv,
      fb   = fbdrv_fb:svr(),
    },
  },
  "rom/fb-drv", "-c", "nanonote"); -- configuration passed to the LCD driver

It was around this time that I discovered the importance of the naming of certain things, noted previously, and so in the accompanying Io configuration, the devices are added to a vbus called “fbdrv”. What this example should do is to run just the framebuffer driver (fb-drv) and the Io server.

Now, you might be wondering what the point of running just the driver, without any client programs, might be. Well, I had noticed that the driver should show a “splash screen”, and so I eagerly anticipated seeing whatever the L4Re developers had decided to show to us upon the driver starting up. Unfortunately, all I got was a blank screen. That made me think some bad things about what might be happening in the code. Fortunately, it didn’t take me long to realise what the problem was, discovering the following (in l4/pkg/fb-drv/server/src/splash.cc):

  if (fb_info->width < SPLASHNAME.width)
    return;
  if (fb_info->height < SPLASHNAME.height)
    return;

Meanwhile, in the source file providing the screen data (l4/pkg/fb-drv/server/data/splash1.c), I saw that the width and height of the image were given as being 480 pixels and 65 pixels respectively. The Ben’s screen is 320 pixels wide and 240 pixels high, thus preventing the supplied image from being shown.

It seemed worthwhile trying to replace this image just to reassure myself that the driver was actually working. The supplied image was exported from GIMP and so I attempted to reproduce this by loading one of my own images into GIMP, cropping to 320×240, and saving as a C source file with run-length encoding, 3 bytes per pixel, just as the supplied image had been created. I then copied it into the appropriate package subdirectory (l4/pkg/fb-drv/server/data) and modified the Makefile (l4/pkg/fb-drv/server/src/Makefile), changing it to reference my image instead of the supplied one.

I also needed to change a few other things in the Makefile, as it turned out, such as the definitions of the sources and libraries used by MIPS architecture devices. It was odd to encounter architecture-specific artefacts at this level in the system, but I suppose that different architectures tend to employ different kinds of display mechanisms: the x86 architecture and derivatives have their “legacy” devices; other architectures used in system-on-a-chip products have other peripherals and devices. Anyway, building the payload and booting the Ben gave me this:

The Ben NanoNote showing my L4Re framebuffer driver splash screen

The Ben NanoNote showing my L4Re framebuffer driver splash screen: rainbow lorikeets on a suburban lawn

So, the framebuffer driver will happily run, showing its splash screen until something connects to it and gets something else onto the screen. Here, we just let it run by itself until it is time to switch off!

Missing Inputs

There was still one component to be tested before arriving at the “spectrum” example: the GUI multiplexer, Mag. But testing it alone did not necessarily seem to make sense, because unlike the framebuffer driver, Mag’s role is merely to arbitrate between different framebuffer clients. So testing it and some kind of example together seemed like the only workable strategy.

I tried to test it with the “spectrum” example, but I only ever saw the framebuffer splash screen, so it seemed that either the example or Mag wasn’t working. But if it were Mag that wasn’t working, how would I be able to find my way to the problem? I briefly considered whether Mag had a problem with the display configuration, given that I had already seen such a problem, albeit a minor one, in the framebuffer driver.

I chased references around the source code until I established that Mag was populating a fixed-length array of “factory” objects defined in one place (l4/pkg/mag-gfx/include/factory) whose members were statically defined in another place (l4/pkg/mag/server/src/screen.cc), each one indicating how to deal with a particular pixel format. Such objects being of type Mag_gfx::Mem::Factory are returned in the Mag main program (in pkg/mag/server/src/main.cc) when the following code is executed:

  Screen_factory *f = dynamic_cast<Screen_factory*>(Screen_factory::set.find(view_i.pixel_info));

I had wondered whether f was being presented with a null value and thus stopping Mag right there, but since there was a suitable factory object being created for the Ben’s pixel format, it seemed rather unlikely. So, the only approach I had considered at this point was not a particularly convenient one: I would have to replicate Mag piece by piece until discovering which part of it caused it to fail.

I set out with a simple example borrowing code from Mag that requests a memory region via a capability or channel, displaying some data on the screen. This managed to work. I then expanded the example, adding various data structures, copying functionality into the example directory from Mag, slowly reassembling what Mag had been all along. Things kept working, right until the point where I needed to set the example going as a server rather than have it wait in an infinite loop doing nothing. Then, the screen would very briefly show the splash image, then the bit patterns, and then go blank.

But maybe Mag was going to clear the framebuffer anyway and thus the server loop was working? Perhaps this was what success actually looked like under these circumstances, which is to say that it did not involve seeing two brightly-coloured birds on a lawn. At this point, out of laziness, I had avoided integrating the plugins into my example, and that was the only remaining difference between it and Mag.

I started to realise that maybe it was a matter of removing things from Mag and seeing if I could get it to work, and the obvious candidates for removal were the plugins. So I removed all the plugins as defined in the Makefile (found at pkg/mag/server/src/Makefile) and tested to see if it changed Mag’s behaviour. Sure enough, the screen went blank. I then added plugins back one by one, knowing by now that the mag-client_fb plugin was going to be required to get the example working.

Well, it turned out that there was one plugin causing all the problems: mag-input-libinput. Removing that got the “spectrum” example to work:

The Ben NanoNote showing the "spectrum" framebuffer example

The Ben NanoNote showing the "spectrum" framebuffer example

Given that I haven’t addressed the matter of supporting input devices, which for the Ben would mostly concern the keyboard, disabling this errant plugin seemed like a reasonable compromise for now.

Update: a description of a remedy for the problem with the plugin is described in the summary article.

A Door Opens?

There isn’t much left to be said, which perhaps makes the ending something of an anticlimax. But perhaps this is part of the point of the exercise, that the L4Re/Fiasco.OC combination now just seems to work on the Ben NanoNote, and that it could potentially in future be just another thing that this software supports. Of course, the Ben is such a relatively rare device that it isn’t likely to have many potential users, but the SoC family of which the JZ4720 is a part is employed by a number of different devices.

If there haven’t been any privately-maintained ports of this software to those other devices, then this work potentially opens the door to its use on other handheld devices like the GCW Zero or any number of randomly-sourced pocket games consoles, portable media players, and even smartwatches and wearable devices, all of which have been vehicles for the SoC vendor’s products. The Letux 400 could probably be supported given the similarity of its own SoC, the JZ4730, to that used in the Ben.

When the Ben was released, work had been done, first by the SoC vendor, then by the Qi Hardware people, to provide a functioning GNU/Linux system for the device. Clearly, there isn’t an overwhelming need for another kind of system if the intention is to just use the device with existing Free Software. But perhaps this is another component in this exercise: to make other technological solutions possible and to explore avenues that get ignored as everyone follows the mainstream. The Ben is hardly a mainstream product itself; why not use it to make other alternative choices?

It seems interesting to consider writing other drivers for the Ben and to gain experience with microkernel-based systems design. The Genode framework might also be worth investigating, given its focus on becoming a system suitable for deployment to end-users. The Hurd was also ported in an exploratory fashion to one of the L4 implementations some time ago, and revisiting this might be possible, even simplifying the effort thanks to the evolution in features provided by implementations like Fiasco.OC.

In any case, I hope that this account has been vaguely interesting and entertaining. If you managed to read it all, you deserve my sincere thanks for spending the time to do so. A summary of the work will probably follow, and the patches themselves are available on a page dedicated to the effort. Good luck with your own investigations into alternative Free Software operating systems!

Porting L4Re and Fiasco.OC to the Ben NanoNote (Part 8)

Tuesday, March 27th, 2018

With some confidence that the effort put in previously had resulted in a working Fiasco.OC kernel, I now found myself where I had expected to be at the start of this adventure: in the position of trying to get an example program working that could take advantage of the framebuffer on the Ben NanoNote. In fact, there is already an example program called “spectrum” that I had managed to get working with the UX variant of Fiasco.OC (this being a kind of “User Mode Fiasco” which runs as a normal process, or maybe a few processes, on top of a normal GNU/Linux system).

However, if this exercise had taught me nothing else, it had taught me not to try out huge chunks of untested functionality and then expect everything to work. It would instead be a far better idea to incrementally add small chunks of functionality whose correctness (or incorrectness) would be easier to see. Thus, I decided to start with a minimal example, adding pieces of tested functionality from my CI20 experiments that obtain access to various memory regions. In the first instance, I wanted to see if I could exercise control over the LCD panel via the general-purpose input/output (GPIO) memory region.

Wiring Up

At this point, I should make a brief detour into the matter of making peripherals available to programs. Unlike the debugging code employed previously, it isn’t possible to just access some interesting part of memory by taking an address and reading from it or writing to it. Not only are user space programs working with virtual memory, whose addresses need not correspond to the actual physical locations employed by the underlying hardware, but as a matter of protecting the system against program misbehaviour, access to memory regions is strictly controlled.

In L4Re, a component called Io is responsible for granting access to predefined memory regions that describe hardware peripherals or devices. My example needed to have the following resources defined:

  • A configuration description (placed in l4/conf/examples/mips-qi_lb60-lcd.cfg)
  • The Io peripheral bus description (placed in l4/conf/examples/mips-qi_lb60-lcd.io)
  • A list of modules needing deployment (added as an entry to l4/conf/modules.list, described separately in l4/conf/examples/mips-qi_lb60-lcd.list)

It is perhaps clearer to start with the Io-related resource, which is short enough to reproduce here:

local hw = Io.system_bus()

local bus = Io.Vi.System_bus
{
  CPM = wrap(hw:match("jz4740-cpm"));
  GPIO = wrap(hw:match("jz4740-gpio"));
  LCD = wrap(hw:match("jz4740-lcd"));
}

Io.add_vbus("devices", bus)

This is Lua source code, with Lua having been adopted as a scripting language in L4Re. Here, the important things are those appearing within the hw:match function call brackets: each of these values refers to a device defined in the Ben’s hardware description (found in l4/pkg/io/io/config/plat-qi_lb60/hw_devices.io), identified using the “hid” property. The “jz4740-gpio” value is used to locate the GPIO device or peripheral, for instance.

Also important is the name used for the bus, “devices”, in the add_vbus function call at the bottom, as we shall see in a moment. Moving on from this, the configuration description for the example itself contains the following slightly-edited details:

local io_buses =
  {
    devices = l:new_channel();
  };

l:start({
  caps = {
      devices = io_buses.devices:svr(),
      icu = L4.Env.icu,
      sigma0 = L4.cast(L4.Proto.Factory, L4.Env.sigma0):create(L4.Proto.Sigma0),
    },
  },
  "rom/io -vvvv rom/hw_devices.io rom/mips-qi_lb60-lcd.io");

l:start({
  caps = {
      icu = L4.Env.icu,
      vbus = io_buses.devices,
    },
  },
  "rom/ex_qi_lb60_lcd");

Here, the “devices” name seemingly needs to be consistent with the name used in the caps mapping for the Io server, set up in the first l:start invocation. The name used for the io_buses mapping can be something else, but then the references to io_buses.devices must obviously be changed to follow suit. I am sure you get the idea.

The consequence of this wiring up is that the example program, set up at the end, accesses peripherals using the vbus capability or channel, communicating with Io and requesting access to devices, whose memory regions are described in the Ben’s hardware description. Io acts as the server on this channel whereas the example program acts as the client.

Is This Thing On?

There were a few good reasons for trying GPIO access first. The most significant one is related to the framebuffer. As you may recall, as the kernel finished starting up, sigma0 started, and we could then no longer be sure that the framebuffer configuration initialised by the bootloader had been preserved. So, I could no longer rely on any shortcuts in getting information onto the screen.

Another reason for trying GPIO operations was that I had written code for the CI20 that accessed input and output pins, and I had some confidence in the GPIO driver code that I had included in the L4Re package hierarchy actually working as intended. Indeed, when working with the CI20, I had implemented some driver code for both the GPIO and CPM (clock/power management) functionality as provided by the CI20’s SoC, the JZ4780. As part of my preparation for this effort, I had adapted this code for the Ben’s SoC, the JZ4720. It was arguably worthwhile to give this code some exercise.

One drawback with using general-purpose input/output, however, is that the Ben doesn’t really have any conveniently-accessed output pins or indicators, whereas the CI20 is a development board with lots of connectors and some LEDs that can be used to give simple feedback on what a program might be doing. Manipulating the LCD panel, in contrast, offers a very awkward way of communicating program status.

Experiments proved somewhat successful, however. Obtaining access to device memory was as straightforward as it had been in my CI20 examples, providing me with a virtual address corresponding to the GPIO memory region. Inserting some driver code employing GPIO operations directly into the principal source file for the example, in order to keep things particularly simple and avoid linking problems, I was able to tell the panel to switch itself off. This involves bit-banging SPI commands using a number of output pins. The consequence of doing this is that the screen fades to white.

I gradually gained confidence and decided to reserve memory for the framebuffer and descriptors. Instead of attempting to use the LCD driver code that I had prepared, I attempted to set up the descriptors manually by writing values to the structure that I knew would be compatible with the state of the peripheral as it had been configured previously. But once again, I found myself making no progress. No image would appear on screen, and I started to wonder whether I needed to do more work to reinitialise the framebuffer or maybe even the panel itself.

But the screen was at least switched on, indicating that the Ben had not completely hung and might still be doing something. One odd thing was that the screen would often turn a certain colour. Indeed, it was turning purple fairly consistently when I realised my mistake: my program was terminating! And this was, of course, as intended. The program would start, access memory, set up the framebuffer, fill it with a pattern, and then finish. I suspected something was happening and when I started to think about it, I had noticed a transient bit pattern before the screen went blank, but I suppose I had put this down to some kind of ghosting or memory effect, at least instinctively.

When a program terminates, it is most likely that the memory it used gets erased so as to prevent other programs from inheriting data that they should not see. And naturally, this will result in the framebuffer being cleared, maybe even the descriptors getting trashed again. So the trivial solution to my problem was to just stop my program from terminating:

while (1);

That did the trick!

The Build-Up

Accessing the LCD peripheral directly from my example is all very well, but I had always intended to provide a display driver for the Ben within L4Re. Attempting to compile the driver as part of the general build process had already identified some errors that had been present in what effectively had been untested code, migrated from my earlier work and modified for the L4Re APIs. But I wasn’t about to try and get the driver to work in one go!

Instead, I copied more functionality into my example program, giving things like the clock management functionality some exercise, using my newly-enabled framebuffer output to test the results from various enquiries about clock frequencies, also setting various frequencies and seeing if this stopped the program from working. This did lead me on another needlessly-distracting diversion, however. When looking at the clock configuration values, something seemed wrong: they were shifted one bit to the left and therefore provided values twice as large as they should have been.

I emitted some other values and saw that they too were shifted to the left. For a while, I wondered whether the panel configuration was wrong and started adjusting all sorts of timing-related values (front and back porches, sync pulses, video-related things), reading up about how the panel was configured using the SPI communications described above. It occurred to me that I should investigate how the display output was shifted or wrapped, and I used the value 0x80000001 to put bit values at the extreme left and right of the screen. Something was causing the output to be wrapped, apparently displaced by 9 pixels or 36 bytes to the left.

You can probably guess where this is heading! The panel configuration had not changed and there was no reason to believe that it was wrong. Similarly, the framebuffer was being set up correctly and was not somehow moved slightly in memory, not that there is really any mechanism in L4Re or the hardware that would cause such a bizarre displacement. I looked again at my debugging code and saw my mistake; in concise terms, it took this form:

for (i = 0; i < fb_size / 4; i++)
{
    fb[i] = value & mask ? onpix : offpix; // write a pixel
    if ((i % 10) == 0)                     // move to the next bit?
    ...
}

For some reason, I had taken some perfectly acceptable code and changed one aspect of it. Here is what it had looked like on the many occasions I had used it in the past:

i = 0;
while (i < fb_size / 4)
{
    fb[i] = value & mask ? onpix : offpix; // write a pixel
    i++;
    if ((i % 10) == 0)                     // move to the next bit?
    ...
}

That’s right: I had accidentally reordered the increment and “next bit” tests, causing the first bit in the value to occupy only one pixel, prematurely skipping to the next bit; at the end of the value, the mask would wrap around and show the remaining nine pixels of the first bit. Again, I was debugging the debugging and wasting my own time!

But despite such distractions, over time, it became interesting to copy the complete source file that does most of the work configuring the hardware from my driver into the example directory and to use its functions directly, these functions normally being accessed by more general driver code. In effect, I was replicating the driver within an environment that was being incrementally enhanced, up until the point where I might assume that the driver itself could work.

With no obvious problems having occurred, I then created another example without all the memory lookup operations, copied-in functions, and other “instrumentation”: one that merely called the driver API, relying on the general mechanisms to find and activate the hardware…

  struct arm_lcd_ops *ops = arm_lcd_probe("nanonote");

…obtaining a framebuffer address and size…

  fb_vaddr = ops->get_fb();
  fb_size = ops->get_video_mem_size();

…and then writing data to the screen just as I had done before. Indeed, this worked just like the previous example, indicating that I had managed to implement the driver’s side of the mechanisms correctly. I had a working LCD driver!

Adopting Abstractions

If the final objective had been to just get Fiasco.OC running and to have a program accessing the framebuffer, then this would probably be the end of this adventure. But, in fact, my aim is to demonstrate that the Ben NanoNote can take advantage of the various advertised capabilities of L4Re and Fiasco.OC. This means supporting facilities like the framebuffer driver, GUI multiplexer, and potentially other things. And such facilities are precisely those demonstrated by the “spectrum” example which I had promised myself (and my readership) that I would get working.

At this point, I shouldn’t need to write any more of my own code: the framebuffer driver, GUI multiplexer, and “spectrum” example are all existing pieces of L4Re code. But what I did need to figure out was whether my recipe for wiring these things up was still correct and functional, and whether my porting exercise had preserved the functionality of these things or mysteriously broken them!