Paul Boddie's Free Software-related blog

Archive for the ‘L4’ Category

Experiments with a Screen

Sunday, November 19th, 2023

Not much to report, really. Plenty of ongoing effort to overhaul my L4Re-based software support for the MIPS-based Ingenic SoC products, plus the slow resumption of some kind of focus on my more general framework to provide a demand-paged system on top of L4Re. And then various distractions and obligations on top of that.

Anyway, here is a picture of some kind of result:

MIPS Creator CI20 and Pirate Audio Mini Speaker board

The MIPS Creator CI20 driving the Pirate Audio Mini Speaker board’s screen.

It shows the MIPS Creator CI20 using a Raspberry Pi “hat”, driving the screen using the SPI peripheral built into the CI20’s JZ4780 SoC. Although the original Raspberry Pi had a 26-pin expansion header that the CI20 adopted for compatibility, the Pi range then adopted a 40-pin header instead. Hopefully, there weren’t too many unhappy vendors of accessories as a result of this change.

What it means for the CI20 is that its primary expansion header cannot satisfy the requirements of the expansion connector provided by this “hat” or board in its entirety. Instead, 14 pins of the board’s connector are left unconnected, with the board hanging over the side of the CI20 if mounted directly. Another issue is that the pinout of the board employs a pin as a data/command pin instead of as its designated function as a SPI data input pin. Whether the Raspberry Pi can configure itself to utilise this pin efficiently in this way might help to explain the configuration, but it isn’t compatible with the way such pins are assigned on the CI20.

Fortunately, the CI20’s designers exposed a SPI peripheral via a secondary header, including a dedicated data/command pin, meaning that a few jumper wires can connect the relevant pins to the appropriate connector pins. After some tedious device driver implementation and accompanying frustration, the screen could be persuaded to show an image. With the SPI peripheral being used instead of “bit banging”, or driving the data transfer to the screen controller directly in software, it became possible to use DMA to have the screen image repeatedly sent. And with that, the screen can be used to continuously reflect the contents of a generic framebuffer, becoming like a tiny monitor.

The board also has a speaker that can be driven using I2S communication. The CI20 doesn’t expose I2S signals via the header pins, instead routing I2S audio via the HDMI connector, analogue audio via the headphone socket, and PCM audio via the Wi-Fi/Bluetooth chip, presumably supporting Bluetooth audio. Fortunately, I have another means of testing the speaker, so I didn’t waste half of my money buying this board!

Posted in English, Free Software, hardware, L4, MIPS | Comments Off on Experiments with a Screen

Continuing Explorations into Filesystems and Paging with L4Re

Saturday, April 8th, 2023

Towards the end of last year, I spent a fair amount of time trying to tidy up and document the work I had been doing on integrating a conventional filesystem into the L4 Runtime Environment (or L4Re Operating System Framework, as it now seems to be called). Some of that effort was purely administrative, such as giving the work a more meaningful name and changing references to the naming in various places, whereas other aspects were concerned with documenting mundane things like how the software might be obtained, built and used. My focus had shifted somewhat towards sharing the work and making it slightly more accessible to anyone who might be interested (even if this is probably a very small audience).

Previously, in seeking to demonstrate various mechanisms such as the way programs might be loaded and run, with their payloads paged into memory on demand, I had deferred other work that I felt was needed to make the software framework more usable. For example, I was not entirely happy with the way that my “client” library for filesystem access hid the underlying errors, making troubleshooting less convenient than it could be. Instead of perpetuating the classic Unix “errno” practice, I decided to give file data structures their own error member to retain any underlying error, meaning that a global variable would not be involved in any error reporting.

Other matters needed attending to, as well. Since acquiring a new computer in 2020 based on the x86-64 architecture, the primary testing environment for this effort has been a KVM/QEMU instance invoked by the L4Re build process. When employing the same x86-64 architecture for the instance as the host system, the instance should in theory be very efficient, but for some reason the startup time of such x86-64 instances is currently rather long. This was not the case at some point in the past, but having adopted the Git-based L4Re distribution, this performance regression made an appearance. Maybe at some stage in the future I will discover why it sits there for half a minute spinning up at the “Booting ROM” stage, but for now a reasonable workaround is to favour QEMU instances for other architectures when testing my development efforts.

Preserving Portability

Having long been aware of the necessity of software portability, I have therefore been testing the software in QEMU instances emulating the classic 32-bit x86 architecture as well as MIPS32, in which I have had a personal interest for several years. Surprisingly, testing on x86 revealed a few failures that were not easily explained, but I eventually tracked them down to interoperability problems with the L4Re IPC library, where that library was leaving parts of IPC message values uninitialised and causing my own IPC library to misinterpret the values being sent. This investigation also led me to discover that the x86 Application Binary Interface is rather different in character to the ABI for other architectures. On those other architectures, the alignment of members in structures (and of parameters in parameter lists) needs to be done more carefully due to the way values in memory are accessed. On x86, meanwhile, it seems that values of different sizes can be more readily packed together.

In any case, I came to believe that the L4Re IPC library is not following the x86 ABI specification in the way IPC messages are prepared. I did wonder whether this was deliberate, but I think that it is actually inadvertent. One of my helpful correspondents confirmed that there was indeed a discrepancy between the L4Re code and the ABI, but nothing came of any enquiries into the matter, so I imagine that in any L4Re systems deployed on x86 (although I doubt that there can be many), the use of the L4Re code on both sides of any given IPC transaction manages to conceal this apparent deficiency. The consequence for me was that I had to introduce a workaround in the cases where my code needs to interact with various existing L4Re components.

Several other portability changes were made to resolve a degree of ambiguity around the sizes of various types. This is where the C language family and various related standards and technologies can be infuriating, with care required when choosing data types and then using these in conjunction with libraries that might have their own ideas about which types should be used. Although there probably are good reasons for some types to be equivalent to a “machine word” in size, such types sit uncomfortably with types of other, machine-independent sizes. I am sure I will have to revisit these choices over and over again in future.

Enhancing Component Interface Descriptions

One thing I had been meaning to return to was the matter of my interface description language (IDL) tool and its lack of support for composing interfaces. For example, a component providing file content might expose several different interfaces for file operations, dataspace operations, and so on. These compound interfaces had been defined by specifying arguments for each invocation of the IDL tool that indicate all the interfaces involved, and thus the knowledge of each compound interface ended up being encoded as definitions within Makefiles like this:

mapped_file_object_INTERFACES = dataspace file flush mapped_file notification

A more natural approach involved defining these interfaces in the interface description language itself, but this was going to require putting in the effort to extend the tool, which would not be particularly pleasant, being written in C using Flex and Bison.

Eventually, I decided to just get on with remedying the situation, adding the necessary tool support, and thus tidying up and simplifying the Makefiles in my L4Re build system package. This did raise the complexity level in the special Makefiles provided to support the IDL tool – nothing in the realm of Makefiles is ever truly easy – but it hopefully confines such complexity out of sight and keeps the main project Makefiles as concise as can reasonably be expected. For reference, here is how a file component interface looks with this new tool support added:

interface MappedFileObject composes Dataspace, File, Flush, MappedFile, Notification;

And for reference, here is what one of the constituent interfaces looks like:

interface Flush
{
  /* Flush data and update the size, if appropriate. */

  [opcode(5)] void flush(in offset_t populated_size, out offset_t size);
};

I decided to diverge from previous languages of this kind and to use “composes” instead of language like “inherits”. These compound interface descriptions deliberately do not seek to combine interfaces in a way that entirely resembles inheritance as supported by various commonly used programming languages, and an interface composing other interfaces cannot also add operations of its own: it can merely combine other interfaces. The main reason for such limitations is the deliberate simplicity or lack of capability of the tool: it only really transcribes the input descriptions to equivalent forms in C or C++ and neglects to impose many restrictions of its own. One day, maybe I will revisit this and at least formalise these limitations instead of allowing them to emerge from the current state of the implementation.

A New Year

I had hoped to deliver something for broader perusal late last year, but the end of the year arrived and with it some intriguing but increasingly time-consuming distractions. Having written up the effective conclusion of those efforts, I was able to turn my attention to this work again. To start with, that involved reminding myself where I had got to with it, which underscores the need for some level of documentation, because documentation not only communicates the nature of a work to others but it also communicates it to one’s future self. So, I had to spend some time rediscovering the finer detail and reminding myself what the next steps were meant to be.

My previous efforts had demonstrated the ability to launch new programs from my own programs, reproducing some of what L4Re already provides but in a form more amenable to integrating with my own framework. If the existing L4Re code had been more obviously adaptable in a number of different ways throughout my long process of investigation and development for it, I might have been able to take some significant shortcuts and save myself a lot of effort. I suppose, however, that I am somewhat wiser about the technologies and techniques involved, which might be beneficial in its own way. The next step, then, was to figure out how to detect and handle the termination of programs that I had managed to launch.

In the existing L4Re framework, a component called Ned is capable of launching programs, although not being able to see quite how I might use it for my own purposes – that being to provide a capable enough shell environment for testing – had led me along my current path of development. It so happens that Ned supports an interface for “parent” tasks that is used by created or “child” tasks, and when a program terminates, the general support code for the program that is brought along by the C library includes the invocation of an operation on this parent interface before the program goes into a “wait forever” state. Handling this operation and providing this interface seemed to be the most suitable approach for replicating this functionality in my own code.

Consolidation and Modularisation

Before going any further, I wanted to consolidate my existing work which had demonstrated program launching in a program written specifically for that purpose, bringing along some accompanying abstractions that were more general in nature. First of all, I decided to try and make a library from the logic of the demonstration program I had written, so that the work involved in setting up the environment and resources for a new program could be packaged up and re-used. I also wanted the functionality to be available through a separate server component, so that programs wanting to start other programs would not need to incorporate this functionality but could instead make a request to this separate “process server” component to do the work, obtaining a reference to the new program in response.

One might wonder why one might bother introducing a separate component to start programs on another program’s behalf. As always when considering the division of functionality between components in a microkernel-based system, it is important to remember that components can have different configurations that afford them different levels of privilege within a system. We might want to start programs with one level of privilege from other programs with a different level of privilege. Another benefit of localising program launching in one particular component is that it might provide an overview of such activities across a number of programs, thus facilitating support for things like job and process control.

Naturally, an operating system does not need to consolidate all knowledge about running programs or processes in one place, and in a modular microkernel-based system, there need not even be a single process server. In fact, it seems likely that if we preserve the notion of a user of the system, each user might have their own process server, and maybe even more than one of them. Such a server would be configured to launch new programs in a particular way, having access only to resources available to a particular user. One interesting possibility is that of being able to run programs provided by one filesystem that then operate on data provided by another filesystem. A program would not be able to see the filesystem from which it came, but it would be able to see the contents of a separate, designated filesystem.

Region Mapper Deficiencies

A few things conspired to make the path of progress rather less direct than it might have been. Having demonstrated the launching of trivial programs, I had decided to take a welcome break from the effort. Returning to the effort, I decided to test access to files served up by my filesystem infrastructure, and this caused programs to fail. In order to support notification events when accessing files, I employ a notification thread to receive such events from other components, but the initialisation of threading in the C library was failing. This turned out to be due to the use of a region mapper operation that I had not yet supported, so I had to undertake a detour to implement an appropriate data structure in the region mapper, which in C++ is not a particularly pleasant experience.

Later on, the region mapper caused me some other problems. I had neglected to implement the detach operation, which I rely on quite heavily for my file access library. Attempting to remedy these problems involved reacquainting myself with the region mapper interface description which is buried in one of the L4Re packages, not to be confused with plenty of other region mapper abstractions which do not describe the actual interface employed by the IPC mechanism. The way that L4Re has abandoned properly documented interface descriptions is very annoying, requiring developers to sift through pages of barely commented code and to be fully aware of the role of that code. I implemented something that seemed to work, quite sure that I still did not have all the details correct in my implementation, and this suspicion would prove correct later on.

Local and Non-Local Capabilities

Another thing that I had not fully understood, when trying to put together a library handling IPC that I could tolerate working with, was the way that capabilities may be transferred in IPC messages within tasks. Capabilities are references to components in the system, and when transferred between tasks, the receiving task is meant to allocate a “slot” for each received capability. By choosing a slot denoted by an index, the task (or the program running in it) can tell the kernel where to record the capability in its own registry for the task, and by employing this index in its own registry, the program will be able to maintain a record of available capabilities consistent with that of the kernel.

The practice of allocating capability slots for received capabilities is necessary for transfers between tasks, but when the transfer occurs within a task, there is no need to allocate a new slot: the received capability is already recorded within the task, and so the item describing the capability in the message will actually encode the capability index known to the task. Previously, I was not generally sending capabilities in messages within tasks, and so I had not knowingly encountered any issues with my simplistic “general case” support for capability transfers, but having implemented a region mapper that resides in the same task as a program being run, it became necessary to handle the capabilities presented to the region mapper from within the same task.

One counterintuitive consequence of the capability management scheme arises from the general, inter-task transfer case. When a task receives a capability from another task, it will assign a new index to the capability ahead of time, since the kernel needs to process this transfer as it propagates the message. This leaves the task with a new capability without any apparent notion of whether it has seen that capability before. Maybe there is a way of asking the kernel if two capabilities refer to the same object, but it might be worthwhile just not relying on such facilities and designing frameworks around such restrictions instead.

Starting and Stopping

So, back to the exercise of stopping programs that I had been able to start! It turned out that receiving the notification that a program had finished was only the start; what then needed to happen was something of a mystery. Intuitively, I knew that the task hosting the program’s threads would need to be discarded, but I envisaged that the threads themselves probably needed to be discarded first, since they are assigned to the task and probably cannot have that task removed from under them, even if they are suspended in some sense.

But what about everything else referenced by the task? After all, the task will have capabilities for things like dataspaces that provide access to regions of files and to the program stack, for things like the filesystem for opening other files, for semaphore and IRQ objects, and so on. I cannot honestly say that I have the definitive solution, and I could not easily find much in the way of existing guidance, so I decided in the end to just try and tidy all the resources up as best I could, hopefully doing enough to make it possible to release the task and have the kernel dispose of it. This entailed a fairly long endeavour that also encouraged me to evolve the way that the monitoring of the process termination is performed.

When the started program eventually reaches the end and sends a message to its “parent” component, that component needs to record any termination state communicated in the message so that it may be passed on to the program’s creator or initiator, and then it also needs to commence the work of wrapping up the program. Here, I decided on a distinct component separate from one responsible for any paging activities to act as the contact point for the creating or initiating program. When receiving a termination message or signal, this component disconnects the terminating program from its internal pager by freeing the capability, and this then causes the internal pager to terminate, itself sending a signal to its own parent.

One important aspect of starting and terminating processes is that of notifying the party that sought to start a process in the first place. For filesystem operations, I had already implemented support for certain notification events related to opening, modifying and closing files and pipes, with these being particularly important for pipes. I wanted to extend this support to processes so that it might be possible to monitor files, pipes and processes together using a kind of select or poll operation. This led to a substantial detour where I became dissatisfied with the existing support, modified it, had to debug it, and remain somewhat concerned that it might need more work in the future.

Testing on the different architectures under QEMU also revealed that I would need to handle the possibility that a program might be started and run to completion before its initiator had even received a reference to the program for notification purposes. Fortunately, a similar kind of vanishing resource problem arose when I was developing the file paging system, and so I had a technique available to communicate the reference to the process monitor component to the initiator of the program, ensuring that the process monitor becomes established in the kernel’s own records, before the program itself gets started, runs and completes, avoiding the process monitor being tidied up before its existence becomes known to the wider system.

Wrapping Up Again

A few concerns remain with the state of the work so far. I experienced problems with filesystem access that I traced to the activity of repeatedly attaching and detaching dataspaces, which is something my filesystem access library does deliberately, but the error suggested that the L4Re region mapper had somehow failed to attach the appropriate region. This may well be caused by issues within my own code, and my initial investigation did indeed uncover a problem in my own code where the size of the attached region of a file would gradually increase over time. With this mistake fixed, the situation was improved, but the underlying problem was not completely eliminated, judging from occasional errors. A workaround has been adopted for now.

Various other problems arose and were hopefully resolved. I would say that some of them were due to oversights when getting things done takes precedence over a more complete consideration of all the issues, particularly when working in a language like C++ where lower-level chores like manual memory management enter the picture. The differing performance when emulating various architectures under QEMU also revealed a deficiency with my region mapper implementation. It turned out that detach operations were not returning successfully, leading the L4Re library function to return without invalidating memory pages, and so my file access operations were returning pages of incorrect content instead of the expected file content for the first few accesses until the correct pages had been paged in and were almost continuously resident.

Here, yet more digging around in the L4Re code revealed an apparent misunderstanding about the return value associated with one of the parameters to the detach operation, that of the detached dataspace. I had concluded that a genuine capability was meant to be returned, but it seems that a simple index value is returned in a message word instead of a message item, and so there is no actual capability transferred to the caller, not even a local one. The L4Re IPC framework does not really make the typing semantics very clear, or at least not to me, and the code involved is quite unfathomable. Again, a formal interface specification written in a clearly expressed language would have helped substantially.

Next Steps

I suppose progress of sorts has been made in the last month or so, for which I can be thankful. Although tidying up the detritus of my efforts will remain an ongoing task, I can now initiate programs and wait for them to finish, meaning that I can start building up test suites within the environment, combining programs with differing functionality in a Unix-like fashion to hopefully validate the behaviour of the underlying frameworks and mechanisms.

Now, I might have tried much of this with L4Re’s Lua-based scripting, but it is not as straightforward as a more familiar shell environment, appearing rather more low-level in some ways, and it is employed in a way that seems to favour parallel execution instead of the sequential execution that I might desire when composing tests: I want tests to involve programs whose results feed into subsequent programs, as opposed to just running a load of programs at once. Also, without more extensive documentation, the Lua-based scripting support remains a less attractive choice than just building something where I get to control the semantics. Besides, I also need to introduce things like interprocess pipes, standard input and output, and such things familiar from traditional software platforms. Doing that for a simple shell-like environment would be generally beneficial, anyway.

Should I continue to make progress, I would like to explore some of the possibilities hinted at above. The modular architecture of a microkernel-based system should allow a more flexible approach in partitioning the activities of different users, along with the configuration of their programs. These days, so much effort is spent in “orchestration” and the management of containers, with a veritable telephone directory of different technologies and solutions competing for the time and attention of developers who are now also obliged to do the work of deployment specialists and systems administrators. Arguably, much of that involves working around the constraints of traditional systems instead of adapting to those systems, with those systems themselves slowly adapting in not entirely convincing or satisfactory ways.

I also think back to my bachelor’s degree dissertation about mobile software agents where the idea was that untrusted code might be transmitted between systems to carry out work in a safe and harmless fashion. Reducing the execution environment of such agent programs to a minimum and providing decent support for monitoring and interacting with them would be something that might be more approachable using the techniques explored in this endeavour. Pervasive, high-speed, inexpensively-accessed networks undermined the envisaged use-cases for mobile agents in general, although the practice of issuing SQL queries to database servers or having your browser run JavaScript programs deployed in Web pages demonstrates that the general paradigm is far from obsolete.

In any case, my “to do” list for this project will undoubtedly remain worryingly long for the foreseeable future, but I will hopefully be able to remedy shortcomings, expand the scope and ambition of the effort, and continue to communicate my progress. Thank you to those who have made it to the end of this rather dry article!

Posted in English, filesystems, Free Software, L4, libext2fs, operating systems | Comments Off on Continuing Explorations into Filesystems and Paging with L4Re

Gradual Explorations of Filesystems, Paging and L4Re

Thursday, June 30th, 2022

A surprising three years have passed since my last article about my efforts to make a general-purpose filesystem accessible to programs running in the L4 (or L4Re) Runtime Environment. Some of that delay was due to a lack of enthusiasm about blogging for various reasons, much more was due to having much of my time occupied by full-time employment involving other technologies (Python and Django mostly, since you ask) that limited the amount of time and energy that could be spent focusing on finding my way around the intricacies of L4Re.

In fact, various other things I looked into in 2019 (or maybe 2018) also went somewhat unreported. I looked into trying to port the “user mode” (UX) variant of the Fiasco.OC microkernel to the MIPS architecture used by the MIPS Creator CI20. This would have allowed me to conveniently develop and test L4Re programs in the GNU/Linux environment on that hardware. I did gain some familiarity with the internals of that software, together with the Linux ptrace mechanism, making some progress but not actually getting to a usable conclusion. Recommendations to use QEMU instead led me to investigate the situation with KVM on MIPS, simply to try and get half-way reasonable performance: emulation is otherwise rather slow.

You wouldn’t think that running KVM on anything other than Intel/AMD or ARM architectures were possible if you only read the summary on the KVM project page or the Debian Wiki’s KVM page. In fact, KVM is supported on multiple architectures including MIPS, but the latest (and by now very old 3.18) “official” kernel for the CI20 turned out to be too old to support what I needed. Or at least, I tried to get it to work but even with all the necessary configuration to support “trap and emulate” on a CPU without virtualisation support, it seemed to encounter instructions it did not emulate. As the hot summer of 2019 (just like 2018) wound down, I switched back to using my main machine at the time: an ancient Pentium 4 system that I didn’t want heating the apartment; one that could run QEMU rather slowly, albeit faster than the CI20, but which gave me access to Fiasco.OC-UX once again.

Since then, the hard yards of upstreaming Linux kernel device support for the CI20 has largely been pursued by the ever-patient Nikolaus Schaller, vendor of the Letux 400 mini-notebook and hardware designer of the Pyra, and a newer kernel capable of running KVM satisfactorily might now be within reach. That is something to be investigated somewhere in the future.

Back to the Topic

In my last article on the topic of this article, I had noted that to take advantage of various features that L4Re offers, I would need to move on from providing a simple mechanism to access files through read and write operations, instead embracing the memory mapping paradigm that is pervasive in L4Re, adopting such techniques to expose file content to programs. This took us through a tour of dataspaces, mapping, pages, flexpages, virtual memory and so on. Ultimately, I had a few simple programs working that could still read and write to files, but they would be doing so via a region of memory where pages of this memory would be dynamically “mapped” – made available – and populated with file content. I even integrated the filesystem “client” library with the Newlib C library implementation, but that is another story.

Nothing is ever simple, though. As I stressed the test programs, introducing concurrent access to files, crashes would occur in the handling of the pages issued to the clients. Since I had ambitiously decided that programs accessing the same files would be able to share memory regions assigned to those files, with two or more programs being issued with access to the same memory pages if they happened to be accessing the same areas of the underlying file, I had set myself up for the accompanying punishment: concurrency errors! Despite the heroic help of l4-hackers mailing list regulars (Frank and Jean), I had to concede that a retreat, some additional planning, and then a new approach would be required. (If nothing else, I hope this article persuades some l4-hackers readers that their efforts in helping me are not entirely going to waste!)

Prototyping an Architecture

In some spare time a couple of years ago, I started sketching out what I needed to consider when implementing such an architecture. Perhaps bizarrely, given the nature of the problem, my instinct was to prototype such an architecture in Python, running as a normal program on my GNU/Linux system. Now, Python is not exactly celebrated for its concurrency support, given the attention its principal implementation, CPython, has often had for a lack of scalability. However, whether or not the Python implementation supports running code in separate threads simultaneously, or whether it merely allows code in threads to take turns running sequentially, the most important thing was that I could have code happily running along being interrupted at potentially inconvenient moments by some other code that could conceivably ruin everything.

Fortunately, Python has libraries for threading and provides abstractions like semaphores. Such facilities would be all that was needed to introduce concurrency control in the different program components, allowing the simulation of the mechanisms involved in acquiring memory pages, populating them, issuing them to clients, and revoking them. It may sound strange to even consider simulating memory pages in Python, which operates at another level entirely, and the issuing of pages via a simulated interprocess communication (IPC) mechanism might seem unnecessary and subject to inaccuracy, but I found it to be generally helpful in refining my approach and even deepening my understanding of concepts such as flexpages, which I had applied in a limited way previously, making me suspect that I had not adequately tested the limits of my understanding.

Naturally, L4Re development is probably never done in Python, so I then had the task of reworking my prototype in C++. Fortunately, this gave me the opportunity to acquaint myself with the more modern support in the C++ standard libraries for threading and concurrency, allowing me to adopt constructs such as mutexes, condition variables and lock guards. Some of this exercise was frustrating: C++ is, after all, a lower-level language that demands more attention to various mundane details than Python does. It did suggest potential improvements to Python’s standard library, however, although I don’t really pay any attention to Python core development any more, so unless someone else has sought to address such issues, I imagine that Python will gain even more in the way of vanity features while such genuine deficiencies and omissions remain unrecognised.

Transplanting the Prototype

Upon reintroducing this prototype functionality into L4Re, I decided to retain the existing separation of functionality into various libraries within the L4Re build system – ones for filesystem clients, servers, IPC – also making a more general memory abstractions library, but I ultimately put all these libraries within a single package. At some point, it is this package that I will be making available, and I think that it will be easier to evaluate with all the functionality in a single bundle. The highest priority was then to test the mechanisms employed by the prototype using the same concurrency stress test program, this originally being written in Python, then ported to C++, having been used in my GNU/Linux environment to loosely simulate the conditions under L4Re.

This stress testing exercise eventually ended up working well enough, but I did experience issues with resource limits within L4Re as well as some concurrency issues with capability management that I should probably investigate further. My test program opens a number of files in a number of threads and attempts to read various regions of these files over and over again. I found that I would run out of capability slots, these tracking the references to other components available to a task in L4Re, and since each open file descriptor or session would require a slot, as would each thread, I had to be careful not to exceed the default budget of such slots. Once again, with help from another l4-hackers participant (Philipp), I realised that I wasn’t releasing some of the slots in my own code, but I also learned that above a certain number of threads, open files, and so on, I would need to request more resources from the kernel. The concurrency issue with allocating individual capability slots remains unexplored, but since I already wrap the existing L4Re functionality in my own library, I just decided to guard the allocation functionality with semaphores.

With some confidence in the test program, which only accesses simulated files with computed file content, I then sought to restore functionality accessing genuine files, these being the read-only files already exposed within L4Re along with ext2-resident files previously supported by my efforts. The former kind of file was already simulated in the prototype in the form of “host” files, although L4Re unhelpfully gives an arbitary (counter) value for the inode identifiers for each file, so some adjustments were required. Meanwhile, introducing support for the latter kind of file led me to update the bundled version of libext2fs I am using, refine various techniques for adapting the upstream code, introduce more functionality to help use libext2fs from my own code (since libext2fs can be rather low-level), and to consider the broader filesystem support architecture.

Here is the general outline of the paging mechanism supporting access to filesystem content:

The data structures employed to provide filesystem content to programs.

It is rather simplistic, and I have practically ignored complicated page replacement algorithms. In practice, pages are obtained for use when a page fault occurs in a program requesting a particular region of file content, and fulfilment of this request will move a page to the end of a page queue. Any independent requests for pages providing a particular file region will also reset the page’s position in the queue. However, since successful accesses to pages will not cause programs to repeatedly request those pages, eventually those pages will move to the front of the queue and be reclaimed.

Without any insight into how much programs are accessing a page successfully, relying purely on the frequency of page faults, I imagine that various approaches can be adopted to try and assess the frequency of accesses, extrapolating from the page fault frequency and seeking to “bias” or “weight” pages with a high frequency of requests so that they move through the queue more slowly or, indeed, move through a queue that provides pages less often. But all of this is largely a distraction from getting a basic mechanism working, so I haven’t directed any more time to it than I have just now writing this paragraph!

Files and File Sessions

While I am quite sure that I ended up arriving at a rather less than optimal approach for the paging architecture, I found that the broader filesystem architecture also needed to be refined further as I restored the functionality that I had previously attempted to provide. When trying to support shared access to file content, it is appropriate to implement some kind of registry of open files, these retaining references to objects that are providing access to each of the open files. Previously, this had been done in a fairly simple fashion, merely providing a thread-safe map or dictionary yielding the appropriate file-related objects when present, otherwise permitting new objects to be registered.

Again, concurrency issues needed closer consideration. When one program requests access to a file, it is highly undesirable for another program to interfere during the process of finding the file, if it exists already, or creating the file, if it does not. Therefore, there must be some kind of “gatekeeper” for the file, enforcing sequential access to filesystem operations involving it and ensuring that any preparatory activities being undertaken to make a file available, or to remove a file, are not interrupted or interfered with. I came up with an architecture looking like this, with a resource registry being the gatekeeper, resources supporting file sessions, providers representing open files, and accessors transferring data to and from files:

The data structures employed to provide access to the underlying filesystem objects.

I became particularly concerned with the behaviour of the system around file deletion. On Unix systems, it is fairly well understood that one can “unlink” an existing file and keep accessing it, as long as a file descriptor has been retained to access that file. Opening a file with the same name as the unlinked file under such circumstances will create a new file, provided that the appropriate options are indicated, or otherwise raise a non-existent file error, and yet the old file will still exist somewhere. Any new file with the same name can be unlinked and retained similarly, and so on, building up a catalogue of old files that ultimately will be removed when the active file descriptors are closed.

I thought I might have to introduce general mechanisms to preserve these Unix semantics, but the way the ext2 filesystem works largely encodes them to some extent in its own abstractions. In fact, various questions that I had about Unix filesystem semantics and how libext2fs might behave were answered through the development of various test programs, some being normal programs accessing files in my GNU/Linux environment, others being programs that would exercise libext2fs in that environment. Having some confidence that libext2fs would do the expected thing leads me to believe that I can rely on it at least for some of the desired semantics of the eventual system.

The only thing I really needed to consider was how the request to remove a file when that file was still open would affect the “provider” abstraction permitting continued access to the file contents. Here, I decided to support a kind of deferred removal: if a program requested the removal of a file, the provider and the file itself would be marked for removal upon the final closure of the file, but the provider for the file would no longer be available for new usage, and the file would be unlinked; programs already accessing the file would continue to operate, but programs opening a file of the same name would obtain a new file and a new provider.

The key to this working satisfactorily is that libext2fs will assign a new inode identifier when opening a new file, whereas an unlinked file retains its inode identifier. Since providers are indexed by inode identifier, and since libext2fs translates the path of a file to the inode identifier associated with the file in its directory entry, attempts to access a recreated file will always yield the new inode identifier and thus the new file provider.

Pipes, Listings and Notifications

In the previous implementation of this filesystem functionality, I had explored some other aspects of accessing a filesystem. One of these was the ability to obtain directory listings, usually exposed in Unix operating systems by the opendir and readdir functions. The previous implementation sought to expose such listings as files, this in an attempt to leverage the paging mechanisms already built, but the way that libext2fs provides such listing information is not particularly compatible with the random-access file model: instead, it provides something more like an iterator that involves the repeated invocation of a callback function, successively supplying each directory entry for the callback function to process.

For this new implementation, I decided to expose directory listings via pipes, with a server thread accessing the filesystem and, in that callback function, writing directory entries to one end of a pipe, and with a client thread reading from the other end of the pipe. Of course, this meant that I needed to have an implementation of pipes! In my previous efforts, I did implement pipes as a kind of investigation, and one can certainly make this very complicated indeed, but I deliberately kept this very simple in this current round of development, merely having a couple of memory regions, one being used by the reader and one being used by the writer, with each party transferring the regions to the other (and blocking) if they find themselves respectively running out of content or running out of space.

One necessary element in making pipes work is that of coordinating the reading and writing parties involved. If we restrict ourselves to a pipe that will not expand (or even not expand indefinitely) to accommodate more data, at some point a writer may fill the pipe and may then need to block, waiting for more space to become available again. Meanwhile, a reader may find itself encountering an empty pipe, perhaps after having read all available data, and it may need to block and wait for more content to become available again. Readers and writers both need a way of waiting efficiently and requesting a notification for when they might start to interact with the pipe again.

To support such efficient blocking, I introduced a notifier abstraction for use in programs that could be instantiated and a reference to such an instance (in the form of a capability) presented in a subscription request to the pipe endpoint. Upon invoking the wait operation on a notifier, the notifier will cause the program (or a thread within a program) to wait for the delivery of a notification from the pipe, this being efficient because the thread will then sleep, only to awaken if a message is sent to it. Here is how pipes make use of notifiers to support blocking reads and writes:

Communication via pipes employing notifications

The use of notifications when programs communicate via a pipe.

A certain amount of plumbing is required behind the scenes to support notifications. Since programs accessing files will have their own sessions, there needs to be a coordinating object representing each file itself, this being able to propagate notification events to the users of the file concerned. Fortunately, I introduced the notion of a “provider” object in my architecture that can act in such a capacity. When an event occurs, the provider will send a notification to each of the relevant notifier endpoints, also providing some indication of the kind of event occurring. Previously, I had employed L4Re’s IRQ (interrupt request) objects as a means of delivering notifications to programs, but these appear to be very limited and do not allow additional information to be conveyed, as far as I can tell.

One objective I had with a client-side notifier was to support waiting for events from multiple files or streams collectively, instead of requiring a program to have threads that wait for events from each file individually, thus attempting to support the functionality provided by Unix functions like select and poll. Such functionality relies on additional information indicating the kind of event that has occurred. The need to wait for events from numerous sources also inverts the roles of client and server, with a notifier effectively acting like a server but residing in a client program, waiting for messages from its clients, these typically residing in the filesystem server framework.

Testing and Layering

Previously, I found that it was all very well developing functionality, but only through a commitment to testing it would I discover its flaws. When having to develop functionality at a number of levels in a system at the same time, testing generally starts off in a fairly limited fashion. Initially, I reintroduced a “block” server that merely provides access to a contiguous block of data, this effectively simulating storage device access that will hopefully be written at some point, and although genuine filesystem support utilises this block server, it is reassuring to be able to know whether it is behaving correctly. Meanwhile, for programs to access servers, they must send requests to those servers, assisted by a client library that provides support for such interprocess communication at a fairly low level. Thus, initial testing focused on using this low-level support to access the block server and verify that it provides access to the expected data.

On top of the lowest-level library functionality is a more usable level of “client” functions that automates the housekeeping needing to be done so that programs may expect an experience familiar to that provided by traditional C library functionality. Again, testing of file operations at that level helped to assess whether library and server functionality was behaving in line with expectations. With some confidence, the previously-developed ext2 filesystem functionality was reintroduced and updated. By layering the ext2 filesystem server on top of the block server, the testing activity is actually elevated to another level: libext2fs absolutely depends on properly functioning access to the block device; otherwise, it will not be able to perform even the simplest operations on files.

When acquainting myself with libext2fs, I developed a convenience library called libe2access that encapsulates some of the higher-level operations, and I made a tool called e2access that is able to populate a filesystem image from a normal program. This tool, somewhat reminiscent of the mtools suite that was popular at one time to allow normal users to access floppy disks on a system, is actually a fairly useful thing to have, and I remain surprised that there isn’t anything like it in common use. In any case, e2access allows me to populate images for use in L4Re, but I then thought that an equivalent to it would also be useful in L4Re for testing purposes. Consequently, a tool called fsaccess was created, but unlike e2access it does not use libe2access or libext2fs directly: instead, it uses the “client” filesystem library, exercising filesystem access via the IPC system and filesystem server architecture.

Ultimately, testing will be done completely normally using C library functions, these wrapping the “client” library. At that point, there will be no distinction between programs running within L4Re and within Unix. To an extent, L4Re already supports normal Unix-like programs using C library functions, this being particularly helpful when developing all this functionality, but of course it doesn’t support “proper” filesystems or Unix-like functionality in a particularly broad way, with various common C library or POSIX functions being stubs that do nothing. Of course, all this effort started out precisely to remedy these shortcomings.

Paging, Loading and Running Programs

Beyond explicitly performed file access, the next level of mutually-reinforcing testing and development came about through the simple desire to have a more predictable testing environment. In wanting to be able to perform tests sequentially, I needed control over the initiation of programs and to be able to rely on their completion before initiating successive programs. This may well be possible within L4Re’s Lua-based scripting environment, but I generally find the details to be rather thin on the ground. Besides, the problem provided some motivation to explore and understand the way that programs are launched in the environment.

There is some summary-level information about how programs (or tasks) are started in L4Re – for example, pages 41 onwards of “Memory, IPC, and L4Re” – but not much in the way of substantial documentation otherwise. Digging into the L4Re libraries yielded a confusing array of classes and apparent interactions which presumably make sense to anyone who is already very familiar with the specific approach being taken, as well as the general techniques being applied, but it seems difficult for outsiders to distinguish between the specifics and the generalities.

Nevertheless, some ideas were gained from looking at the code for various L4Re components including Moe (the root task), Ned (the init program), the loader and utilities libraries, and the oddly-named l4re_kernel component, this actually providing the l4re program which itself hosts actual programs by providing the memory management functionality necessary for those programs to work. In fact, we will eventually be looking at a solution that replicates that l4re program.

A substantial amount of investigation and testing took place to explore the topic. There were a number of steps required to initialise a new program:

Open the program executable file and obtain details of the different program segments and the program’s start address, this requiring some knowledge of ELF binaries.
Initialise a stack for the program containing the arguments to be presented to it, plus details of the program’s environment. The environment is of particular concern.
Create a task for the program together with a thread to begin execution at the start address, setting the stack pointer to the appropriate place in where the stack should be made available.
Initialise a control block for the thread.
Start the thread. This should immediately generate a page fault because the memory at the start address is not yet available within the task.
Service page faults for the program, providing pages for the program code – thus resolving that initial page fault – as well as for the stack and other regions of memory.

Naturally, each of these steps entails a lot more work than is readily apparent. Particularly the last step is something of an understatement in terms of what is required: the mechanism by which demand paging of the program is to be achieved.

L4Re provides some support for inspecting ELF binaries in its utilities library, but I found the ELF specification to be very useful in determining the exact purposes of various program header fields. For more practical guidance, the OSDev wiki page about ELF provides an explanation of the program loading process, along with how the different program segments are to be applied in the initialisation of a new program or process. With this information to hand, together with similar descriptions in the context of L4Re, it became possible to envisage how the address space of a new program might be set up, determining which various parts of the program file might be installed and where they might be found. I wrote some test programs, making some use of the structures in the utilities library, but wrote my own functions to extract the segment details from an ELF binary.

I found a couple of helpful resources describing the initialisation of the program stack: “Linux x86 Program Start Up” and “How statically linked programs run on Linux”. These mainly demystify the code that is run when a program starts up, setting up a program before the user’s main function is called, giving a degree of guidance about the work required to set up the stack so that such code may perform as expected. I was, of course, also able to study what the various existing L4Re components were doing in this respect, although I found the stack abstractions used to be idiomatic C/C++ bordering on esoteric. Nevertheless, the exercise involves obtaining some memory that can eventually be handed over to the new program, populating that memory, and then making it available to the new program, either immediately or on request.

Although I had already accumulated plenty of experience passing object capabilities around in L4Re, as well as having managed to map memory between tasks by sending the appropriate message items, the exact methods of setting up another task with memory and capabilities had remained mysterious to me, and so began another round of experimentation. What I wanted to do was to take a fairly easy route to begin with: create a task, populate some memory regions containing the program code and stack, transfer these things to the new task (using the l4_task_map function), and then start the thread to run the program, just to see what happened. Transferring capabilities was fairly easily achieved, and the L4Re libraries and frameworks do employ the equivalent of l4_task_map in places like the Remote_app_model class found in libloader, albeit obfuscated by the use of the corresponding C++ abstractions.

Frustratingly, this simple approach did not seem to work for the memory, and I could find only very few cases of anyone trying to use l4_task_map (or its equivalent C++ incantations) to transfer memory. Despite the memory apparently being transferred to the new task, the thread would immediately cause a page fault. Eventually, a page fault is what we want, but that would only occur because no memory would be made available initially, precisely because we would be implementing a demand paging solution. In the case of using l4_task_map to set up program memory, there should be no new “demand” for pages of such memory, this demand having been satisfied in advance. Nevertheless, I decided to try and get a page fault handler to supply flexpages to resolve these faults, also without success.

Having checked and double-checked my efforts, an enquiry on the l4-hackers list yielded the observation that the memory I had reserved and populated had not been configured as “executable”, for use by code in running programs. And indeed, since I had relied on the plain posix_memalign function to allocate that memory, it wasn’t set up for such usage. So, I changed my memory allocation strategy to permit the allocation of appropriately executable memory, and fortunately the problem was solved. Further small steps were then taken. I sought to introduce a region mapper that would attempt to satisfy requests for memory regions occurring in the new program, these occurring because a program starting up in L4Re will perform some setting up activities of its own. These new memory regions would be recognised by the page fault handler, with flexpages supplied to resolve page faults involving those regions. Eventually, it became possible to run a simple, statically-linked program in its own task.

Supporting program loading with an external page fault handler

When loading and running a new program, an external page fault handler makes sure that accesses to memory are supported by memory regions that may be populated with file content.

Up to this point, the page fault handler had been external to the new task and had been supplying memory pages from its own memory regions. Requests for data from the program file were being satisfied by accessing the appropriate region of the file, this bringing in the data using the file’s paging mechanism, and then supplying a flexpage for that part of memory to the program running in the new task. This particular approach compels the task containing the page fault handler to have a memory region dedicated to the file. However, the more elegant solution involves having a page fault handler communicating directly with the file’s pager component which will itself supply flexpages to map the requested memory pages into the new task. And to be done most elegantly, the page fault handler needs to be resident in the same task as the actual program.

Putting the page fault handler and the actual program in the same task demanded some improvements in the way I was setting up tasks and threads, providing capabilities to them, and so on. Separate stacks need to be provided for the handler and the program, and these will run in different threads. Moving the page fault handler into the new task is all very well, but we still need to be able to handle page faults that the “internal” handler might cause, so this requires us to retain an “external” handler. So, the configuration of the handler and program are slightly different.

Another tricky aspect of this arrangement is how the program is configured to send its page faults to the handler running alongside it – the internal handler – instead of the one servicing the handler itself. This requires an IPC gate to be created for the internal handler, presented to it via its configuration, and then the handler will bind to this IPC gate when it starts up. The program may then start up using a reference to this IPC gate capability as its “pager” or page fault handler. You would be forgiven for thinking that all of this can be quite difficult to orchestrate correctly!

Configuring the communication between program and page fault handler

An IPC gate must be created and presented to the page fault handler for it to bind to before it is presented to the program as its “pager”.

Although I had previously been sending flexpages in messages to satisfy map requests, the other side of such transactions had not been investigated. Senders of map requests will specify a “receive window” to localise the placement of flexpages returned from such requests, this being an intrinsic part of the flexpage concept. Here, some aspects of the IPC system became more prominent and I needed to adjust the code generated by my interface description language tool which had mostly ignored the use of message buffer registers, employing them only to control the reception of object capabilities.

More testing was required to ensure that I was successfully able to request the mapping of memory in a particular region and that the supplied memory did indeed get mapped into the appropriate place. With that established, I was then able to modify the handler deployed to the task. Since the flexpage returned by the dataspace (or resource) providing access to file content effectively maps the memory into the receiving task, the page fault handler does not need to explicitly return a valid flexpage: the mapping has already been done. The semantics here were not readily apparent, but this approach appears to work correctly.

The use of an internal page fault handler with a new program

An internal page fault handler satisfies accesses to memory from the program running in the same task, providing it with access to memory regions that may be populated with file content.

One other detail that proved to be important was that of mapping file content to memory regions so that they would not overlap somehow and prevent the correct region from being used to satisfy page faults. Consider the following regions of the test executable file described by the readelf utility (with the -l option):

  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000001000000 0x0000000001000000
                 0x00000000000281a6 0x00000000000281a6  R E    0x1000
  LOAD           0x0000000000028360 0x0000000001029360 0x0000000001029360
                 0x0000000000002058 0x0000000000008068  RW     0x1000

Here, we need to put the first region providing the program code at a virtual address of 0x1000000, having a size of at least 0x281a6, populated with exactly that amount of content from the file. Meanwhile, we need to put the second region at address 0x1029360, having a size of 0x8068, but only filled with 0x2058 bytes of data. Both regions need to be aligned to addresses that are multiples of 0x1000, but their contents must be available at the stated locations. Such considerations brought up two apparently necessary enhancements to the provision of file content: the masking of content so that undefined areas of each region are populated with zero bytes, this being important in the case of the partially filled data region; the ability to support writes to a region without those writes being propagated to the original file.

The alignment details help to avoid the overlapping of regions, and the matter of populating the regions can be managed in a variety of ways. I found that since file content was already being padded at the extent of a file, I could introduce a variation of the page mapper already used to manage the population of memory pages that would perform such padding at the limits of regions defined within files. For read-only file regions, such a “masked” page mapper would issue a single read-only page containing only zero bytes for any part of a file completely beyond the limits of such regions, thus avoiding the allocation of lots of identical pages. For writable regions that are not to be committed to the actual files, a “copied” page mapper abstraction was introduced, this providing copy-on-write functionality where write accesses cause new memory pages to be allocated and used to retain the modified data.

Some packaging up of the functionality into library routines and abstractions was undertaken, although as things stand more of that still needs to be done. I haven’t even looked into support for dynamic library loading, nor am I handling any need to extend the program stack when that is necessary, amongst other things, and I also need to make the process of creating tasks as simple as a function call and probably also expose the process via IPC in order to have a kind of process server. I still need to get back to addressing the lack of convenient support for the sequential testing of functionality.

But I hope that much of the hard work has now already been done. Then again, I often find myself climbing one particular part of this mountain, thinking that the next part of the ascent will be easier, only to find myself confronted with another long and demanding stretch that brings me only marginally closer to the top! This article is part of a broader consolidation process, along with writing some documentation, and this process will continue with the packaging of this work for the historical record if nothing else.

Conclusions and Reflections

All of this has very much been a learning exercise covering everything from the nuts and bolts of L4Re, with its functions and abstractions, through the design of a component architecture to support familiar, intuitive but hard-to-define filesystem functionality, this requiring a deeper understanding of Unix filesystem behaviour, all the while considering issues of concurrency and resource management that are not necessarily trivial. With so much going on at so many levels, progress can be slow and frustrating. I see that similar topics and exercises are pursued in some university courses, and I am sure that these courses produce highly educated people who are well equipped to go out into the broader world, developing systems like these using far less effort than I seem to be applying.

That leads to the usual question of whether such systems are worth developing when one can just “use Linux” or adopt something already under development and aimed at a particular audience. As I note above, maybe people are routinely developing such systems for proprietary use and don’t see any merit in doing the same thing openly. The problem with such attitudes is that experience with the development of such systems is then not broadly cultivated, the associated expertise and the corresponding benefits of developing and deploying such systems are not proliferated, and the average user of technology only gets benefits from such systems in a limited sense, if they even encounter them at all, and then only for a limited period of time, most likely, before the products incorporating such technologies wear out or become obsolete.

In other words, it is all very well developing proprietary systems and celebrating achievements made decades ago, but having reviewed decades of computing history, it is evident to me that achievements that are not shared will need to be replicated over and over again. That such replication is not cutting-edge development or, to use the odious term prevalent in academia, “novel” is not an indictment of those seeking to replicate past glories: it is an indictment of the priorities of those who commercialised them on every prior occasion. As mundane as the efforts described in this article may be, I would hope that by describing them and the often frustrating journey involved in pursuing them, people may be motivated to explore the development of such systems and that techniques that would otherwise be kept as commercial secrets or solutions to assessment exercises might hopefully be brought to a broader audience.

Posted in English, Fiasco.OC, filesystems, Free Software, L4, libext2fs, MIPS, operating systems | Comments Off on Gradual Explorations of Filesystems, Paging and L4Re

Dataspaces and Paging in L4Re

Tuesday, March 19th, 2019

The experiments covered by my recent articles about filesystems and L4Re managed to lead me along another path in the past few weeks. I had defined a mechanism for providing access to files in a filesystem via a programming interface employing interprocess communication within the L4Re system. In doing so, I had defined calls or operations that would read from and write to a file, observing that some kind of “memory-mapped” file support might also be possible. At the time, I had no clear idea of how this would actually be made to work, however.

As can often be the case, once some kind of intellectual challenge emerges, it can become almost impossible to resist the urge to consider it and to formulate some kind of solution. Consequently, I started digging deeper into a number of things: dataspaces, pagers, page faults, and the communication that happens within L4Re via the kernel to support all of these things.

Dataspaces and Memory

Because the L4Re developers have put a lot of effort into making a system where one can compile a fairly portable program and probably expect it to work, matters like the allocation of memory within programs, the use of functions like malloc, and other things we take for granted need no special consideration in the context of describing general development for an L4Re-based system. In principle, if our program wants more memory for its own use, then the use of things like malloc will probably suffice. It is where we have other requirements that some of the L4Re abstractions become interesting.

In my previous efforts to support MIPS-based systems, these other requirements have included the need to access memory with a fixed and known location so that the hardware can be told about it, thus supporting things like framebuffers that retain stored data for presentation on a display device. But perhaps most commonly in a system like L4Re, it is the need to share memory between processes or tasks that causes us to look beyond traditional memory allocation techniques at what L4Re has to offer.

Indeed, the filesystem work so far employs what are known as dataspaces to allow filesystem servers and client applications to exchange larger quantities of information conveniently via shared buffers. First, the client requests a dataspace representing a region of memory. It then associates it with an address so that it may access the memory. Then, the client shares this with the server by sending it a reference to the dataspace (known as a capability) in a message.

Opening a file using shared memory containing file information

The kernel, in propagating the message and the capability, makes the dataspace available to the server so that both the client and the server may access the memory associated with the dataspace and that these accesses will just work without any further effort. At this level of sophistication we can get away with thinking of dataspaces as being blocks of memory that can be plugged into tasks. Upon obtaining access to such a block, reads and writes (or loads and stores) to addresses in the block will ultimately touch real memory locations.

Even in this simple scheme, there will be some address translation going on because each task has its own way of arranging its view of memory: its virtual address space. The virtual memory addresses used by a task may very well be different from the physical memory addresses indicating the actual memory locations involved in accesses.

An illustration of virtual memory corresponding to physical memory

Such address translation is at the heart of operating systems like those supported by the L4 family of microkernels. But the system will make sure that when a task tries to access a virtual address available to it, the access will be translated to a physical address and supported by some memory location.

Mapping and Paging

With some knowledge of the underlying hardware architecture, we can say that each task will need support from the kernel and the hardware to be able to treat its virtual address space as a way of accessing real memory locations. In my experiments with simple payloads to run on MIPS-based hardware, it was sufficient to define very simple tables that recorded correspondences between virtual and physical addresses. Processes or tasks would access memory addresses, and where the need arose to look up such a virtual address, the table would be consulted and the hardware configured to map the virtual address to a physical address.

Naturally, proper operating systems go much further than this, and systems built on L4 technologies go as far as to expose the mechanisms for normal programs to interact with. Instead of all decisions about how memory is mapped for each task being taken in the kernel, with the kernel being equipped with all the necessary policy and information, such decisions are delegated to entities known as pagers.

When a task needs an address translated, the kernel pushes the translation activity over to the designated pager for a decision to be made. And the event that demands an address translation is known as a page fault since it occurs when a task accesses a memory page that is not yet supported by a mapping to physical memory. Pagers are therefore present to receive page fault notifications and to respond in a way that causes the kernel to perform the necessary privileged actions to configure the hardware, this being one of the few responsibilities of the kernel.

The role of a pager in managing access to the contents of a dataspace

Treating a dataspace as an abstraction for memory accessed by a task or application, the designated pager for the dataspace acts as a dataspace manager, ensuring that memory accesses within the dataspace can be satisfied. If an access causes a page fault, the pager must act to provide a mapping for the accessed page, leaving the application mostly oblivious to the work going on to present the dataspace and its memory as a continuously present resource.

An Aside

It is rather interesting to consider the act of delegation in the context of processor architecture. It would seem to be fairly common that the memory management units provided by various architectures feature built-in support for consulting various forms of data structures describing the virtual memory layout of a process or task. So, when a memory access fails, the information about the actual memory address involved can be retrieved from such a predefined structure.

However, the MIPS architecture largely delegates such matters to software: a processor exception is raised when a “bad” virtual address is used, and the job of doing something about it falls immediately to a software routine. So, there seems to be some kind of parallel between processor architecture and operating system architecture, L4 taking a MIPS-like approach of eager delegation to a software component for increased flexibility and functionality.

Messages and Flexpages

So, the high-level view so far is as follows:

Dataspaces represent regions of virtual memory
Virtual memory is mapped to physical memory where the data actually resides
When a virtual memory access cannot be satisfied, a page fault occurs
Page faults are delivered to pagers (acting as dataspace managers) for resolution
Pagers make data available and indicate the necessary mapping to satisfy the failing access

To get to the level of actual implementation, some familiarisation with other concepts is needed. Previously, my efforts have exposed me to the interprocess communications (IPC) central in L4Re as a microkernel-based system. I had even managed to gain some level of understanding around sending references or capabilities between processes or tasks. And it was apparent that this mechanism would be used to support paging.

Unfortunately, the main L4Re documentation does not seem to emphasise the actual message details or protocols involved in these fundamental activities. Instead, the library code is described in reference documentation with some additional explanation. However, some investigation of the code yielded some insights as to the kind of interfaces the existing dataspace implementations must support, and I also tracked down some message sending activities in various components.

When a page fault occurs, the first thing to know about it is the kernel because the fault occurs at the fundamental level of instruction execution, and it is the kernel’s job to deal with such low-level events in the first instance. Notification of the fault is then sent out of the kernel to the page fault handler for the affected task. The page fault handler then contacts the task’s pager to request a resolution to the problem.

Page fault handling in detail

In L4Re, this page fault handler is likely to be something called a region mapper (or perhaps a region manager), and so it is not completely surprising that the details of invoking the pager was located in some region mapper code. Putting together both halves of the interaction yielded the following details of the message:

map: offset, hot spot, flags → flexpage

Here, the offset is the position of the failing memory access relative to the start of the dataspace; the flags describe the nature of the memory to be accessed. The “hot spot” and “flexpage” need slightly more explanation, the latter being an established term in L4 circles, the former being almost arbitrarily chosen and not particularly descriptive.

The term “flexpage” may have its public origins in the “Flexible-Sized Page-Objects” paper whose title describes the term. For our purposes, the significance of the term is that it allows for the consideration of memory pages in a range of sizes instead of merely considering a single system-wide page size. These sizes start at the smallest page size supported by the system (but not necessarily the absolute smallest supported by the hardware, but anyway…) and each successively larger size is double the size preceding it. For example:

4096 (2¹²) bytes
8192 (2¹³) bytes
16384 (2¹⁴) bytes
32768 (2¹⁵) bytes
65536 (2¹⁶) bytes
…

When a page fault occurs, the handler identifies a region of memory where the failing access is occurring. Although it could merely request that memory be made available for a single page (of the smallest size) in which the access is situated, there is the possibility that a larger amount of memory be made available that encompasses this access page. The flexpage involved in a map request represents such a region of memory, having a size not necessarily decided in advance, being made available to the affected task.

This brings us to the significance of the “hot spot” and some investigation into how the page fault handler and pager interact. I must admit that I find various educational materials to be a bit vague on this matter, at least with regard to explicitly describing the appropriate behaviour. Here, the flexpage paper was helpful in providing slightly different explanations, albeit employing the term “fraction” instead of “hot spot”.

Since the map request needs to indicate the constraints applying to the region in which the failing access occurs, without demanding a particular size of region and yet still providing enough useful information to the pager for the resulting flexpage to be useful, an efficient way is needed of describing the memory landscape in the affected task. This is apparently where the “hot spot” comes in. Consider a failing access in page #3 of a memory region in a task, with the memory available in the pager to satisfy the request being limited to two pages:

Mapping available memory to pages in a task experiencing a page fault

Here, the “hot spot” would reference page #3, and this information would be received by the pager. The significance of the “hot spot” appears to be the location of the failing access within a flexpage, and if the pager could provide it then a flexpage of four system pages would map precisely to the largest flexpage expected by the handler for the task.

However, with only two system pages to spare, the pager can only send a flexpage consisting of those two pages, the “hot spot” being localised in page #1 of the flexpage to be sent, and the base of this flexpage being the base of page #0. Fortunately, the handler is smart enough to fit this smaller flexpage onto the “receive window” by using the original “hot spot” information, mapping page #2 in the receive window to page #0 of the received flexpage and thus mapping the access page #3 to page #1 of that flexpage.

So, the following seems to be considered and thus defined by the page fault handler:

The largest flexpage that could be used to satisfy the failing access.
The base of this flexpage.
The page within this flexpage where the access occurs: the “hot spot”.
The offset within the broader dataspace of the failing access, it indicating the data that would be expected in this page.

(Given this phrasing of the criteria, it becomes apparent that “flexpage offset” might be a better term than “hot spot”.)

With these things transferred to the pager using a map request, the pager’s considerations are as follows:

How flexpages of different sizes may fit within the memory available to satisfy the request.
The base of the most appropriate flexpage, where this might be the largest that fits within the available memory.
The population of the available memory with data from the dataspace.

To respond to the request, the pager sends a special flexpage item in its response message. Consequently, this flexpage is mapped into the task’s address space, and the execution of the task may resume with the missing data now available.

Practicalities and Pitfalls

If the dataspace being provided by a pager were merely a contiguous region of memory containing the data, there would probably be little else to say on the matter, but in the above I hint at some other applications. In my example, the pager only uses a certain amount of memory with which it responds to map requests. Evidently, in providing a dataspace representing a larger region, the data would have to be brought in from elsewhere, which raises some other issues.

Firstly, if data is to be copied into the limited region of memory available for satisfying map requests, then the appropriate portion of the data needs to be selected. This is mostly a matter of identifying how the available memory pages correspond to the data, then copying the data into the pages so that the accessed location ultimately provides the expected data. It may also be the case that the amount of data available does not fill the available memory pages; this should cause the rest of those pages to be filled with zeros so that data cannot leak between map requests.

Secondly, if the available memory pages are to be used to satisfy the current map request, then what happens when we re-use them in each new map request? It turns out that the mappings made for previous requests remain active! So if a task traverses a sequence of pages, and if each successive page encountered in that traversal causes a page fault, then it will seem that new data is being made available in each of those pages. But if that task inspects the earlier pages, it will find that the newest data is exposed through those pages, too, banishing the data that we might have expected.

Of course, what is happening is that all of the mapped pages in the task’s dataspace now refer to the same collection of pages in the pager, these being dedicated to satisfying the latest map request. And so, they will all reflect the contents of those available memory pages as they currently are after this latest map request.

The effect of mapping the same page repeatedly

One solution to this problem is to try and make the task forget the mappings for pages it has visited previously. I wondered if this could be done automatically, by sending a flexpage from the pager with a flag set to tell the kernel to invalidate prior mappings to the pager’s memory. After a time looking at the code, I ended up asking on the l4-hackers mailing list and getting a very helpful response that was exactly what I had been looking for!

There is, in fact, a special way of telling the kernel to “unmap” memory used by other tasks (l4_task_unmap), and it is this operation that I ended up using to invalidate the mappings previously sent to the task. Thus the task, upon backtracking to earlier pages, finds that the mappings from virtual addresses to the physical memory holding the latest data are absent once again, and page faults are needed to restore the data in those pages. The result is a form of multiplexing access to a resource via a limited region of memory.

Applications of Flexible Paging

Given the context of my investigations, it goes almost without saying that the origin of data in such a dataspace could be a file in a filesystem, but it could equally be anything that exposes data in some kind of backing store. And with this backing store not necessarily being an area of random access memory (RAM), we enter the realm of a more restrictive definition of paging where processes running in a system can themselves be partially resident in RAM and partially resident in some other kind of storage, with the latter portions being converted to the former by being fetched from wherever they reside, depending on the demands made on the system at any given point in time.

One observation worth making is that a dataspace does not need to be a dedicated component in the system in that it is not a separate and special kind of entity. Anything that is able to respond to the messages understood by dataspaces – the paging “protocol” – can provide dataspaces. A filesystem object can therefore act as a dataspace, exposing itself in a region of memory and responding to map requests that involve populating that region from the filesystem storage.

It is also worth mentioning that dataspaces and flexpages exist at different levels of abstraction. Dataspaces can be considered as control mechanisms for accessing regions of virtual memory, and the Fiasco.OC kernel does not appear to employ the term at all. Meanwhile, flexpages are abstractions for memory pages existing within or even independently of dataspaces. (If you wish, think of the frame of Banksy’s work “Love is in the Bin” as a dataspace, with the shredded pieces being flexpages that are mapped in and out.)

One can envisage more exotic forms of dataspace. Consider an image whose pixels need to be computed, like a ray-traced image, for instance. If it exposed those pixels as a dataspace, then a task reading from pages associated with that dataspace might cause computations to be initiated for an area of the image, with the task being suspended until those computations are performed and then being resumed with the pixel data ready to read, with all of this happening largely transparently.

I started this exercise out of somewhat idle curiosity, but it now makes me wonder whether I might introduce memory-mapped access to filesystem objects and then re-implement operations like reading and writing using this particular mechanism. Not being familiar with how systems like GNU/Linux provide these operations, I can only speculate as to whether similar decisions have been taken elsewhere.

But certainly, this exercise has been informative, even if certain aspects of it were frustrating. I hope that this account of my investigations proves useful to anyone else wondering about microkernel-based systems and L4Re in particular, especially if they too wish there were more discussion, reflection and collaboration on the design and implementation of software for these kinds of systems.

Posted in English, Fiasco.OC, Free Software, L4, operating systems | Comments Off on Dataspaces and Paging in L4Re

Integrating libext2fs with a Filesystem Framework

Wednesday, February 20th, 2019

Given the content covered by my previous articles, there probably doesn’t seem to be too much that needs saying about the topic covered by this article. Previously, I described the work involved in building libext2fs for L4Re and testing the library, and I described a framework for separating filesystem providers from programs that want to use files. But, as always, there are plenty of little details, detours and learning experiences that help to make the tale longer than it otherwise might have been.

Although this file access framework sounds intimidating, it is always worth remembering that the only exotic thing about the software being written is that it needs to request system resources and to communicate with other programs. That can be tricky in itself in many programming environments, and I have certainly spent enough time trying to figure out how to use the types and functions provided by the many L4Re libraries so that these operations may actually work.

But in the end, these are programs that are run just like any other. We aren’t building things into the kernel and having to conform to a particularly restricted environment. And although it can still be tiresome to have to debug things, particularly interprocess communication (IPC) problems, many familiar techniques for debugging and inspecting program behaviour remain available to us.

A Quick Translation

The test program I had written for libext2fs simply opened a file located in the “rom” filesystem, exposed it to libext2fs, and performed operations to extract content. In my framework, I had directed my attention towards opening and reading files, so it made sense to concentrate on providing this functionality in a filesystem server or “provider”.

Accessing a filesystem server employing a "rom" file for the data

The user of the framework (shielded from the details by a client library) would request the opening of a file (thus obtaining a file descriptor able to communicate with a dedicated resource object) and then read from the file (causing communication with the resource object and some transfers of data). These operations, previously done in a single program employing libext2fs directly, would now require collaboration by two separate programs.

So, I would need to insert the appropriate code in the right places in my filesystem server and its objects to open a filesystem, search for a file of the given name, and to provide the file data. For the first of these, the test program was doing something like this in the main function:

retval = ext2fs_open(devname, EXT2_FLAG_RW, 0, 0, unix_io_manager, &fs);

In the main function of the filesystem server program, something similar needs to be done. A reference to the filesystem (fs) is then passed to the server object for it to use:

Fs_server server_obj(fs, devname);

When a request is made to open a file, the filesystem server needs to locate the file just as the test program needed to. The code to achieve this is tedious, employing the ext2fs_lookup function and traversing the directory hierarchy. Ultimately, something like this needs to be done to obtain a structure for accessing the file contents:

retval = ext2fs_file_open(_fs, ino_file, ext2flags, &file);

Here, the _fs variable is our reference in the server object to the filesystem structure, the ino_file variable refers to the place in the filesystem where the file is found (the inode), some flags indicate things like whether we are reading and/or writing, and a supplied file variable is set upon the successful opening of the file. In the filesystem server, we want to create a specific object to conduct access to the file:

Fs_object *obj = new Fs_object(file, EXT2_I_SIZE(&inode_file), fsobj, irq);

Here, this resource object is initialised with the file access structure, an indication of the file size, something encapsulating the state of the communication between client and server, and the IRQ object needed for cleaning up (as described in the last article). Meanwhile, in the resource object, the read operation is supported by a pair of libext2fs functions:

ext2fs_file_lseek(_file, _obj.position, EXT2_SEEK_SET, 0);
ext2fs_file_read(_file, _obj.buffer, to_transfer, &read);

These don’t appear next to each other in the actual code, but the first call is used to seek to the indicated position in the file, this having been specified by the client. The second call appears in a loop to read into a buffer an indicated amount of data, returning the amount that was actually read.

In summary, the work done by a collection of function calls appearing together in a single function is now spread out over three places in the filesystem server program:

The initialisation is done in the main function as the server starts up
The locating and opening of a file in the filesystem is done in the general filesystem server object
Reading and writing is done in the file-specific resource object

After initialisation, the performance of each part of the work only occurs upon receiving a distinct kind of message from a client program, of which more details are given below.

The Client Library

Although we cannot yet use the familiar C library functions for accessing files (fopen, fread, fwrite, fclose, and so on), we can employ functions that try to be as friendly. Thus, the following form of program may be used:

char buffer[80];
file_descriptor_t *desc = client_open("test.txt", O_RDONLY);

available = client_read(desc, buffer, 80);
if (available)
    fwrite((void *) buffer, sizeof(char), available, stdout); /* using existing fwrite function */
client_close(desc);

As noted above, the existing fwrite function in L4Re may be used to write file data out to the console. Ultimately, we would want our modified version of the function to be doing this job.

These client library functions resemble lower-level C library functions such as open, read, write, close, and so on. By targeting this particular level of functionality, it is hoped that much of the logic in functions like fopen can be preserved, this logic having to deal with things like mode strings (“r”, “r+”, “w”, and so on) which have little to do with the actual job of transmitting file content around the system.

In order to do their work, the client library functions need to send and receive IPC messages, or at least need to get other functions to deal with this particular work. My approach has been to write a layer of functions that only deals with messaging and that hides the L4-specific details from the rest of the code.

This lower-level layer of functions allows us to treat interprocess interactions like normal function calls, and in this framework those calls would have the following signatures, with the inputs arriving at the server and the outputs arriving back at the client:

fs_open: flags, buffer → file size, resource object
fs_flush: (no parameters) → (no return values)
fs_read: position → available
fs_write: position, available → written, file size

Here, the aim is to keep the interprocess interactions as simple and as infrequent as possible, with data buffered in the indicated buffer dataspace, and with reading and writing only occurring when the buffer is read or has been filled by writing. The more friendly semantics therefore need to be supported in the client library functions resting on top of these even-lower-level IPC messaging functions.

The responsibilities of the client library functions can be summarised as follows:

client_open: allocate memory for the buffer, obtain a server reference (“capability”) from the program’s environment
client_close: deallocate the allocated resources
client_flush: invoke fs_flush with any available data, resetting the buffer status
client_read: provide data to the caller from its buffer, invoking fs_read whenever the buffer is empty
client_write: commit data from the caller into the buffer, invoking fs_write whenever the buffer is full, also flushing the buffer when appropriate

The lack of a fs_close function might seem surprising, but as described in the previous article, the server process is designed to receive a notification when the client process discards a reference to the resource object dedicated to a particular file. So in client_close, we should be able to merely throw away the things acquired by client_open, and the system together with the server will hopefully handle the consequences.

Switching the Backend

Using a conventional file as the repository for file content is convenient, but since the aim is to replace the existing filesystem mechanisms, it would seem necessary to try and get libext2fs to use other ways of accessing the underlying storage. Previously, my considerations had led me to provide a “block” storage layer underneath the filesystem layer. So it made sense to investigate how libext2fs might communicate with a “block server” or “block device” in order to read and write raw filesystem data.

Employing a separate server to provide filesystem data

Changing the way libext2fs accesses its storage sounds like an ominous task, but fortunately some thought has evidently gone into accommodating different storage types and platforms. Indeed, the library code includes support for things like DOS and Windows, with this functionality evidently being used by various applications on those platforms (or, these days, the latter one, at least) to provide some kind of file browser support for ext2-family filesystems.

The kind of component involved in providing this variety of support is known as an “I/O manager”, and the one that we have been using is known as the “Unix” I/O manager, this employing POSIX or standard C library calls to access files and devices. Now, this may have been adequate until now, but with the requirement that we use the replacement IPC mechanisms to access a block server, we need to consider how a different kind of I/O manager might be implemented to use the client library functions instead of the C library functions.

This exercise turned out to be relatively straightforward and perhaps a little less work than envisaged once the requirements of initialising an io_channel object had been understood, this involving the allocation of memory and the population of a structure to indicate things like the block size, error status, and so on. Beyond this, the principal operations needing support are as follows:

open: initialises the io_channel and calls client_open
close: calls client_close
set block size: sets the block size for transfers, something that gets done at various points in the opening of a filesystem
read block: calls client_seek and client_read to obtain data from the block server
write block: calls client_seek and client_write to commit data to the block server

It should be noted that the block server largely acts like a single-file filesystem, so the same interface supported by the filesystem server is also supported by the block server. This is how we get away with using the client libraries.

Meanwhile, in the filesystem server code, the only changes required are to declare the new I/O manager, implemented in a separate library package, and to use it instead of the previous one:

retval = ext2fs_open(devname, ext2flags, 0, 0, blockserver_io_manager, &fs);

The Final Trick

By pushing use of the “rom” filesystem further down in the system, use of the new file access mechanisms can be introduced and tested, with the only “unauthentic” aspect of the arrangement being that a parallel set of file access functions is being used instead of the conventional ones. The only thing left to do would be to change the C library to incorporate the new style of file access, probably by incorporating the client library internally, thus switching the C library away from its previous method of accessing files.

With the conventional file abstractions reimplemented, access to files would go via the virtual filesystem and hopefully end up encountering block devices that are able to serve up the needed data directly. And ultimately, we could end up switching back to using the Unix I/O manager with libext2fs.

Introducing the new IPC mechanisms at the C library level

Changing things so drastically would also force us to think about maintaining access to the “rom” filesystem through the revised architecture, at least at first, because it happens to provide a very convenient way of getting access to data for use as storage. We could try and implement storage hardware support in order to get round this problem, but that probably isn’t convenient – or would be a distraction – when running L4Re on Fiasco.OC-UX as a kind of hosted version of the software.

Indeed, tackling the C library is probably too much of a challenge at this early stage. Fortunately, there are plenty of other issues to be considered first, with the use of non-standard file access functions being only a minor inconvenience in the broader scheme of things. For instance, how are permissions and user identities to be managed? What about concurrent access to the filesystem? And what mechanisms would need to be provided for grafting filesystems onto a larger virtual filesystem hierarchy? I hope to try and discuss some of these things in future articles.

Posted in English, Fiasco.OC, filesystems, Free Software, L4, libext2fs, operating systems | Comments Off on Integrating libext2fs with a Filesystem Framework

Filesystem Abstractions for L4Re

Tuesday, February 12th, 2019

In my previous posts, I discussed the possibility of using “real world” filesystems in L4Re, initially considering the nature of code to access an ext2-based filesystem using the library known as libext2fs, then getting some of that code working within L4Re itself. Having previously investigated the nature of abstractions for providing filesystems and file objects to applications, it was inevitable that I would now want to incorporate libext2fs into those abstractions and to try and access files residing in an ext2 filesystem using those abstractions.

It should be remembered that L4Re already provides a framework for filesystem access, known as Vfs or “virtual file system”. This appears to be the way the standard file access functions are supported, with the “rom” part of the filesystem hierarchy being supported by a “namespace filesystem” library that understands the way that the deployed payload modules are made available as files. To support other kinds of filesystem, other libraries must apparently be registered and made available to programs.

Although I am sure that the developers of the existing Vfs framework are highly competent, I found the mechanisms difficult to follow and quite unlike what I expected, which was to see a clear separation between programs accessing files and other programs providing those files. Indeed, this is what one sees when looking at other systems such as Minix 3 and its virtual filesystem. I must also admit that I became tired of having to dig into the code to understand the abstractions in order to supplement the reference documentation for the Vfs framework in L4Re.

An Alternative Framework

It might be too soon to label what I have done as a framework, but at the very least I needed to decide upon a few mechanisms to implement an alternative approach to providing file-like abstractions to programs within L4Re. There were a few essential characteristics to be incorporated:

A way of requesting access to a named file
The provision of objects maintaining the state of access to an opened file
The transmission of file content to file readers and from file writers
A way of cleaning up when programs are no longer accessing files

One characteristic that I did want to uphold in any solution was to make programs largely oblivious to the nature of the accessed filesystems. They would navigate a virtual filesystem hierarchy, just as one does in Unix-like systems, with certain directories acting as gateways to devices exposing potentially different filesystems with superficially similar semantics.

Requesting File Access

In a system like L4Re, with notions of clients and servers already prevalent, it seems natural to support a mechanism for requesting access to files that sees a client – a program wanting to access a file – delegating the task of locating that file to a server. How the server performs this task may be as simple or as complicated as we wish, depending on what kind of architecture we choose to support. In an operating system with a “monolithic” kernel, like GNU/Linux, we also see such delegation occurring, with the kernel being the entity having to support the necessary wiring up of filesystems contributing to the virtual filesystem.

So, it makes sense to support an “open” system call just like in other operating systems. The difference here, however, is that since L4Re is a microkernel-based environment, both the caller and the target of the call are mere programs, with the kernel only getting involved to route the call or message between the programs concerned. We would first need to make sure that the program accessing files has a reference (known as a “capability”) to another program that provides a filesystem and can respond to this “open” message. This wiring up of programs is a task for the system’s configuration file, as featured in some of my previous articles.

We may now consider what the filesystem-providing program or filesystem “server” needs to do when receiving an “open” message. Let us consider the failure to find the requested file: the filesystem server would, in such an event, probably just return a response indicating failure without any real need to do anything else. It is in the case of a successful lookup that the response to the caller or client needs some more consideration: the server could indicate success, but what is the client going to do with such information? And how should the server then facilitate further access to the file itself?

Providing Objects for File Access

It becomes gradually clearer that the filesystem server will need to allocate some resources for the client to conduct its activities, to hold information read from the filesystem itself and to hold data sent for writing back to the opened file. The server could manage this within a single abstraction and support a range of different operations, accommodating not only requests to open files but also operations on the opened files themselves. However, this might make the abstraction complicated and also raise issues around things like concurrency.

What if this server object ends up being so busy that while waiting for reading or writing operations to complete on a file for one program, it leaves other programs queuing up to ask about or gain access to other files? It all starts to sound like another kind of abstraction would be beneficial for access to specific files for specific clients. Consequently, we end up with an arrangement like this:

Accessing a filesystem and a resource

When a filesystem server receives an “open” message and locates a file, it allocates a separate object to act as a contact point for subsequent access to that file. This leaves the filesystem object free to service other requests, with these separate resource objects dealing with the needs of the programs wanting to read from and write to each individual file.

The underlying mechanisms by which a separate resource object is created and exposed are as follows:

The instantiation of an object in the filesystem server program holding the details of the accessed file.
The creation of a new thread of execution in which the object will run. This permits it to handle incoming messages concurrently with the filesystem object.
The creation of an “IPC gate” for the thread. This effectively exposes the object to the wider environment as what often appears to be known as a “kernel object” (rather confusingly, but it simply means that the kernel is aware of it and has to do some housekeeping for it).

Once activated, the thread created for the resource is dedicated to listening for incoming messages and handling them, invoking methods on the resource object as a proxy for the client sending those messages to achieve the same effect.

Transmitting File Content

Although we have looked at how files manifest themselves and may be referenced, the matter of obtaining their contents has not been examined too closely so far. A program might be able to obtain a reference to a resource object and to send it messages and receive responses, but this is not likely to be sufficient for transferring content to and from the file. The reason for this is that the messages sent between programs – or processes, since this is how we usually call programs that are running – are deliberately limited in size.

Thus, another way of exchanging this data is needed. In a situation where we are reading from a file, what we would most likely want to see is a read operation populate some memory for us in our process. Indeed, in a system like GNU/Linux, I imagine that the Linux kernel shuttles the file data from the filesystem module responsible to an area of memory that it has reserved and exposed to the process. In a microkernel-based system, things have to be done more “collaboratively”.

The answer, it would seem to me, is to have dedicated memory that is shared between processes. Fortunately, and arguably unsurprisingly, L4Re provides an abstraction known as a “dataspace” that provides the foundation for such sharing. My approach, then, involves requesting a dataspace to act as a conduit for data, making the dataspace available to the file-accessing client and the file-providing server object, and then having a protocol to notify each other about data being sent in each direction.

I considered whether it would be most appropriate for the client to request the memory or whether the server should do so, eventually concluding that the client could decide how much space it would want as a buffer, handing this over to the server to use to whatever extent it could. A benefit of doing things this way is that the client may communicate initialisation details when it contacts the server, and so it becomes possible to transfer a filesystem path – the location of a file from the root of the filesystem hierarchy – without it being limited to the size of an interprocess message.

Opening a file using a path written to shared memory

So, the “open” message references the newly-created dataspace, and the filesystem server reads the path written to the dataspace’s memory so that it may use it to locate the requested file. The dataspace is not retained by the filesystem object but is instead passed to the resource object which will then share the memory with the application or client. As described above, a reference to the resource object is returned in the response to the “open” message.

It is worthwhile to consider the act of reading from a file exposed in this way. Although both client (the application in the above diagram) and server (resource object) should be able to access the shared “buffer”, it would not be a good idea to let them do so freely. Instead, their roles should be defined by the protocol employed for communication with one another. For a simple synchronous approach it would look like this:

Reading data from a resource via a shared buffer

Here, upon the application or client invoking the “read” operation (in other words, sending the “read” message) on the resource object, the resource is able to take control of the buffer, obtaining data from the file and writing it to the buffer memory. When it is done, its reply or response needs to indicate the updated state of the buffer so that the client will know how much data there is available, potentially amongst other things of interest.

Cleaning Up

Many of us will be familiar with the workflow of opening, reading and writing, and closing files. This final activity is essential not only for our own programs but also for the system, so that it does not tie up resources for activities that are no longer taking place. In the filesystem server, for the resource object, a “close” operation can be provided that causes the allocated memory to be freed and the resource object to be discarded.

However, merely providing a “close” operation does not guarantee that a program would use it, and we would not want a situation where a program exits or crashes without having invoked this operation, leaving the server holding resources that it cannot safely discard. We therefore need a way of cleaning up after a program regardless of whether it sees the need to do so itself.

In my earliest experiments with L4Re on the MIPS Creator CI20, I had previously encountered the use of interrupt request (IRQ) objects, in that case signalling hardware-initiated events such as the pressing of physical switches. But I also knew that the IRQ abstraction is employed more widely in L4Re to allow programs to participate in activities that would normally be the responsibility of the kernel in a monolithic architecture. It made me wonder whether there might be interrupts communicating the termination of a process that could then be used to clean up afterwards.

One area of interest is that concerning the “IPC gate” mentioned above. This provides the channel through which messages are delivered to a particular running program, and up to this point, we have considered how a resource object has its own IPC gate for the delivery of messages intended for it. But it also turns out that we can also enable notifications with regard to the status of the IPC gate via the same mechanism.

By creating an IRQ object and associating it with a thread as the “deletion IRQ”, when the kernel decides that the IPC gate is no longer needed, this IRQ will be delivered. And the kernel will make this decision when nothing in the system needs to use the IPC gate any more. Since the IPC gate was only created to service messages from a single client, it is precisely when that client terminates that the kernel will realise that the IPC gate has no other users.

Resource deletion upon the termination of a client

To enable this to actually work, however, a little trick is required: the server must indicate that it is ready to dispose of the IPC gate whenever necessary, doing so by decreasing the “reference count” which tracks how many things in the system are using the IPC gate. So this is what happens:

The IPC gate is created for the resource thread, and its details are passed to the client, exposing the resource object.
An IRQ object is bound to the thread and associated with the IPC gate deletion event.
The server decreases its reference count, relinquishing the IPC gate and allowing its eventual destruction.
The client and server communicate as desired.
Upon the client terminating, the kernel disassociates the client from the IPC gate, decreasing the reference count.
The kernel notices that the reference count is zero and sends an IRQ telling the server about the impending IPC gate deletion.
The resource thread in the server deallocates the resource object and terminates.
The IPC gate is deleted.

Using the “gate label”, the thread handling communications for the resource object is able to distinguish between the interrupt condition and normal messages from the client. Consequently, it is able to invoke the appropriate cleaning up routine, to discard the resource object, and to terminate the thread. Hopefully, with this approach, resource objects will no longer be hanging around long after their clients have disappeared.

Other Approaches

Another approach to providing file content did also occur to me, and I wondered whether this might have been a component of the “namespace filesystem” in L4Re. One technique for accessing files involves mapping the entire file into memory using a “mmap” function. This could be supported by requesting a dataspace of a suitable size, but only choosing to populate a region of it initially.

The file-accessing program would attempt to access the memory associated with the file, and upon straying outside the populated region, some kind of “fault” would occur. A filesystem server would have the job of handling this fault, fetching more data, allocating more memory pages, mapping them into the file’s memory area, and disposing of unwanted pages, potentially writing modified pages to the appropriate parts of the file.

In effect, the filesystem server would act as a pager, as far as I can tell, and I believe it to be the case that Moe – the root task – acts in such a way to provide the “rom” files from the deployed payload modules. Currently, I don’t find it particularly obvious from the documentation how I might implement a pager, and I imagine that if I choose to support such things, I will end up having to trawl the code for hints on how it might be done.

Client Library Functions

To present a relatively convenient interface to programs wanting to use files, some client library functions need to be provided. The intention with these is to support the traditional C library paradigms and for these functions to behave like those that C programmers are generally familiar with. This means performing interprocess communications using the “open”, “read”, “write”, “close” and other messages when necessary, hiding the act of sending such messages from the library user.

The details of such a client library are probably best left to another article. With some kind of mechanism in place for accessing files, it becomes a matter of experimentation to see what the demands of the different operations are, and how they may be tuned to reduce the need for interactions with server objects, hopefully allowing file-accessing programs to operate efficiently.

The next article on this topic is likely to consider the integration of libext2fs with this effort, along with the library functionality required to exercise and test it. And it will hopefully be able to report some real experiences of accessing ext2-resident files in relatively understandable programs.

Posted in English, Fiasco.OC, filesystems, Free Software, L4, operating systems | Comments Off on Filesystem Abstractions for L4Re

Using ext2 Filesystems with L4Re

Tuesday, February 5th, 2019

Previously, I described my initial investigations into libext2fs and the development of programs to access and populate ext2/3/4 filesystems. With a program written and now successfully using libext2fs in my normal GNU/Linux environment, the next step appeared to be the task of getting this library to work within the L4Re system. The following steps were envisaged:

Figuring out the code that would be needed, this hopefully being supportable within L4Re.
Introducing the software as a package within L4Re.
Discovering the configuration required to build the code for L4Re.
Actually generating a library file.
Testing the library with a program.

This process is not properly completed in that I do not yet have a good way of integrating with the L4Re configuration and using its details to configure the libext2fs code. I felt somewhat lazy with regard to reconciling the use of autotools with the rather different approach taken to build L4Re, which is somewhat reminiscent of things like Buildroot and OpenWrt in certain respects.

So, instead, I built the Debian package from source in my normal environment, grabbed the config.h file that was produced, and proceeded to use it with a vastly simplified Makefile arrangement, also in my normal environment, until I was comfortable with building the library. Indeed, this exercise of simplified building also let me consider which portions of the libext2fs distribution would really be needed for my purposes. I did not really fancy having to struggle to build files that would ultimately be superfluous.

Still, as I noted, this work isn’t finished. However, it is useful to document what I have done so far so that I can subsequently describe other, more definitive, work.

Making a Package

With a library that seemed to work with the archiving program, written to populate filesystems for eventual deployment, I then set about formulating this simplified library distribution as a package within L4Re. This involves a few things:

Structuring the files so that the build system may process them.
Persuading the build system to install things in places for other packages to find.
Formulating the appropriate definitions to build the source files (and thus producing the right compiler and linker invocations).

Here are some notes about the results.

The Package Structure

Currently, I have the following arrangement inside the pkg/libext2fs directory:

include
include/libblkid
include/libe2p
include/libet
include/libext2fs
include/libsupport
include/libuuid
lib
lib/libblkid
lib/libe2p
lib/libet
lib/libext2fs
lib/libsupport
lib/libuuid

To follow L4Re conventions, public header files have been moved into the include hierarchy. This breaks assumptions in the code, with header files being referenced without a prefix (like “ext2fs”, “et”, “e2p”, and so on) in some places, but being referenced with such a prefix in others. The original build system for the code gets away with this by using the “ext2fs” and other prefixes as the directory names containing the code for the different libraries. It then indicates the parent “lib” directory of these directories as the place to start looking for headers.

But I thought it worthwhile to try and map out the header usage and distinguish between public and private headers. At the very least, it helps me to establish the relationships between the different components involved. And I may end up splitting the different components into their own packages, requiring some formalisation of their interactions.

Meanwhile, I defined a Control file to indicate what the package provides:

provides: libblkid libe2p libet libext2fs libsupport libuuid

This appears to be used in dependency resolution, causing the package to be built if another package requires one of the named entities in its own Control file.

Header File Locations

In each include subdirectory (such as include/libext2fs) is a Makefile indicating a couple of things, the following being used for libext2fs:

PKGNAME = libext2fs
CONTRIB_HEADERS = 1

The effect of this is to install the headers into a include/contrib/libext2fs directory in the build output.

In the corresponding lib subdirectory (which is lib/libext2fs), the following seems to be needed:

CONTRIB_INCDIR = libext2fs

Hopefully, with this, other packages can depend on libext2fs and have the headers made available to it by an include statement like this:

#include <ext2fs/ext2fs.h>

(The ext2fs prefix is provided by a directory inside include/libext2fs.)

Otherwise, headers may end up being put in a special “l4” hierarchy, and then code would need changing to look something like this:

#include <l4/ext2fs/ext2fs.h>

So, avoiding this and having the original naming seems to be the benefit of the “contrib” settings, as far as I can tell.

Defining Build Files

The Makefile in each specific lib subdirectory employs the usual L4Re build system definitions:

TARGET          = libext2fs.a libext2fs.so
PC_FILENAME     = libext2fs

The latter of these is used to identify the build products so that the appropriate compiler and linker options can be retrieved by the build system when this library is required by another. Here, PC is short for “package config” but the notion of “package” is different from that otherwise used in this article: it just refers to the specific library being built in this case.

An important aspect related to “package config” involves the requirements or dependencies of this library. These are specified as follows for libext2fs:

REQUIRES_LIBS   = libet libe2p

We saw these things in the Control file. By indicating these other libraries, the compiler and linker options to find and use these other libraries will be brought in when something else requires libext2fs. This should help to prevent build failures caused by missing headers or libraries, and it should also permit more concise declarations of requirements by allowing those declarations to omit libet and libe2p in this case.

Meanwhile, the actual source files are listed using a SRC_C definition, and the PRIVATE_INCDIR definition lists the different paths to be used to search for header files within this package. Moving the header files around complicates this latter definition substantially.

There are other complications with libext2fs, notably the building of a tool that generates a file to be used when building the library itself. I will try and return to this matter at some point and figure out a way of doing this within the build system. Such generation of binaries for use in build processes can be problematic, particularly if there is some kind of assumption that the build system is the same as the target system, but such assumptions are probably not being made here.

Building the Library

Fortunately, the build system mostly takes care of everything else, and a command like this should see the package being built and libraries produced:

make O=mybuild S=pkg/libext2fs

The “S” option is a real time saver, and I wish I had made more use of it before. Use of the “V” option can be helpful in debugging command options, since the normal output is abridged:

make O=mybuild S=pkg/libext2fs V=1

I will admit that since certain header files are not provided by L4Re, a degree of editing of the config.h file was required. Things like HAVE_LINUX_FD_H, indicating the availability of Linux-specific headers, needed to be removed.

Testing the Library

An appropriate program for testing the library is really not much different from one used in a GNU/Linux environment. Indeed, I just took some code from my existing program that lists a directory inside a filesystem image. Since L4Re should provide enough of a POSIX-like environment to support such unambitious programs, practically no changes were needed and no special header files were included.

A suitable Makefile is needed, of course, but the examples package in L4Re provides plenty of guidance. The most important part is this, however:

REQUIRES_LIBS   = libext2fs

A Control file requiring libext2fs is actually not necessary for an example in the examples hierarchy, it would seem, but such a file would otherwise be advisible. The above library requirements pull in the necessary compiler and linker flags from the “package config” universe. (It also means that the libext2fs headers are augmented by the libe2p and libet headers, as defined in the required libraries for libext2fs itself.)

As always, deploying requires a suitable configuration description and a list of modules to be deployed. The former looks like this:

local L4 = require("L4");

local l = L4.default_loader;

l:startv({
    log = { "ext2fstest", "g" },
  },
  "rom/ex_ext2fstest", "rom/ext2fstest.fs", "/");

The interesting part is right at the end: a program called ex_ext2fstest is run with two arguments: the name of a file containing a filesystem image, and the directory inside that image that we want the program to show us. Here, we will be using the built-in “rom” filesystem in L4Re to serve up the data that we will be decoding with libext2fs in the program. In effect, we use one filesystem to bootstrap access to another!

Since the “rom” filesystem is merely a way of exposing modules as files, the filesystem image therefore needs to be made available as a module in the module list provided in the conf/modules.list file, the appropriate section starting off like this:

entry ext2fstest
roottask moe rom/ext2fstest.cfg
module ext2fstest.cfg
module ext2fstest.fs
module l4re
module ned
module ex_ext2fstest
# plus lots of library modules

All these experiments are being conducted with L4Re running on the UX configuration of Fiasco.OC, meaning that the system runs on top of GNU/Linux: a sort of “user mode L4”. Running the set of modules for the above test is a matter of running something like this:

make O=mybuild ux E=ext2fstest

This produces a lot of output and then some “logged” output for the test program:

ext2fste| Opened rom/ext2fstest.fs.
ext2fste| /
ext2fste| drwxr-xr-x-       0     0        1024 .
ext2fste| drwxr-xr-x-       0     0        1024 ..
ext2fste| drwx-------       0     0       12288 lost+found
ext2fste| -rw-r--r---    1000  1000       11449 e2access.c
ext2fste| -rw-r--r---    1000  1000        1768 file.c
ext2fste| -rw-r--r---    1000  1000        1221 format.c
ext2fste| -rw-r--r---    1000  1000        6504 image.c
ext2fste| -rw-r--r---    1000  1000        1510 path.c

It really isn’t much to look at, but this indicates that we have managed to access an ext2 filesystem within L4Re using a program that calls the libext2fs library functions. If nothing else, the possibility of porting a library to L4Re and using it has been demonstrated.

But we want to do more than that, of course. The next step is to provide access to an ext2 filesystem via a general interface that hides the specific nature of the filesystem, one that separates the work into a different program from those wanting to access files. To do so involves integrating this effort into my existing filesystem framework, then attempting to re-use a generic file-accessing program to obtain its data from ext2-resident files. Such activities will probably form the basis of the next article on this topic.

Posted in English, Fiasco.OC, filesystems, Free Software, L4, libext2fs, operating systems | Comments Off on Using ext2 Filesystems with L4Re

Filesystem Familiarisation

Tuesday, January 29th, 2019

I previously noted that accessing filesystems would be a component in my work with microkernel-based systems, and towards the end of last year I began an exercise in developing a simple “toy” filesystem that could hold file-like entities. Combining this with some L4Re-based components that implement seemingly reasonable mechanisms for providing access to files, I was able to write simple test programs that open and access these files.

The starting point for all this was the observation that a normal system file – that is, something stored in the filesystem in my GNU/Linux environment – can be treated like an archive containing multiple files and therefore be regarded as providing a filesystem itself. Such a file can then be embedded in a payload providing a L4Re system by specifying it as a “module” in conf/modules.list for a particular payload entry:

module image_root.fs

Since L4Re provides a rudimentary “rom” filesystem that exposes the modules embedded in the payload, I could open this “toy” filesystem module as a file within L4Re using the normal file access functions.

fp = fopen("rom/image_root.fs", "r");

And with that, I could then use my own functions to access the files stored within. Some additional effort went into exposing file access via interprocess communication, which forms the basis of those mechanisms mentioned above, those mechanisms being needed if such filesystems are to be generally usable in the broader environment rather than by just a single program.

Preparing Filesystems

The first step in any such work is surely to devise how a filesystem is to be represented. Then, code must be written to access the filesystem, firstly to write files and directories to it, and then to be able to perform the necessary task of reading that file and directory information back out. At some point, an actual filesystem image needs to be prepared, and here it helps a lot if a convenient tool can be developed to speed up testing and further development.

I won’t dwell on the “toy” representation I used, mostly because it was merely chosen to let me explore the mechanisms and interfaces to be provided as L4Re components. The intention was always to switch to a “real world” filesystem and to use that instead. But in order to avoid being overwhelmed with learning about existing filesystems alongside learning about L4Re and developing file access mechanisms, I chose some very simple representations that I thought might resemble “real world” filesystems sufficiently enough to make the exercise realistic.

With the basic proof of concept somewhat validated, my attentions have now turned to “real world” filesystems, and here some interesting observations can be made about tools and libraries. If you were to ask someone about how they might prepare a filesystem, particularly a GNU/Linux user, it would be unsurprising to me if they suggested preparing a file…

dd if=/dev/zero of=image_root.fs bs=1024 count=1 seek=$SIZE_IN_KB

…then a filesystem in the file…

/sbin/mkfs.ext2 image_root.fs

…and then mounting it as follows:

sudo mount image_root.fs $MOUNTPOINT

Here, an ext2 filesystem is prepared in a normal system file, and then the operating system is asked to mount the filesystem and to expose it via a mountpoint, this being a directory in the general hierarchy of files and filesystems. But this last step requires special privileges and for the kernel to get involved, and yet all we are doing is accessing a file with the data inside it stored in a particular way. So why is there not a more straightforward, unprivileged way of writing data to that file in the required format?

Indeed, other projects of mine have needed to initialise filesystems, and such mounting operations have been a necessary aspect of those, given the apparent shortage of other methods. It really seemed that filesystems and kernel mechanisms were bound to each other, requiring us to always get the kernel involved. But it turns out that there are other solutions.

A History Lesson

I am reminded of the mtools suite of programs for accessing floppy disks. Once upon a time, when I was in my first year of university studies, practically all of our class’s programming was performed on a collection of DECstations. Although networked, each of these also provided a floppy drive capable of supporting 2.88MB disks: an uncommon sight, for me at least, with the availability of media and compatibility concerns dictating the use of 720KB and 1.44MB disks instead.

Presumably, within the Ultrix environment we were using, normal users were granted access to the floppy drive when logged in. With a disk inserted, mtools could then be used to access the disk as one big file, interpreting the contents and presenting the user with a view onto files and directories. Of course, mtools exposes a DOS-like interface to the disk, with DOS-like commands providing DOS-like output, and it does not attempt to integrate the contents of a disk within the general Unix filesystem hierarchy.

Indeed, the mechanisms of integrating such foreign data into the general filesystem hierarchy are denied to mere programs, this being a motivation for pursuing alternative operating system architectures like GNU Hurd which support such integration. But the point here is that filesystems – in this example, DOS-based filesystems on floppy disks – can readily be interpreted with the appropriate tools and without “operator” privileges.

Decoding Filesystem Data

Since filesystems are really just data structures encoded in storage, there should really be no magic involved in decoding and accessing them. After all, the code in the Linux kernel and in other operating system kernels has to do just that, and these things are just programs that happen to run under certain special conditions. So it would make sense if some of the knowledge encoded in these kernels had been extracted and made available as library code for other purposes. After all, it might come in useful elsewhere.

Fortunately, it is likely that such library code is already installed on your system, at least if you are using the ext2 family of filesystems. A search for some common utilities can be informative in this respect. Here is a query being issued for the appropriate filesystem checking utility on a Debian system:

$ dpkg -S e2fsck
e2fsprogs: /usr/share/man/man5/e2fsck.conf.5.gz
e2fsprogs: /sbin/e2fsck
e2fsprogs: /usr/share/man/man8/e2fsck.8.gz

And for the filesystem initialisation utility mentioned above:

$ dpkg -S mkfs.ext2
e2fsprogs: /sbin/mkfs.ext2
e2fsprogs: /usr/share/man/man8/mkfs.ext2.8.gz

The e2fsprogs package itself depends on a package called libext2fs2 – or e2fslibs on earlier distribution versions – and ultimately one discovers that these tools and their libraries are provided by a software distribution, e2fsprogs, whose aim is to provide programs and libraries for general access to the ext2/3/4 filesystem format. So it turns out to be possible and indeed feasible to write programs accessing filesystems without needing to make use of code residing in some kernel or other.

Tooling Up

Had I bothered to investigate further, I might have discovered another useful package. Running one or both of the following commands on a Debian system lets us see which other packages make use of the library functionality of e2fsprogs:

apt-cache rdepends e2fslibs
apt-cache rdepends libext2fs2

Amongst those listed is e2tools which offers a suite of commands resembling those provided by mtools, albeit with a Unix flavour instead of a DOS flavour. Investigating this, I discovered that these tools inherit somewhat from the utilities provided by e2fsprogs, particularly the debugfs utility.

However, investigating e2fsprogs by myself gave me a chance to become familiar with the details of libext2fs and how the different utilities managed to use it. Since it is not always obvious to me how the library should be used, and I find myself missing some good documentation for it, the more program code I can find to demonstrate its use, the better.

For my purposes, accessing individual files and directories is not particularly interesting: I really just want to treat an ext2 filesystem like an archive when preparing my L4Re payload; it is only within L4Re that I actually want to access individual things. Outside L4Re, having an equivalent to the tar command, but with the output being a filesystem image instead of a tar file, would be most useful for me. For example:

e2archive --create image_root.fs $ROOTFS

Currently, this can be made to populate a filesystem for eventual deployment, although the breadth of support for the filesystem features is rather limited. It is possible that I might adopt e2tools as the basis of this archiving program, given that it is merely a shell script that calls another program. Then again, it might be useful to gain direct experience with libext2fs for my other activities.

Future Directions

And so, in the GNU/Linux environment, the creation of such archives has been the focus of my experiments. Meanwhile, I need to develop library functions to support filesystem operations within L4Re, which means writing code to support things like file descriptor abstractions and the appropriate functions for accessing and manipulating files and directories. The basics of some of this is already done for the “toy” filesystem, but it will be a matter of figuring out which libext2fs functions and abstractions need to be used to achieve the same thing for ext2 and its derivatives.

Hopefully, once I can demonstrate file access via the same interprocess communications mechanisms, I can then make a start in replacing the existing conventional file access functions with versions that use my mechanisms instead of those provided in L4Re. This will most likely involve work on the C library support in L4Re, which is a daunting prospect, but some familiarity with that is probably beneficial if a more ambitious project to replace the C library is to be undertaken.

But if I can just manage to get the dynamic linker to be able to read shared libraries from an ext2 filesystem, then a rather satisfying milestone will have been reached. And this will then motivate work to support storage devices on various hardware platforms of interest, permitting the hosting of filesystems and giving those systems some potential as L4Re-based general-purpose computing devices, too.

Posted in English, filesystems, L4, libext2fs, operating systems | Comments Off on Filesystem Familiarisation

Some Ideas for 2019

Wednesday, January 23rd, 2019

Well, after my last article moaning about having wishes and goals while ignoring the preconditions for, and contributing factors in, the realisation of such wishes and goals, I thought I might as well be constructive and post some ideas I could imagine working on this year. It would be a bonus to get paid to work on such things, but I don’t hold out too much hope in that regard.

In a way, this is to make up for not writing an article summarising what I managed to look at in 2018. But then again, it can be a bit wearing to have to read through people’s catalogues of work even if I do try and make my own more approachable and not just list tons of work items, which is what one tends to see on a monthly basis in other channels.

In any case, 2018 saw a fair amount of personal focus on the L4Re ecosystem, as one can tell from looking at my article history. Having dabbled with L4Re and Fiasco.OC a bit in 2017 with the MIPS Creator CI20, I finally confronted certain aspects of the software and got it working on various devices, which had been something of an ambition for at least a couple of years. I also got back into looking at PIC32 hardware and software experiments, tidying up and building on earlier work, and I keep nudging along my Python-like language and toolchain, Lichen.

Anyway, here are a few ideas I have been having for supporting a general strategy of building flexible, sustainable and secure computing environments that respect the end-user. Such respect not being limited to software freedom, but also extending to things like privacy, affordability and longevity that are often disregarded in the narrow focus on only one set of end-user rights.

Building a General-Purpose System with L4Re

Apart from writing unfinished articles about supporting hardware devices on the Ben NanoNote and Letux 400, I also spent some time last year considering the mechanisms supporting filesystems in L4Re. For an outsider like myself, the picture isn’t particularly clear, but the mechanisms don’t really seem particularly well documented, and I am not convinced that the filesystem support is what people might expect from a microkernel-based system.

Like L4Re’s device support, the way filesystems are made available to tasks appears to use libraries extensively, whereas I would expect more use of individual programs, with interprocess messages and shared memory being employed to move the data around. To evaluate my expectations, I have been writing programs that operate in such a way, employing a “toy” filesystem to test the viability of such an approach. The plan is to make use of libext2fs to expose ext2/3/4 filesystems to L4Re components, then to try and replace the existing file access mechanisms with ones that access these file-serving components.

It is unfortunate that systems like these no longer seem to get much attention from the wider Free Software community. There was once a project to port GNU Hurd to L4-family microkernels, but with the state of the art having seemingly been regarded as insufficient or inappropriate, the focus drifted off onto other things, and now there doesn’t seem to be much appetite to resume such work. Even the existing Hurd implementation doesn’t get that much interest these days, either. And yet there are plenty of technical, social and practical reasons for not just settling for using the Linux kernel and systems based on it for every last application and device.

Extending Hardware Support within L4Re

It is all very well developing filesystem support, but there also has to be support for the things that provide the storage on which those filesystems reside. This is something I didn’t bother to look at when getting L4Re working on various devices because the priority was to have something to show, meaning that the display had to work, along with testing and demonstrating other well-understood peripherals, with the keyboard or keypad being something that could be supported with relative ease and then used to demonstrate other aspects of the solution.

It seems perverse that one must implement support for SD or microSD card storage all over again when the software being run is already being loaded from such storage, but this is rather like the way that “live CD” versions of GNU/Linux would load an environment directly from a CD, yet an installed version of such an environment might not have the capability to access the CD drive. Still, this is an unavoidable step along the path to make a practical system.

It might also be nice to make the CI20 support a bit better. Although a device notionally supported by L4Re, various missing pieces of hardware support needed to be added, and the HDMI output capability remains unavailable. Here, the mystery hardware left undocumented by the datasheet happens to be used in other chipsets and has been supported in the Linux kernel for many of them for quite some time. Hopefully, the exercise will not be too frustrating.

Another device that might be a good candidate for L4Re is the Efika MX Smartbook. Although having a modest specification by today’s bloated-Web and pointless-eye-candy standards, it has a nice keyboard (with a more sensible layout than the irritating HP Elitebook, as I recently discovered) and is several times more powerful than the Letux 400. My brother, David, has already investigated certain aspects of the hardware, and it might make the basis of a nice portable system. And since support in Linux and other more commonly-used technologies has been left to rot, why not look into developing a more lasting alternative?

Reviving Mail-Based Communication

It is tiresome to see the continuing neglect of the health of e-mail, despite it being used as the bedrock of the Internet’s activities, while proprietary and predatory social media platforms enjoy undeserved attention and promotion in mass media and in society at large. Governmental and corporate negligence mean that the average person is subjected to an array of undesirable, disturbing and generally unwanted communications from chancers and criminals through their e-mail accounts which, if it had ever happened to the same degree with postal mail, would have seen people routinely arrested and convicted for their activities.

It is tempting to think that “they” do not want “us” to have a more reliable form of mail-based communication because that would involve things like encryption and digital signatures. Even when these things are deemed necessary, they always seem to be rolled out as part of a special service that hosts “our” encryption and signing keys for us, through which we must go to access our messages. It is, I suppose, yet another example of the abandonment of common infrastructure: that when something better is needed, effort and attention is siphoned off from the “commons” and ploughed into something separate that might make someone a bit of money.

There are certainly challenges involved in making e-mail better, with any discussion of managing identities, vouching for and recognising correspondents, and the management of keys most likely to lead to dispute about the “best” way of doing things. But in the end, we probably find ourselves pursuing perfect solutions that do everything whilst proprietary competitors just focus on doing a handful of things effectively. So I envisage turning this around and evaluating whether a more limited form of mail-based communication can be done in the way that most people would need.

I did look fairly recently at a selection of different projects seeking to advise and support people on providing their own e-mail infrastructure. That is perhaps worth an article in its own right. And on the subject of mail-based communication, I hope to look a bit more at imip-agent again after neglecting it for so long.

Sustaining a Python Alternative

One motivation for developing my Python-like language and toolchain, Lichen, was to explore ways in which Python might have been developed to better suit my own needs and preferences. Although I still use Python extensively, I remain mindful of the need to write conservative, uncomplicated code without the kind of needless tricks and flourishes that Python’s expanding feature set can tempt the developer with, and thus I almost always consider the possibility of being able to use the Lichen toolchain with my projects one day.

Lichen may still be a proof of concept, but there has been work done on gradually pushing it towards being genuinely usable. I spent some time considering the way floating point numbers might be represented, and this raised some interesting issues around how they might be stored within instances. Like the tuple performance optimisations, I hope to introduce floating point support into the established feature set of Lichen and hopefully offer decent-enough performance, with the latter aspect being yet another path of investigation this year.

Documenting and Publishing

A continuing worry I have is whether I should still be running MoinMoin sites or even sites derived from MoinMoin “export dumps” for published information that is merely static. I have spent some time developing a parsing and formatting tool to generate static content from Moin content, thus avoiding running Moin altogether and also avoiding having to run a script acting as a URL-preserving front-end to exported Moin content (which is unfortunately required because of how the “export dump” seems to work).

Currently, this tool supports HTML and Moin output, the latter to test the parsing activity, with Graphviz content rendered as inline SVG or in other supported formats (although inline SVG is really what you want). Some macros are supported, but I need to add support for others, like the search macros which are useful for generating page listings. Themes are also supported, but I need to make sure that things like tables of contents – already supported with a macro – can be styled appropriately.

Already, I can generate the bulk of my existing project documentation, and the aim here is to be able to focus on improving that documentation, particularly for things like Lichen that really need explanations to be written before I need to start reviewing the code from scratch as if I were a total newcomer to the work. I have also considered using this tool as the basis for a decentralised wiki solution, but that can probably wait for a while given how many other things I have said I want to do!

Anything More?

There are probably other things that are worth looking at, but these are perhaps the ones I feel are most pressing. It could be said that pursuing all these at once would spread me and my efforts very thin, but I tend to cycle through projects periodically and hope that I can catch up with what I was previously doing, hence the mention above of documenting my work.

I wonder how much other people think about the year ahead, whether it is a productive and ultimately rewarding exercise to state aspirations and goals in this kind of way. New Year’s resolutions are a familiar practice, of course, but here I make no promises!

Nevertheless, a belated Happy New Year to anyone still reading! I hope we can all pursue our objectives enthusiastically over the year ahead and make a real and positive difference to computing, its users and to our societies.

Posted in Ben NanoNote, email, English, Fiasco.OC, Free Software, imip-agent, L4, Letux 400, Lichen, mail, MoinMoin, NanoNote, operating systems | Comments Off on Some Ideas for 2019

Shared-Mode Executables in L4Re for MIPS-Based Devices

Sunday, July 8th, 2018

I have been meaning to write about my device driver experiments with L4Re, following on from my porting exercises, but that exercise took me along various routes and I haven’t yet got back to documenting all of them. Meanwhile, one thing that did start to bother me was how much space the software was taking up when compiled, linked and ready to deploy.

Since each of my device drivers is a separate program, and since each one may be linked to various libraries, they each started to contribute substantially to the size of the resulting file – the payload – needing to be transferred to the device. At one point, I had to resize the boot partition on the memory card used by the Letux 400 notebook computer to make the payload fit in the available space.

The work done to port L4Re to the MIPS Creator CI20 had already laid the foundations for functioning payloads, and once the final touches were put in place to support the peculiarities of the Ingenic JZ4780 system-on-a-chip, it was possible to run both the conventional “hello” example which is statically linked to its libraries, as well as a “shared-hello” example which is dynamically linked to its libraries. The latter configuration of the program results in a smaller executable program and thus a smaller payload.

So it seemed clear that I might be able to run my own programs on the Letux 400 or Ben NanoNote with similar reductions in payload size. Unfortunately, nothing ever seems to be as straightforward as it ought to be.

Exceptional Obstructions and Observations

Initially, I set about trying one of my own graphical examples with the MODE variable set to “shared” in its Makefile. This, upon powering up, merely indicated that it had not managed to start up properly. Instead of a blank screen, the viewports set up by the graphical multiplexer, Mag, were still active and showing their usual blankness. But these regions did not then change in any way when I pressed keys on the keyboard (which is functionality that I will hopefully get round to describing in another article).

I sought some general advice from the l4-hackers mailing list, but quickly realised that to make any real progress, I would need a decent way of accessing the debugging output produced by the dynamic linker. This took me on a diversion that led to my debugging capabilities being strengthened with the availability of a textual output console on the screen of my devices. I still don’t like the idea of performing hardware modifications to get access to the serial console, so this is a useful and welcome alternative.

Having switched out the “hello” program with the “shared-hello” program in the system configuration and module list demonstrating the framebuffer terminal, I deployed the payload and powered up, but I did not get the satisfying output of the program operating normally. Instead, the framebuffer terminal appeared and rewarded me with the following message:

L4Re: rom/ex_hello_shared: Unhandled exception: PC=0x800000 PFA=8d7a LdrFlgs=0

This isn’t really the kind of thing you want to see. Having not had to debug L4Re or Fiasco.OC in any serious fashion for a couple of months, I was out of practice in considering the next step, but fortunately some encouragement arrived in a private e-mail from Jean Wolter. This brought the suggestion of triggering the kernel debugger, but since this requires serial console access, it wasn’t a viable approach. But another idea that I could use involved writing out a bit more information in the routine that was producing this output.

The message in question originates in the pkg/l4re-core/l4re_kernel/server/src/region.cc file, within the Region_map::op_exception method. The details it produces are rather minimal and generic: the program counter (PC) tells us where the exception occurred; the loader flags (LdrFlags) presumably tell us about the activity of the library loader; the mysterious “PFA” is supposedly the page fault address but it actually seemed to be the stack pointer address on these MIPS-based systems.

On their own, these details are not particularly informative, but I suppose that more useful information could quickly become fairly specific to a particular architecture. Jean suggested looking at the structure describing the exception state, l4_exc_regs_t (defined with MIPS-specific members in pkg/l4re-core/l4sys/include/ARCH-mips/utcb.h), to see what else I might dig up. This I did, generating the following:

pc=0x800000
gp=0x82dd30
sp=0x8d7a
ra=0x802f6c
cause=0x1000002c

A few things interested me, thus motivating my choice of registers to dump. The global pointer (gp) register tells us about symbols in the problematic code, and I felt that having once made changes to the L4Re sources – way back in the era of getting the CI20 to run GCC-generated code – so that another register (t9) would be initialised correctly, this so that the gp register would be set up correctly within programs, it was entirely possible that I had rather too enthusiastically changed something that was now causing a problem.

The stack pointer (sp) is useful to check, just to see if it located in a sensible region of memory, and here I discovered that this seemed to be the same as the “PFA” number. Oddly, the “PFA” seems to occupy the same place in the exception structure as any “bad virtual address” featuring in an address exception, and so I started to suspect that maybe the stack pointer was referencing the wrong part of memory. But this was partially ruled out by examining the value of the stack pointer in the “hello” example, which appeared to reference broadly the same part of memory. And, of course, the “hello” example works just fine.

In fact, the cause register indicated another kind of exception entirely, and it was one I was not really expecting: a “coprocessor unusable” exception indicating that coprocessor 1, typically a floating point arithmetic unit, was being illegally requested by an instruction. Here is how I interpreted that register’s value:

hex value   binary value
1000002c == 00010000000000000000000000101100
              --                     -----
              CE                     ExcCode

=> CE == 1; ExcCode == 11 (coprocessor unusable)
=> coprocessor 1 unusable

Now, as I may have mentioned before, the hardware involved in this exercise does not support floating point instructions itself, and this is why I have configured compilers to use “soft-float” (software-based floating point arithmetic) support. It meant that I had to find places that might have wanted to use floating point instructions and eliminate those instructions somehow. Fortunately, only code generated by the compiler was likely to contain such instructions. But now I wondered if there weren’t some instructions of this nature lurking in places I hadn’t checked.

I had also thought to check the return address (ra) register. This tells us where the processor will jump to when it has finished executing the current routine, and since this is usually a matter of “returning” somewhere, it tells us something about the code that was being executed before the problematic routine was called. I figured that the work being done before the exception was probably going to be more important than the exception itself.

Floating Point Magic

Another debugging suggestion that now became unavoidable was to inspect the erroneous instruction. I noted above that this instruction was causing the processor to signal an illegal attempt to use an unusable – actually completely unavailable – coprocessor. Writing a numeric representation of the instruction to the display provided me with the following hexadecimal (base 16) value:

464c457f

This can be interpreted as follows in binary, with groups of bits defined for interpretation according to the MIPS instruction set architecture, and with tentative interpretations of these groups provided beneath:

010001 10010  01100 01000 10101 111111
COP1   rs/fmt rt/ft rd/fs       C.ABS.NGT

The first group of bits is the opcode field which is interpreted as a coprocessor 1 (COP1) opcode. Should we then wish to consider what the other groups mean, we might then examine the final group which could indicate a comparison instruction. However, this becomes rather hypothetical since the processor will most likely interpret the opcode field and then decide that it cannot handle the instruction.

So, I started to look for places where the instruction might have been written, but no obvious locations were forthcoming. One peculiar aspect of all this is that the location of the instruction is at a rather “clean” location – 0x800000 – and some investigations indicated that this is where the library containing the problematic code gets loaded. I actually don’t remember precisely how I figured this out, but I think it was as follows.

I had looked at linker scripts that might give some details of the location of program objects, and one of them (pkg/l4re-core/ldscripts/ARCH-mips/main_dyn.ld) seemed to be related. It gave an address for the code of 0x400000. This made me think that some misconfiguration or erroneous operation was putting the observed code somewhere it shouldn’t be. But changing this address in the linker script just gave another exception at 0x400000, meaning that I had disrupted something that was intentional and probably working fine.

Meanwhile, emitting the t9 register’s value from the exception state yielded 0x800000, indicating that the calling routine had most likely jumped straight to that address, not to another address with execution having then proceeded normally until reaching the exception location. I decided to look at the instructions around the return address, these most likely being the ones that had set up the call to the exception location. Writing these locations out gave me some idea about the instructions involved. Below, I provide the stored values and their interpretations as machine instructions:

8f998250 # lw $t9, -32176($gp)
24a55fa8 # addiu $a1, $a1, 0x5fa8
0320f809 # jalr $t9
24844ee4 # addiu $a0, $a0, 0x4ee4
8fbc0010 # lw $gp, 16($sp)

One objective of doing this, apart from confirming that a jump instruction (jalr) was involved, with the t9 register being employed as is the convention with MIPS code, was to use the fragment to identify the library that was causing the error. A brute-force approach was employed here, generating “object dumps” from the library files and writing them out as files in a new directory:

mkdir tmpdir
for FILENAME in mybuild/lib/mips_32/l4f/* ; do
    mipsel-linux-gnu-objdump -d "$FILENAME" > tmpdir/`basename "$FILENAME"`
done

The textual dump files were then searched for the instruction values using grep, narrowing down the places where these instructions were found in consecutive locations. This yielded the following code, found in the libld-l4.so library:

    2f5c:       8f998250        lw      t9,-32176(gp)
    2f60:       24a55fa8        addiu   a1,a1,24488
    2f64:       0320f809        jalr    t9
    2f68:       24844ee4        addiu   a0,a0,20196
    2f6c:       8fbc0010        lw      gp,16(sp)

The integer operands for the addiu instructions are the same, of course, just being shown as decimal rather than hexadecimal values. Now, we previously saw that the return address (ra) register had the value 0x802f6c. When a MIPS processor executes a jump instruction, it will also fetch the following instruction and execute it, this being a consequence of the way the processor architecture is designed.

So, the instruction after the jump, residing in what is known as the “branch delay slot” is not the instruction that will be visited upon returning from the called routine. Instead, it is the instruction after that. Here, we see that the return address from the jump at location 0x2f64 would be two locations later at 0x2f6c. This provides a kind of realisation that the program object – the libld-l4.so library – is positioned in memory at 0x800000: 0x2f6c added to 0x800000 gives the value of ra, 0x802f6c.

And this means that the location of the problematic instruction – the cause of our exception – is the first location within this object. Anyone with any experience of this kind of software will have realised by now that this doesn’t sound like a healthy situation: the first location within a library is not actually going to be code because these kinds of objects are packaged up in a way that permits their manipulation by other programs.

So what is the first location of a library used for? Since such objects employ the Executable and Linkable Format (ELF), we can take a look at some documentation. And we see that the first location is used to identify the kind of object, employing a “magic number” for the purpose. And that magic number would be…

464c457f

In the little-endian arrangement employed by this processor, the stored bytes are as follows:

7f
45 ('E')
4c ('L')
46 ('F')

The value was not a floating point instruction at all, but the magic number at the start of the library object! It was something of a coincidence that such a value would be interpreted as a floating point instruction, an accidentally convenient way of signalling something going badly wrong.

Missing Entries

The investigation now started to focus on how the code trying to jump to the start of the library had managed to get this incorrect address and what it was trying to do by jumping to it. I started to wonder if the global pointer (gp), whose job it is to reference the list of locations of program routines and other global data, might have been miscalculated such that attempts to load the addresses of routines would then be failing with data being fetched from the wrong places.

But looking around at code fragments where the gp register was being calculated, they seemed to look set to calculate the correct values based on assumptions about other registers. For example, from the object dump for libld-l4.so:

00002780 <_ftext>:
    2780:       3c1c0003        lui     gp,0x3
    2784:       279cb5b0        addiu   gp,gp,-19024
    2788:       0399e021        addu    gp,gp,t9

Assuming that the processor has t9 set to 0x2780 and then jumps to the value of t9, as is the convention, the following calculation is then performed:

gp = 0x30000 (since lui loads the "upper" half-word)
gp = gp - 19024 = 0x30000 - 19024 = 0x2b5b0
gp = gp + t9 = 0x2b5b0 + 0x2780 = 0x2dd30

Using the nm tool, which tells us about symbols in program objects, it was possible to check this value:

mipsel-linux-gnu-nm -n mybuild/lib/mips_32/l4f/libld-l4.so

This shows the following at the end of the output:

0002dd30 d _gp

Also appearing somewhat earlier in the output is this, telling us where the table of symbols starts (as well as the next thing in the file):

00025d40 a _GLOBAL_OFFSET_TABLE_
00025f90 g __dso_handle

Some digging around in the L4Re source code gave a kind of confirmation that the difference between _gp and _GLOBAL_OFFSET_TABLE_ was to be expected. Here is what I found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/mips/elfinterp.c file:

#define OFFSET_GP_GOT 0x7ff0

If gp, when recalculated in other places, ended up getting the same value, there didn’t seem to be anything wrong with it. Some quick inspections of neighbouring calculations indicated that this wasn’t likely to be the problem. But what about the values used in conjunction with gp? Might they be having an effect? In the case of the erroneous jump, the following calculation is involved:

lw t9,-32176(gp) => load word into t9 from the location at gp - 32176
                 => ...               from 0x2dd30 - 32176
                 => ...               from 0x25f80

The calculated address, 0x25f80, is after the start of _GLOBAL_OFFSET_TABLE_ providing entries for program routines and other things, which is a good sign, but what is perhaps more troubling is how far after the start of the table such a value is. In the above output, another symbol (__dso_handle) indicates something that is located at the end of the table. Now, although its address is still greater than the one computed above, meaning that the computation does not cause us to stray off the end of the table, the computed address is suspiciously close to the end.

There was nothing else to do than to have a look at the table contents itself, and here it was rather useful to have a way of displaying a number of values on the screen. At this point, we have to note that the addresses in use in the running system are adjusted according to the start of the loaded object, so that the table is positioned at 0x25d40 in the object dump, but in the running system we would see 0x800000 + 0x25d40 and thus 0x825d40 instead.

What I saw was that the table contained entries that varied in the expected way right up until 0x825f60 (corresponding to 0x25f60 in the object dump) being only 0x30 (or 48 bytes, or 12 entries) before the end of the table, but then all remaining entries starting at 0x825f64 (corresponding to 0x25f64) yielded a value of 0x800000, apart from 0x825f90 (corresponding to 0x25f90, right at the end of the table) which yielded itself.

Since the calculated address above (0x25f80, adjusted to 0x825f80 in the running system) lies in this final region, we now know the origin of this annoying 0x800000: it comes from entries at the end of the table that do not seem to hold meaningful values. Indeed, the object dump for the library seemed to skip over this region of the table entirely, presumably because it was left uninitialised. And using the readelf tool with the –relocs option to show “relocations”, which applies to this table, it appeared that the last entries rather confirmed my observations:

00025d34  00000003 R_MIPS_REL32
00025f90  00000003 R_MIPS_REL32

Clearly, something is missing from this table. But since something has to adjust the contents of the table to add the “base address”, 0x800000, to the entries in order to provide valid addresses within the running program, what started to intrigue me was whether the code that performed this adjustment had any idea about these missing entries, and how this code might be related to the code causing the exception situation.

Routines and Responsibilities

While considering the nature of the code causing the exception, I had been using the objdump utility with the -d (disassemble) and -D (disassemble all) options. These provide details of program sections, code routines and the machine instructions themselves. But Jean pointed out that if I really wanted to find out which part of the source code was responsible for producing certain regions of the program, I might use a combination of options: -d, -l (line numbers) and -S (source code). This was almost a revelation!

However, the code responsible for the jump to the start of the library resisted such measures. A large region of code appeared to have no corresponding source, suggesting that it might be generated. Here is how it starts:

_ftext():
    2dac:       00000000        nop
    2db0:       3c1c0003        lui     gp,0x3
    2db4:       279caf80        addiu   gp,gp,-20608
    2db8:       0399e021        addu    gp,gp,t9
    2dbc:       8f84801c        lw      a0,-32740(gp)
    2dc0:       8f828018        lw      v0,-32744(gp)

There is no function defined in the source code with the name _ftext. However, _ftext is defined in the linker script (in pkg/l4re-core/ldscripts/ARCH-mips/main_rel.ld) as follows:

  .text           :
  {
    _ftext = . ;
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em.  */
    *(.gnu.warning)
    *(.mips16.fn.*) *(.mips16.call.*)
  }

If you haven’t encountered linker scripts before, then you probably don’t want to spend too much time looking at this, linker scripts being frustratingly terse and arcane, but the essence of the above is that a bunch of code is stuffed into the .text section, with _ftext being assigned the address of the start of all this code. Now, _ftext in the linker script corresponds to a particular label in the object dump (which we saw earlier was positioned at 0x2780) whereas the _ftext function in the code occurs later (at 0x2dac, above). After the label but before the function is code whose source is found by objdump.

So I took the approach of removing things from the linker script, ultimately removing everything from the .text section apart from the assignment to _ftext. This removed the annotated regions of the code and left me with only the _ftext function. It really did appear that this was something the compiler might be responsible for. But where would I find the code responsible?

One hint that was present in the _ftext function code was the use of another identified function, __cxa_finalize. Searching the GCC sources for code that might use it led me to the libgcc sources and to code that invokes destructor functions upon program exit. This wasn’t really what I was looking for, but the file containing it (libgcc/crtstuff.c) would prove informative.

Back to the Table

Jean had indicated that there might be a difference in output between compilers, and that certain symbols might be produced by some but not by others. I investigated further by using the readelf tool with the -a option to show almost everything about the library file. Here, the focus was on the global offset table (GOT) and information about the entries. In particular, I wanted to know more about the entry providing the erroneous 0x800000 value, located at (gp – 32176). In my output I saw the following interesting thing:

 Global entries:
   Address     Access  Initial Sym.Val. Type    Ndx Name
  00025f80 -32176(gp) 00000000 00000000 FUNC    UND __register_frame_info

This seems to tell us what the program expects to find at the location in question, and it indicates that the named symbol is presumably undefined. There were some other undefined symbols, too:

_ITM_deregisterTMCloneTable
_ITM_registerTMCloneTable
__deregister_frame_info

Meanwhile, Jean was seeing symbols with other names:

__register_frame_info_base
__deregister_frame_info_base

During my perusal of the libgcc sources, I had noticed some of these symbols being tested to see if they were non-zero. For example:

  if (__register_frame_info)
    __register_frame_info (__EH_FRAME_BEGIN__, &object);

These fragments of code appear to be located in functions related to program initialisation. And it is also interesting to note that back in the library code, after the offending table entry has been accessed, there are tests against zero:

    2f34:       3c1c0003        lui     gp,0x3
    2f38:       279cadfc        addiu   gp,gp,-20996
    2f3c:       0399e021        addu    gp,gp,t9
    2f40:       27bdffe0        addiu   sp,sp,-32
    2f44:       8f828250        lw      v0,-32176(gp)
    2f48:       afbc0010        sw      gp,16(sp)
    2f4c:       afbf001c        sw      ra,28(sp)
    2f50:       10400007        beqz    v0,2f70 <_ftext+0x7f0>

Here, gp gets set up correctly, v0 is set to the value of the table entry, which we now believe refers to __register_frame_info, and the beqz instruction tests this value against zero, skipping ahead if it is zero. Does that not sound a bit like the code shown above? One might think that the libgcc code might handle an uninitialised table entry, and maybe it is intended to do so, but the table entry ends up getting adjusted to 0x800000, presumably as part of the library loading process.

I think that the most relevant function here for the adjustment of these entries is _dl_perform_mips_global_got_relocations which can be found in the pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/ldso.c file as part of the L4Re C library code. It may well have changed the entry from zero to this erroneous non-zero value, merely because the entry lies within the table and is assumed to be valid.

So, as a consequence, the libgcc code acts as if it has a genuine __register_frame_info function to call, and doing so causes the jump to the start of the library object and the exception. Maybe the code is supposed to be designed to handle missing symbols, those symbols potentially being deliberately omitted, but it doesn’t function correctly under these particular circumstances.

Symbol Restoration

However, despite identifying this unfortunate interaction between C library and libgcc, the matter of a remedy remained unaddressed. What was I to do about these missing symbols? Were they supposed to be there? Was there a way to tell libgcc not to expect them to be there at all?

In attempting to learn a bit more about the linking process, I had probably been through the different L4Re packages several times, but Jean then pointed me to a file I had seen before, perhaps before I had needed to think about these symbols at all. It contained “empty” definitions for some of the symbols but not for others. Maybe the workaround or even the solution was to just add more definitions corresponding to the symbols the program was expecting? Jean thought so.

So, I added a few things to the file (pkg/l4re-core/ldso/ldso/fixup.c):

void __deregister_frame_info(void);
void __register_frame_info(void);
void _ITM_deregisterTMCloneTable(void);
void _ITM_registerTMCloneTable(void);

void __deregister_frame_info(void) {}
void __register_frame_info(void) {}
void _ITM_deregisterTMCloneTable(void) {}
void _ITM_registerTMCloneTable(void) {}

I wasn’t confident that this would fix the problem. After all the investigation, adding a few lines of trivial code to one file seemed like too easy a way to fix what seemed like a serious problem. But I checked the object dump of the library, and suddenly things looked a bit more reasonable. Instead of referencing an uninitialised table entry, objdump was able to identify the jump target as __register_frame_info:

    2e14:       8f828040        lw      v0,-32704(gp)
    2e18:       afbc0010        sw      gp,16(sp)
    2e1c:       afbf001c        sw      ra,28(sp)
    2e20:       10400007        beqz    v0,2e40 <_ftext+0x7f0>
    2e24:       8f85801c        lw      a1,-32740(gp)
    2e28:       8f84803c        lw      a0,-32708(gp)
    2e2c:       8f998040        lw      t9,-32704(gp)
    2e30:       24a55fa8        addiu   a1,a1,24488
    2e34:       04111c39        bal     9f1c <__register_frame_info>

Of course, the code had been reorganised and so things were no longer in quite the same places, but in the above, (gp – 32704) is actually a reference to __register_frame_info, and although this value gets tested against zero as before, we can see that enough is now known about the previously-missing symbol that a branch directly to the location of the function has been incorporated, rather than a jump to the address stored in the table.

And sure enough, powering up the Letux 400 produced the framebuffer terminal showing the expected output:

Hi World! (shared)

It had been a long journey for such a modest reward, but thanks to Jean’s help and a bit of perseverance, I got there in the end.

Posted in Ben NanoNote, English, Fiasco.OC, Free Software, hardware, L4, Letux 400, MIPS, NanoNote, operating systems | Comments Off on Shared-Mode Executables in L4Re for MIPS-Based Devices

Archives

Categories