Paul Boddie's Free Software-related blog


Archive for the ‘retrocomputing’ Category

VGA Signal Generation with the PIC32

Monday, May 22nd, 2017

It all started after I had designed – and received from fabrication – a circuit board for prototyping cartridges for the Acorn Electron microcomputer. Although some prototyping had already taken place with an existing cartridge, with pins intended for ROM components being routed to drive other things, this board effectively “breaks out” all connections available to a cartridge that has been inserted into the computer’s Plus 1 expansion unit.

Acorn Electron cartridge breakout board

The Acorn Electron cartridge breakout board being used to drive an external circuit

One thing led to another, and soon my brother, David, was interfacing a microcontroller to the Electron in order to act as a peripheral being driven directly by the system’s bus signals. His approach involved having a program that would run and continuously scan the signals for read and write conditions and then interpret the address signals, sending and receiving data on the bus when appropriate.

Having acquired some PIC32 devices out of curiosity, with the idea of potentially interfacing them with the Electron, I finally took the trouble of looking at the datasheet to see whether some of the hard work done by David’s program might be handled by the peripheral hardware in the PIC32. The presence of something called “Parallel Master Port” was particularly interesting.

Operating this function in the somewhat insensitively-named “slave” mode, the device would be able to act like a memory device, with the signalling required by read and write operations mostly being dealt with by the hardware. Software running on the PIC32 would be able to read and write data through this port and be able to get notifications about new data while getting on with doing other things.

So began my journey into PIC32 experimentation, but this article isn’t about any of that, mostly because I put that particular investigation to one side after a degree of experience gave me perhaps a bit too much confidence, and I ended up being distracted by something far more glamorous: generating a video signal using the PIC32!

The Precedents’ Hall of Fame

There are plenty of people who have written up their experiments generating VGA and other video signals with microcontrollers. Here are some interesting examples:

And there are presumably many more pages on the Web with details of people sending pixel data over a cable to a display of some sort, often trying to squeeze every last cycle out of their microcontroller’s instruction processing unit. But, given an awareness of how microcontrollers should be able to take the burden off the programs running on them, employing peripheral hardware to do the grunt work of switching pins on and off at certain frequencies, maybe it would be useful to find examples of projects where such advantages of microcontrollers had been brought to bear on the problem.

In fact, I was already aware of the Maximite “single chip computer” partly through having seen the cloned version of the original being sold by Olimex – something rather resented by the developer of the Maximite for reasons largely rooted in an unfortunate misunderstanding of Free Software licensing on his part – and I was aware that this computer could generate a VGA signal. Indeed, the method used to achieve this had apparently been written up in a textbook for the PIC32 platform, albeit generating a composite video signal using one of the on-chip SPI peripherals. The Colour Maximite uses three SPI channels to generate one red, one green, and one blue channel of colour information, thus supporting eight-colour graphical output.

But I had been made aware of the Parallel Master Port (PMP) and its “master” mode, used to drive LCD panels with eight bits of colour information per pixel (or, using devices with many more pins than those I had acquired, with sixteen bits of colour information per pixel). Would it surely not be possible to generate 256-colour graphical output at the very least?

Information from people trying to use PMP for this purpose was thin on the ground. Indeed, reading again one article that mentioned an abandoned attempt to get PMP working in this way, using the peripheral to emit pixel data for display on a screen instead of a panel, I now see that it actually mentions an essential component of the solution that I finally arrived at. But the author had unfortunately moved away from that successful component in an attempt to get the data to the display at a rate regarded as satisfactory.

Direct Savings

It is one thing to have the means to output data to be sent over a cable to a display. It is another to actually send the data efficiently from the microcontroller. Having contemplated such issues in the past, it was not a surprise that the Maximite and other video-generating solutions use direct memory access (DMA) to get the hardware, as opposed to programs, to read through memory and to write its contents to a destination, which in most cases seemed to be the memory address holding output data to be emitted via a data pin using the SPI mechanism.

I had also envisaged using DMA and was still fixated on using PMP to emit the different data bits to the output circuit producing the analogue signals for the display. Indeed, Microchip promotes the PMP and DMA combination as a way of doing “low-cost controllerless graphics solutions” involving LCD panels, so I felt that there surely couldn’t be much difference between that and getting an image on my monitor via a few resistors on the breadboard.

And so, a tour of different PIC32 features began, trying to understand the DMA documentation, the PMP documentation, all the while trying to get a grasp of what the VGA signal actually looks like, the timing constraints of the various synchronisation pulses, and battle various aspects of the MIPS architecture and the PIC32 implementation of it, constantly refining my own perceptions and understanding and learning perhaps too often that there may have been things I didn’t know quite enough about before trying them out!

Using VGA to Build a Picture

Before we really start to look at a VGA signal, let us first look at how a picture is generated by the signal on a screen:

VGA Picture Structure

The structure of a display image or picture produced by a VGA signal

The most important detail at this point is the central area of the diagram, filled with horizontal lines representing the colour information that builds up a picture on the display, with the actual limits of the screen being represented here by the bold rectangle outline. But it is also important to recognise that even though there are a number of visible “display lines” within which the colour information appears, the entire “frame” sent to the display actually contains yet more lines, even though they will not be used to produce an image.

Above and below – really before and after – the visible display lines are the vertical back and front porches whose lines are blank because they do not appear on the screen or are used to provide a border at the top and bottom of the screen. Such extra lines contribute to the total frame period and to the total number of lines dividing up the total frame period.

Figuring out how many lines a display will have seems to involve messing around with something called the “generalised timing formula”, and if you have an X server like Xorg installed on your system, you may even have a tool called “gtf” that will attempt to calculate numbers of lines and pixels based on desired screen resolutions and frame rates. Alternatively, you can look up some common sets of figures on sites providing such information.

What a VGA Signal Looks Like

Some sources show diagrams attempting to describe the VGA signal, but many of these diagrams are open to interpretation (in some cases, very much so). They perhaps show the signal for horizontal (display) lines, then other signals for the entire image, but they either do not attempt to combine them, or they instead combine these details ambiguously.

For instance, should the horizontal sync (synchronisation) pulse be produced when the vertical sync pulse is active or during the “blanking” period when no pixel information is being transmitted? This could be deduced from some diagrams but only if you share their authors’ unstated assumptions and do not consider other assertions about the signal structure. Other diagrams do explicitly show the horizontal sync active during vertical sync pulses, but this contradicts statements elsewhere such as “during the vertical sync period the horizontal sync must also be held low”, for instance.

After a lot of experimentation, I found that the following signal structure was compatible with the monitor I use with my computer:

VGA Signal Structure

The basic structure of a VGA signal, or at least a signal that my monitor can recognise

There are three principal components to the signal:

  • Colour information for the pixel or line data forms the image on the display and it is transferred within display lines during what I call the visible display period in every frame
  • The horizontal sync pulse tells the display when each horizontal display line ends, or at least the frequency of the lines being sent
  • The vertical sync pulse tells the display when each frame (or picture) ends, or at least the refresh rate of the picture

The voltage levels appear to be as follows:

  • Colour information should be at 0.7V (although some people seem to think that 1V is acceptable as “the specified peak voltage for a VGA signal”)
  • Sync pulses are supposed to be at “TTL” levels, which apparently can be from 0V to 0.5V for the low state and from 2.7V to 5V for the high state

Meanwhile, the polarity of the sync pulses is also worth noting. In the above diagram, they have negative polarity, meaning that an active pulse is at the low logic level. Some people claim that “modern VGA monitors don’t care about sync polarity”, but since it isn’t clear to me what determines the polarity, and since most descriptions and demonstrations of VGA signal generation seem to use negative polarity, I chose to go with the flow. As far as I can tell, the gtf tool always outputs the same polarity details, whereas certain resources provide signal characteristics with differing polarities.

It is possible, and arguably advisable, to start out trying to generate sync pulses and just grounding the colour outputs until your monitor (or other VGA-capable display) can be persuaded that it is receiving a picture at a certain refresh rate and resolution. Such confirmation can be obtained on a modern display by seeing a blank picture without any “no signal” or “input not supported” messages and by being able to activate the on-screen menu built into the device, in which an option is likely to exist to show the picture details.

How the sync and colour signals are actually produced will be explained later on. This section was merely intended to provide some background and gather some fairly useful details into one place.

Counting Lines and Generating Vertical Sync Pulses

The horizontal and vertical sync pulses are each driven at their own frequency. However, given that there are a fixed number of lines in every frame, it becomes apparent that the frequency of vertical sync pulse occurrences is related to the frequency of horizontal sync pulses, the latter occurring once per line, of course.

With, say, 622 lines forming a frame, the vertical sync will occur once for every 622 horizontal sync pulses, or at a rate that is 1/622 of the horizontal sync frequency or “line rate”. So, if we can find a way of generating the line rate, we can not only generate horizontal sync pulses, but we can also count cycles at this frequency, and every 622 cycles we can produce a vertical sync pulse.

But how do we calculate the line rate in the first place? First, we decide what our refresh rate should be. The “classic” rate for VGA output is 60Hz. Then, we decide how many lines there are in the display including those extra non-visible lines. We multiply the refresh rate by the number of lines to get the line rate:

60Hz * 622 = 37320Hz = 37.320kHz

On a microcontroller, the obvious way to obtain periodic events is to use a timer. Given a particular frequency at which the timer is updated, a quick calculation can be performed to discover how many times a timer needs to be incremented before we need to generate an event. So, let us say that we have a clock frequency of 24MHz, and a line rate of 37.320kHz, we calculate the number of timer increments required to produce the latter from the former:

24MHz / 37.320kHz = 24000000Hz / 37320Hz = 643

So, if we set up a timer that counts up to 642 and then upon incrementing again to 643 actually starts again at zero, with the timer sending a signal when this “wraparound” occurs, we can have a mechanism providing a suitable frequency and then make things happen at that frequency. And this includes counting cycles at this particular frequency, meaning that we can increment our own counter by 1 to keep track of display lines. Every 622 display lines, we can initiate a vertical sync pulse.

One aspect of vertical sync pulses that has not yet been mentioned is their duration. Various sources suggest that they should last for only two display lines, although the “gtf” tool specifies three lines instead. Our line-counting logic therefore needs to know that it should enable the vertical sync pulse by bringing it low at a particular starting line and then disable it by bringing it high again after two whole lines.

Generating Horizontal Sync Pulses

Horizontal sync pulses take place within each display line, have a specific duration, and they must start at the same time relative to the start of each line. Some video output demonstrations seem to use lots of precisely-timed instructions to achieve such things, but we want to use the peripherals of the microcontroller as much as possible to avoid wasting CPU time. Having considered various tricks involving specially formulated data that might be transferred from memory to act as a pulse, I was looking for examples of DMA usage when I found a mention of something called the Output Compare unit on the PIC32.

What the Output Compare (OC) units do is to take a timer as input and produce an output signal dependent on the current value of the timer relative to certain parameters. In clearer terms, you can indicate a timer value at which the OC unit will cause the output to go high, and you can indicate another timer value at which the OC unit will cause the output to go low. It doesn’t take much imagination to realise that this sounds almost perfect for generating the horizontal sync pulse:

  1. We take the timer previously set up which counts up to 643 and thus divides the display line period into units of 1/643.
  2. We identify where the pulse should be brought low and present that as the parameter for taking the output low.
  3. We identify where the pulse should be brought high and present that as the parameter for taking the output high.

Upon combining the timer and the OC unit, then configuring the output pin appropriately, we end up with a low pulse occurring at the line rate, but at a suitable offset from the start of each line.

VGA Display Line Structure

The structure of each visible display line in the VGA signal

In fact, the OC unit also proves useful in actually generating the vertical sync pulses, too. Although we have a timer that can tell us when it has wrapped around, we really need a mechanism to act upon this signal promptly, at least if we are to generate a clean signal. Unfortunately, handling an interrupt will introduce a delay between the timer wrapping around and the CPU being able to do something about it, and it is not inconceivable that this delay may vary depending on what the CPU has been doing.

So, what seems to be a reasonable solution to this problem is to count the lines and upon seeing that the vertical sync pulse should be initiated at the start of the next line, we can enable another OC unit configured to act as soon as the timer value is zero. Thus, upon wraparound, the OC unit will spring into action and bring the vertical sync output low immediately. Similarly, upon realising that the next line will see the sync pulse brought high again, we can reconfigure the OC unit to do so as soon as the timer value again wraps around to zero.

Inserting the Colour Information

At this point, we can test the basic structure of the signal and see if our monitor likes it. But none of this is very interesting without being able to generate a picture, and so we need a way of getting pixel information from the microcontroller’s memory to its outputs. We previously concluded that Direct Memory Access (DMA) was the way to go in reading the pixel data from what is usually known as a framebuffer, sending it to another place for output.

As previously noted, I thought that the Parallel Master Port (PMP) might be the right peripheral to use. It provides an output register, confusingly called the PMDIN (parallel master data in) register, that lives at a particular address and whose value is exposed on output pins. On the PIC32MX270, only the least significant eight bits of this register are employed in emitting data to the outside world, and so a DMA destination having a one-byte size, located at the address of PMDIN, is chosen.

The source data is the framebuffer, of course. For various retrocomputing reasons hinted at above, I had decided to generate a picture 160 pixels in width, 256 lines in height, and with each byte providing eight bits of colour depth (specifying how many distinct colours are encoded for each pixel). This requires 40 kilobytes and can therefore reside in the 64 kilobytes of RAM provided by the PIC32MX270. It was at this point that I learned a few things about the DMA mechanisms of the PIC32 that didn’t seem completely clear from the documentation.

Now, the documentation talks of “transactions”, “cells” and “blocks”, but I don’t think it describes them as clearly as it could do. Each “transaction” is just a transfer of a four-byte word. Each “cell transfer” is a collection of transactions that the DMA mechanism performs in a kind of batch, proceeding with these as quickly as it can until it either finishes the batch or is told to stop the transfer. Each “block transfer” is a collection of cell transfers. But what really matters is that if you want to transfer a certain amount of data and not have to keep telling the DMA mechanism to keep going, you need to choose a cell size that defines this amount. (When describing this, it is hard not to use the term “block” rather than “cell”, and I do wonder why they assigned these terms in this way because it seems counter-intuitive.)

You can perhaps use the following template to phrase your intentions:

I want to transfer <cell size> bytes at a time from a total of <block size> bytes, reading data starting from <source address>, having <source size>, and writing data starting at <destination address>, having <destination size>.

The total number of bytes to be transferred – the block size – is calculated from the source and destination sizes, with the larger chosen to be the block size. If we choose a destination size less than the source size, the transfers will not go beyond the area of memory defined by the specified destination address and the destination size. What actually happens to the “destination pointer” is not immediately obvious from the documentation, but for our purposes, where we will use a destination size of one byte, the DMA mechanism will just keep writing source bytes to the same destination address over and over again. (One might imagine the pointer starting again at the initial start address, or perhaps stopping at the end address instead.)

So, for our purposes, we define a “cell” as 160 bytes, being the amount of data in a single display line, and we only transfer one cell in a block. Thus, the DMA source is 160 bytes long, and even though the destination size is only a single byte, the DMA mechanism will transfer each of the source bytes into the destination. There is a rather unhelpful diagram in the documentation that perhaps tries to communicate too much at once, leading one to believe that the cell size is a factor in how the destination gets populated by source data, but the purpose of the cell size seems only to be to define how much data is transferred at once when a transfer is requested.

DMA Transfer Mechanism

The transfer of framebuffer data to PORTB using DMA cell transfers (noting that this hints at the eventual approach which uses PORTB and not PMDIN)

In the matter of requesting a transfer, we have already described the mechanism that will allow us to make this happen: when the timer signals the start of a new line, we can use the wraparound event to initiate a DMA transfer. It would appear that the transfer will happen as fast as both the source and the destination will allow, at least as far as I can tell, and so it is probably unlikely that the data will be sent to the destination too quickly. Once the transfer of a line’s pixel data is complete, we can do some things to set up the transfer for the next line, like changing the source data address to point to the next 160 bytes representing the next display line.

(We could actually set the block size to the length of the entire framebuffer – by setting the source size – and have the DMA mechanism automatically transfer each line in turn, updating its own address for the current line. However, I intend to support hardware scrolling, where the address of the first line of the screen can be adjusted so that the display starts part way through the framebuffer, reaches the end of the framebuffer part way down the screen, and then starts again at the beginning of the framebuffer in order to finish displaying the data at the bottom of the screen. The DMA mechanism doesn’t seem to support the necessary address wraparound required to manage this all by itself.)

Output Complications

Having assumed that the PMP peripheral would be an appropriate choice, I soon discovered some problems with the generated output. Although the data that I had stored in the RAM seemed to be emitted as pixels in appropriate colours, there were gaps between the pixels on the screen. Yet the documentation seemed to vaguely indicate that the PMDIN register was more or less continuously updated. That meant that the actual output signals were being driven low between each pixel, causing black-level gaps and ruining the result.

I wondered if anything could be done about this issue. PMP is really intended as some kind of memory interface, and it isn’t unreasonable for it to only maintain valid data for certain periods of time, modifying control signals to define this valid data period. That PMP can be used to drive LCD panels is merely a result of those panels themselves upholding this kind of interface. For those of you familiar with microcontrollers, the solution to my problem was probably obvious several paragraphs ago, but it needed me to reconsider my assumptions and requirements before I realised what I should have been doing all along.

Unlike SPI, which concerns itself with the bit-by-bit serial output of data, PMP concerns itself with the multiple-bits-at-once parallel output of data, and all I wanted to do was to present multiple bits to a memory location and have them translated to a collection of separate signals. But, of course, this is exactly how normal I/O (input/output) pins are provided on microcontrollers! They all seem to provide “PORT” registers whose bits correspond to output pins, and if you write a value to those registers, all the pins can be changed simultaneously. (This feature is obscured by platforms like Arduino where functions are offered to manipulate only a single pin at once.)

And so, I changed the DMA destination to be the PORTB register, which on the PIC32MX270 is the only PORT register with enough bits corresponding to I/O pins to be useful enough for this application. Even then, PORTB does not have a complete mapping from bits to pins: some pins that are available in other devices have been dedicated to specific functions on the PIC32MX270F256B and cannot be used for I/O. So, it turns out that we can only employ at most seven bits of our pixel data in generating signal data:

PORTB Pin Availability on the PIC32MX270F256B
Pins
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RPB15 RPB14 RPB13 RPB11 RPB10 RPB9 RPB8 RPB7 RPB5 RPB4 RPB3 RPB2 RPB1 RPB0

We could target the first byte of PORTB (bits 0 to 7) or the second byte (bits 8 to 15), but either way we will encounter an unmapped bit. So, instead of choosing a colour representation making use of eight bits, we have to make do with only seven.

Initially, not noticing that RPB6 was not available, I was using a “RRRGGGBB” or “332″ representation. But persuaded by others in a similar predicament, I decided to choose a representation where each colour channel gets two bits, and then a separate intensity bit is used to adjust the final intensity of the basic colour result. This also means that greyscale output is possible because it is possible to balance the channels.

The 2-bit-per-channel plus intensity colours

The colours employing two bits per channel plus one intensity bit, perhaps not shown completely accurately due to circuit inadequacies and the usual white balance issues when taking photographs

It is worth noting at this point that since we have left the 8-bit limitations of the PMP peripheral far behind us now, we could choose to populate two bytes of PORTB at once, aiming for sixteen bits per pixel but actually getting fourteen bits per pixel once the unmapped bits have been taken into account. However, this would double our framebuffer memory requirements for the same resolution, and we don’t have that much memory. There may be devices with more than sixteen bits mapped in the 32-bit PORTB register (or in one of the other PORT registers), but they had better have more memory to be useful for greater colour depths.

Back in Black

One other matter presented itself as a problem. It is all very well generating a colour signal for the pixels in the framebuffer, but what happens at the end of each DMA transfer once a line of pixels has been transmitted? For the portions of the display not providing any colour information, the channel signals should be held at zero, yet it is likely that the last pixel on any given line is not at the lowest possible (black) level. And so the DMA transfer will have left a stray value in PORTB that could then confuse the monitor, producing streaks of colour in the border areas of the display, making the monitor unsure about the black level in the signal, and also potentially confusing some monitors about the validity of the picture, too.

As with the horizontal sync pulses, we need a prompt way of switching off colour information as soon as the pixel data has been transferred. We cannot really use an Output Compare unit because that only affects the value of a single output pin, and although we could wire up some kind of blanking in our external circuit, it is simpler to look for a quick solution within the capabilities of the microcontroller. Fortunately, such a quick solution exists: we can “chain” another DMA channel to the one providing the pixel data, thereby having this new channel perform a transfer as soon as the pixel data has been sent in its entirety. This new channel has one simple purpose: to transfer a single byte of black pixel data. By doing this, the monitor will see black in any borders and beyond the visible regions of the display.

Wiring Up

Of course, the microcontroller still has to be connected to the monitor somehow. First of all, we need a way of accessing the pins of a VGA socket or cable. One reasonable approach is to obtain something that acts as a socket and that breaks out the different signals from a cable, connecting the microcontroller to these broken-out signals.

Wanting to get something for this task quickly and relatively conveniently, I found a product at a local retailer that provides a “male” VGA connector and screw-adjustable terminals to break out the different pins. But since the VGA cable also has a male connector, I also needed to get a “gender changer” for VGA that acts as a “female” connector in both directions, thus accommodating the VGA cable and the male breakout board connector.

Wiring up to the broken-out VGA connector pins is mostly a matter of following diagrams and the pin numbering scheme, illustrated well enough in various resources (albeit with colour signal transposition errors in some resources). Pins 1, 2 and 3 need some special consideration for the red, green and blue signals, and we will look at them in a moment. However, pins 13 and 14 are the horizontal and vertical sync pins, respectively, and these can be connected directly to the PIC32 output pins in this case, since the 3.3V output from the microcontroller is supposedly compatible with the “TTL” levels. Pins 5 through 10 can be connected to ground.

We have seen mentions of colour signals with magnitudes of up to 0.7V, but no explicit mention of how they are formed has been presented in this article. Fortunately, everyone is willing to show how they converted their digital signals to an analogue output, with most of them electing to use a resistor network to combine each output pin within a channel to produce a hopefully suitable output voltage.

Here, with two bits per channel, I take the most significant bit for a channel and send it through a 470ohm resistor. Meanwhile, the least significant bit for the channel is sent through a 1000ohm resistor. Thus, the former contributes more to the magnitude of the signal than the latter. If we were only dealing with channel information, this would be as much as we need to do, but here we also employ an intensity bit whose job it is to boost the channels by a small amount, making sure not to allow the channels to pollute each other via this intensity sub-circuit. Here, I feed the intensity output through a 2200ohm resistor and then to each of the channel outputs via signal diodes.

VGA Output Circuit

The circuit showing connections relevant to VGA output (generic connections are not shown)

The Final Picture

I could probably go on and cover other aspects of the solution, but the fundamental aspects are probably dealt with sufficiently above to help others reproduce this experiment themselves. Populating memory with usable image data, at least in this solution, involves copying data to RAM, and I did experience problems with accessing RAM that are probably related to CPU initialisation (as covered in my previous article) and to synchronising the memory contents with what the CPU has written via its cache.

As for the actual picture data, the RGB-plus-intensity representation is not likely to be the format of most images these days. So, to prepare data for output, some image processing is needed. A while ago, I made a program to perform palette optimisation and dithering on images for the Acorn Electron, and I felt that it was going to be easier to adapt the dithering code than it was to figure out the necessary techniques required for software like ImageMagick or the Python Imaging Library. The pixel data is then converted to assembly language data definition statements and incorporated into my PIC32 program.

VGA output from a PIC32 microcontroller

VGA output from a PIC32 microcontroller, featuring a picture showing some Oslo architecture, with the PIC32MX270 being powered (and programmed) by the Arduino Duemilanove, and with the breadboards holding the necessary resistors and diodes to supply the VGA breakout and, beyond that, the cable to the monitor

To demonstrate control over the visible region, I deliberately adjusted the display frequencies so that the monitor would consider the signal to be carrying an image 800 pixels by 600 pixels at a refresh rate of 60Hz. Since my framebuffer is only 256 lines high, I double the lines to produce 512 lines for the display. It would seem that choosing a line rate to try and produce 512 lines has the monitor trying to show something compatible with the traditional 640×480 resolution and thus lines are lost off the screen. I suppose I could settle for 480 lines or aim for 300 lines instead, but I actually don’t mind having a border around the picture.

The on-screen menu showing the monitor's interpretation of the signal

The on-screen menu showing the monitor's interpretation of the signal

It is worth noting that I haven’t really mentioned a “pixel clock” or “dot clock” so far. As far as the display receiving the VGA signal is concerned, there is no pixel clock in that signal. And as far as we are concerned, the pixel clock is only important when deciding how quickly we can get our data into the signal, not in actually generating the signal. We can generate new colour values as slowly (or as quickly) as we might like, and the result will be wider (or narrower) pixels, but it shouldn’t make the actual signal invalid in any way.

Of course, it is important to consider how quickly we can generate pixels. Previously, I mentioned a 24MHz clock being used within the PIC32, and it is this clock that is used to drive peripherals and this clock’s frequency that will limit the transfer speed. As noted elsewhere, a pixel clock frequency of 25MHz is used to support the traditional VGA resolution of 640×480 at 60Hz. With the possibilities of running the “peripheral clock” in the PIC32MX270 considerably faster than this, it becomes a matter of experimentation as to how many pixels can be supported horizontally.

Some Oslo street art being displayed by the PIC32

Some Oslo street art being displayed by the PIC32

For my own purposes, I have at least reached the initial goal of generating a stable and usable video signal. Further work is likely to involve attempting to write routines to modify the framebuffer, maybe support things like scrolling and sprites, and even consider interfacing with other devices.

Naturally, this project is available as Free Software from its own repository. Maybe it will inspire or encourage you to pursue something similar, knowing that you absolutely do not need to be any kind of “expert” to stubbornly persist and to eventually get results!

Making Python Programs Faster with Shedskin

Tuesday, February 2nd, 2016

A few months ago, I had the opportunity to combine two of my interests: retrocomputing and Python programming. The latter needs little additional explanation, but the former perhaps requires a few more words. Retrocomputing is the study and use of computing equipment from an earlier era in computing, where such equipment is typically no longer in use, or is no longer in widespread use. My own experiences with microcomputers began in the 1980s and are largely centred upon those manufactured by Acorn Computers, such as the Acorn Electron and BBC Microcomputer.

Some History

One of my earlier initiatives was to attempt to document and understand the functioning of the Acorn Electron’s ULA integrated circuit: the Uncommitted Logic Array employed in the computer to generate video and perform input/output tasks. When Acorn decided to make the Electron as a variant of the BBC Microcomputer, the engineers merged many of the functions performed by separate chips into a single one, for several good and not-so-good reasons:

  • To reduce system complexity: having to connect several components and make sure that they all work correctly and in time with each other can be challenging, and it is arguably best to reduce the number of things that can go wrong by just reducing the number of things involved in the first place.
  • To reduce system cost: production becomes less complicated and savings can potentially be realised by combining discrete components.
  • To deepen the organisation’s experience with integrated circuit design: this being the company that developed the ARM architecture and eventually brought the first ARM chipset (CPU, audio/video controller, memory management unit, and input/output controller) to market.
  • To make a proprietary component that others could not readily clone: this being something of an obsession in the early 1980s marketplace.

Now, the ULA has the job of reading from memory and translating what it reads into a sequence of colour values, thus generating a picture on the screen. However, it has the annoying limitation of locking the CPU out of the memory (in fact, only the RAM which resides in a certain region) while it generates each line of the displayed image. Depending on which “screen mode” is selected (determining the resolution and colour depth), it may still let the CPU access the RAM at a lower speed, or it may effectively suspend the CPU for the entire time taken to generate a single horizontal display line.

Consequently, the Acorn Electron is considerably slower than the BBC Microcomputer for this reason: the BBC Micro employs faster memory and lets the CPU and its own video circuitry take turns with the memory while they both run at full speed; the Electron used cheaper memory as another cost-saving measure; reviews of the Electron, while generally positive about getting BBC Micro features for less, tended to note this performance degradation with some disappointment.

The Acorn Electron ULA

The Acorn Electron ULA (with socket cover/heatsink removed)

Some Background

Normally, the matter of the CPU stalling for much of its time would be considered a disadvantage, but my brother and I were having a discussion about software running from ROM (or, in fact, in the area of memory not affected by the ULA), with the pitfalls of such software needing to access the RAM and potentially becoming stalled as the CPU finds itself waiting for the ULA to do its work. Somehow the notion arose that the ULA effectively provides a kind of synchronisation mechanism for software needing to run in the period between display lines.

So, a program can read instructions from ROM and then, when it needs to know that the display is not being updated, it can attempt to access RAM. Whether or not the program stalls remains unknown to the program itself, but when it gets the result of accessing the RAM – perhaps immediately, perhaps after a few microseconds – it can be certain that the ULA is not accessing the RAM because it just did so itself.

A screen image showing the update region

An image showing the display update region (the "test card" picture) when the ULA accesses memory, with the black border indicating the region (or time period) during which the CPU may access the lower region of the Electron's memory.

(See the BBC Test Cards for the origin of the above picture.)

One might wonder what kind of program would benefit from synchronising itself to the display line periods. Well, one thing that tended to happen back in the microcomputing era, was the trick of reprogramming the display palette – the selection of colours shown on screen – so that a greater number of colours can be displayed simultaneously on the screen than would normally be the case for that particular display configuration. For example, a screen mode normally offering only four colours could instead offer the full range – eight “proper” colours on an Electron – if it switched the palette during screen updates. Even a screen mode offering only two colours could offer eight by employing palette switching.

And with a certain amount of experimentation, a working solution was eventually delivered (with only limited input needed from my side). By running software in a ROM (or from RAM mapped into the same area of memory as a ROM), it became possible to reliably change the palette on a line-by-line basis, bridging the gap between a medium-resolution four-colour mode and a hypothetical eight-colour version that would only become available on Acorn’s ARM-based microcomputers. Although only four colours can be used per display line, each of the 256 display lines can employ four-colour combinations of the full eight colours, not quite making an eight-colour mode – which would, of course, allow eight colours on every display line – but still permitting graphical output much closer to a true eight-colour mode than can be contemplated with a restrictive four-colour mode.

Rainbow lorikeets on a lawn

Rainbow lorikeets on a lawn (left: 4 colours from 8 per display line; right: 8 colours per display line)

Palette Optimisation

With the trick in place to switch the palette on demand, all that remained was to make a program that could take an input image and to optimise the colours so that…

  1. Only the eight permitted colours (black, white, primary and secondary colours) are used in the image.
  2. No pixel row (display line) employs more than four different colours.

Expectations were rather low at first. First of all, in this era of bountiful quantities of fresh imagery, most pictures appear to have plenty of colours and are photographs, and so the easiest images to use are the ones that first need to be reduced in both resolution and colour depth. And once this is done, the job of optimising the colours to meet the second of the above criteria is required. So, initially, very simple techniques were employed to do both of these things.

Later, after some discussions, it appeared that integrating the two processing activities and, crucially, applying basic dithering and error propagation techniques, made it possible to represent complicated “deep-colour” images within the limited display representation.

Magnified details of the two representations

Magnified details of the two representations

Closer examination of the above images reveals how the algorithm attempts to spread the responsibility of representing colours across rows, being restricted to the choice of four colours (in the left image) to produce the appropriate tones that are more readily encoded using eight colours (in the right image).

Tuning the Implementation

To perform the optimisation process, I wrote a program which took an input image, rotated and scaled it to the target resolution, and processed the colours by inspecting the pixel data on each horizontal line (or row) of the image, calculating the appropriate four-colour combination for a line and generating suitable pixel data appropriate for this restricted palette. Since I am probably most comfortable with Python, and since Python also has various convenient image-processing libraries, I found myself developing a short Python program to do the work.

Now, Python doesn’t usually exhibit the highest performance, particularly for tasks such as this, and as the experimentation with different approaches started to lessen, with the most rewarding approaches making themselves evident, and with the temptation to convert lots of images just to see the results, my attentions turned to speeding up the program. Since I had been involved with packaging the Shedskin Python-to-C++ compiler for Debian, and thus the package was right there for me to use, it made sense to give it a try on this code.

At this point, the program was taking around 40 seconds to convert a 16 megapixel photographic image into the appropriate output representation. Despite using the Python Imaging Library to do the rotation and scaling, visiting each pixel in a Python program and doing some simple statistical and arithmetic operations was taking up rather a lot of time. In the past, I have used various “numeric” Python extensions – usually the ones supported by pygame – but things have moved on somewhat incoherently since then, and I also wanted to keep my code straightforward and readable, which is often something that is lost when making code “numeric”.

Time for Shedskin

I have used Shedskin before, and the first thing to think about when using it is how it will interact with non-Python code. Shedskin takes Python code and generates C++ which must then be compiled. If a whole program is translated to C++, the resulting executable can be run with no further work necessary. However, the program of interest here uses libraries that are actually implemented in C and are delivered as shared libraries that are loaded by CPython (the Python virtual machine implementation written in the C programming language). Shedskin cannot generally translate such a program in its entirety.

From the earliest days of Python, it was (and has remained) a common practice to first write library code in Python, desire better performance, and to then rewrite much of that code in the C programming language against the Python/C API (or CPython API) as an “extension module”, which is what these problematic shared libraries are. Such libraries act as part of the CPython virtual machine, more or less, and with suitable implementation choices made for performance, everything using such libraries will run much more quickly. Unfortunately, but understandably, Shedskin doesn’t seek to interact closely with the CPython implementation: it produces C++ code that works with its own runtime libraries.

So, instead of working with a single program file, I split my program up into a main program which deals with CPython extensions such as the Python Imaging Library, whose JPEG and PNG manipulation facilities are too convenient to abandon, along with a library module that does all the computation for this particular application. The library would provide its own image abstraction for pixel-level accesses, but be given the pixel data obtained by the main program, returning the processed data to the main program for saving to a file. By only compiling the library, Shedskin can produce a standalone library file containing hopefully high-performance implementations of the original Python code.

But how can such a library constructed by Shedskin be used? Would we now need to somehow translate the main program into C or C++ in order to be able to use it? Fortunately, but slightly confusingly, Shedskin can also generate the necessary CPython API wrapper around the translated code, making it possible for CPython to load this newly-created library after all.

(One might wonder how this is possible, but if one considers that arbitrary C or C++ code can be wrapped using the CPython API, with the values being sent in and out of a library only being converted at the interface, Shedskin has the luxury of generating something that lives by its own rules internally, with the wrapper doing the necessary “marshalling” of the values as they go in and come out. Such a wrapped library might be frowned upon in the Python world, not being a sophisticated extension module that takes full advantage of the CPython API, but such libraries were, and probably still are, the “bread and butter” of Python’s access to a wide array of tools and technologies.)

Knowing that this is a reasonable approach, I experienced a moment of excessive ambition. I tried to take the newly-broken-out library and compile it using the appropriate option:

shedskin -e optimiser.py

This took quite some time, and things did not go well…

*** SHED SKIN Python-to-C++ Compiler 0.9.2 ***
Copyright 2005-2011 Mark Dufour; License GNU GPL version 3 (See LICENSE)

[analyzing types..]
****************************88%
*WARNING* reached maximum number of iterations
********************************100%
[generating c++ code..]
*WARNING* 'get_combinations' function not exported (cannot convert argument 'c')
*WARNING* 'get_colours' function not exported (cannot convert return value)
*WARNING* 'balance' function not exported (cannot convert argument 'd')
*WARNING* optimiser.py: expression has dynamic (sub)type: {None, int, tuple}
*WARNING* optimiser.py: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiser.py: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiser.py: expression has dynamic (sub)type: {int, tuple}
*WARNING* optimiser.py: variable 'data' has dynamic (sub)type: {None, int, tuple}
*WARNING* optimiser.py: variable (class SimpleImage, 'data') has dynamic (sub)type: {None, int, tuple}

And then there were many more lines of a more specific nature:

*WARNING* optimiser.py:41: function distance not called!
*WARNING* optimiser.py:50: expression has dynamic (sub)type: {None, int, tuple}
*WARNING* optimiser.py:53: expression has dynamic (sub)type: {None, int, tuple}
*WARNING* optimiser.py:57: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiser.py:57: expression has dynamic (sub)type: {int, tuple}

And so on. Now, if you are thinking that this will probably not end well, you would be right. Running make gives plenty of errors looking as scary as this…

optimiser.cpp: In function '__shedskin__::list<__shedskin__::tuple2<__shedskin__::tuple2<int, int>*, double>*>* __optimiser__::list_comp_0(__shedskin__::pyiter<double>*)':
optimiser.cpp:76:17: error: base operand of '->' is not a pointer
optimiser.cpp:77:21: error: base operand of '->' is not a pointer
optimiser.cpp:78:68: error: invalid conversion from '__shedskin__::tuple2<int, int>*' to 'int' [-fpermissive]
In file included from /usr/share/shedskin/lib/builtin.hpp:1204:0,
                 from optimiser.cpp:1:
/usr/share/shedskin/lib/builtin/tuple.hpp:211:28: error:   initializing argument 2 of '__shedskin__::tuple2<A, B>::tuple2(int, A, B) [with A = int; B = double]' [-fpermissive]
optimiser.cpp:78:70: error: no matching function for call to '__shedskin__::list<__shedskin__::tuple2<__shedskin__::tuple2<int, int>*, double>*>::append(__shedskin__::tuple2<int, double>*)'
optimiser.cpp:78:70: note: candidate is:
In file included from /usr/share/shedskin/lib/builtin.hpp:1203:0,
                 from optimiser.cpp:1:
/usr/share/shedskin/lib/builtin/list.hpp:96:25: note: void* __shedskin__::list<T>::append(T) [with T = __shedskin__::tuple2<__shedskin__::tuple2<int, int>*, double>*]
/usr/share/shedskin/lib/builtin/list.hpp:96:25: note:   no known conversion for argument 1 from '__shedskin__::tuple2<int, double>*' to '__shedskin__::tuple2<__shedskin__::tuple2<int, int>*, double>*'

Of course, it would be unfair to expect this to have worked: Shedskin did indeed give us plenty of warnings! But we should at least try and understand such warnings if we are to make progress. After a few more iterations of this bold strategy, I realised that I had started out in the wrong fashion, presenting Shedskin with code that is too dynamic (and ambiguous) for it to reasonably infer sensible types and produce a compilable C++ representation.

First Things First

I started out by reintroducing some necessary changes gradually. First of all, I made a branch of my code committing the special image abstraction to be used in any future Shedskin-compiled version. Meanwhile, I looked at some things that might generally be performance-degrading in Python, and which might also help Shedskin deduce the program types more readily.

One thing that I had done in the initial implementation was to freely use sequences to represent colour triplets: the red, green and blue values. I was using code like this:

def invert(srgb):
    return tuple(map(lambda x: 1.0 - x, srgb))

This might be elegant – there’s no separate treatment of each element in the triplet – but there is likely to be more overhead in treating the triplet as a sequence, iterating over it, and so on. Moreover, Shedskin is likely to see objects appearing from a collection without any idea of what was put into that collection to start with. So, more mundane code was introduced in its place:

def invert(srgb):
    r, g, b = srgb
    return 1.0 - r, 1.0 - g, 1.0 - b

Such changes reduced the processing time from around 35 seconds to around 25 seconds. After various other performance modifications, such as avoiding the repeated computation of common values, this became around 18 to 19 seconds. Interestingly, merging such modifications into the branch with the special image abstraction reduced the running time to around 16 to 17 seconds. But at this point, obvious optimisation opportunities were more or less exhausted.

It then became time for another attempt to present this to Shedskin. This time, instead of calling the main program main.py and the library optimiser.py, which is not particularly intuitive, I retained the name of the main program as optimiser.py and used the traditional optimiserlib.py naming for the library:

shedskin -e optimiserlib.py

This completed without any warnings, and running make produced a library file, optimiserlib.so, that Python recognises as an extension module. Running the program produced a palette-optimised image in just under 9 seconds!

Making the Difference

So, what were the differences that made Shedskin accept the library code? Since the splitting up of the code into two files makes comparisons between versions awkward, we can first compare the version of the program before our preparatory work with the version of the program that fed into our Shedskin version. There are four areas of changes:

  • The introduction of that image abstraction class – SimpleImage – to be used to manipulate images instead of using Python Imaging Library objects. (In fact, this is only used initially as a container for the raw pixel data, with such data being passed to library functions, as described below.)
  • Modifications to access the different components of colour triplets explicitly.
  • Some “common sense” performance modifications, avoiding the repeated computation of things that always produce the same result anyway.
  • Some local name adjustments.

This final point is something that the Shedskin documentation mentions prominently, but it can be easy to forget. Consider the following code (in the balance function of the library module):

dd = dict([(value, f) for f, value in d])

Originally, this looked like this:

d = dict([(value, f) for f, value in d])

This earlier version just reverses a dictionary mapping and assigns the result to the name of the original dictionary. Although this seems harmless enough, Shedskin’s analysis would be complicated substantially by having to deal with a name being reassigned and potentially being associated with completely different kinds of objects, even if in this case we are only dealing with dictionaries. (Even if the same type were only ever involved, as we see here, it would also prevent any analysis being done on the nature of the keys and values in the dictionaries.)

We can see the effect of this by trying to compile the functioning version with the earlier naming scheme reintroduced. First, we see a warning like this:

*WARNING* 'balance' function not exported (cannot convert argument 'd')

Since Shedskin propagates type information around the program, this leads to other problems:

*WARNING* optimiserlib.py: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiserlib.py: expression has dynamic (sub)type: {int, tuple}
*WARNING* optimiserlib.py: variable (function (function balance, 'list_comp_0'), 'value') has dynamic (sub)type: {int, tuple}
*WARNING* optimiserlib.py: variable (function (function balance, 'list_comp_1'), 'f') has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py: variable (function (function balance, 'list_comp_1'), 'value') has dynamic (sub)type: {int, tuple}

And later in the messages, the more specific errors indicate the problem and its consequences:

*WARNING* optimiserlib.py:100: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py:100: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiserlib.py:100: expression has dynamic (sub)type: {int, tuple}
*WARNING* optimiserlib.py:102: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py:102: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiserlib.py:103: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py:103: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiserlib.py:104: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py:104: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiserlib.py:105: class 'list' has no method 'items'
*WARNING* optimiserlib.py:105: expression has dynamic (sub)type: {float, int, tuple2}
*WARNING* optimiserlib.py:105: expression has dynamic (sub)type: {float, int, tuple}
*WARNING* optimiserlib.py:105: expression has dynamic (sub)type: {int, tuple}

And so on. Such warnings, which will cause errors if one tries to compile the result, can seem very intimidating, but within all these details the cause should be identifiable.

But another important aspect of the successful translation of the library module should not be forgotten: it is the matter of choosing the right initial functionality to give to Shedskin. Instead of trying to compile everything, it makes sense to concentrate on things like functions which are fairly self-contained, which do not call potentially vast regions of other program functionality, and which are invoked often, especially in loops that take a long time. Migrating a small piece of functionality at a time helps to keep the troubleshooting activity manageable.

By using the SimpleImage class as a convenient staging post for pixel data, simple library functions could be migrated first, and the library could be kept ignorant of the abstractions being maintained in the main program. Later on, as we shall see below, such abstractions and functions that require them could then be migrated themselves.

Broadening the Scope

With some core functionality migrated to a compilable extension module, it becomes possible to give other things the Shedskin treatment. At the start of the second attempt to get Shedskin to compile the code, quite a few functions were still being run by the CPython virtual machine, notably various image operations. With the functions called by these operations already migrated, it becomes possible to move each of these operations over into the extension module. The expectation here is that since these functions can now call each other without the overhead of the CPython API, as well as avoiding running code in the virtual machine, further performance improvements will occur.

And since Shedskin’s own runtime library supports certain Python standard library modules in accelerated form, it becomes interesting to migrate operations that employ such module functionality. For example, the get_combinations function can be moved to take advantage of Shedskin’s itertools implementation. And in addition to just moving functions, classes can also be compiled by Shedskin and potentially accessed more quickly, while remaining accessible to normal Python code that hasn’t been compiled.

Not all of these optimisations give the boost in performance that might be hoped for, perhaps even bringing performance penalties when first introduced. But it is important to look beyond any current, small optimisation to future optimisations that the accumulation of those small optimisations will permit. Moving the SimpleImage class into the extension module permits the migration of other code, with the consequence that the program when run takes around 6 seconds instead of 9 seconds: the apparent optimisation barrier is overcome, leading to yet more gains!

Some Conclusions

It is possible to make very useful performance gains just by revisiting code and writing it in a way that suits the language implementation. Starting out with an execution time of around 40 seconds, it was possible to make the program run in closer to 15 seconds, and that might have been enough for occasional conversion of images. But by adopting Shedskin and applying it to a subset of the program’s functionality, an execution time of 9 seconds was initially obtained. This might not be quite as impressive as the earlier two-and-a-half fold reduction in time (or, of course, a two-and-a-half fold increase in throughput), but it was obtained with little additional effort, with admittedly some planning for it being incorporated into the earlier optimisation work.

Ultimately, reaching around 6 seconds of execution time means that Shedskin was indeed able to more or less match earlier performance gains, but again with little actual optimisation effort. And one can argue that merely using Shedskin helped to inform those earlier gains, anyway. Regardless of what should take the credit, the program ended up running almost seven times faster than when we started out on this journey.

Tools like Shedskin have a slightly controversial place in the Python world. People like to point out that Shedskin really only compiles a “restricted subset” of Python: one that Shedskin is able to analyse. And as the above exercise demonstrates, modifications to programs are likely to be required to take advantage of it. But the resulting programs are still Python programs, and the CPython virtual machine will still run them, just slower – three times slower in this case – than if they were compiled.

Authors of tools like Shedskin and alternative implementations of the Python language have a tough decision to make. They must either deal with the continuously changing “full-fat” version of the language, with new features being added all the time that they have to support, with the risk that they will never catch up and never be considered a “proper” implementation. Or they must impose limitations on the form of the language and the features they support, knowing that even if their software accepts programs written in Python, it will be marginalised and regarded as a distraction from “proper” Python programming. And yet, Python as delivered by CPython offers little in the way of things that Shedskin and similar tools seek to offer, so it should be no wonder that such tools have come to exist.

Instead, one might wonder why it is that the language must continue to evolve in ways that frustrate static analysis and enhancements to program performance and scalability. Indeed, my own interests in the analysis of Python code have been somewhat rekindled by this exercise, but unlike the valiant efforts of other developers (such as the author of Nuitka and his continuing quest to support the compilation of “full-fat” Python), I have no intention of using Python as my gold standard. In essence, Python is and has been a productive language to use – that is why I have used it here and many times before – but the very nature of the language should be open to question and review.

If, by discarding features, something like Python can be more predictable and better-performing, why would exploring this avenue of inquiry be a bad thing? Shedskin shows us that there are benefits in doing so, and I hope that this article has also shown some practical techniques for using and understanding the tool. And maybe this will encourage me and others to make some more progress with our own Python program analysis efforts as well, if only to offer alternatives that seek to deliver on some of the unrealised promise and potential that Python seemed to offer back when we first discovered it, so many years ago now.

Oslo architecture (4 from 8 colours with scaled original)

An architectural picture from Oslo with the left version using 4 colours from 8 on every display line, the right version being a scaled version of the original.