Paul Boddie's Free Software-related blog


Archive for the ‘data preservation’ Category

Migrating the Mailman Wiki from Confluence to MoinMoin

Tuesday, May 12th, 2015

Having recently seen an article about the closure of a project featuring that project’s usage of proprietary tools from Atlassian – specifically JIRA and Confluence – I thought I would share my own experiences from the migration of another project’s wiki site that had been using Confluence as a collaborative publishing tool.

Quite some time ago now, a call for volunteers was posted to the FSF blog, asking for people familiar with Python to help out with a migration of the Mailman Wiki from Confluence to MoinMoin. Subsequently, Barry Warsaw followed up on the developers’ mailing list for Mailman with a similar message. Unlike the project at the start of this article, GNU Mailman was (and remains) a vibrant Free Software project, but a degree of dissatisfaction with Confluence, combined with the realisation that such a project should be using, benefiting from, and contributing to Free Software tools, meant that such a migration was seen as highly desirable, if not essential.

Up and Away

Initially, things started off rather energetically, and Bradley Dean initiated the process of fact-finding around Confluence and the Mailman project’s usage of it. But within a few months, things apparently became noticeably quieter. My own involvement probably came about through seeing the ConfluenceConverter page on the MoinMoin Wiki, looking at the development efforts, and seeing if I couldn’t nudge the project along by pitching in with notes about representing Confluence markup format features in the somewhat more conventional MoinMoin wiki syntax. Indeed, it appears that my first contribution to this work occurred as early as late May 2011, but I was more or less content to let the project participants get on with their efforts to understand how Confluence represents its data, how Confluence exposes resources on a wiki, and so on.

But after a while, it occurred to me that the volunteers probably had other things to do and that progress had largely stalled. Although there wasn’t very much code available to perform concrete migration-related tasks, Bradley had gained some solid experience with the XML format employed by exported Confluence data, and such experience when combined with my own experiences dealing with very large XML files in my day job suggested an approach that had worked rather well with such large files: performing an extraction of the essential information, including identifiers and references that communicate the actual structure of the information, as opposed to the hierarchical structure of the XML data itself. With the data available in a more concise and flexible form, it can then be processed in a more convenient fashion, and within a few weeks I had something ready to play with.

With a day job and other commitments, it isn’t usually possible to prioritise volunteer projects like this, and I soon discovered that some other factors were involved: technological progress, and the tendency for proprietary software and services to be upgraded. What had initially involved the conversion of textual content from one markup format to another now seemed to involve the conversion from two rather different markup formats. All the effort documenting the original Confluence format now seemed to be almost peripheral if not superfluous: any current content on the Mailman Wiki would now be in a completely different format. And volunteer energy seemed to have run out.

A Revival

Time passed. And then the Mailman developers noticed that the Confluence upgrade had made the wiki situation even less bearable (as indeed other Confluence users had noticed and complained about), and that the benefits of such a solution were being outweighed by the inconveniences of the platform. And it was at this point that I realised that it was worthwhile continuing the migration effort: it is bad enough that people feel constrained by a proprietary platform over which they have little control, but it is even worse when it appears that they will have to abandon their content and start over with little or no benefit from all the hard work they have invested in creating and maintaining that content in the first place.

And with that, I started the long process of trying to support not only both markup formats, but also all the features likely to have been used by the Mailman project and those using its wiki. Some might claim that Confluence is “powerful” by supporting a multitude of seemingly exotic features (page relationships, comments, “spaces”, blogs, as well as various kinds of extensions), but many of these features are rarely used or never actually used at all. Meanwhile, as many migration projects can attest, if one inadvertently omits some minor feature that someone regards as essential, one risks never hearing the end of it, especially if the affected users have been soaking up the propaganda from their favourite proprietary vendor (which was thankfully never a factor in this particular situation).

Despite the “long tail” of feature support, we were able to end 2012 with some kind of overview of the scope of the remaining work. And once again I was able to persuade the concerned parties that we should focus on MoinMoin 1.x not 2.x, which has proved to be the correct decision given the still-ongoing status of the latter even now in 2015. Of course, I didn’t at that point anticipate how much longer the project would take…

Iterating

Over the next few months, I found time to do more work and to keep the Mailman development community informed again and again, which is a seemingly minor aspect of such efforts but is essential to reassure people that things really are happening: the Mailman community had, in fact, forgotten about the separate mailing list for this project long before activity on it had subsided. One benefit of this was to get feedback on how things were looking as each iteration of the converted content was made available, and with something concrete to look at, people tend to remember things that matter to them that they wouldn’t otherwise think of in any abstract discussion about the content.

In such processes, other things tend to emerge that initially aren’t priorities but which have to be dealt with eventually. One of the stated objectives was to have a full history, meaning that all the edits made to the original content would need to be preserved, and for an authentic record, these edits would need to preserve both timestamp and author information. This introduced complications around the import of converted content – it being no longer sufficient to “replay” edits and have them assume the timestamp of the moment they were added to the new wiki – as well as the migration and management of user profiles. Particularly this latter area posed a problem: the exported data from Confluence only contained page (and related) content, not user profile information.

Now, one might not have expected user details to be exportable anyway due to potential security issues with people having sufficient privileges to request a data dump directly from Confluence and then to be able to obtain potentially sensitive information about other users, but this presented another challenge where the migration of an entire site is concerned. On this matter, a very pragmatic approach was taken: any user profile pages (of which there were thankfully very few) were retrieved directly over the Web from the existing site; the existence of user profiles was deduced from the author metadata present in the actual exported wiki content. Since we would be asking existing users to re-enable their accounts on the new wiki once it became active, and since we would be avoiding the migration of spammer accounts, this approach seemed to be a reasonable compromise between convenience and completeness.

Getting There

By November 2013, the end was in sight, with coverage of various “actions” supported by Confluence also supported in the migrated wiki. Such actions are a good example of how things that are on the edges of a migration can demand significant amounts of time. For instance, Confluence supports a PDF export action, and although one might suggest that people just print a page to file from their browser, choosing PDF as the output format, there are reasonable arguments to be made that a direct export might also be desirable. Thus, after a brief survey of existing options for MoinMoin, I decided it would be useful to provide one myself. The conversion of Confluence content had also necessitated the use of more expressive table syntax. Had I not been sufficiently interested in implementing improved table facilities in MoinMoin prior to this work, I would have needed to invest quite a bit of effort in this seemingly peripheral activity.

Again, time passed. Much of the progress occurred off-list at this point. In fact, a degree of confusion, miscommunication and elements of other factors conspired to delay the availability of the infrastructure on which the new wiki would be deployed. Already in October 2013 there had been agreement about hosting within the python.org infrastructure, but the matter seemed to stall despite Barry Warsaw trying to push it along in February and April 2014. Eventually, after complaining from me on the PSF members’ mailing list at the end of May, some motion occurred on the matter and in July the task of provisioning the necessary resources began.

After returning from a long vacation in August, the task of performing the final migration and actually deploying the content could finally begin. Here, I was able to rely on expert help from Mark Sapiro who not only checked the results of the migration thoroughly, but also configured various aspects of the mail system functionality (one of the benefits of participating in a mail-oriented project, I guess), and even enhanced my code to provide features that I had overlooked. By September, we were already playing with the migrated content and discussing how the site would be administered and protected from spam and vandalism. By October, Barry was already confident enough to pre-announce the migrated site!

At Long Last

Alas, things stalled again for a while, perhaps due to other commitments of some of the volunteers needed to make the final transition occur, but in January the new Mailman Wiki was finally announced. But things didn’t stop there. One long-standing volunteer, Jim Tittsler, decided that the visual theme of the new wiki would be improved if it were made to match the other Mailman Web resources, and so he went and figured out how to make a MoinMoin theme to do the job! The new wiki just wouldn’t look as good, despite all the migrated content and the familiarity of MoinMoin, if it weren’t for the special theme that Jim put together.

The all-new Mailman Wiki with Jim Tittsler's theme

The all-new Mailman Wiki with Jim Tittsler's theme

There have been a few things to deal with after deploying the new wiki. Spam and vandalism have not been a problem because we have implemented a very strict editing policy where people have to request editing access. However, this does not prevent people from registering accounts, even if they never get to use them to do anything. To deal with this, we enabled textcha support for new account registrations, and we also enabled e-mail verification of new accounts. As a result, the considerable volume of new user profiles that were being created (potentially hundreds every hour) has been more or less eliminated.

It has to be said that throughout the process, once it got started in earnest, the Mailman development community has been fantastic, with constructive feedback and encouragement throughout. I have had disappointing things to say about the experience of being a volunteer with regard to certain projects and initiatives, but the Mailman project is not that kind of project. Within the limits of their powers, the Mailman custodians have done their best to enable this work and to see it through to the end.

Some Lessons

I am sure that offers of “for free” usage of certain proprietary tools and services are made in a genuinely generous way by companies like Atlassian who presumably feel that they are helping to make Free Software developers more productive. And I can only say that those interactions I experienced with Contegix, who were responsible for hosting the Confluence instance through which the old Mailman Wiki was deployed, were both constructive and polite. Nevertheless, proprietary solutions are ultimately disempowering: they take away the control over the working environment that users and developers need to have; they direct improvement efforts towards themselves and away from Free Software solutions; they also serve as a means of dissuading people from adopting competing Free Software products by giving an indication that only they can meet the rigorous demands of the activity concerned.

I saw a position in the Norwegian public sector not so long ago for someone who would manage and enhance a Confluence installation. While it is not for me to dictate the tools people choose to do their work, it seems to me that such effort would be better spent enhancing Free Software products and infrastructure instead of remedying the deficiencies of a tool over which the customer ultimately has no control, to which the customer is bound, and where the expertise being cultivated is relevant only to a single product for as long as that product is kept viable by its vendor. Such strategic mistakes occur all too frequently in the Norwegian public sector, with its infatuation with proprietary products and services, but those of us not constrained by such habits can make better choices when choosing tools for our own endeavours.

I encourage everyone to support Free Software tools when choosing solutions for your projects. It may be the case that such tools may not at first offer precisely the features you might be looking for, and you might be tempted to accept an offer of a “for free” product or to use a no-cost “cloud” service, and such things may appear to offer an easier path when you might otherwise be confronted with a choice of hosting solutions and deployment issues. But there are whole communities out there who can offer advice and will support their Free Software project, and there are Free Software organisations who will help you deploy your choice of tools, perhaps even having it ready to use as part of their existing infrastructure.

In the end, by embracing Free Software, you get the control you need over your content in order to manage it sustainably. Surely that is better than having some random company in charge, with the ever-present risk of them one day deciding to discontinue their service and/or, with barely enough notice, discard your data.

International Day Against DRM

Wednesday, May 6th, 2015

A discussion on the International Day Against DRM got my attention, and instead of replying on the site in question, I thought I’d write something about it here. The assertion was that “this war has been lost“, to which it was noted that “ownership isn’t for everyone”.

True enough: people are becoming conditioned to accept that they can enjoy nice things but not have any control of them or, indeed, any right to secure them for themselves. So, you effectively have the likes of Spotify effectively reinventing commercial radio where the interface is so soul-crushingly awful that it’s almost more convenient to have to call the radio station to request that they play a track. Or at least it was when I was confronted with it on someone’s smartphone fairly recently.

Meanwhile, the ignorant will happily trumpet the corporate propaganda claiming that those demanding digital rights are “communists”, when the right to own things to enjoy on your own terms has actually been taken away by those corporations and their pocket legislators. Maybe people should remember that when they’re next out shopping for gadgets or, heaven forbid, voting in a public election.

An Aside on Music

Getting older means that one can happily and justifiably regard a lot of new cultural output as inferior to what came before, which means that if one happened to stop buying music when DRM got imposed, deciding not to bother with new music doesn’t create such a big problem after all. I have plenty of legitimately purchased music to listen to already, and I didn’t need to have the potential enjoyment of any new work inconvenienced by only being able to play that work on certain devices or on somebody else’s terms.

Naturally, the music industry blames the decline in new music sales on “piracy”, but in fact people just got used to getting their music in more convenient ways, or they decided that they already have enough music and don’t really need any more. I remember how some people would buy a CD or two every weekend just as a treat or to have something new to listen to, and the music industry made a very nice living from this convenient siphoning of society’s disposable income, but that was just a bubble: the prices were low enough for people to not really miss the money, but the prices were also high enough and provided generous-enough margins for the music industry to make a lot of money from such casual purchasers while they could.

Note that I emphasised “potential” above. That’s another thing that the music business got away with for years: the loyalty of their audiences. How many people bought new material from an artist they liked only to discover that it wasn’t as good as they’d hoped? After a while, people just lose interest. This despite the effective state subsidy of the music business through public broadcasters endlessly and annoyingly playing and promoting that industry’s proprietary content. And there is music from even a few years ago that you wouldn’t be able to persuade anyone to sell you any more. It is like they don’t want your money, or at least if it is not being handed over on precisely their terms, which for a long time now has seemed to involve the customer going back and paying them again and again for something they already bought (under threat of legal penalties for “format shifting” in order to compel such repeat business).

It isn’t a surprise that the focus now is on music (and video) streaming and that actually buying media to play offline is becoming harder and harder. The focus of the content industries is on making it more and more difficult to acquire their content in ways that make it possible to experience that content on sustainable terms. Just as standard music CDs became corrupted with DRM mechanisms that bring future access to the content into doubt, so have newer technologies been encumbered with inconvenient and illegitimate mechanisms to deny people legitimate access. And as the campaign against DRM notes, some of the outcomes are simply discriminatory and shameful.

Our Legacy

Even content that has not been “protected” has proven difficult to recover simply due to technological progress and material, cultural and intellectual decay. It would appal many people that anyone would put additional barriers around content just to maximise revenues when the risk is that the “protectors” of such content will either inadvertently (their competence not being particularly noted) or deliberately (their vindictiveness being especially noted) consign that content to the black hole of prehistory just to stop anyone else actually enjoying it without them benefiting from the act. In some cases, one would think that content destruction is really what the supposed guardians of the content actually want, especially when there’s no more easy money to be made.

Of course, such profiteers don’t actually care about things like cultural legacy or the historical record, but society should care about such things. Regardless of who paid for something to be made – and frequently it was the artist, with the publishers only really offering financing that would most appropriately be described as “predatory” – such content is part of our culture and our legacy. That is why we should resist DRM, we should not support its proponents when buying devices and content or when electing our representatives, and it is why we should try and limit copyright terms so that legacy content may stand a chance of being recovered and responsibly archived.

We owe it to ourselves and to future generations to resist DRM.