Bobulate


Posts Tagged ‘server’

Murphy’s Day

Wednesday, November 10th, 2010

If something weird is happening with a server, never think “It’ll just be an hour or two.” Never think “If I’m going to be in the server room anyway, I might as well do foo as well to another box.” Since I thought both of these foolish things, it shows off that there’s definitely areas of Linux system administration that I’m no good at and that are needlessly complicated, and that I’m an inveterate optimist when it comes to these things.

The CodeYard server — a five year old IBM x306 with hard drives showing over 30000 hours of continuous operation and which has had uptimes over 500 days — slowed to a crawl, then rebooted yesterday. Sjors pinged me by phone, so I biked to the University to take a look with him. While en-route, the box did another kernel panic while running fsck(8). Ugh.

Now, working on a server that has two partially-mirrored 250GB SATA-150 hard drives and only 1GB of RAM (seriously, when we got this machine it was a sensible box for supporting medium sized workgroups, now my phone has more oomph) just takes forever. It never takes just an hour or two to wait for GEOM mirror to complete and then the fsck(8) to wind up and then .. bam, another kernel panic. By the end of the day we hadn’t really pinned down what was causing the problem, but memcheck seems in order.

All the data — students SVN and git repositories — on the machine seems safe, but we’ve pretty much turned off all the services offered by the box by various service jails until we get things sorted out.

So one failure doesn’t a Murphy’s day make. The second is that my laptop — which worked in the morning and didn’t when I got to the server room — has suddenly forgotten that it has a display panel attached to it, so I don’t see a thing. Not even BIOS POST messages. It still seems to boot into Fedora OK and I can even log in to my wonderful pink desktop (now there’s a blessing in disguise). Can’t see a thing. This particularly puts a crimp in the plan to use the laptop as a KDE demonstration machine during the NLUUG fall conference. I might end up lugging a desktop machine along instead.

In parallel with all this I did some upgrades on the EBN machine, which was foolish of me. That server had been running off of a spare laptop drive for some time now — a situation that was bound to come crashing down at some point. So the plan was simple: add a 500GB data disk, put back the Sun 10kRPM SAS disk that came out of the machine some time ago, copy boot stuff to SAS disk, reboot, done.

Yeah, right.

Three things I’d forgotten: dump + restore no longer works, making disks bootable is non-trivial and initrd is some brain-dead invention intended to prevent you from moving things around effectively. Give me FreeBSD, which at least will boot (quickly) and then complain and you can type in the root directory for single-user mode in a human-friendly fashion.

In the end I dd’ed the old disk onto the new disk, then did a chroot and mkinitrd. It just doesn’t seem right. Maybe I’ve missed a really obvious manpage somewhere explaining how the boot process works nowadays and how to migrate an installation to a different disk (lazyweb!). Tracking down the remaining references to the old disk took a bit longer, but the machine is up-and-running again. Now my next challenge is to convince the disk subsystem that I hot-attached a new drive (which would be /dev/sdf) which is physically identical to /dev/sde, and then dd everything over again so there’s a spare boot disk.

Plenty of things to go wrong. In retrospect, the old Nethack adage serves best (e.g. when going down stairs while burdened with a cockatrice corpse) “just don’t do that.”

EBN up^Wdown for the moment

Sunday, May 30th, 2010

I wrote this on Saturday afternoon, but didn’t hit “post”: After running fsck repeatedly until it finally stopped finding unreferenced files, I’m hopefully calling the disk array fixed on the EBN. I’m still going to reconfigure the server in some way, but for the time being I’ve restarted the VM for api.kde.org and the EBN. That means that api.kde.org is accessible again and the EBN is running. I’m hoping that the NIC and RAID will hold out. It will take a while for API regeneration to finish as well as a new round of Krazy checks to run.

And I would add this today: the RAID array didn’t survive the night, with new read timeouts on the disk followed by mirror corruption and end of story. So I rebooted, dropped that VM again and the machine is currently running only essential services. We’re in the process of moving things off of the machine now so that it is easier to reinstall, with fewer tasks. Then we’ll hopefully have something usable again.

EBN, api.kde.org, vizzion.org and others back at last

Tuesday, May 18th, 2010

Well, it took ages, but the EBN and the different VMs it hosts are back. Add “sysadmin” to the list of occupations I probably shouldn’t attempt without (1) more training (2) a stricter schedule. The NLUUG spring conference on systems administration was quite educational — and fun, too, chatting with various companies and learning about NanoBSD and ZFS — but it didn’t give me any magical beans to fix what ailed the EBN.

So what was the problem? Well, the whole thing started (yay, placing the blame!) with Bertjan, who wanted a newer Qt version on the EBN for his software quality checking tools. The EBN ran 6.2-R, and the necessary Qt versions and stuff are not supported on that OS anymore. While the EOL for FreeBSD 6 is still six months away, the ports maintainers don’t necessarily want to support that. So we needed to update the OS to something newer.

There’s tools to do that now, but I’ve never used them, and anyway I don’t think they support FreeBSD 6. So that means lots of “make buildworld buildkernel installkernel installworld” kinds of steps. First off I found that doing the compilations took a lot longer than I expected (or hoped). So where I planned to go 6-6.4-7.3-8.0 in one day, the fact was that just compiling was going to take longer than that. I couldn’t pre compile everything either with the machine still up, because FreeBSD 8 doesn’t compile in a FreeBSD 6 environment. Hence the multiple steps. Note to self: update more frequently to avoid this kind of large upgrade.

Second problem was that the jails (virtual machines) on the server were poorly set up. They all had their own copies of the world. I hadn’t realized that a 6.2 jail wouldn’t work in a 7.3 host (for instance, ps fails and lots of other system tools don’t like it). If I had spent more time thinking, I would have realized that I could installworld to each jail again and things would be ok. Note to self: set up jails with an easily upgradeable world, as described in lots of best-practices documents on jails.

So I upgraded the host onwards to FreeBSD 8.0. Another long long compile, with no GNU screen to make it easier to deal with. Thank goodness for the ILOM and the system console redirection it provides.

Of course, then I went on to make delete-old-libs, which meant that the ports on the system — all of which were compiled against the 6.2 libraries — didn’t work anymore. Note to self: see that little note “in case no 3rd party program uses them anymore”? Keep it in mind next time.

So, after about two days, I had a base system updated to 8.0, no working jails at all, and all ports — both in the host and in the jails — broken. At this point, I started doing two things in parallel. Note to self: don’t. I started rebuilding the ports in the host system, and reconfiguring the jails to have a single base installation with just /home, /etc, /var and /usr/local local to each jail, using nullfs mounts; I also decided to drop the starting of jails in /etc/rc.local and to use the jail-launching support that is now built in (but which wasn’t, as far as I know, available in 6.0 which is when I first configured the machine). Note to self: that was actually a good idea, and thanks also to Sjors who reminded me of the jail_* variables.

So, rebuilding ports after a big step like that is complicated by the fact that perl, ruby, php and python all needed to be recompiled and portupgrade -apP sometimes doesn’t quite get it right. In any case I needed to rebuild the ruby stack first to get a working portupgrade. The other three languages were a mess, with some modules of the languages disappearing at inopportune points along the upgrade path. Basically I did portupgrade -apP ; pkgdb -F ; portinstall <something missing> an awful lot until things were working again. This morning I finally got rid of the last missing PHP 5.3 modules which brought the EBN parts back to life. Note to self: read UPDATING twice before doing this again.

Of course, all that would have been less problematic if the disk array hadn’t given out twice during the whole operation. Once the ridiculously heavy load on the machine caused a panic and once the power on one of the disks fluctuated enough to cause another panic. Running fsck on a 600GB filesystem with 14M inodes is not quick (especially if there’s a few directories with 1M files in each, as is the case with KDE SVN mirrors). Note to self: badger more people about a better disk array for KDE.

Combine all that with sickness and family time and that’s why it took a week. I’m blogging this for the notes to self for the next time I run an upgrade (resolution: when FreeBSD 8.1 comes out) and to notify folks that things should be back to normal. (If not, drop me a note in comments). One the positive side, the server is better organized now, disk usage is down a little bit, and future upgrades should be much easier.

Rains, pours, downtime

Thursday, May 13th, 2010

There’s a few things I learned today: One is that FreeBSD with UFS2 is a little slow when dealing with directories with over a million files in them. The KDE SVN — created way back in the SVN 1.4 or earlier days — is set up like that, with one flat directory structure. As a consequence, copying a SVN repo mirror from one place on the disk to another is rather slow. Moving it (within the same filesystem) would be a lot faster, but I wanted a copy. Second is that the EBN machine has grown SVN mirrors and experiments and KDE checkouts (of the whole thing) like mushrooms after rain. I’ll have to clean some of that up, not so much for the diskspace, but for tidyness. Third is that while copying three distinct million-file trees in parallel, your disk array will have a power hiccup, panicking the machine and leading to another two days of fsck. So more waiting for the EBN to come back — particularly annoying since I had the other virtual machines on the system back up and running, so that Sebas had his website back, the KDE4-Solaris packages were available again, and Claudia could share documents with the rest of the board.

Fourth is that Mystic Kriek is really quite tasty, in a pink-and-foamy-cherry-coke-with-alcohol kind of way.

Speaking of pink, I got word that my talk for Akademy has been accepted, with the condition that I must bring my pink whip. Paul Adams has nothing to do with that, I’m sure. However, I need to point out that I got a new whip in Kano last month, made of rolled up goat hide. White, plumed, a little bit more floppy than the nylon-core things we’re used to at Akademy. We’ll see how that turns out as a speaker motivational tool.

Extended downtime for EBN

Tuesday, May 11th, 2010

Overconfidence goeth before a fall, they say. The software upgrades on the EBN machine are taking a good deal longer than intended. Part of this is some unexpected trips I had to make, but mostly it’s just that 6.2-6.4-7.4-8.0 is a lot of buildworlds (which take surprisingly long!) and reboots, followed by the realization that the jails need upgrading too to work at all (although the ports may continue to work, the base system doesn’t).

All in all it’s just taking a lot longer than intended, but it’s coming back bit-by-bit, rest assured.

Also, I should say that Sun’s ILOM really rocks for remote management, since I needed a great deal of console access to get this done, and I don’t feel like sitting in a cold and drafty datacentre to do so — not with this ongoing hacking cough and headache, no sirree.

Planned downtime on the EBN

Wednesday, May 5th, 2010

The server running the EBN (a Sun X4200 running FreeBSD — soon to be running OpenSolaris in a VM) is getting a bit long in the tooth, software-wise, and it turns out that it can no longer even run all of the software needed for improvements to the EBN. Bertjan has been bugging me to update it, which I can’t until I update the whole machine from 6-CRUFTY to 8-STABLE, so I’m going to plan some downtime for the EBN machine: this weekend, 8 and 9 may 2010, from 12 noon (GMT) on the 8th until midnight (GMT) on the 9th. That should give me enough time to bring the machine down, make additional backups, upgrade the heck out of it (all except hardware, unless someone cares to donate a pair of ECC Registered DDR2-800 DIMMs) and bring it back up. There may be some additional downtime on Monday (but only brief) as some disks are swapped and I correct some historical mistakes in the machine’s hardware configuration regarding disk layout and management.

Sites affected: the EBN itself (www.englishbreakfastnetwork.org) and the KDE4-OpenSolaris package site (solaris.bionicmutton.org) and some personal sites, including Sebas’ vizzzion.org, bionicmutton.org and euroquis.nl.

EBN up again

Monday, April 12th, 2010

The EBN machine is up again. Will Stephenson pointed out how the API documentation can be generated locally which takes away one reason for the machine (because the machine serves up api.kde.org). The value of the on-line version comes from multiple versions, indexing, searchability (although the PHP stuff that drives the search is rather poor, so that’s been on the stuff-to-fix-sometime for a long time), and saving you time in generating the whole shebang. The scripts and styling get occasional tweaks, too, to improve the API documentation. As always comments are welcome, patches are welcomer.

The machine does some other things, such as serving up my personal website (not updated in a gazillion years since I blog on FSFE’s Fellowship blogging platform) and that of Sebastian Kügler.

The main CPU load on the machine is running Krazy, the KDE code quality checker. It still produces lots of warnings and small items to fix in the KDE codebase. Sometimes I see people committing whole bunches of Krazy fixes. I recently saw it referred to as “KDE’s level grind”, which is a pretty good description in the sense that it was originally intended to find and explain the kind of low-hanging fruit you can fix on a Friday afternoon.

One last thing the machine does is provide the OpenSolaris packages for KDE4 as well. For this I ported pkg.depotd(8) to FreeBSD some time ago, but it’s starting to show its age and I think I’ll have to start running an actual OpenSolaris on the machine at some point. That means fiddling around with the available virtualization options and possibly updating and rebooting the machine repeatedly. If VirtualBox gives any measure of performance under FreeBSD, that’ll be it (because that simplifies updating from home where I also use it).

So, you may ask: what was the cause of this extended downtime? How can we prevent this from happening again? Well, one thing to fix it would be donating a 1U 4×3.5″ SATA disk server chassis with redundant power. Right now — thanks to being hosted in the “random crap” rack — there’s a little mini-tower thing doing the job, and it turned out to have a dodgy power cord. Some shifting and pushing-around of cables in the rack caused a momentary disconnect, which panicked the server and left it at a fsck(8) prompt for a while. This means, perhaps, that I should try to configure the system with “noauto” on the affected file systems, so that it comes up with reduced functionality even if the disk array is toast. Two other really important points: configure serial console support so that the ILOM can get at it and remember the ILOM password. Cue jumpering the server to reset the password and all the Fun that entails.

Anyway, it comes down to a few decisions made three years ago caused this downtime to be longer than expected. It’s up again, decisions have been amended, and we’re good to go for another three, I hope.

EBN Down

Wednesday, March 31st, 2010

The English Breakfast Network — which hosts the KDE code checking site, vizzzion.org, an anonsvn mirror for KDE, my irssi-in-screen instance and a bunch of other stuff — is down following a power outage at the university. While the older CodeYard machine came back up with no problems (yay FreeBSD 6.1!? too bad about the 3 years uptime, maybe) the EBN is stuck somewhere. I’ll have to fiddle around to find the ILOM password, I guess, or in worst case go over there and sort it out at the console (which is not a little trip I would look forward to, as it’s hella stormy today). Expect medium to major delays.