« What programming language / platform should you choose for your next project ? | Main | Wubi : Windows Linux dual boot the easy way, when virtualization is not an option »
November 24, 2008
Linux disaster recovery - From upgrade, installation, re-installation to salvaging data
Until a few weeks ago, I'd never had a big issue with Linux. Having installed and worked on over 4 workstations with different distros over the years, they had all been stable, meaning no lost data, no sudden re-boots, frozen screens or application crashes. Well, there is such a thing as 'segmentation fault hell' or even the prospect of losing data on a Linux workstation/server as I realized all to well just recently. So if you're a Linux user, this post contains some preventive measures on avoiding a panic situation or if you're already in this situation, some steps to recover your data from what are apparently irrecoverable hardware errors.
| [Entry continues to the left and below ad ] |
I'll divide this post into two sections. The first centered on what not to do in order to prevent a Linux system from going haywire and the other on what do if your Linux system has already gone haywire, this last section to the point of just wanting to recover some old files and trash the hardware after that.
Upgrading after installation: The kiss of deathIf you've had a stable Linux system for say 1 or 2 months after installation, you should avoid any upgrade if at all possible and I mean any. If you're like most Linux users, this temptation will probably be difficult to avoid since most of us like to try out new stuff, but even if you're not too worried about upgrading there are some things that are better left untouched.
What about distro sanctioned upgrades? Most distros now a days offer pre-packaged applications in formats like .rpm or .deb in order to ease the installation and upgrade process. In fact, distros like Ubuntu even have an 'Automatic update' icon flashing before a user's eyes which makes upgrading all the more tempting and simple, just like Windows. BEWARE: Just because an upgrade is sanctioned by a distro doesn't make it entirely safe, it can cause as much havoc as installing an application from source if you don't know exactly what is happening.
What about security? I don't want to leave my system exposed!. I hear you, while its true upgrades are split between security related and new feature availability, the issue with most packages is also in their dependencies. Installing or upgrading a package on Linux is like pulling a thread, a click might upgrade your browser for security patches, but it might also upgrade a few more packages/libraries in the background needed by this package that can cause problems.
There are certain packages you should never upgrade no matter what, unless you want to experience the process of 'settling in' as if it were a new workstation/server with the potential of losing pre-existing data. Here is my list of packages you should never attempt to upgrade (from crashes I've had throughout the years)
*libc*/glibc* libraries: Most if not all Linux applications have some type of dependency on these libraries. If you're brazen enough to upgrade them, be prepared for the possibility of 'segmentation fault hell' on the next reboot. Whatever new application or feature requires upgrading one of these libraries, it can likely wait believe me.
*gtk* libraries: This applies to those using workstation's relying on GUIs(Gnome). An upgrade to one of these libraries is likely to cause all type of application windows to start crashing with no apparent reason.
Linux Kernel: This should be the most obvious, after all the whole system works off the kernel. Its curiously the easiest upgrade to revert, since most installations allow you to boot different kernels, including the older ones (read stable) you might have used prior to upgrading.
These are all software related issues, but in fact even issues with your hardware can cause a once stable Linux system to start acting strange.
Segmentation Faults and Kernel Panics: It may be the hardwareIf you've made minimal software upgrades on a server or workstation, and start getting 'Kernel Panic' messages at start-up or applications start going haywire all of a sudden with segmentation faults, it may be related to your hardware - and not to the peripheral kind(CD-ROMs) but the motherboard, chip-set, memory(RAM) kind.
The easiest one to diagnose is faulty memory(RAM), which is actually pretty common in older systems, though it really needs to be acting up for it to show up as application errors which range from segmentation faults to sudden re-boots.
I found the memtest option included in Ubuntu Linux Live-CD especially helpful. You will just need to burn this CD and enter the memtest option and the memory(RAM) will be tested. You will see some red boxes if the memory is faulty. It won't tell you which memory(RAM) stick is faulty, but once you know this you can open up your box and start a trial and error process until you remove the flaky memory stick.
Once the memory test(RAM) runs cleanly, other hardware problems can be traced back to the motherboard, chip set and even the processor. These more often present themselves as 'Kernel Panic' errors or inclusively segmentation faults. If the errors continue to be sudden and you haven't made any software upgrades, then its likely your hardware is now junk. Pulling out or replacing memory sticks(RAMs) is feasible, not unlike replacing a motherboard, chip set or processor which given the cost/benefit of today's hardware its likely better to replace everything.
If you're still getting kernel panics or random segmentation faults after running a memory test and you've performed a software upgrade, check for the possibility of a kernel upgrade not compatible with your hardware.On this last system I had, I upgraded from a prepackaged kernel version to another prepackaged version and it was a minor kernel version at that ( the last digits in the kernel version), well the system started acting up with kernel panics and segmentation faults.
Though most distros make every attempt to test kernels on a variety of hardware, its often impossible to do so on every possible variation. Last time I checked compiling a Linux Kernel had over 200 options! Most use conservative defaults, but in an era with dual-core processors, 64-bits, RAID, SATA, APIC and other 'fun stuff', a particular kernel version might just have the wrong or missing option that causes havoc on a once stable system.
Correcting these errors is a case of compiling your own kernel or passing certain flags at boot time that make the kernel work with your hardware. This is a very common scenario for laptop users and 'strange' hardware setups like those in Dell hardware, which have exotic combinations of chip-sets, hard-drives and processors, which in turn require the activation or deactivation of certain kernel options to make a system run smoothly. I've even read cases where the temperature of a system was traced to making it unstable, with the solution to the problem being compiling a new kernel with some esoteric temperature control kernel options disabled. On laptops I've also read some Wi-Fi modules compiled with the kernel can make certain systems unstable, so its necessary to re-compile a kernel or use some flag at boot to avoid it being activated.
You're best bet here is to consult the manufacturer for a stable and tested kernel, or if they don't have one, check online forums to see what kernel runs best - or has problems - with a particular hardware model. Some kernel versions can be passed a series of flags to avoid certain features when they're booted, others kernel versions may just be unusable for certain types of hardware unless compiled from scratch.
Unfortunately, you may have to do this process so many times that it may cause another type of error, one you may have had from the start: a failure to access your hard drive in any shape or form, which is the biggest nightmare imaginable if you have some type of valuable data on it.
Ups! no more partitions! Salvaging data and important filesI'll let you in on how I got to this point step innocently enough, to illustrate how this is a slippery slope. After the automatic kernel upgrade, I started getting kernel panics and segmentation faults, needles to say it took me a few re-boots and log inspections to trace this to the new upgraded kernel.
In this my discovery process though, each 'kernel panic' froze the system, which in turn required a hard reboot (a.k.a unplugging the system). I knew this was not good, since every time I rebooted the sort-of-stable system it started checking the file system for errors, which I had experienced before but not so many times(consecutively at least).
At this point, I had a system that was booting but after 30 minutes to 1 hour started acting up and downright freezing after 2 hours, so hard reboot again, file system checked again and try to correct the problem asap with a new kernel downgrade.(What I didn't know until later was my problem was both an inadvertent kernel upgrade and a faulty memory(RAM)stick on a 1 year old machine...but the lesson still continues)
After close to 10 hard reboots, I started getting a S.M.A.R.T message, which is presented by the system BIOS prior to anything related to the operating system. Things started looking pretty grim after I read in summary that any S.M.A.R.T message indicates your hard drive is about to die.
At this point I was getting nervous, because I had some important data on the hard drive from the existing stable installation. But still S.M.A.R.T was a message that a disk was 'about to die' not 'already dead'. I realized this was probably related to the hard reboots and re-checking the file system on each boot, so I knew the clock was ticking just to salvage my existing data.
I then booted the system from an Ubuntu Live-CD and started mailing out individual files (a very easy salvage process) from the mounted hard drive. But after 30 minutes *crash* even from the Live-CD, so hard reboot AGAIN, S.M.A.R.T message, re-check file-system AGAIN , load Live-CD.( At this point I realized it wasn't the kernel upgrade to the existing system, but a hardware problem, though not exactly which type of hardware problem on a 1 year old system).
After approx 3 more hard-boots and loading the Live-CD to salvage data from the hard disk, I didn't see a S.M.A.R.T message.......and my worst fears came true: I couldn't see the hard disk from the Live-CD. Panic started setting in, the hard drive with part of my unsalvaged data might have just died. Onto the desperation options.
If a Live-CD couldn't see the hard-drive may be it was just the master boot record or some other thing, I still had my data in separate partitions which I could copy, a pain, but probably one of the biggest advantages to using partitions. So I used GParted which is another Live-CD only this one with a bunch of partition utilities. Loaded up GParted......NO PARTITION IN SIGHT, time to call mommy and the salvage specialist to recover data.
I knew the data was there, but would I need to take the hard disk to a salvage specialist? The data was important, but just a few files I hadn't managed to salvage. So I took one more shot searching for something/anything that could let me salvage some files. After trying a few trial software's I came onto TestDisk and PhotoRec .
Eureka! Not only was I able to rebuild the partition table from what seemed irrecoverable disk drive damage with TestDisk, but it was not even a trial version, it turned out to be open source. A truly amazing piece of software that you can also run on a Live-CD. After running what the program calls a 'deep-scan', the entire partition table was re-built making it possible to recover my remaining files.
I didn't use PhotoRec since they suggest you attempt to use TestDisk first, which worked for me. But if PhotoRec works like TestDisk, its probably your last option at recovering files on a damaged hard drive before taking your hard drive to a salvage specialist.
Linux is stable under most circumstances, but if an upgrade or re-installation surprises you in some way, I hope this entry helps you get through the process in a shorter amount of time than I did or at least have a shorter route to salvage data from a broken workstation or server.
| [Comments below ad ] |
Posted by Daniel at November 24, 2008 10:41 PM
Comments
Post a comment
Track back Pings
Track Back URL for this entry:
http://www.webforefront.com/mtblog/mt-tb.cgi/104.











