what a wild ride.
this post contains some technical information.
so, first of all, i apologize for the downtime. this was both unforeseen and foreseen, and would have been preventable if i had money a year ago rather than now. on november 5th, i updated our database server to freebsd 6.0, and began the process of backing up the whatthefuck.com databases to begin the upgrade to mysql 5. sometime during the copy, i lost my ssh connection to the server, and did a few remote reboots and reconnects only to find i could not maintain a stable connection. the next day, i was finally able to head in, only to find that the hard drive was spewing dma errors and read/write failures, grinding everything to a halt. turns out, all that extra writing of data brought out an issue in the drive itself as well as the cheap ass sata controller.
(fuck you, silicon image.)
i took most of the machines home to begin work on them. i couldn't work on the system itself as it wouldn't stay active for more than a minute or two. killing the background fsck processes wouldn't do much of anything, as invariably, some os process would read or write from the /usr partition, bringing the whole thing down. using a handy ubuntu live cd and freesbie cd, i managed to at least look at some data. i pulled off the dns zone files and moved them to another server so the site would resolve again, and began work on the drive. luckily, i had a spare drive to work with.
i ended up doing a complete disk duplication on another drive, telling it to fill out the bad blocks. after ten hours, the process was complete, i fsck'ed the drive, and found the data directory missing. lo and behold, all the files were in /lost+found. those of you familiar with ufs-derived file systems know that when the drive can't read a directory entry, it'll give it a numeric value corresponding to its location, and throw the file into lost+found.
all of the database files were that way, as well as other nibblets on that partition. 814 files, all named something like "#10849128".
i've sorted through most of it, and a few files are flat out gone. we have it back to a point that everything's working again. the database server is still at my home, being served out of a cable connection with a shitty upstream speed, so that's why some things take longer than others. i will have the machine back in the colo facility as soon as i make sure everything's okay. i appreciate everyone's patience here, and thank you for coming back to check on us.
if you have any questions, post them to 'questions'. thanks! |
|