« Can’t these clowns get anything right? | Main | who needs coffee, when you’ve got power outages »

Lessons learned after a major system crash

I wanted to title this something snide about co-location clowns again, but I won’t. At this hour, the anger won’t do any good. My apologies if this isn’t as coherent as it could be.

This evening, one of my client’s data centers had a major power outage. (No, they’re not in Queens, NYC.) I found out about it right after the Sabbath ended by a phone call from one of my client’s clients, whose own monitoring was going bananas—it happened in between the Saturday coverage’s most recent check and my first check post-Sabbath.
(Yes, they have UPSes...allegedly. No, I do not know why the UPSs did not kick in. We’re waiting to hear back from them for a RCA [root cause analysis] to determine what needs to be done.)

After a major outage, you learn a lot of things about your system:

I wanted to title this something snide about co-location clowns again, but I won’t. At this hour, the anger won’t do any good.

This evening, one of my client’s data centers had a major power outage. (No, they’re not in Queens, NYC.) I found out about it right after the Sabbath ended by a phone call from one of my client’s clients, whose own monitoring was going bananas—it happened in between the Saturday coverage’s most recent check and my first check post-Sabbath.

After a major outage, you learn a lot of things about your system:

  1. You learn if your system will boot up cleanly with all services you expect to come up coming up. (Ours did not. I knew about a few that would not have come up, but was surprised when INND did not come up cleanly.
  2. You learn that even though you’re using a supposedly-resilient, journalling filesystem (ext3), your applications may not really care if things went south right during a write. MySQL, even with InnoDB, appears to be vulnerable to this, and INND is very susceptible to this problem, so much so that it’s an FAQ for this (a quick google search turned up this very nice page on it, search for the word “Problems”.)
  3. Even though you might have backup machines, they might be just as corrupted or unavailable as your main machines, if they had the same problems as your main machines.
  4. Debugging is still a dark art. I had many problems chasing down MySQL replication; I thought it was due to some kind of mismatch in replication file names (MySQL, out of the box, names its binary logs with the name of the machine). Instead, I found the following error in the error log:
    060722 22:15:31  InnoDB: Page checksum 2206980548, prior-to-4.0.14-form checksum 2406998997
    InnoDB: stored checksum 3308267039, prior-to-4.0.14-form stored checksum 3334416017
    InnoDB: Page lsn 36 2991543943, low 4 bytes of lsn at page end 2977609415
    InnoDB: Page may be an update undo log page
    InnoDB: Page may be an index page where index id is 0 2662
    InnoDB: and table cldpmaster/News index PRIMARY
    InnoDB: Database page corruption on disk or a failed
    InnoDB: file read of page 62009.
    InnoDB: You may have to recover from a backup.
    InnoDB: It is also possible that your operating
    InnoDB: system has corrupted its own file cache
    InnoDB: and rebooting your computer removes the
    InnoDB: error.
    InnoDB: If the corrupt page is an index page
    InnoDB: you can also try to fix the corruption
    InnoDB: by dumping, dropping, and reimporting
    InnoDB: the corrupt table. You can use CHECK
    InnoDB: TABLE to scan your table for corruption.
    InnoDB: Look also at section 6.1 of
    InnoDB: http://www.innodb.com/ibman.html about
    InnoDB: forcing recovery.
    InnoDB: Database page corruption on disk or a failed
    InnoDB: file read of page 62010.
    InnoDB: You may have to recover from a backup.
    
    This one took me well over an hour to fix; I ended up following closely the instructions on the MySQL website, followed by dumping the table using mysqldump, dropping it and recreating it.
    [jbaltz@HOSTNAME mysql]$ mysqldump --single-transaction -e -u root -p DATABASENAME News >/tmp/News.sql
    Enter password:
    InnoDB: Database page corruption on disk or a failed
    InnoDB: file read of page 62009.
    [...many lines of errors...]
    mysql> drop table News;
    ERROR 2006: MySQL server has gone away
    No connection. Trying to reconnect...
    Connection id:    1
    Current database: DATABASENAME
    
    

    Query OK, 0 rows affected (3.84 sec)

    mysql> \. /tmp/News.sql



  5. You get to learn about all those little “this won’t affect the system at all” changes that you’ve made that, indeed, will affect the system upon reboot (like “temporary” hostname changes).

  6. Having a good production checklist (“runbook”) is really invaluable. Anything that reduce the amount of thinking you have to do in a stressful situation is a good thing. HOWEVER the runbook is only as good as it is accurate. It is much worse to be complete and inaccurate than to be incomplete and accurate. (Thankfully my own system is more of the latter than the former.)

  7. As angry as you are at other people’s incompetence, try not to let it cloud what you’re doing, lest your employer direct his anger at you—it will come to bite you.

  8. It always takes longer than you think it will.

TrackBack

TrackBack URL for this entry:
http://www.jbaltz.com/mt/mt-tb.cgi/22

Listed below are links to weblogs that reference Lessons learned after a major system crash:

» INND and crashing machines; what you don’t get at first gloss from Entropy Reducers Amalgamated
After the past few days of multiple outages at one of my customer's LA datacenters, I got to learn a few things about the resiliency of popular packaged unix software. Read this all the way, since the very last step... [Read More]

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)