Lessons learned after a major system crash
I wanted to title this something snide about co-location clowns again, but I won’t. At this hour, the anger won’t do any good. My apologies if this isn’t as coherent as it could be.
This evening, one of my client’s data centers had a major power outage. (No, they’re not in Queens, NYC.) I found out about it right after the Sabbath ended by a phone call from one of my client’s clients, whose own monitoring was going bananas—it happened in between the Saturday coverage’s most recent check and my first check post-Sabbath.
(Yes, they have UPSes...allegedly. No, I do not know why the UPSs did not kick in. We’re waiting to hear back from them for a RCA [root cause analysis] to determine what needs to be done.)
After a major outage, you learn a lot of things about your system:
I wanted to title this something snide about co-location clowns again, but I won’t. At this hour, the anger won’t do any good.
This evening, one of my client’s data centers had a major power outage. (No, they’re not in Queens, NYC.) I found out about it right after the Sabbath ended by a phone call from one of my client’s clients, whose own monitoring was going bananas—it happened in between the Saturday coverage’s most recent check and my first check post-Sabbath.
After a major outage, you learn a lot of things about your system:
- You learn if your system will boot up cleanly with all services you expect to come up coming up. (Ours did not. I knew about a few that would not have come up, but was surprised when INND did not come up cleanly.
- You learn that even though you’re using a supposedly-resilient, journalling filesystem (ext3), your applications may not really care if things went south right during a write. MySQL, even with InnoDB, appears to be vulnerable to this, and INND is very susceptible to this problem, so much so that it’s an FAQ for this (a quick google search turned up this very nice page on it, search for the word “Problems”.)
- Even though you might have backup machines, they might be just as corrupted or unavailable as your main machines, if they had the same problems as your main machines.
- Debugging is still a dark art. I had many problems chasing down MySQL replication; I thought it was due to some kind of mismatch in replication file names (MySQL, out of the box, names its binary logs with the name of the machine). Instead, I found the following error in the error log:
This one took me well over an hour to fix; I ended up following closely the instructions on the MySQL website, followed by dumping the table using mysqldump, dropping it and recreating it.060722 22:15:31 InnoDB: Page checksum 2206980548, prior-to-4.0.14-form checksum 2406998997 InnoDB: stored checksum 3308267039, prior-to-4.0.14-form stored checksum 3334416017 InnoDB: Page lsn 36 2991543943, low 4 bytes of lsn at page end 2977609415 InnoDB: Page may be an update undo log page InnoDB: Page may be an index page where index id is 0 2662 InnoDB: and table cldpmaster/News index PRIMARY InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 62009. InnoDB: You may have to recover from a backup. InnoDB: It is also possible that your operating InnoDB: system has corrupted its own file cache InnoDB: and rebooting your computer removes the InnoDB: error. InnoDB: If the corrupt page is an index page InnoDB: you can also try to fix the corruption InnoDB: by dumping, dropping, and reimporting InnoDB: the corrupt table. You can use CHECK InnoDB: TABLE to scan your table for corruption. InnoDB: Look also at section 6.1 of InnoDB: http://www.innodb.com/ibman.html about InnoDB: forcing recovery. InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 62010. InnoDB: You may have to recover from a backup.
[jbaltz@HOSTNAME mysql]$ mysqldump --single-transaction -e -u root -p DATABASENAME News >/tmp/News.sql Enter password: InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 62009. [...many lines of errors...] mysql> drop table News; ERROR 2006: MySQL server has gone away No connection. Trying to reconnect... Connection id: 1 Current database: DATABASENAME
Query OK, 0 rows affected (3.84 sec)
mysql> \. /tmp/News.sql
- You get to learn about all those little “this won’t affect the system at all” changes that you’ve made that, indeed, will affect the system upon reboot (like “temporary” hostname changes).
- Having a good production checklist (“runbook”) is really invaluable. Anything that reduce the amount of thinking you have to do in a stressful situation is a good thing. HOWEVER the runbook is only as good as it is accurate. It is much worse to be complete and inaccurate than to be incomplete and accurate. (Thankfully my own system is more of the latter than the former.)
- As angry as you are at other people’s incompetence, try not to let it cloud what you’re doing, lest your employer direct his anger at you—it will come to bite you.
- It always takes longer than you think it will.