« The Uppercut punch: I got *something* right after <mumble> years | Main | Lessons learned after a major system crash »

Can’t these clowns get anything right?

(Two entries in one day? egad. Names deleted to protect the guilty, although they deserve being shown. I’m that mad.)

So this morning at 0530 LCL I’m getting up and about to head out to an early-morning karate workout. I check my cellphone and I find out that the monitoring system for one of my clients shows me that a machine has gone down.

Great.

So I get on the phone with the co-location center NOC and have them find (of course our machines aren’t marked with anything and they have no record of which machine is which, even though we’ve asked them for this before!) and the NOC tech. finds our machine (finally) and reports to me:

“It’s turned off.”

Excuse me? Turned off? I certainly didn’t tell it to shutdown.
I dutifully reported to my client:
I saw these [outage messages—ed.] this morning. I called the $CLOWN_COLO NOC, who told me that the “machine was off” and they restarted it for me.

I do not know *why* it was down; there are no indications in any error logs of any panicking or anything like that. It looks like the power simply went off -- maybe someone was working back there and pulled a plug?

Our NOC technician responds to my client directly with (and this is a verbatim quote of the message):

yes the system got unplug

Sweet Baby Cthulhu what is going on?!
This machine is in a co-location center, what are they doing going around in cabinets and tweaking with power cables?
Thank goodness that the machine was not a centerpiece of the client’s production system! (It is a production machine, but a not-frequenly-used one, and so the 4-hour downtime experienced likely (hopefully!) caused no major issues.)

I don’t want to even get into the fact that this co-location center boasts that it has multiple internet feeds, but what they don’t tell you is that they don’t aggregate these feeds: customers are either on one or the other internet feed, and if that internet provider goes down, well, that's just too bad!

I guess I should mention that prudent network engineering would be to aggregate all the feeds into border routers, announce via BGP to all upstreams the co-lo’s entire netblocks, and provide resiliency for customers—if one feed goes down, the other feed is still available!

TrackBack

TrackBack URL for this entry:
http://www.jbaltz.com/mt/mt-tb.cgi/21

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)