Saturday, 8 October 2011

(RESOLVED) Unexpected outage

Update at 1915: KeepMeBooked back up again at about 19:06 UK time, after down time of 1hr 40mins. That's the first time in two years we've had a full-on outage affecting all users. See final update below for detailed explanation of what happened

We are aware that KeepMeBooked is has been unreachable since about 1730 UK time (20mins ago). There is a problem at our host, BrightBox. They are aware and investigating.

Updates soon.

17:55: Update from BrightBoxIt seems as if a switch stack has crashed – we’re getting it rebooted. ETA is currently 30mins. http://status.brightbox.co.uk/2011/10/internet-outage/

18:24 Still waiting for more info from datacentre

18:46 Update from our hosting partner, BrightBox: "We’ve ruled out the possibility of this being due to a DoS or a power failure. We’re still waiting on the switch stack reboot, which we now expect to be done within another 30 minutes (due to a communication failure regarding availability of the engineer – our apologies for the miscommunication)." http://status.brightbox.co.uk/2011/10/internet-outage/

19:07 Back up!

19:09 OK, connectivity has been restored. Two firewalls had crashed at the datacentre. More details to follow when we have them.

Update Monday 10th October:
Here's what happened, and why it took so long to resolve:

One firewall crashed. Ordinarily, that shouldn't be a big deal, because there is a secondary firewall in place to take over if the primary firewall fails.

But the secondary firewall crashed too.

Again, that shouldn't be a huge disaster, because BrightBox engineers should be able to access it remotely and restart it.

But BrightBox didn't have "out of band" access to the firewall. That is to say, their only way of remotely accessing the firewall to work on it was through the firewall. No firewall, no remote access.

(With other parts of our hosting system, BrightBox do have 'out of band' access, so even if the primary internet connection to the servers fails, they can still get in remotely through a backup connection. Not so with the firewalls.)

So BrightBox needed to rely on an on-site engineer in the datacentre to restart the firewall. That's what caused the delay - raising someone on-site to go and manually hit the restart button.

BrightBox will now put in place out-of-band access to the firewalls, so if both firewalls crash again, they will be able to remotely access the firewalls and fix them quickly, without needing to wait for someone physically at the datacentre to do that.

They have also updated some software on the firewalls, but are still investigating what caused them to crash in the first place.

(BrightBox's own explanation is here.)

0 comments:

Post a Comment