Thursday, 26 July 2012

[Resolved] We are currently experiencing unplanned outages

We are currently experiencing unexpected, intermittent outages. We apologise for this and are investigating urgently.

Update 0924: Problem was resolved at around 0830, details to follow.

Update 1040: One of our two application servers went down at around 1am UK time on July 26 (that's 8pm July 25th in New York; 10am July 26 in Sydney). That left the other healthy server handling all our users, and that one server got frequently overloaded until we got the other server back up again at around 8.30am UK time.

KMB wasn't totally down for that whole time, many users will have been able to connect, but service was slow and often not available at all for several minutes at time. I am very sorry for this, I understand how much you all rely on KMB to run your businesses.

The reason it took so long to resolve was, embarrassingly enough, that we actually hadn't realised that one of our servers had gone down, despite a myriad of alerts telling us that it was.

We use Pingdom to make a simple "Is KeepMeBooked reachable?" check every five minutes. And we have NewRelic to give us detailed insight into what's going on on each of our servers.

The problem arose because we also use Pingdom to monitor SiteMinder PMSXchange, which is an external service that many of our users use in conjunction with KeepMeBooked. (We use Pingdom to check SiteMinder so that if our own application cannot reach SiteMinder, and nor can Pingdom, then we know there is a SiteMinder issue, not an issue with our application.)


Our developer in Indonesia got an alert from Pingdom saying that KeepMeBooked was down, but misread it, and thought it related to SiteMinder's PMSXchange. SiteMinder being unavailable would not have been a big deal, that doesn't really matter for short periods.

Only when I got up and saw a stack of Pingdom alerts on my phone did we know we had a problem. We then restarted the dead server and got back on track pretty quickly.

We've now added SMS alerts to our development team in Indonesia if Pingdom reports KeepMeBooked to be down. That means there will be much less risk of confusing a run-of-the-mill Pingdom alert with a panic-now-site-is-down Pingdom alert.

Again, I am very sorry for the trouble and uncertainty this will have caused, especially to our Australian users who suffered most of the day with terrible service.

No comments:

Post a Comment