A new best practice in dealing with alerts.

Posted by & filed under Application Monitoring, Best Practices, How-to, Tips & Troubleshooting.

Eleanor Roosevelt is reputed to have said “Learn from the mistakes of others. You can’t live long enough to make them all yourself.” In that spirit, we’re sharing a mistake we made so that you may learn.

This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people’s pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.

All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible?OffTarget

It was (as most failures are) the result of multiple conditions interacting.

Firstly, the primary on call engineer for the week (stage 1 in the escalation chain) was going to be unreachable for 24 hours. He knew this, but, instead of moving himself to stage 2, he instead just informed his stage 2 counterpart he’d be out of touch, and relied on the escalations of LogicMonitor’s alerting system. This did mean that such alerts would be delayed by 5 minutes, waiting for the 5 minute resend interval to escalate to the next stage, but he thought this was an acceptable risk – it is rare that our services have issues, and 5 minutes extra did not seem too long a delay in the unlikely event of an alert.

Secondly, the error that occurred was like a ‘brown-out’ condition. An error was triggered, an alert sent to stage 1 – but before the 5 minute escalation interval passed and the error was sent to the stage 2 engineer, the condition cleared, and service recovered for a few minutes. Then the condition recurred, an alert was sent, then cleared again. This pattern, of a few minutes of service impact that cleared before the alerts could be escalated to stage 2, repeated over and over. Only once the alert happened to persist for long enough to escalate to stage 2 was an alert sent to someone that could respond. (Which he did within minutes, restarted the problematic service, and restored service.)

Thirdly, this occurred on Sunday. Had it occurred mid-week, it would have been noticed regardless, via our dashboards or our Hip Chat integration, which posts alerts of error or critical to our Tech Ops chat room. (More on that later.)

How to prevent this in the future? This is where you get to learn from our mistakes.  There’s two ways to ensure that short alert/alert clear cycles don’t prevent alerts from reaching the right people, at the right time.  The first is a simple process one – ensure your stage 1 alert recipients are going to be available. (Kind of a ‘duh’ process in retrospect…)  The second: for service affecting alerts, ensure that the Alert Clear Interval is set long enough that the alert will not clear before the escalation interval.

If we’d have had the Alert Clear Interval set to at least 6 minutes, then even if the raw data was no longer triggering the alert, the alert would still be in effect at the 5 minute escalation period. The stage 2 engineer would have received the very first alert. (If he was not in front of a computer, he may well not have taken any action, as he’d have received an alert clear a minute later – but he then would have received a second alert the next time the issue occurred, a short time later, and would definitely have investigated, and resolved the issue.)

Having your Alert Clear Interval set to at least as long as your escalation interval will ensure that if your stage 1 drops the ball (or even is intentionally planning on dropping the ball), then short ‘flapping’ alerts will still be brought to the attention of stage 2. Of course, if stage 1 acknowledges, schedules down time, or otherwise deals with the alerts, then stage 2 can remain unaware, as the alerts won’t be escalated to them at all.

(Another option: always ensure there are multiple people contacted on each stage, but then there is a problem of diffusion of responsibility, so we don’t recommend that in general.)

We’ll be updating our Alert Response Best Practices document to reflect this, but figured we’d get the word out sooner via the blog.


One Response to “A new best practice in dealing with alerts.”

  1. Robert Barth says:

    I’m sure you’ve changed your process by now, I happened to hit on this page after a web search for something related and was intrigued by the story and wanted to relate an idea for resolving the alerting problem you had. I believe the problem was in your alerting system solely, but not in the alert clear level timeout. You should consider adding a counter to the alert error condition, with a configurable max occurrence property and time span within which to apply the count. This way, if the same condition occurs x number of times within some reasonable time span, the alert gets escalated to the next level. That general rule should be in place for every error at every level below maximum. It’s something that would have to be tuned, but it could have prevented the problem as described. IMO, simply setting the timeout of the error condition higher doesn’t really fix anything because now you’re going to get a chatty alert system (if this is something that may occur occasionally with no ill-effect), which is going to annoy the people who monitor it and possibly cause it to be ignored. This kind of counting monitor also guards against future problems in which the timeout may not be high enough for escalation, without having to set the timeout for every subordinate escalation small enough that it will get escalated. You can have a reasonable timeout for the individual error to clear, while guaranteeing that it will escalate if something like this occurs again.

Leave a Reply

Popular Posts
Subscribe to our blog.